Modifs

[ancetre.git] / closedgenomes.tex
diff --git a/closedgenomes.tex b/closedgenomes.tex

index d468f102bc25e700ba30a8981f56bcafdc561323..4b7ecd4ab2948f06645a141f33769f6192a10681 100644 (file)
--- a/closedgenomes.tex
+++ b/closedgenomes.tex
@@ -1,8 +1,8 @@
-The approache is further based on the ability to decide how far is each 
+The approach is further based on the ability to decide how far is each 
  genome from each others. To achieve this, we combine XXX metrics which are 
  detailed in this part.
  
  genome from each others. To achieve this, we combine XXX metrics which are 
  detailed in this part.
  
-\subsection{Core SNP based metric} 
+\subsection{Core SNP based Metric} 
  Due to the definition of the core genome, for each element $\dot{x}$ 
  in this set, there is a gene $x \in \dot{x}$ in each genome. 
  Let us consider a class 
  Due to the definition of the core genome, for each element $\dot{x}$ 
  in this set, there is a gene $x \in \dot{x}$ in each genome. 
  Let us consider a class 
@@ -14,12 +14,12 @@ soit une métrique élevée soit une métrique très faible}
  %1/ On SNPs of the core genome strict
  All the $y$ are thus aligned 
  thanks to a global alignment tool. The SNPs may thus be extracted.
  %1/ On SNPs of the core genome strict
  All the $y$ are thus aligned 
  thanks to a global alignment tool. The SNPs may thus be extracted.
-For each genome, one can thus compute the vector of boolean values 
-memorizing at index $i$ wether the SNP $i$ is present in one of its gene 
-(postive value) or  not (null value). 
+For each genome, one can thus compute the vector of Boolean values 
+memorizing at index $i$ whether the SNP $i$ is present in one of its gene 
+(positive value) or  not (null value). 
  A Hamming distance between two vectors allows to build the distance 
  between two genes. 
  A Hamming distance between two vectors allows to build the distance 
  between two genes. 
-This metric is further refered as to $m_S$.
+This metric is further referred as to $m_S$.
  
  % plus il y a de diff, plus le nombre est élevé
  
  
  % plus il y a de diff, plus le nombre est élevé
  
@@ -28,22 +28,59 @@ This metric is further refered as to $m_S$.
  The $m_S$ method does not consider genes to have the same incidence in the 
  metric value. A gene with many SNPs has a larger influence in 
  the metric computation than a gene with fewer ones. 
  The $m_S$ method does not consider genes to have the same incidence in the 
  metric value. A gene with many SNPs has a larger influence in 
  the metric computation than a gene with fewer ones. 
-The metric further refered as to $m_{|S|}$ gives the same weight to each gene
+The metric further referred as to $m_{|S|}$ gives the same weight to each gene
  without considering the number of SNP it contains. 
  
  % plus il y a de diff, plus le nombre est élevé
  
  without considering the number of SNP it contains. 
  
  % plus il y a de diff, plus le nombre est élevé
  
-
-%3/ On gene content (symmetric difference)
-The third metric consider the symetric difference $\Delta$ 
-between the two sets $G_1$ and $G_2$ of genes.
+\subsection{Symmetric Difference based Metric}
+The third metric consider the symmetric difference $\Delta$ 
+between the two sets $G_1$ and $G_2$ of genes recalled hereafter
  $$
  G_1\Delta G2 = 
  $$
  G_1\Delta G2 = 
-(G1\cup G_2)\setminus (G1\cap G_2) = (G1\setminus G_2)\cup(G_2\setminus G1) 
+(G1\cup G_2)\setminus (G1\cap G_2) = (G1\setminus G_2)\cup(G_2\setminus G1).
  $$
  $$
-\end{document}
+The cardinality of $G_1\Delta G2$, give the metric.
+This metric is furthered referred as to $m_{\Delta}$.
+
+Practically, let $k$ be the number of all the equivalence classes. Due to the definition of the pan genome, this number is equal to the cardinality of this set.
+For each genome, if we only consider which gene belongs into it \textit{i.e.}, if  we abstract away all the position this gene appears, this genome may be 
+memorized as a vector of $k$ Boolean values. The element at index $i, 0 \le i \le k-1$ is true if and only if the $i$-th gene of the pan genome belongs to this 
+one.  
+This metric is equal to the Hamming distance between the two corresponding  
+vectors of Boolean values.
+
+% plus il y a de diff, plus le nombre est élevé
+
+
  
  % 4/ Using EPFL method
  
  % 4/ Using EPFL method
-% 5/ On size of the biggest syntheny bloc
-% 6/ On average size of syntheny blocs
-% 7/ On number of syntheny blocs.
+\subsection{Adjacency based metric}
+Following~\cite{23424133}, a sequence 
+of all the adjacencies, which is present in 
+a genomes at least is computed. This sequence
+is augmented with the pan genome content.
+Then, each genome is compared 
+with such a sequence and a boolean vector is produced with the following rule. 
+%If the element $i$ in the sequence of adjacencies or content is present 
+%in the genome, the 
+
+
+
+
+
+
+\subsection{Shared Synteny based Metric}
+Given two genomes abstracted as sequences of classes, it is classical
+to computes all the maximum shared synteny chains. 
+
+% Attention ici, moins il y a de diff, plus le nombre est élevé
+There are then three issues with such a set of shared synteny chains:
+\begin{itemize}
+\item let $m_{Y}$ be the metric, which returns the 
+length of the largest chains;
+\item let $m_{\overline{Y}}$ be the metric, which returns the 
+average length of synteny chains;
+\item finally, let $m_{|Y|}$ be the metric, which returns the 
+number of synteny chains.
+\end{itemize}