still on update section 4

[chloroplast13.git] / classEquiv.tex
diff --git a/classEquiv.tex b/classEquiv.tex

index 00b227d0e30be5eff890d6dc5fe33a4f27ff5357..f3d3ed13e5f19a1161dcc58dc14fbbae6fa70bc5 100644 (file)
--- a/classEquiv.tex
+++ b/classEquiv.tex
@@ -1,7 +1,6 @@
-
  The first method, described below, considers NCBI annotations and uses
  a distance-based similarity measure. We start with the following
-preliminary Definition:
+preliminary definition:
  
  \begin{definition}
  \label{def1}
@@ -13,25 +12,31 @@ all   $x,y\in  A^{\ast}$,   we   will  say   that  $x\sim_{d,T}y$   if
  $d(x,y)\leqslant T$.
  \end{definition}
  
-\noindent $\sim_{d,T}$ is obviously an equivalence relation and when $d=1-\Delta$, where $\Delta$ is the similarity scoring function embedded into the emboss package (Needleman-Wunch released by EMBL), we will simply denote $\sim_{d,0.1}$ by $\sim$.
+%\noindent $\sim_{d,T}$ is obviously an equivalence relation and when $d=1-\Delta$, where $\Delta$ is the similarity scoring function embedded into the emboss package , we will simply denote $\sim_{d,0.1}$ by $\sim$.
+
+Let be given a \emph{similarity} threshold $T$  and a distance $d$
+(Needleman-Wunch released by EMBL for instance).
+The method begins by building  an undirected graph 
+between all the DNA~sequences $g$ of the set  of genomes as follows:
+there is  an edge between $g_{i}$ and $g_{j}$
+if  $g_i \sim_{d,T} g_j$ is established.
+This graph is further denoted as the ``similarity'' graph.
+
+We thus consider that the pair of two coding sequences 
+$(g_i,g_j)$ belongs in the relation $\mathcal{R}$ if both $g_i$ and 
+$g_j$  belong in the same 
+connected component (CC), \textit{i.e.} if there is a path between $g_i$ 
+and $g_j$ in the similarity graph. It is not hard to see that this relation is an
+equivalence relation whereas $\sim$ is not.
  
-The method begins by building  an undirected graph based on similarity
-rates $r_{ij}$ between DNA~sequences $g_{i}$ and $g_{j}$ (\emph{i.e.},
-$r_{ij}=\Delta\left(g_{i},g_{j}\right)$).  In this latter graph, nodes
-are  constituted by all  the coding  sequences of  the set  of genomes
-under consideration, and there is  an edge between $g_{i}$ and $g_{j}$
-if the  similarity rate  $r_{ij}$ is greater  than a  given similarity
-threshold. The  Connected Components (CC) of  the ``similarity'' graph
-are thus computed.
  
-This process also results in an equivalence relation between sequences
-in the  same CC  based on Definition~\ref{def1}.   Any class  for this
-relation   is  called   ``gene''  here,   where   its  representatives
+Any class for this relation   is  called   ``gene'' 
+here,   where   its  representatives
  (DNA~sequences)  are the ``alleles''  of this  gene.  Thus  this first
  method   produces   for   each    genome   $G$,   which   is   a   set
  $\left\{g_{1}^G,...,g_{m_G}^G\right\}$    of   $m_{G}$    DNA   coding
  sequences, the  projection of each sequence according  to $\pi$, where
-$\pi$ maps each sequence into its gene (class) according to $\sim$. In
+$\pi$ maps each sequence into its gene (class) according to $\mathcal{R}$. In
  other     words,      a     genome     $G$      is     mapped     into
  $\left\{\pi(g_{1}^G),...,\pi(g_{m_G}^G)\right\}$.    Note    that    a
  projected genome has no duplicated gene since it is a set.
@@ -42,8 +47,26 @@ union) of their projected  genomes.  We then consider the intersection
  of  all the  projected genomes,  which  is the  set of  all the  genes
  $\dot{x}$  such  that   each  genome  has  at  least   one  allele  in
  $\dot{x}$. The  pan genome is computed  similarly as the  union of all
-the projected  genomes. However  such approach suffers  from producing
-too small core genomes,  for any chosen similarity threshold, compared
+the projected  genomes. 
+
+\begin{figure}
+\begin{center}
+\includegraphics[scale=0.5]{stats.png}
+\end{center}
+\caption{Size of core and pan genomes w.r.t. the similarity threshold}\label{Fig:sim:core:pan}
+\end{figure}
+
+The number of genes in the core genome and in the pan genome are 
+represented in  Figure~\ref{Fig:sim:core:pan} with respect to the 
+threshold value. 
+First of all, the higher is the threshold, 
+the smaller the connected components are. In other words, the number 
+of alleles of one gene is small if the threshold is high.
+When the threshold is high, the number of genes and the size of 
+pan genome is high too. However due to the construction method of the
+core genome,  this set of genes has few elements in such a  situation.  
+This approach even suffers from producing
+too small core genomes (of size 0 or 1),  for any chosen similarity threshold, compared
  to   what  is   usually   expected  by   biologists  regarding   these
  chloroplasts. We are  then left with the following  questions: how can
  we improve the confidence put in  the produced core? Can we thus guess