Modifications made in the two first sections

[chloroplast13.git] / classEquiv.tex
diff --git a/classEquiv.tex b/classEquiv.tex

index 41abda49eafb820da73bfe905fea10d0361d7be4..00b227d0e30be5eff890d6dc5fe33a4f27ff5357 100644 (file)
--- a/classEquiv.tex
+++ b/classEquiv.tex
@@ -1,28 +1,50 @@
-Identifying  core genes  is important  to understand  evolutionary and
-functional phylogenies. Therefore, in this work we present two methods
-to build a  genes content evolutionary tree. More  precisely, we focus
-on   the    following   questions   considering    a   collection   of
-99~chloroplasts  annotated from  NCBI \cite{Sayers01012011} and  Dogma
-\cite{RDogma} : how can we identify the best core genome and what
-is the evolutionary scenario of these chloroplasts.
-Two methods are considered here. The first one is based on NCBI annotation, it is explained below.
-We start by the following definition.
+
+The first method, described below, considers NCBI annotations and uses
+a distance-based similarity measure. We start with the following
+preliminary Definition:
+
  \begin{definition}
  \label{def1}
  \begin{definition}
  \label{def1}
-Let $A=\{A,T,C,G\}$ be the nucleotides alphabet, and $A^\ast$ be the set of finite words on $A$ (\emph{i.e.}, of DNA sequences). Let $d:A^{\ast}\times A^{\ast}\rightarrow[0,1]$ be a distance on $A^{\ast}$. Consider a given value $T\in[0,1]$ called a threshold. For all $x,y\in A^{\ast}$, we will say that $x\sim_{d,T}y$ if $d(x,y)\leqslant T$. 
+Let $A=\{A,T,C,G\}$  be the nucleotides alphabet, and  $A^\ast$ be the
+set  of finite  words on  $A$  (\emph{i.e.}, of  DNA sequences).   Let
+$d:A^{\ast}\times   A^{\ast}\rightarrow[0,1]$   be   a   distance   on
+$A^{\ast}$. Consider a given value $T\in[0,1]$ called a threshold. For
+all   $x,y\in  A^{\ast}$,   we   will  say   that  $x\sim_{d,T}y$   if
+$d(x,y)\leqslant T$.
  \end{definition}
  
  \end{definition}
  
-\noindent$\sim_{d,T}$ is obviously an equivalence relation. When $d=1-\Delta$, where $\Delta$ is the similarity scoring function embedded into the emboss package (Needleman-Wunch released by EMBL), we will simply denote $\sim_{d,0.1}$ by $\sim$. The method starts by building an undirected graph based on
-the similarity rates $r_{ij}$  between sequences $g_{i}$ and $g_{j}$ (\emph{i.e.}, $r_{ij}=\Delta(g_{i},g_{j})$).
-In this latter, nodes are constituted by all the coding sequences of the set of genomes under consideration, and there is an edge between $g_{i}$ and $g_{j}$ if the 
-similarity rate $r_{ij}$ is
-greater than the given similarity threshold. The Connected Components
-(CC) of the ``similarity'' graph are thus computed.
-This produces an equivalence 
-relation between sequences in the same CC based on Definition~\ref{def1}.
-Any class for this relation is called ``gene'' here, where its representatives (DNA sequences) are the ``alleles'' of this gene. Thus this first method produces for each genome $G$, which is a set $\{g_{1}^G,...,g_{m_G}^G\}$ of $m_{G}$ DNA coding sequences, the projection of each sequence according to $\pi$, where $\pi$ maps each sequence
-into its gene (class) according to $\sim$. In other words, $G$ is mapped into $\{\pi(g_{1}^G),...,\pi(g_{m_G}^G)\}$.  
-Remark that a projected genome has no duplicated gene, as it is a set. The core  genome (resp. the pan genome) of $G_{1}$ and $G_{2}$ is defined thus as the intersection (resp. as the union) of these projected genomes.\\
-We then consider the intersection of all the projected genomes, which is the set of all the genes $\dot{x}$
-such that each genome has at least one allele in $\dot{x}$. The pan genome is computed similarly as the union of all the projected genomes. However such approach suffers from producing too small core genomes, 
-for any chosen similarity threshold, compared to what is usually waited by biologists regarding these chloroplasts. We are then left with the following questions: how can we improve the confidence put in the produced core? Can we thus guess the evolution scenario of these genomes?
-\ No newline at end of file
+\noindent $\sim_{d,T}$ is obviously an equivalence relation and when $d=1-\Delta$, where $\Delta$ is the similarity scoring function embedded into the emboss package (Needleman-Wunch released by EMBL), we will simply denote $\sim_{d,0.1}$ by $\sim$.
+
+The method begins by building  an undirected graph based on similarity
+rates $r_{ij}$ between DNA~sequences $g_{i}$ and $g_{j}$ (\emph{i.e.},
+$r_{ij}=\Delta\left(g_{i},g_{j}\right)$).  In this latter graph, nodes
+are  constituted by all  the coding  sequences of  the set  of genomes
+under consideration, and there is  an edge between $g_{i}$ and $g_{j}$
+if the  similarity rate  $r_{ij}$ is greater  than a  given similarity
+threshold. The  Connected Components (CC) of  the ``similarity'' graph
+are thus computed.
+
+This process also results in an equivalence relation between sequences
+in the  same CC  based on Definition~\ref{def1}.   Any class  for this
+relation   is  called   ``gene''  here,   where   its  representatives
+(DNA~sequences)  are the ``alleles''  of this  gene.  Thus  this first
+method   produces   for   each    genome   $G$,   which   is   a   set
+$\left\{g_{1}^G,...,g_{m_G}^G\right\}$    of   $m_{G}$    DNA   coding
+sequences, the  projection of each sequence according  to $\pi$, where
+$\pi$ maps each sequence into its gene (class) according to $\sim$. In
+other     words,      a     genome     $G$      is     mapped     into
+$\left\{\pi(g_{1}^G),...,\pi(g_{m_G}^G)\right\}$.    Note    that    a
+projected genome has no duplicated gene since it is a set.
+
+Consequently, the core  genome (resp.  the pan genome)  of two genomes
+$G_{1}$  and $G_{2}$  is defined  as  the intersection  (resp. as  the
+union) of their projected  genomes.  We then consider the intersection
+of  all the  projected genomes,  which  is the  set of  all the  genes
+$\dot{x}$  such  that   each  genome  has  at  least   one  allele  in
+$\dot{x}$. The  pan genome is computed  similarly as the  union of all
+the projected  genomes. However  such approach suffers  from producing
+too small core genomes,  for any chosen similarity threshold, compared
+to   what  is   usually   expected  by   biologists  regarding   these
+chloroplasts. We are  then left with the following  questions: how can
+we improve the confidence put in  the produced core? Can we thus guess
+the evolution scenario of these genomes?