update section two

[chloroplast13.git] / classEquiv.tex
diff --git a/classEquiv.tex b/classEquiv.tex

index b77c51ac3bddcd20373d15cbd79facfd35360925..41abda49eafb820da73bfe905fea10d0361d7be4 100644 (file)
--- a/classEquiv.tex
+++ b/classEquiv.tex
@@ -1,55 +1,28 @@
-This step considers as input the set 
-$\{((g_1,g_2),r_{12}), (g_1,g_3),r_{13}), (g_{n-1},g{n}),r_{n-1.n})\}$ of 
-$\frac{n(n-1)}{2}$ elements. 
-Each one $(g_i,g_j),r_{ij})$ where $i < j$, 
-is a pair that gives the similarity rate $r_{ij}$ between the two genes  
-$g_{i}$ and $g_{j}$.
-
-The first step of this stage consists in building the following non-oriented
-graph further denoted as to \emph{similarity graph}.
-In this one, the vertices are the genes. There is an edge between 
-$g_{i}$ and $g_{j}$ if the rate $r_{ij}$ is greater than a given similarity 
-threshold $t$.
-
-We then define the relation $\sim$  such that
-$ x \sim y$ if $x$ and $y$ belong in the same connected component.
-Mathematically speaking, it is obvious that this 
-defines an equivalence relation. 
-Let $\dot{x}= \{y  | x \sim y\}$
-denotes the equivalence class to which $x$ belongs.
-All the genes which are  equivalent to each other
-are also elements of the same equivalence class.
-Let us then consider the set of all equivalence classes of the set of genes 
-by $\sim$, denoted $X/\sim = \{\dot{x} | x \textrm{ is a gene}\}$. 
-defined by $\pi(x) = \dot{x}$
-which maps each gene  into it respective equivalence class by $\sim$.
-
-
-
-
-For each genome $[g_l,\ldots,g{l+m}]$, the second step computes 
-the projection of each gene according to $\pi$. 
-The resulting genome  which is 
-$$
-[\pi(g_l),\ldots,\pi(g{l+m})]
-$$ 
-is again of size $m$.
-
-Intuitively speaking, for two genes $g_i$ and $g_j$ 
-in the same equivalence class, there is path from  $g_i$ and $g_j$.
-It signifies that  each evolution step 
-(represented by an edge in the similarity graph) 
-has produced a gene s.t. the similarity with the previous one 
-is greater than $t$. 
-Genes $g_i$ and $g_j$ may thus have a common ancestor.
-
-
-We compute the core genome as follow.
-Each genome is projected according to $\pi$. We then consider the 
-intersection of all the projected genomes which are considered as sets of genes
-and not as sequences of genes.
-This results as the set of all the class $\dot{x}$
-such that each genome has an gene $x$ in  $\dot{x}$.
-The pan genome is computed similarly: the union of all the 
-projected genomes in computed here.
-
+Identifying  core genes  is important  to understand  evolutionary and
+functional phylogenies. Therefore, in this work we present two methods
+to build a  genes content evolutionary tree. More  precisely, we focus
+on   the    following   questions   considering    a   collection   of
+99~chloroplasts  annotated from  NCBI \cite{Sayers01012011} and  Dogma
+\cite{RDogma} : how can we identify the best core genome and what
+is the evolutionary scenario of these chloroplasts.
+Two methods are considered here. The first one is based on NCBI annotation, it is explained below.
+We start by the following definition.
+\begin{definition}
+\label{def1}
+Let $A=\{A,T,C,G\}$ be the nucleotides alphabet, and $A^\ast$ be the set of finite words on $A$ (\emph{i.e.}, of DNA sequences). Let $d:A^{\ast}\times A^{\ast}\rightarrow[0,1]$ be a distance on $A^{\ast}$. Consider a given value $T\in[0,1]$ called a threshold. For all $x,y\in A^{\ast}$, we will say that $x\sim_{d,T}y$ if $d(x,y)\leqslant T$. 
+\end{definition}
+
+\noindent$\sim_{d,T}$ is obviously an equivalence relation. When $d=1-\Delta$, where $\Delta$ is the similarity scoring function embedded into the emboss package (Needleman-Wunch released by EMBL), we will simply denote $\sim_{d,0.1}$ by $\sim$. The method starts by building an undirected graph based on
+the similarity rates $r_{ij}$  between sequences $g_{i}$ and $g_{j}$ (\emph{i.e.}, $r_{ij}=\Delta(g_{i},g_{j})$).
+In this latter, nodes are constituted by all the coding sequences of the set of genomes under consideration, and there is an edge between $g_{i}$ and $g_{j}$ if the 
+similarity rate $r_{ij}$ is
+greater than the given similarity threshold. The Connected Components
+(CC) of the ``similarity'' graph are thus computed.
+This produces an equivalence 
+relation between sequences in the same CC based on Definition~\ref{def1}.
+Any class for this relation is called ``gene'' here, where its representatives (DNA sequences) are the ``alleles'' of this gene. Thus this first method produces for each genome $G$, which is a set $\{g_{1}^G,...,g_{m_G}^G\}$ of $m_{G}$ DNA coding sequences, the projection of each sequence according to $\pi$, where $\pi$ maps each sequence
+into its gene (class) according to $\sim$. In other words, $G$ is mapped into $\{\pi(g_{1}^G),...,\pi(g_{m_G}^G)\}$.  
+Remark that a projected genome has no duplicated gene, as it is a set. The core  genome (resp. the pan genome) of $G_{1}$ and $G_{2}$ is defined thus as the intersection (resp. as the union) of these projected genomes.\\
+We then consider the intersection of all the projected genomes, which is the set of all the genes $\dot{x}$
+such that each genome has at least one allele in $\dot{x}$. The pan genome is computed similarly as the union of all the projected genomes. However such approach suffers from producing too small core genomes, 
+for any chosen similarity threshold, compared to what is usually waited by biologists regarding these chloroplasts. We are then left with the following questions: how can we improve the confidence put in the produced core? Can we thus guess the evolution scenario of these genomes?
+\ No newline at end of file