classEquiv.tex

   1 Identifying  core genes  is important  to understand  evolutionary and
   2 functional phylogenies. Therefore, in this work we present two methods
   3 to build a  genes content evolutionary tree. More  precisely, we focus
   4 on   the    following   questions   considering    a   collection   of
   5 99~chloroplasts  annotated from  NCBI \cite{Sayers01012011} and  Dogma
   6 \cite{RDogma} : how can we identify the best core genome and what
   7 is the evolutionary scenario of these chloroplasts.
   8 Two methods are considered here. The first one is based on NCBI annotation, it is explained below.
   9 We start by the following definition.
  10 \begin{definition}
  11 \label{def1}
  12 Let $A=\{A,T,C,G\}$ be the nucleotides alphabet, and $A^\ast$ be the set of finite words on $A$ (\emph{i.e.}, of DNA sequences). Let $d:A^{\ast}\times A^{\ast}\rightarrow[0,1]$ be a distance on $A^{\ast}$. Consider a given value $T\in[0,1]$ called a threshold. For all $x,y\in A^{\ast}$, we will say that $x\sim_{d,T}y$ if $d(x,y)\leqslant T$.
  13 \end{definition}
  14
  15 \noindent$\sim_{d,T}$ is obviously an equivalence relation. When $d=1-\Delta$, where $\Delta$ is the similarity scoring function embedded into the emboss package (Needleman-Wunch released by EMBL), we will simply denote $\sim_{d,0.1}$ by $\sim$. The method starts by building an undirected graph based on
  16 the similarity rates $r_{ij}$  between sequences $g_{i}$ and $g_{j}$ (\emph{i.e.}, $r_{ij}=\Delta(g_{i},g_{j})$).
  17 In this latter, nodes are constituted by all the coding sequences of the set of genomes under consideration, and there is an edge between $g_{i}$ and $g_{j}$ if the
  18 similarity rate $r_{ij}$ is
  19 greater than the given similarity threshold. The Connected Components
  20 (CC) of the ``similarity'' graph are thus computed.
  21 This produces an equivalence
  22 relation between sequences in the same CC based on Definition~\ref{def1}.
  23 Any class for this relation is called ``gene'' here, where its representatives (DNA sequences) are the ``alleles'' of this gene. Thus this first method produces for each genome $G$, which is a set $\{g_{1}^G,...,g_{m_G}^G\}$ of $m_{G}$ DNA coding sequences, the projection of each sequence according to $\pi$, where $\pi$ maps each sequence
  24 into its gene (class) according to $\sim$. In other words, $G$ is mapped into $\{\pi(g_{1}^G),...,\pi(g_{m_G}^G)\}$.
  25 Remark that a projected genome has no duplicated gene, as it is a set. The core  genome (resp. the pan genome) of $G_{1}$ and $G_{2}$ is defined thus as the intersection (resp. as the union) of these projected genomes.\\
  26 We then consider the intersection of all the projected genomes, which is the set of all the genes $\dot{x}$
  27 such that each genome has at least one allele in $\dot{x}$. The pan genome is computed similarly as the union of all the projected genomes. However such approach suffers from producing too small core genomes,
  28 for any chosen similarity threshold, compared to what is usually waited by biologists regarding these chloroplasts. We are then left with the following questions: how can we improve the confidence put in the produced core? Can we thus guess the evolution scenario of these genomes?