classEquiv.tex

   1
   2 The first method, described below, considers NCBI annotations and uses
   3 a distance-based similarity measure. We start with the following
   4 preliminary Definition:
   5
   6 \begin{definition}
   7 \label{def1}
   8 Let $A=\{A,T,C,G\}$  be the nucleotides alphabet, and  $A^\ast$ be the
   9 set  of finite  words on  $A$  (\emph{i.e.}, of  DNA sequences).   Let
  10 $d:A^{\ast}\times   A^{\ast}\rightarrow[0,1]$   be   a   distance   on
  11 $A^{\ast}$. Consider a given value $T\in[0,1]$ called a threshold. For
  12 all   $x,y\in  A^{\ast}$,   we   will  say   that  $x\sim_{d,T}y$   if
  13 $d(x,y)\leqslant T$.
  14 \end{definition}
  15
  16 \noindent $\sim_{d,T}$ is obviously an equivalence relation and when $d=1-\Delta$, where $\Delta$ is the similarity scoring function embedded into the emboss package (Needleman-Wunch released by EMBL), we will simply denote $\sim_{d,0.1}$ by $\sim$.
  17
  18 The method begins by building  an undirected graph based on similarity
  19 rates $r_{ij}$ between DNA~sequences $g_{i}$ and $g_{j}$ (\emph{i.e.},
  20 $r_{ij}=\Delta\left(g_{i},g_{j}\right)$).  In this latter graph, nodes
  21 are  constituted by all  the coding  sequences of  the set  of genomes
  22 under consideration, and there is  an edge between $g_{i}$ and $g_{j}$
  23 if the  similarity rate  $r_{ij}$ is greater  than a  given similarity
  24 threshold. The  Connected Components (CC) of  the ``similarity'' graph
  25 are thus computed.
  26
  27 This process also results in an equivalence relation between sequences
  28 in the  same CC  based on Definition~\ref{def1}.   Any class  for this
  29 relation   is  called   ``gene''  here,   where   its  representatives
  30 (DNA~sequences)  are the ``alleles''  of this  gene.  Thus  this first
  31 method   produces   for   each    genome   $G$,   which   is   a   set
  32 $\left\{g_{1}^G,...,g_{m_G}^G\right\}$    of   $m_{G}$    DNA   coding
  33 sequences, the  projection of each sequence according  to $\pi$, where
  34 $\pi$ maps each sequence into its gene (class) according to $\sim$. In
  35 other     words,      a     genome     $G$      is     mapped     into
  36 $\left\{\pi(g_{1}^G),...,\pi(g_{m_G}^G)\right\}$.    Note    that    a
  37 projected genome has no duplicated gene since it is a set.
  38
  39 Consequently, the core  genome (resp.  the pan genome)  of two genomes
  40 $G_{1}$  and $G_{2}$  is defined  as  the intersection  (resp. as  the
  41 union) of their projected  genomes.  We then consider the intersection
  42 of  all the  projected genomes,  which  is the  set of  all the  genes
  43 $\dot{x}$  such  that   each  genome  has  at  least   one  allele  in
  44 $\dot{x}$. The  pan genome is computed  similarly as the  union of all
  45 the projected  genomes. However  such approach suffers  from producing
  46 too small core genomes,  for any chosen similarity threshold, compared
  47 to   what  is   usually   expected  by   biologists  regarding   these
  48 chloroplasts. We are  then left with the following  questions: how can
  49 we improve the confidence put in  the produced core? Can we thus guess
  50 the evolution scenario of these genomes?