classEquiv.tex

   1 The first method, described below, considers NCBI annotations and uses
   2 a distance-based similarity measure. We start with the following
   3 preliminary definition:
   4
   5 \begin{definition}
   6 \label{def1}
   7 Let $A=\{A,T,C,G\}$  be the nucleotides alphabet, and  $A^\ast$ be the
   8 set  of finite  words on  $A$  (\emph{i.e.}, of  DNA sequences).   Let
   9 $d:A^{\ast}\times   A^{\ast}\rightarrow[0,1]$   be   a   distance   on
  10 $A^{\ast}$. Consider a given value $T\in[0,1]$ called a threshold. For
  11 all   $x,y\in  A^{\ast}$,   we   will  say   that  $x\sim_{d,T}y$   if
  12 $d(x,y)\leqslant T$.
  13 \end{definition}
  14
  15 %\noindent $\sim_{d,T}$ is obviously an equivalence relation and when $d=1-\Delta$, where $\Delta$ is the similarity scoring function embedded into the emboss package , we will simply denote $\sim_{d,0.1}$ by $\sim$.
  16
  17 Let be given a \emph{similarity} threshold $T$  and a distance $d$,
  18 (Needleman-Wunch released by EMBL for instance).
  19 The method begins by building  an undirected graph
  20 between all the DNA~sequences $g$ of the set  of genomes as follows:
  21 there is  an edge between $g_{i}$ and $g_{j}$
  22 if  $g_i \sim_{d,T} g_j$ is established.
  23 This graph is further denoted as the ``similarity'' graph.
  24
  25 We thus consider that the pair of two coding sequences
  26 $(g_i,g_j)$ belongs in the relation $\mathcal{R}$ if both $g_i$ an,d
  27 $g_j$  belong in the same
  28 connected component (CC), \textit{i.e.} if there is a path between $g_i$
  29 and $g_j$ in the similarity graph. It is not hard to see this relation is an
  30 equivalence relation whereas $\sim$ is not.
  31
  32
  33 Any class for this relation   is  called   ``gene''
  34 here,   where   its  representatives
  35 (DNA~sequences)  are the ``alleles''  of this  gene.  Thus  this first
  36 method   produces   for   each    genome   $G$,   which   is   a   set
  37 $\left\{g_{1}^G,...,g_{m_G}^G\right\}$    of   $m_{G}$    DNA   coding
  38 sequences, the  projection of each sequence according  to $\pi$, where
  39 $\pi$ maps each sequence into its gene (class) according to $\mathcal{R}$. In
  40 other     words,      a     genome     $G$      is     mapped     into
  41 $\left\{\pi(g_{1}^G),...,\pi(g_{m_G}^G)\right\}$.    Note    that    a
  42 projected genome has no duplicated gene since it is a set.
  43
  44 Consequently, the core  genome (resp.  the pan genome)  of two genomes
  45 $G_{1}$  and $G_{2}$  is defined  as  the intersection  (resp. as  the
  46 union) of their projected  genomes.  We then consider the intersection
  47 of  all the  projected genomes,  which  is the  set of  all the  genes
  48 $\dot{x}$  such  that   each  genome  has  at  least   one  allele  in
  49 $\dot{x}$. The  pan genome is computed  similarly as the  union of all
  50 the projected  genomes.
  51
  52 \begin{figure}
  53 \begin{center}
  54 \includegraphics[scale=0.4]{stats.png}
  55 \end{center}
  56 \caption{Size of core and pan genomes w.r.t. the similarity threshold}\label{Fig:sim:core:pan}
  57 \end{figure}
  58
  59 The number of genes in the core genome and in the pan genome are
  60 represented in  Figure~\ref{Fig:sim:core:pan} with respect to the
  61 threshold value.
  62 First of all, the higher is the threshold,
  63 the smaller the connected components are. In other words, the number
  64 of alleles of one gene is small if the threshold is high.
  65 When the threshold is high, the number of genes and the size of
  66 pan genome is high too. However due to the construction method of the
  67 core genome,  this set of genes has few elements in such a  situation.
  68 This approach even suffers from producing
  69 too small core genomes (of size 0 or 1),  for any chosen similarity threshold, compared
  70 to   what  is   usually   expected  by   biologists  regarding   these
  71 chloroplasts. We are  then left with the following  questions: how can
  72 we improve the confidence put in  the produced core? Can we thus guess
  73 the evolution scenario of these genomes?