Paper2/classEquiv.tex

   1 \color{red}
   2 We still need to propose good methodology to predict good core genes that reflect natural biological relationships among species. Proposing new methods and compare them with previous ones can give us an indicator of which method can produce good functional genes. In this section, we will recall the definition with a fast revision of similarity based method (see \cite{Alkindy2014} for further details).
   3 This method, considers annotated genomes from NCBI and DOGMA and uses a distance-based similarity measure on genes' coding sequences. Such an approach requires annotated genomes, like the ones provided by the NCBI website.
   4
   5 \subsubsection{Theoretical presentation}
   6
   7 We start by fast revision of first method by the following preliminary definition~\cite{acgs13:onp,Alkindy2014}.\color{black}
   8
   9 \begin{definition}
  10 \label{def1}
  11 Let $A=\{A,T,C,G\}$  be the nucleotides alphabet, and  $A^\ast$ be the
  12 set  of finite  words on  $A$  (\emph{i.e.}, of  DNA sequences).   Let
  13 $d:A^{\ast}\times   A^{\ast}\rightarrow[0,1]$   be   a   function called
  14 similarity measure   on
  15 $A^{\ast}$. Consider a given value $T\in[0,1]$ called a threshold. For
  16 all   $x,y\in  A^{\ast}$,   we   will  say   that  $x\sim_{d,T}y$   if
  17 $d(x,y)\leqslant T$.
  18 \end{definition}
  19
  20 %\noindent $\sim_{d,T}$ is obviously an equivalence relation and when $d=1-\Delta$, where $\Delta$ is the similarity scoring function embedded into the emboss package , we will simply denote $\sim_{d,0.1}$ by $\sim$.
  21
  22 Let be given a \emph{similarity} threshold $T$  and a \emph{similarity measure} $d$.
  23 The method begins by building  an undirected graph
  24 between all the DNA~sequences $g$ of the set  of genomes as follows:
  25 there is  an edge between $g_{i}$ and $g_{j}$
  26 if  $g_i \sim_{d,T} g_j$ is established.
  27 This graph is further denoted as the ``similarity'' graph.
  28 We thus say that two coding sequences
  29 $g_i$, $g_j$ are
  30 equivalent with respect to the relation $\mathcal{R}$ if both $g_i$ and
  31 $g_j$  belong in the same
  32 connected component (CC) of this similarity graph, \textit{i.e.}, if there is a path between $g_i$
  33 and $g_j$ in the graph.
  34
  35 It is not hard to see that this relation is an
  36 equivalence relation whereas $\sim$ is not.
  37 Any class for this relation   is  called  a ``gene''
  38 in this article,   where   its  representatives
  39 (DNA~sequences)  are the ``alleles''  of this  gene, such abuse of language
  40 being proposed to set our ideas down.  Thus  this first
  41 method   produces   for   each    genome   $G$,   which   is   a   set
  42 $\left\{g_{1}^G,...,g_{m_G}^G\right\}$    of   $m_{G}$    DNA   coding
  43 sequences, the  projection of each sequence according  to $\pi$, where
  44 $\pi$ maps each sequence into its gene (class) according to $\mathcal{R}$. In
  45 other     words,      a     genome     $G$      is     mapped     into
  46 $\left\{\pi(g_{1}^G),...,\pi(g_{m_G}^G)\right\}$.    Note    that    a
  47 projected genome has no duplicated gene since it is a set.
  48
  49 Consequently, the core  genome (resp.,  the pan genome)  of two genomes
  50 $G_{1}$  and $G_{2}$  is defined  as  the intersection  (resp., as  the
  51 union) of their projected  genomes.  We finally consider the intersection
  52 of  all the  projected genomes,  which  is the  set of  all the  genes
  53 $\dot{x}$  such  that   each  genome  has  at  least   one  allele  in
  54 $\dot{x}$. This set will constitute the core genome of the whole species
  55 under consideration. The  pan genome is computed  similarly as the  union of all
  56 the projected  genomes.
  57
  58 % \begin{figure}
  59 % \begin{center}
  60 % \includegraphics[scale=0.5]{stats.png}
  61 % \end{center}
  62 % \caption{Size of core and pan genomes w.r.t. the similarity threshold, first approach with NCBI annotations and Needleman-Wunsch similarity measure.}\label{Fig:sim:core:pan}
  63 % \end{figure}
  64 %
  65
  66
  67 % \begin{table}[h!]
  68 % \centering
  69 % \begin{tabular}{ccccccc}
  70 % \hline
  71 %           & \multicolumn{4}{c}{Method 1}                          & \multicolumn{2}{c}{Method 3}       \\ \hline
  72 %           & \multicolumn{2}{c}{NCBI}  & \multicolumn{2}{c}{DOGMA} & \multicolumn{2}{c}{NCBI and DOGMA} \\ \hline
  73 % Threshold & core       & pan          & core       & pan          & core            & pan              \\ \hline
  74 % 50        & 1          & 163          & 1          & 118          & \textbf{5}      & \textbf{245}     \\
  75 % 51        & 1          & 291          & 1          & 194          & -               & -                \\
  76 % 52        & 1          & 412          & 1          & 258          & -               & -                \\
  77 % 53        & 1          & 508          & 1          & 321          & -               & -                \\
  78 % 54        & 4          & 617          & 2          & 372          & -               & -                \\
  79 % 55        & \textbf{5} & \textbf{692} & 2          & 409          & -               & -                \\
  80 % 56        & \textbf{5} & \textbf{761} & \textbf{3} & \textbf{445} & -               & -                \\
  81 % 57        & 4          & 832          & \textbf{3} & \textbf{459} & -               & -                \\
  82 % 58        & 4          & 905          & 2          & 477          & -               & -                \\
  83 % 59        & 4          & 976          & 2          & 497          & -               & -                \\
  84 % 60        & 2          & 1032         & 2          & 519          & 4               & 242              \\
  85 % 61        & 2          & 1113         & 2          & 553          & -               & -                \\
  86 % 62        & 2          & 1186         & 2          & 580          & -               & -                \\
  87 % 63        & 2          & 1264         & 2          & 607          & -               & -                \\
  88 % 64        & 2          & 1352         & 2          & 644          & -               & -                \\
  89 % 65        & 1          & 1454         & 2          & 685          & -               & -                \\
  90 % 66        & 1          & 1544         & 1          & 756          & -               & -                \\
  91 % 67        & 0          & 1652         & 1          & 838          & -               & -                \\
  92 % 68        & 0          & 1775         & 1          & 912          & -               & -                \\
  93 % 69        & 0          & 1886         & 1          & 1007         & -               & -                \\
  94 % 70        & 0          & 2000         & 1          & 1116         & 3               & 242              \\
  95 % 80        & 0          & 3541         & 0          & 2730         & 1               & 242              \\
  96 % 90        & 0          & 5703         & 0          & 5181         & 0               & 241              \\ \hline
  97 % \end{tabular}
  98 % \caption{Size of core and pan genomes w.r.t. the similarity threshold, first and third approaches.}
  99 % \label{Fig:sim:core:pan}
 100 % \end{table}
 101 %
 102 \begin{table}[h!]
 103 \centering
 104 \begin{tabular}{cccccccc}
 105 \hline
 106           & \multicolumn{4}{c}{Method 1}                          & \multicolumn{2}{c}{Method 2}       \\
 107 \hline
 108           & \multicolumn{2}{c}{NCBI}  & \multicolumn{2}{c}{DOGMA} & \multicolumn{2}{c}{NCBI and DOGMA} \\ \hline
 109 Threshold & core       & pan          & core       & pan          & core            & pan              \\
 110 \hline
 111 50        & 1          & 163          & 1          & 118          & \textbf{5}      & \textbf{245}     \\
 112 51        & 1          & 291          & 1          & 194          & -               & -                \\
 113 52        & 1          & 412          & 1          & 258          & -               & -                \\
 114 53        & 1          & 508          & 1          & 321          & -               & -                \\
 115 54        & 4          & 617          & 2          & 372          & -               & -                \\
 116 55        & \textbf{5} & \textbf{692} & 2          & 409          & -               & -                \\
 117 56        & \textbf{5} & \textbf{761} & \textbf{3} & \textbf{445} & -               & -                \\
 118 57        & 4          & 832          & \textbf{3} & \textbf{459} & -               & -                \\
 119 58        & 4          & 905          & 2          & 477          & -               & -                \\
 120 59        & 4          & 976          & 2          & 497          & -               & -                \\
 121 60        & 2          & 1032         & 2          & 519          & 4               & 242              \\
 122 61        & 2          & 1113         & 2          & 553          & -               & -                \\
 123 62        & 2          & 1186         & 2          & 580          & -               & -                \\
 124 63        & 2          & 1264         & 2          & 607          & -               & -                \\
 125 64        & 2          & 1352         & 2          & 644          & -               & -                \\
 126 65        & 1          & 1454         & 2          & 685          & -               & -                \\
 127 66        & 1          & 1544         & 1          & 756          & -               & -                \\
 128 67        & 0          & 1652         & 1          & 838          & -               & -                \\
 129 68        & 0          & 1775         & 1          & 912          & -               & -                \\
 130 69        & 0          & 1886         & 1          & 1007         & -               & -                \\
 131 70        & 0          & 2000         & 1          & 1116         & 3               & 242              \\
 132 80        & 0          & 3541         & 0          & 2730         & 1               & 242              \\
 133 90        & 0          & 5703         & 0          & 5181         & 0               &241              \\
 134 \hline
 135 \end{tabular}
 136 \caption{Size of core and pan genomes w.r.t. the similarity threshold, first and second
 137 approache.}
 138 \label{Fig:sim:core:pan}
 139 \end{table}
 140
 141 \subsubsection{Case study}% using NCBI annotations}
 142
 143 Let us now consider the 99 chloroplastic genomes introduced earlier.
 144 We will use in this case study either the coding sequences
 145 downloaded from NCBI website or the sequences predicted by DOGMA.
 146 DOGMA, which stands for \textit{Dual  Organellar GenoMe Annotator}, has
 147 already been evoked in this article. This is a
 148  tool  developed in 2004 at  University of Texas    for annotating plant
 149 chloroplast and  animal mitochondrial genomes.  This tool
 150 translates  a  genome  in all  six  reading frames  and then
 151 queries  its own     amino     acid     sequence     database     using
 152 Blast  (blastx~\cite{altschul1990basic})  with  various ad hoc
 153 parameters. The choice of DOGMA is natural, as this annotation tool is
 154 reputed and specific to chloroplasts.
 155
 156 Each genome is thus constituted
 157 by a list of coding sequences. In this illustration study,
 158 we have evaluated the similarity between two sequences by
 159 using a global alignment. More precisely, the measure $d$
 160 introduced above is the similarity score provided after
 161 a Needleman-Wunch global alignment, as obtained by running
 162 the \emph{needle} command from the \emph{emboss} package
 163 released by EMBL~\cite{Rice2000}. Parameters of the \emph{needle}
 164 command are the default ones: 10.0 for gap open penalty and 0.5
 165 for  gap extension.
 166
 167 The number of genes in the core genome and in the pan genome,
 168 according to this first method using data and measure described above
 169 have been computed using the supercomputer facilities of the M\'esocentre
 170 de calcul de Franche-Comt\'e. Obtained results are
 171 represented in  Table~\ref{Fig:sim:core:pan} with respect to various
 172 threshold values on Needleman-Wunsch similarity scores.
 173 Remark that when the threshold is large,
 174 we obtain more connected components, but with small sizes (a large number
 175 of genes, with a few numbers of alleles for each of them). In other words,
 176 %the number
 177 %of alleles of one gene is naturally small if the threshold is large.
 178 %Similarly,
 179 when the threshold is large, the
 180 pan genome is large too. %However due to the construction method of the
 181 %core genome,  this set of genes has few elements in such a  situation.
 182 No matter the chosen annotation tool, this first approach  suffers from producing
 183 too small core genomes,  for any chosen similarity threshold, compared
 184 to   what  is   usually   expected  by   biologists.
 185 %  regarding   these
 186 %chloroplasts and their commonly accepted evolutionary scenario.
 187 For NCBI, it is certainly due to a wrong determination of start and stop
 188 codons in some annotated genomes, due to a large variety of annotation
 189 tools used during genomes submission on the NCBI server, some of them
 190 being old or deficient: such truncated
 191 genes will not produce a large similarity score with their orthologous genes
 192 present in other genomes. The case of DOGMA is more
 193 difficult to explain as, according to our experiments and to the
 194 state of the art, this gene prediction tool produces normally good
 195 results in average. The best explanation of such an under-performance
 196 is that a few genomes are very specific and far from the remainder ones, in terms
 197 of gene contents, which leads to a small number of genes in the global core
 198 genome. However this first approach cannot help us to determine
 199 which genomes must be removed from our set of data. To do so, we
 200 need to introduce a second approach based on gene names: from the
 201 problematic gene names, we will be able to trace back to the
 202 problematic genomes.
 203
 204 %
 205 % This first illustration emphasizes the importance to deal with
 206 % well-predicted coding sequences: gene prediction must be
 207 % achieved with the same tool on each genome, and this tool must
 208 % be well chosen among the state of the art ones. As we will use
 209 % good annotation tools, a natural idea is to take advantage to
 210 % the whole produced annotations (not only the predicted coding
 211 % sequence, but its name and location too for instance). The implementation
 212 % of this idea is the principle of the approaches for finding core
 213 % and pan genomes detailed below.
 214 % % We are  then left with the following  questions: how can
 215 % we improve the confidence put in the produced core? That is, how to obtain
 216 % such a core genome without considering coding sequences provided by the NCBI?