Paper2/annotated.tex

   1
   2
   3 %\subsubsection{Using genes names provided by annotation tools}
   4 %
   5 %Instead of using the sequences predicted by annotation tools, we can
   6 %try to use the names associated to these sequences, when available.
   7 %The basic idea is thus to annotate all the sequences using a given
   8 %software, and to consider as core gene each sequence whose name can
   9 %be found in all the genomes.
  10 %Two annotation  techniques will be used in the remainder of this article,
  11 %namely DOGMA and NCBI.
  12 %
  13 %
  14 %It is true that the NCBI annotations are of varying
  15 %qualities, and sometimes such annotations are totally erroneous. As stated before, it is due to the
  16 %large variety of annotation tools that can been used during each
  17 %sequence submission process. However, we also considered it in this
  18 %article, as this database contains human-curated annotations. To say this
  19 %another way, DOGMA automatic annotations are good in average, while
  20 %NCBI contains very good human-based annotations together with very badly
  21 %annotated genomes.
  22 %Let us finally remark that DOGMA also predict the locations of
  23 %\textit{ribosomal  RNA (rRNA)}, while they are not provided in
  24 %gene features from NCBI. Thus core genomes constructed on NCBI
  25 %data will not contain rRNA.
  26
  27 We now investigate core and pan genomes design
  28 using each of the two tools separately, which will constitute the second
  29 approach detailed in this article. From now on we will consider annotated
  30 genomes: either ``genes features'' downloaded from the NCBI, or the
  31 result of DOGMA.
  32
  33 %\subsubsection{Names processing}
  34 %
  35 %As DOGMA is a deterministic annotation tool, when a given gene
  36 %is detected twice in two genomes, the same name will be attached
  37 %to the two coding sequences: DOGMA spells exactly in the same manner
  38 %the two gene names. So each genome is replaced by a list of gene
  39 %names, and finding the core genes common to two genomes simply
  40 %consists in intersecting the two lists of genes. The sole problem
  41 %we have detected using DOGMA on our 97 chloroplasts is the case
  42 %of the RPS12 gene: some genomes contain RPS12\_3end
  43 %or RPS12\_5end in the DOGMA result. We have manually
  44 %considered that all these representatives belong to the same gene,
  45 %namely to RPS12.
  46 %
  47 %Dealing with NCBI names is more complicated, as various annotation
  48 %tools have been used together with human annotations, and because there is
  49 %no spelling rule for gene names. For instance, NAD6 mitochondrial gene is
  50 %sometimes written as ND6, while we can find RPOC1, RPOC1A, and RPOC1B in
  51 %our chloroplasts. So if we simply consider NCBI data without
  52 %treatment, intersecting two genomes provided as list of gene names often
  53 %lead to duplication of misspelled genes. Automatic names homogenization is thus required
  54 %on NCBI annotations, the question being where to draw the line
  55 %on correcting errors in the spelling of genes ? In this second approach,
  56 %we propose to automate only obvious modifications like putting all names
  57 %in capital letters and removing useless symbols as ``\_'', ``('', and ``)''.
  58 %Remark that such simple renaming process cannot tackle with the situations of NAD6 or
  59 %RPOC1 evoked above. To go further in automatic corrections requires
  60 %to use edit distances like the Levenshtein one, however such an use will
  61 %raise false positives (different genes with close names will be homogenized).
  62 %To solve this problem, a compromise that reduces the number of false positives, by considering the similarity between DNA sequences of genes having similar names, will be detailed in the third approach.
  63 %
  64 %At this stage, we now consider that each genome is mapped to a list of gene
  65 %names, where names have been homogenized in the NCBI case.
  66
  67
  68
  69 %\subsubsection{Core genes extraction}
  70 %
  71 %% The goal of  this stage is to extract maximum core genes from sets of
  72 %% genes.  To find core genes, the following methodology is applied.
  73 %%
  74 %
  75 %%\subsubsection{Intersection Core Matrix (\textit{ICM})}
  76 %
  77 %To extract  core genes, we  iteratively collect the maximum  number of
  78 %common  genes   between  genomes,  therefore   during  this  stage
  79 %an \textit{Intersection  Core Matrix}  (ICM) is built.   ICM is  a two
  80 %dimensional symmetric matrix where each row and each column correspond
  81 %to   one   genome.   Hence,   an   element   of   the  matrix   stores
  82 %the  \textit{Intersection Score}  (IS):  the cardinality  of the  core
  83 %genes   set  obtained   by  intersecting     the two genomes.
  84 %%Maximum  cardinality results in selecting the  two genomes having
  85 %%the maximum score.
  86 %Mathematically speaking, if we have $n$ genomes in
  87 %local database, the ICM is an $n \times n$ matrix whose elements
  88 %satisfy:
  89 %\begin{equation}
  90 %score_{ij}=\vert g_i \cap g_j\vert
  91 %\label{Eq1}
  92 %\end{equation}
  93 %\noindent where $1 \leq i \leq n$, $1 \leq j \leq n$, and $g_i, g_j$ are
  94 %genomes. The  generation of a new  core genome depends  obviously on the
  95 %value  of the  intersection scores  $score_{ij}$. More  precisely, the
  96 %idea is  to consider a  pair of genomes  such that their score  is the
  97 %largest element in the ICM. These two genomes are then removed from the matrix
  98 %and the  resulting new  core genome is  added for the  next iteration.
  99 %The ICM is then updated to take into account the new core genome: new IS
 100 %values are computed for it. This process is repeated until no new core
 101 %genome can be obtained.
 102 %
 103 %We  can observe  that  the ICM  is relatively  large  due to  the amount  of
 104 %species. As a consequence, the  computation of the intersection scores is
 105 %both  time and  memory consuming.  However,  since ICM  is obviously a  symmetric
 106 %matrix we can reduce the  computation overhead by considering only its
 107 %triangular  upper part.  The  time complexity  for this  process %after
 108 %%enhancement
 109 %is thus $O(\frac{n.(n-1)}{2})$.  Algorithm~\ref{Alg1:ICM}
 110 %illustrates the construction  of the ICM matrix and  the extraction of
 111 %the  core  genomes, where  \textit{GenomeList}  represents the  database
 112 %storing all genomes  data. At each iteration, this algorithm computes the maximum
 113 %core genome with its two  parents (genomes).
 114 %
 115 %% ALGORITHM HAS BEEN REWRITTEN
 116 %
 117 %\begin{algorithm}[H]
 118 %\caption{Extract Maximum Intersection Score}
 119 %\label{Alg1:ICM}
 120 %\begin{algorithmic}
 121 %\REQUIRE $L \leftarrow \text{genomes sets}$
 122 %\ENSURE $B1 \leftarrow \text{Max Core set}$
 123 %\FOR{$i \leftarrow 1:len(L)-1$}
 124 %        \STATE $score \leftarrow 0$
 125 %       \STATE $core1 \leftarrow set(GenomeList[L[i]])$
 126 %       \STATE $g1 \leftarrow L[i]$
 127 %       \FOR{$j \leftarrow i+1:len(L)$}
 128 %               \STATE $core2 \leftarrow set(GenomeList[L[j]])$
 129 %               \STATE $core \leftarrow core1 \cap core2$
 130 %               \IF{$len(core) > score$}
 131 %                  \STATE $score \leftarrow len(core)$
 132 %                 \STATE $g2 \leftarrow L[j]$
 133 %                \ENDIF
 134 %       \ENDFOR
 135 %       \STATE $B1[score] \leftarrow (g1,g2)$
 136 %\ENDFOR
 137 %\RETURN $max(B1)$
 138 %\end{algorithmic}
 139 %\end{algorithm}
 140 %
 141 %For complete core trees based either on NCBI names or on DOGMA ones, (see \url{http://members.femto-st.fr/christophe-guyeux/}).
 142 %%\color{red} The second approach is dependent on gene names spelling. When realizing simple homogenization of names provided by NCBI, we miss core genes which have slightly different name formats. So that, good annotation tool is highly required. \color{black}
 143 %
 144 %%\subsection{Features visualization}
 145 %%The last stage of the proposed pipeline is naturally to take advantage
 146 %%of the produced core and pan genomes for biological studies. As
 147 %%this key stage is not directly related to the methodology for core
 148 %%and pan genomes discovery, we will only outline a few tasks that
 149 %%can be operated on the produced data.
 150 %%
 151 %%\begin{figure}
 152 %%\centering
 153 %%\includegraphics[scale=0.215]{tree}
 154 %%\caption{Part of a core genomes evolutionary tree (NCBI gene names)}
 155 %%\label{coreTree}
 156 %%\end{figure}
 157 %%
 158 %%Obtained results may be visualized by building a core genomes evolutionary tree.
 159 %%% All core  genes generated  represent  an important information  in the  tree,
 160 %%% because they  provide ancestor information of two  or more
 161 %%% genomes.
 162 %%Each  node in this  tree represents a chloroplast  genome or
 163 %%a predicted core, as depicted in Figure~\ref{coreTree}. In this
 164 %%figure, nodes labels are of the form
 165 %%\textit{(Genes number:Family name\_Scientific name\_Accession number)},
 166 %%while an edge is labeled with the number of
 167 %%gene loss when compared to its parents (a leaf  genome or  an intermediate
 168 %%core  genome). Such numbers can answer questions like:
 169 %%how many genes are different between two  species? Which functionality has
 170 %%been lost between an ancestor and its children ? For complete core trees
 171 %%based either on NCBI names or on DOGMA ones, see supplementary data.
 172 %%
 173 %%
 174 %%
 175 %%A second application of such data is obviously to build accurate phylogenetic
 176 %%trees, using tools like
 177 %%PHYML\cite{guindon2005phyml} or
 178 %%RAxML{\cite{stamatakis2008raxml,stamatakis2005raxml}.
 179 %%Consider a set of species, the last common core genome in the core tree
 180 %%contains all the genes shared in common by these species. These genes may be
 181 %%multi aligned to serve as input of the phylogenetic tools mentioned above.
 182 %%An example of such a phylogenetic tree on core 58 (NCBI cores tree, see
 183 %%supplementary data) is provided in Appendix~\ref{philoTree}. Remark that, in
 184 %%order to constitute a relevant outgroup, we have simply blasted each gene
 185 %%of this core on a chosen \emph{Cyanobacteria}.
 186 %%