Modifications made in section 4

[chloroplast13.git] / annotated.tex
diff --git a/annotated.tex b/annotated.tex

index 720adc5c3881ac5ed4560770f25e4ea178e9cd65..6bd0d552b050c368d00b966d3b1566925f4b47d9 100644 (file)
--- a/annotated.tex
+++ b/annotated.tex
@@ -1 +1,423 @@
- sdfdfadqdqaaaaaaaaaaaaaaaaaaa
+
+These  last years  the cost  of  sequencing genomes  has been  greatly
+reduced,  and thus  more and  more genomes  are  sequenced.  Therefore
+automatic annotation tools are required to deal with this continuously
+increasing amount of genomical data. Moreover, a reliable and accurate
+genome  annotation  process  is  needed  in order  to  provide  strong
+indicators for the study of life\cite{Eisen2007}.
+
+Various  annotation   tools  (\emph{i.e.},  cost-effective  sequencing
+methods\cite{Bakke2009}) producing genomic  annotations at many levels
+of detail  have been designed  by different annotation  centers. Among
+the major annotation  centers we can notice NCBI\cite{Sayers01012011},
+Dogma       \cite{RDogma},       cpBase      \cite{de2002comparative},
+CpGAVAS                   \cite{liu2012cpgavas},                   and
+CEGMA\cite{parra2007cegma}. Usually, previous  studies used one out of
+three methods  for finding  genes in annoted  genomes using  data from
+these  centers: \textit{alignment-based},  \textit{composition based},
+or a  combination of both~\cite{parra2007cegma}.   The alignment-based
+method  is used  when trying  to predict  a coding  gene (\emph{i.e.}.
+genes that produce proteins) by aligning a genomic DNA sequence with a
+cDNA  sequence  coding  an homologous  protein  \cite{parra2007cegma}.
+This approach is  also used in GeneWise\cite{birney2004genewise}.  The
+alternative   method,   the    composition-based   one   (also   known
+as  \textit{ab initio})  is based  on  a probabilistic  model of  gene
+structure  to  find genes  according  to  the  gene value  probability
+(GeneID \cite{parra2000geneid}).  Such  annotated genomic data will be
+used to overcome  the limitation of the first  method described in the
+previous section.   In fact, the  second method we propose  finds core
+genes  from  large  amount  of  chloroplast  genomes  through  genomic
+features extraction.
+
+Figure~\ref{Fig1} presents an overview  of the entire method pipeline.
+More    precisely,    the   second    method    consists   of    three
+stages:   \textit{Genome    annotation},   \textit{Core   extraction},
+and    \textit{Features    Visualization}    which   highlights    the
+relationships.  To  understand the  whole core extraction  process, we
+describe briefly each  stage below. More details will  be given in the
+coming subsections.   The method uses as starting  point some sequence
+database  chosen  among   the  many  international  databases  storing
+nucleotide sequences, like  the GenBank at NBCI \cite{Sayers01012011},
+the    \textit{EMBL-Bank}     \cite{apweiler1985swiss}    in    Europe
+or   \textit{DDBJ}   \cite{sugawara2008ddbj}   in  Japan.    Different
+biological tools can analyze  and annotate genomes by interacting with
+these databases to  align and extract sequences to  predict genes. The
+database in  our method must be  taken from any  confident data source
+that stores annotated and/or unannotated chloroplast genomes.  We have
+considered the GenBank-NCBI \cite{Sayers01012011} database as sequence
+database:  99~genomes of chloroplasts  were retrieved.   These genomes
+lie in  the eleven type  of chloroplast families and  Table \ref{Tab2}
+summarizes their distribution in our dataset.
+
+\begin{figure}[h]  
+  \centering
+    \includegraphics[width=0.75\textwidth]{generalView}
+\caption{A general overview of the annotation-based approach}\label{Fig1}
+\end{figure}
+
+Annotation,  which  is the  first  stage,  is  an important  task  for
+extracting gene features. Indeed, to extract good gene feature, a good
+annotation tool  is obviously  required. To obtain  relevant annotated
+genomes, two annotation  techniques from NCBI and Dogma  are used. The
+extraction of gene feature, the  next stage, can be anything like gene
+names,  gene  sequences, protein  sequences,  and  so  on. Our  method
+considers gene  names, gene counts,  and gene sequence  for extracting
+core  genes and  producing  chloroplast evolutionary  tree. The  final
+stage   allows  to   visualize  genomes   and/or  gene   evolution  in
+chloroplast.    Therefore   we   use  representations   like   tables,
+phylogenetic  trees,  graphs,  etc.   to  organize  and  show  genomes
+relationships,  and  thus  achieve   the  goal  of  representing  gene
+evolution.   In addition,  comparing these  representations  with ones
+issued from  another annotation tool dedicated to  large population of
+chloroplast genomes  give us biological perspectives to  the nature of
+chloroplasts evolution. Notice that  a local database linked with each
+pipe stage is  used to store all the  informations produced during the
+process.
+
+\input{population_Table}
+       
+\subsection{Genome annotation techniques}
+
+For  the first  stage, genome  annotation, many  techniques  have been
+developed  to annotate chloroplast  genomes.  These  techniques differ
+from  each others  in  the number  and  type of  predicted genes  (for
+example:  \textit{Transfer  RNA   (tRNA)}  and  \textit{Ribosomal  RNA
+(rRNA)}  genes). Two  annotation techniques  from NCBI  and  Dogma are
+considered to analyze chloroplast genomes.
+
+\subsubsection{Genome annotation from NCBI} 
+
+The objective  is to generate sets  of genes from each  genome so that
+genes are organized  without any duplication.  The input  is a list of
+chloroplast genomes  annotated from NCBI. More  precisely, all genomes
+are stored as \textit{.fasta} files  which consists in a collection of
+protein  coding genes\cite{parra2007cegma,RDogma}  (gene  that produce
+proteins) organized in coding sequences.   To be able build the set of
+core    genes,     we    need    to     preprocess    these    genomes
+using  \textit{BioPython}  package \cite{chapman2000biopython}.   This
+step  starts by  converting  each  genome from  FASTA  file format  to
+GenVision \cite{geneVision}  format from DNASTAR. Each  genome is thus
+converted in  a list of genes,  with gene names and  gene counts. Gene
+name duplications can be accumulated during the treatment of a genome.
+These  duplications  come   from  gene  fragments  (\emph{e.g.}   gene
+fragments treated  with NCBI) and  from chloroplast DNA  sequences. To
+ensure that  all the  duplications are removed,  each list of  gene is
+translated  into a  set of  genes.  Note that  NCBI genome  annotation
+produces genes except \textit{Ribosomal (rRNA)} genes.
+
+\subsubsection{Genome annotation from Dogma}
+
+Dogma stands for \textit{Dual  Organellar GenoMe Annotator}.  It is an
+annotation tool  developed at  University of Texas  in 2004  for plant
+chloroplast and  animal mitochondrial genomes.  This tool  has its own
+database  for translating  a  genome  in all  six  reading frames  and
+queries     the     amino     acid     sequence     database     using
+BLAST  \cite{altschul1990basic}  (\emph{i.e.}   Blastx)  with  various
+parameters.  Protein  coding genes are  identified in an  input genome
+using sequence similarity of genes  in Dogma database.  In addition in
+comparison   with   NCBI    annotation   tool,   Dogma   can   produce
+both \textit{Transfer RNAs (tRNA)} and \textit{Ribosomal RNAs (rRNA)},
+verify their start and end  positions. Another difference is also that
+there  is   no  gene  duplication   with  Dogma  after   solving  gene
+fragmentation. In  fact, genome annotation  with Dogma can be  the key
+difference when extracting core genes.
+
+The Dogma  annotation process  is divided into  two tasks.   First, we
+manually annotate chloroplast genomes using Dogma web tool. The output
+of this step is supposed to  be a collection of coding genes files for
+each genome, organized in GeneVision file. The second task is to solve
+the  gene   duplication  problem  and   therefore  we  have   use  two
+methods. The first method, based  on gene name, translates each genome
+into a set  of genes without duplicates. The  second method avoid gene
+duplication  through a  defragment  process. In  each iteration,  this
+process  starts by taking  a gene  from gene  list, searches  for gene
+duplication, if a duplication is found, it looks on the orientation of
+the  fragment sequence.   If it  is positive  it appends  directly the
+sequence to  gene files.  Otherwise reverse  complement operations are
+applied  on the sequence,  which is  then also  append to  gene files.
+Finally, a  check for missing start  and stop codons  is performed. At
+the  end  of  the  annotation  process,  all  the  genomes  are  fully
+annotated,  their   genes  are  defragmented,  and   gene  counts  are
+available.
+
+\subsection{Core genes extraction}
+
+The goal of  this stage is to extract maximum core  genes from sets of
+genes.  To find core genes, the following methodology is applied.
+
+\subsubsection{Preprocessing}
+
+In order  to extract  core genomes in  a suitable manner,  the genomic
+data are preprocessed with two methods: on the one hand a method based
+on gene  name and count,  and on  the other hand  a method based  on a
+sequence quality control test.
+
+In the first method, we extract  a list of genes from each chloroplast
+genome.  Then we store this list of genes in the database under genome
+nam and  genes counts can be  extracted by a  specific length command.
+The \textit{Intersection  Core Matrix}, described  in next subsection,
+is then  computed to  extract the core  genes.  The problem  with this
+method can be stated as follows: how can we ensure that the gene which
+is  predicted in  core genes  is the  same gene  in leaf  genomes? The
+answer  to this problem  is that  if the  sequences of  any gene  in a
+genome annotated  from Dogma  and NCBI are  similar with respect  to a
+given  threshold,  then   we  do  not  have  any   problem  with  this
+method. When the sequences are  not similar we have a problem, because
+we cannot decide which sequence belongs to a gene in core genes.
+
+The second method is based on  the underlying idea: we can predict the
+the best annotated  genome by merging the annotated  genomes from NCBI
+and Dogma according to a quality test on genes names and sequences. To
+obtain all  quality genes  of each genome,  we consider  the following
+hypothesis: any gene  will appear in the predicted  genome if and only
+if the  annotated genes  in NCBI and  Dogma pass a  specific threshold
+of  \textit{quality  control test}.   In  fact,  the Needle-man  Wunch
+algorithm  is applied  to compare  both  sequences with  respect to  a
+threshold. If  the alignment  score is above  the threshold,  then the
+gene will be  retained in the predicted genome,  otherwise the gene is
+ignored.   Once    the   prediction   of   all    genomes   is   done,
+the \textit{Intersection Core Matrix} is computed on these new genomes
+to extract core genes, as explained in Algorithm \ref{Alg3:thirdM}.
+
+\begin{algorithm}[H]
+\caption{Extract new genome based on gene quality test}
+\label{Alg3:thirdM}
+\begin{algorithmic} 
+\REQUIRE $Gname \leftarrow \text{Genome Name}, Threshold \leftarrow 65$
+\ENSURE $geneList \leftarrow \text{Quality genes}$ 
+\STATE $dir(NCBI\_Genes) \leftarrow \text{NCBI genes of Gname}$
+\STATE $dir(Dogma\_Genes) \leftarrow \text{Dogma genes of Gname}$
+\STATE $geneList=\text{empty list}$
+\STATE $common=set(dir(NCBI\_Genes)) \cap set(dir(Dogma\_Genes))$
+\FOR{$\text{gene in common}$}
+       \STATE $g1 \leftarrow open(NCBI\_Genes(gene)).read()$   
+       \STATE $g2 \leftarrow open(Dogma\_Genes(gene)).read()$
+       \STATE $score \leftarrow geneChk(g1,g2)$
+       \IF {$score > Threshold$}
+               \STATE $geneList \leftarrow gene$
+       \ENDIF 
+\ENDFOR
+\RETURN $geneList$
+\end{algorithmic}
+\end{algorithm}
+
+\textbf{geneChk} is a subroutine used to find the best similarity score between 
+two gene sequences after applying operations like \textit{reverse}, {\it complement}, 
+and {\it reverse complement}. Algorithm~\ref{Alg3:genechk} gives the outline of 
+geneChk subroutine.
+
+\begin{algorithm}[H]
+\caption{Find the Maximum Similarity Score between two sequences}
+\label{Alg3:genechk}
+\begin{algorithmic} 
+\REQUIRE $g1,g2 \leftarrow \text{NCBI gene sequence, Dogma gene sequence}$
+\ENSURE $\text{Maximum similarity score}$
+\STATE $score1 \leftarrow needle(g1,g2)$
+\STATE $score2 \leftarrow needle(g1,Reverse(g2))$
+\STATE $score3 \leftarrow needle(g1,Complement(g2))$
+\STATE $score4 \leftarrow needle(g1,Reverse(Complement(g2)))$
+\RETURN $max(score1,score2,score3,score4)$
+\end{algorithmic}
+\end{algorithm}  
+
+\subsubsection{Intersection Core Matrix (\textit{ICM})}
+
+To extract  core genes, we  iteratively collect the maximum  number of
+common  genes   between  genomes  and  therefore   during  this  stage
+an \textit{Intersection  Core Matrix}  (ICM) is built.   ICM is  a two
+dimensional symmetric matrix where each row and each column correspond
+to   one   genome.   Hence,   an   element   of   the  matrix   stores
+the  \textit{Intersection Score}  (IS):  the cardinality  of the  core
+genes   set  obtained   by  intersecting   one  genome   with  another
+one. Maximum  cardinality results in selecting the  two genomes having
+the maximum score. Mathematically speaking, if we have $n$ genomes in
+local database, the ICM is an $n \times n$ matrix whose elements
+satisfy: 
+\begin{equation}
+score_{ij}=\vert g_i \cap g_j\vert
+\label{Eq1}
+\end{equation}
+\noindent where $1 \leq i \leq n$, $1 \leq j \leq n$, and $g_i, g_j$ are 
+genomes. The  generation of a new  core gene depends  obviously on the
+value  of the  intersection scores  $score_{ij}$. More  precisely, the
+idea is  to consider a  pair of genomes  such that their score  is the
+largest element in ICM. These two genomes are then removed from matrix
+and the  resulting new  core genome is  added for the  next iteration.
+The ICM is then updated to take into account the new core gene: new IS
+values are computed for it. This process is repeated until no new core
+gene can be obtained.
+
+We  can observe  that  the ICM  is very  large  due to  the amount  of
+data. As a consequence, the  computation of the intersection scores is
+both  time and  memory consuming.  However,  since ICM  is a  symetric
+matrix we can reduce the  computation overhead by considering only its
+triangular  upper part.  The  time complexity  for this  process after
+enhancement is thus $O(\frac{n.(n-1)}{2})$.  Algorithm ~\ref{Alg1:ICM}
+illustrates the construction  of the ICM matrix and  the extraction of
+the  core  genes, where  \textit{GenomeList}  represents the  database
+storing all genomes  data. At each iteration, it  computes the maximum
+core genes with its two genomes parents.
+
+% ALGORITHM HAS BEEN REWRITTEN
+
+\begin{algorithm}[H]
+\caption{Extract Maximum Intersection Score}
+\label{Alg1:ICM}
+\begin{algorithmic} 
+\REQUIRE $L \leftarrow \text{genomes sets}$
+\ENSURE $B1 \leftarrow \text{Max Core set}$ 
+\FOR{$i \leftarrow 1:len(L)-1$}
+        \STATE $score \leftarrow 0$
+       \STATE $core1 \leftarrow set(GenomeList[L[i]])$
+       \STATE $g1 \leftarrow L[i]$
+       \FOR{$j \leftarrow i+1:len(L)$}
+               \STATE $core2 \leftarrow set(GenomeList[L[j]])$
+               \STATE $core \leftarrow core1 \cap core2$
+               \IF{$len(core) > score$}
+                  \STATE $score \leftarrow len(core)$
+                 \STATE $g2 \leftarrow L[j]$
+                \ENDIF
+       \ENDFOR
+       \STATE $B1[score] \leftarrow (g1,g2)$
+\ENDFOR
+\RETURN $max(B1)$
+\end{algorithmic}
+\end{algorithm}
+
+\subsection{Features visualization}
+
+The goal is to visualize results  by building a tree of evolution. All
+core  genes generated  represent  an important information  in the  tree,
+because they  provide ancestor information of two  or more
+genomes. Each  node in the  tree represents one chloroplast  genome or
+one predicted core and labelled as \textit{(Genes count:Family name\_Scientific
+names\_Accession number)}. While an edge is labelled with the number of
+lost  genes from  a leaf  genome or  an intermediate  core  gene. Such
+numbers are  very interesting because  they give an  information about
+the evolution:  how many genes  were lost between two  species whether
+they  belong  to  the  same  family  or not.   By  the  principle  of
+classification, a  small number of genes lost  among species indicates
+that those species are close to  each other and belong to same family,
+while a  large lost  means that we  have an  evolutionary relationship
+between species  from different families. To depict  the links between
+species   clearly,  we   built   a  phylogenetic   tree  showing   the
+relationships based on the distances among genes sequences. Many tools
+are    available   to    obtain    a   such    tree,   for    example:
+PHYML\cite{guindon2005phyml},
+RAxML{\cite{stamatakis2008raxml,stamatakis2005raxml},    BioNJ,    and
+TNT\cite{goloboff2008tnt}}.    In   this  work,   we   chose  to   use
+RAxML\cite{stamatakis2008raxml,stamatakis2005raxml}   because   it  is
+fast, accurate,  and can build large  trees when dealing  with a large
+number of genomic sequences.
+
+The procedure used to built a phylogenetic tree is as follows:
+\begin{enumerate}
+\item For each gene in a core gene, extract its sequence and store it in the database.
+\item Use multiple alignment tools such as (****to be write after see christophe****) 
+to align these sequences with each others.
+\item Use an outer-group genome from cyanobacteria to calculate distances.
+\item Submit the resulting aligned sequences to RAxML program to compute 
+the distances and finally draw the phylogenetic tree.
+\end{enumerate} 
+
+\begin{figure}[H]
+  \centering \includegraphics[width=0.75\textwidth]{Whole_system} \caption{Overview
+  of the pipeline}\label{wholesystem}
+\end{figure}
+
+\section{Implementation}
+
+The different  algorithms have  been implemented using  Python version
+2.7,  on  a  laptop  running Ubuntu~12.04~LTS.   More  precisely,  the
+computer is a Dell Latitude laptop - model E6430 with 6~GiB memory and
+a  quad-core Intel  core~i5~processor with  an operating  frequency of
+2.5~GHz. Many python packages  such as os, Biopython, memory\_profile,
+re,  numpy, time,  shutil, and  xlsxwriter were  used to  extract core
+genes  from large  amount of  chloroplast  genomes.
+
+\begin{center}
+\begin{table}[b]
+\caption{Type of annotation, execution time, and core genes 
+for each method}\label{Etime}
+{\scriptsize
+\begin{tabular}{p{2cm}p{0.5cm}p{0.25cm}p{0.5cm}p{0.25cm}p{0.5cm}p{0.25cm}p{0.5cm}p{0.25cm}p{0.5cm}p{0.2cm}}
+\hline\hline
+ Method & \multicolumn{2}{c}{Annotation} & \multicolumn{2}{c}{Features} & \multicolumn{2}{c}{Exec. time (min.)} & \multicolumn{2}{c}{Core genes} & \multicolumn{2}{c}{Bad genomes} \\
+~ & N & D & Name & Seq & N & D & N & D & N & D \\
+\hline
+Gene prediction & $\surd$ & - & - & $\surd$ & ? & - & ? & - & 0 & -\\[0.5ex]
+Gene features & $\surd$ & $\surd$ & $\surd$ & - & 4.98 & 1.52 & 28 & 10 & 1 & 0\\[0.5ex]
+Gene quality & $\surd$ & $\surd$ & $\surd$ & $\surd$ & \multicolumn{2}{c}{$\simeq$3 days + 1.29} & \multicolumn{2}{c}{4} & \multicolumn{2}{c}{1}\\[1ex]
+\hline
+\end{tabular}
+}
+\end{table}
+\end{center} 
+
+\vspace{-1cm}
+
+Table~\ref{Etime}  presents  for  each  method  the  annotation  type,
+execution time,  and the  number of core  genes. We use  the following
+notations:  \textbf{N}  denotes NCBI,  while  \textbf{D} means  DOGMA,
+and \textbf{Seq}  is for sequence. The first  {\it Annotation} columns
+represent the algorithm used to annotate chloroplast genomes, the {\it
+Features} columns mean  the kind  of gene feature used to extract core
+genes: gene name, gene sequence, or  both of them. It can be seen that
+almost all methods need low {\it Execution time} to extract core genes
+from large chloroplast genome.   Only the gene quality method requires
+several days of computation (about 3-4 days) for sequence comparisons,
+once the quality genomes are  construced it takes just 1.29~minutes to
+extract core gene. Thanks to this low execution times we can use these
+methods to extract core genes  on a personal computer rather than main
+frames or parallel computers. The lowest execution time: 1.52~minutes,
+is obtained with the second method using Dogma annotations. The number
+of {\it  Core genes} represents the  amount of genes in  the last core
+genome. The main goal is to  find the maximum core genes that simulate
+biological background of chloroplasts. With  NCBI we have 28 genes for
+96   genomes,   instead   of    10   genes   for   97   genomes   with
+Dogma. Unfortunately, the biological distribution of genomes with NCBI
+in core tree do not  reflect good biological perspective, whereas with
+DOGMA the  distribution of genomes is biologically  relevant. {\it Bad
+genomes} gives  the number of genomes  that destroy core  genes due to
+low  number  of  gene  intersection.  \textit{NC\_012568.1  Micromonas
+pusilla} is the only genome which destroyed the core genome with NCBI
+annotations for both gene features and gene quality methods.
+
+The second important factor is the amount of memory being used by each
+methodology.   Table   \ref{mem}  shows  the  memory   usage  of  each
+method.  We  used  a  package from  PyPI~(\textit{the  Python  Package
+Index})     named     \textit{Memory\_profile}    (located     at~{\tt
+https://pypi.python.org/pypi})   to   extract   all  the   values   in
+table~\ref{mem}. In  this table, the values are  presented in megabyte
+unit and \textit{gV} means  genevision~file~format. We can notice that
+the level  of memory which is  used is relatively low  for all methods
+and is available  on any personal computer. The  different values also
+show that the gene features  method based on Dogma annotations has the
+more   reasonable   memory   usage,   except  when   extracting   core
+sequences. The third method gives the lowest values if we already have
+the   quality   genomes,   otherwise   it  will   consume   far   more
+memory. Moreover, the  amount of memory used by  the third method also
+depends on the size of each genome.
+
+\begin{center}
+\begin{table}[H]
+\caption{Memory usages in (MB) for each methodology}\label{mem}
+{\scriptsize
+\begin{tabular}{p{2.5cm}p{1.5cm}p{1cm}p{1cm}p{1cm}p{1cm}p{1cm}p{1cm}}
+\hline\hline
+Method& & Load Gen. & Conv. gV & Read gV & ICM & Core tree & Core Seq. \\
+\hline
+Gene prediction & ~ & ~ & ~ & ~ & ~ & ~ & ~\\
+\multirow{2}{*}{Gene Features} & NCBI & 15.4 & 18.9 & 17.5 & 18 & 18 & 28.1\\
+              & DOGMA& 15.3 & 15.3 & 16.8 & 17.8 & 17.9 & 31.2\\
+Gene Quality  & ~ & 15.3 & $\le$3G & 16.1 & 17 & 17.1 & 24.4\\ 
+\hline
+\end{tabular}
+}
+\end{table}
+\end{center}  
+
+
+
+
+