Paper2/general.tex

   1
   2
   3 \begin{figure}[h]
   4   \centering
   5     \includegraphics[width=0.75\textwidth]{Whole_system}
   6 \caption{A general overview of the annotation-based approach}\label{Fig1}
   7 \end{figure}
   8
   9 %Figure~\ref{Fig1} presents a general overview of the entire proposed pipeline
  10 %for core and pan genomes production and exploitation, which consists   of    three stages:   \textit{Genomes    annotation},   \textit{Core   extraction}, and    \textit{Features    Visualization}.
  11 % To  understand the  whole core extraction  process, we
  12 % describe briefly each  stage below. More details will  be given in the
  13 % coming subsections.
  14 \color{red}In previous work \cite{Alkindy2014}, we proposed a pipeline for the extraction of core genome. In this work, the pipline is considered with quality test method in extracting core genes, for more details (see figure~\ref{Fig1}). As a starting point, an annotation uses a DNA sequences database % chosen  among   the  many  international  databases  storing %nucleotide sequences,
  15 such as NCBI's GenBank~\cite{Sayers01012011}, the   European \textit{EMBL} database~\cite{apweiler1985swiss}, or the Japanese  \textit{DDBJ} one~\cite{sugawara2008ddbj}.
  16 \color{black}
  17 Further more, It is possible to obtain annotated genomes (DNA coding sequences with gene
  18 names and locations) by interacting with these databases, either by directly downloading
  19 annotated genomes delivered by these websites, or by launching an
  20 annotation tool on complete downloaded  genomes.
  21 Obviously, this annotation stage must be of quality if we want
  22 to obtain acceptable core and pan genomes.
  23 % These  last years  the cost  of  sequencing genomes  has been  greatly
  24 % reduced,  and thus  more and  more genomes  are  sequenced.  Therefore
  25 % automatic annotation tools are required to deal with this continuously
  26 % increasing amount of genomics data. %Moreover, a reliable and accurate
  27 % %genome  annotation  process  is  needed  in order  to  provide  strong
  28 %indicators for the study of life\cite{Eisen2007}.
  29 %Various  cost-effective annotation  tools~\cite{Bakke2009} producing genomic  annotations at many levels of detail  have been designed recently, some reputed ones being: % NCBI~\cite{Sayers01012011}, DOGMA~\cite{RDOGMA},       cpBase~\cite{de2002comparative}, CpGAVAS~\cite{liu2012cpgavas},                   and CEGMA~\cite{parra2007cegma}. Such tools usually use one out of the three following methods  for finding  gene locations in large DNA sequences:  \textit{alignment-based},  \textit{composition based}, or a  combination of both~\cite{parra2007cegma}.   The alignment-based method  is used  when trying  to predict  a protein coding  sequence   by aligning a genomic DNA sequence with a cDNA  sequence  coding  an already known homologous  protein~\cite{parra2007cegma}.0 This approach is used for instance  in GeneWise~\cite{birney2004genewise}.  The alternative   method,   the    composition-based   one   (also   known as  \textit{ab initio})  is based  on  probabilistic  models of  genes structure~\cite{parra2000geneid}. % to  find genes  according  to  the  gene value  probability
  30 %(GeneID).
  31
  32 Using such annotated genomes, we will detail two general approaches for extracting the core genome, which is the third stage of the pipeline: the first one uses similarities computed on predicted coding sequences, while the second one uses all the information provided during the annotation stage.
  33
  34 \color{red}instead of considering only gene sequences taken from NCBI or DOGMA, a quality test process is take place by working with gene names and sequences to produce quality genes. However, we will show that such a simple idea is not so easy to realize, and that it is not sufficient to only consider gene names provided by such tools while it gives good results in previous work \cite{Alkindy2014}. \color{black}
  35 %
  36 %
  37 Annotation,  which  is the  first  stage,  is  an important  task  for extracting gene features. Indeed, to extract good gene feature, a good annotation tool  is obviously  required.
  38 Indeed, such annotations can be used in various manners (based on gene names,  gene  sequences, protein  sequences, etc.) to extract the core and pan genomes.
  39 We will subsequently propose  methods that use gene  names and sequences for extracting core  genes and  producing  chloroplast evolutionary  tree.
  40
  41 %\input{population_Table}
  42 The  final stage of our pipeline, only invoked in this article, is to take advantage
  43 of the information produced during the core and pan genomes search.
  44 This features visualization stage encompasses phylogenetic tree construction (see \cite{Alkindy2014} for more details)
  45 using core genes, genes content evolution illustrated by core trees, functionality
  46 investigations, and so on.
  47 %
  48 %    allows  to   visualize  genomes   and/or  gene   evolution  in chloroplast.    Therefore   we   use  representations   like   tables, phylogenetic  trees,  graphs,  etc.   to  organize  and  show  genomes relationships,  and  thus  achieve   the  goal  of  representing  gene
  49 % evolution.   In addition,  comparing these  representations  with ones issued from  another annotation tool dedicated to  large population of chloroplast genomes  give us biological perspectives to  the nature of chloroplasts evolution. %Notice that  a local database linked with each pipe stage is  used to store all the  information produced during the process.
  50
  51 For illustration purposes, we have considered % GenBank-NCBI~\cite{Sayers01012011} as sequence
  52 %database:
  53 99~genomes of chloroplasts downloaded from GenBank database~\cite{Sayers01012011}. These genomes
  54 lie in  the eleven type  of chloroplast families (see \cite{Alkindy2014} for more details).%as described in Table~\ref{Tab2}.
  55 Furthermore, two kinds of annotations will be considered in this document, namely the
  56 ones provided by NCBI on the one hand, and the ones by DOGMA on the other hand.
  57 %The
  58 %database in  our method must be  taken from any  confident data source
  59 %that stores annotated and/or unannotated chloroplast genomes.
  60 % As stated in the previous section, we have
  61 % considered  GenBank-NCBI~\cite{Sayers01012011} as sequence
  62 % database.