First modifications in section 3

[chloroplast13.git] / annotated.tex
diff --git a/annotated.tex b/annotated.tex

index e1889730f1cbbd38d2464dba9abf353905155088..9810db82204ceb90017929cefef1ae44667bef72 100644 (file)
--- a/annotated.tex
+++ b/annotated.tex
@@ -1,19 +1,82 @@
-The field of genome annotation pays a lot of attentions where the ability to collect and analysis genomical data can provide strong indicators for the study of life\cite{Eisen2007}. Four of genome annotation centers (such as, \textit{NCBI\cite{Sayers01012011}, Dogma \cite{RDogma}, cpBase \cite{de2002comparative}, CpGAVAS \cite{liu2012cpgavas}, and CEGMA\cite{parra2007cegma}}) present various types of annotation tools (\emph{i.e.} cost-effective sequencing methods\cite{Bakke2009}) on different annotation levels. Generally, previous studies used one of three methods for gene finding in annotated genome using these centers: \textit{alignment-based, composition based, or combination of both\cite{parra2007cegma}}. The alignment-based method is used when we try to predict a coding gene (\emph{i.e.}. genes that produce proteins) by aligning DNA sequence of gene to the protein of cDNA sequence of homology\cite{parra2007cegma}. This approach also is used in GeneWise\cite{birney2004genewise}. Composition-based method (known as \textit{ab initio}) is based on a probabilistic model of gene structure to find genes according to the gene value probability (GeneID\cite{parra2000geneid}). In this section, we consider a new method of finding core genes from large amount of chloroplast genomes, as a solution of the problem resulting from the method stated in section two. This method is based on extracting gene features. A general overview of the system is illustrated in Figure \ref{Fig1}.\\
  
-\begin{figure}[H]  
+These  last years  the cost  of  sequencing genomes  has been  greatly
+reduced,  and thus  more and  more genomes  are  sequenced.  Therefore
+automatic annotation tools are required to deal with this continuously
+increasing amount of genomical data. Moreover, a reliable and accurate
+genome  annotation  process  is  needed  in order  to  provide  strong
+indicators for the study of life\cite{Eisen2007}.
+
+Various  annotation   tools  (\emph{i.e.},  cost-effective  sequencing
+methods\cite{Bakke2009}) producing genomic  annotations at many levels
+of detail  have been designed  by different annotation  centers. Among
+the major annotation  centers we can notice NCBI\cite{Sayers01012011},
+Dogma       \cite{RDogma},       cpBase      \cite{de2002comparative},
+CpGAVAS                   \cite{liu2012cpgavas},                   and
+CEGMA\cite{parra2007cegma}. Usually, previous  studies used one out of
+three methods  for finding  genes in annoted  genomes using  data from
+these  centers: \textit{alignment-based},  \textit{composition based},
+or a  combination of both~\cite{parra2007cegma}.   The alignment-based
+method  is used  when trying  to predict  a coding  gene (\emph{i.e.}.
+genes that produce proteins) by aligning a genomic DNA sequence with a
+cDNA  sequence  coding  an homologous  protein  \cite{parra2007cegma}.
+This approach is  also used in GeneWise\cite{birney2004genewise}.  The
+alternative   method,   the    composition-based   one   (also   known
+as  \textit{ab initio})  is based  on  a probabilistic  model of  gene
+structure  to  find genes  according  to  the  gene value  probability
+(GeneID \cite{parra2000geneid}).  Such  annotated genomic data will be
+used to overcome  the limitation of the first  method described in the
+previous section.   In fact, the  second method we propose  finds core
+genes  from  large  amount  of  chloroplast  genomes  through  genomic
+features extraction.
+
+Figure~\ref{Fig1} presents an overview  of the entire method pipeline.
+More    precisely,    the   second    method    consists   of    three
+stages:   \textit{Genome    annotation},   \textit{Core   extraction},
+and    \textit{Features    Visualization}    which   highlights    the
+relationships.  To  understand the  whole core extraction  process, we
+describe briefly each  stage below. More details will  be given in the
+coming subsections.   The method uses as starting  point some sequence
+database  chosen  among   the  many  international  databases  storing
+nucleotide sequences, like  the GenBank at NBCI \cite{Sayers01012011},
+the    \textit{EMBL-Bank}     \cite{apweiler1985swiss}    in    Europe
+or   \textit{DDBJ}   \cite{sugawara2008ddbj}   in  Japan.    Different
+biological tools can analyze  and annotate genomes by interacting with
+these databases to  align and extract sequences to  predict genes. The
+database in  our method must be  taken from any  confident data source
+that stores annotated and/or unannotated chloroplast genomes.  We have
+considered the GenBank-NCBI \cite{Sayers01012011} database as sequence
+database:  99~genomes of chloroplasts  were retrieved.   These genomes
+lie in  the eleven type  of chloroplast families and  Table \ref{Tab2}
+summarizes their distribution in our dataset.
+
+\begin{figure}[h]  
    \centering
-    \includegraphics[width=0.7\textwidth]{generalView}
-\caption{A general overview of the system}\label{Fig1}
+    \includegraphics[width=0.75\textwidth]{generalView}
+\caption{A general overview of the annotation-based approach}\label{Fig1}
  \end{figure}
  
-In Figure 1, we illustrate the general overview of system pipeline: \textit{Database, Genomes annotation, Core extraction,} and \textit{relationships}. We will give a short discussion for each stage of the model in order to understand the whole core extraction process. This work starts with a gene Bank database; however, many international Banks for nucleotide sequence databases (such as, \textit{GenBank} \cite{Sayers01012011} in USA, \textit{EMBL-Bank} \cite{apweiler1985swiss} in Europe, and \textit{DDBJ} \cite{sugawara2008ddbj} in Japon) exist to store various genomes and DNA species. Different biological tools can analyse and annotate genomes by interacting with these databases to  align and extract sequences to predict genes. The database in this model must be taken from any confident data source that stores annotated and/or unannotated chloroplast genomes. We consider GenBank-NCBI \cite{Sayers01012011} database to be our nucleotide sequences database. Annotation (as the second stage) is considered to be the first important task for extract gene features. Good annotation tool leads us to extract good gene feature. In this paper, two annotation techniques from \textit{NCBI, and Dogma} are used to extract \textit{genes features}. Extracting gene feature (as a third stage) can be anything like (genes names, gene sequences, protein sequence,...etc). Our methodology in this paper consider gene names, genes counts, and gene sequence for extracting core genes and producing chloroplast evolutionary tree. \\
-In last stage, features visualization represents methods to visualize genomes and/or gene evolution in chloroplast. We use the forms of tables, phylogenetic trees, graphs,...,etc to organize and represent genomes relationships to achieve the goal of representing gene evolution. In addition, comparing these forms with another annotation tool forms dedicated to large population of chloroplast genomes give us biological perspectives to the nature of chloroplasts evolution. \\
-A local database attached with each pipe stage is used to store all the informations of extraction process. The output from each stage in our system will be an input to the second stage and so on.
-
-\subsection{Genomes Samples}
-In this research, we retrieve genomes of Chloroplasts from NCBI. Ninety nine genome of them are considered to work with. These genomes lies in the eleven type of chloroplast families. The distribution of genomes is illustrated in detail in Table \ref{Tab2}.
-
-\input{population_Table}       
+Annotation,  which  is the  first  stage,  is  an important  task  for
+extracting gene features. Indeed, to extract good gene feature, a good
+annotation tool  is obviously  required. To obtain  relevant annotated
+genomes, two annotation  techniques from NCBI and Dogma  are used. The
+extraction of gene feature, the  next stage, can be anything like gene
+names,  gene  sequences, protein  sequences,  and  so  on. Our  method
+considers gene  names, gene counts,  and gene sequence  for extracting
+core  genes and  producing  chloroplast evolutionary  tree. The  final
+stage   allows  to   visualize  genomes   and/or  gene   evolution  in
+chloroplast.    Therefore   we   use  representations   like   tables,
+phylogenetic  trees,  graphs,  etc.   to  organize  and  show  genomes
+relationships,  and  thus  achieve   the  goal  of  representing  gene
+evolution.   In addition,  comparing these  representations  with ones
+issued from  another annotation tool dedicated to  large population of
+chloroplast genomes  give us biological perspectives to  the nature of
+chloroplasts evolution. Notice that  a local database linked with each
+pipe stage is  used to store all the  informations produced during the
+process.
+
+\input{population_Table}
+       
+% MICHEL : TO BE CONTINUED FROM HERE
  
  \subsection{Genome Annotation Techniques}
  Genome annotation is the second stage in the model pipeline. Many techniques were developed to annotate chloroplast genomes but the problem is that they vary in the number and type of predicted genes (\emph{i.e.} the ability to predict genes and \textit{for example: Transfer RNA (tRNA)} and \textit{Ribosomal RNA (rRNA)} genes). Two annotation techniques from NCBI and Dogma are considered to analyse chloroplast genomes to examine the accuracy of predicted coding genes.   
@@ -77,12 +140,15 @@ The second pre-processing method states: we can predict the best annotated genom
  
  \subsubsection{Intersection Core Matrix (\textit{ICM})}
  
-The idea behind extracting core genes is to iteratively collect the maximum number of common genes between two genomes. To do so, the system builds an \textit{Intersection Core Matrix (ICM)}. ICM is a two dimensional symmetric matrix where each row and each column represents one genome. Each position in ICM stores the \textit{Intersection Scores(IS)}. IS is the cardinality number of a core genes which comes from intersecting one genome with other ones. Maximum cardinality results to select two genomes with their maximum core. Mathematically speaking, if we have an $n \times n$ matrix where $n \text{is the number of genomes in local database}$, then lets consider:\\
+The idea behind extracting core genes is to iteratively collect the maximum number of common genes between two genomes. To do so, the system builds an \textit{Intersection Core Matrix (ICM)}. ICM is a two dimensional symmetric matrix where each row and each column represents one genome. Each position in ICM stores the \textit{Intersection Scores(IS)}. IS is the cardinality number of a core genes which comes from intersecting one genome with other ones. Maximum cardinality results to select two genomes with their maximum core. Mathematically speaking, if we have an $n \times n$ matrix where $n$  
+is the number of genomes in local database, then lets consider:\\
+
  \begin{equation}
  Score=\max_{i<j}\vert x_i \cap x_j\vert
  \label{Eq1}
  \end{equation}
-where $x_i, x_j$ are elements in the matrix. The generation of a new core genes is depending on the cardinality value of intersection scores, we call it \textit{Score}:
+
+\noindent where $x_i, x_j$ are elements in the matrix. The generation of a new core genes is depending on the cardinality value of intersection scores, we call it \textit{Score}:
  $$\text{New Core} = \begin{cases} 
  \text{Ignored} & \text{if $\textit{Score}=0$;} \\
  \text{new Core id} & \text{if $\textit{Score}>0$.}