-The field of genome annotation pays a lot of attentions where the
-ability to collect and analysis genomical data can provide strong
-indicators for the study of life\cite{Eisen2007}. Four of genome
-annotation centers (such as, \textit{NCBI\cite{Sayers01012011},
+
+These last years the cost of sequencing genomes has been greatly
+reduced, and thus more and more genomes are sequenced. Therefore
+automatic annotation tools are required to deal with this continuously
+increasing amount of genomical data. Moreover, a reliable and accurate
+genome annotation process is needed in order to provide strong
+indicators for the study of life\cite{Eisen2007}.
+
+Various annotation tools (\emph{i.e.}, cost-effective sequencing
+methods\cite{Bakke2009}) producing genomic annotations at many levels
+of detail have been designed by different annotation centers. Among
+the major annotation centers we can notice NCBI\cite{Sayers01012011},
Dogma \cite{RDogma}, cpBase \cite{de2002comparative},
-CpGAVAS \cite{liu2012cpgavas}, and CEGMA\cite{parra2007cegma}})
-present various types of annotation tools (\emph{i.e.} cost-effective
-sequencing methods\cite{Bakke2009}) on different annotation
-levels. Generally, previous studies used one of three methods for gene
-finding in annotated genome using these
-centers: \textit{alignment-based, composition based, or combination of
-both\cite{parra2007cegma}}. The alignment-based method is used when we
-try to predict a coding gene (\emph{i.e.}. genes that produce
-proteins) by aligning DNA sequence of gene to the protein of cDNA
-sequence of homology\cite{parra2007cegma}. This approach also is used
-in GeneWise\cite{birney2004genewise}. Composition-based method (known
+CpGAVAS \cite{liu2012cpgavas}, and
+CEGMA\cite{parra2007cegma}. Usually, previous studies used one out of
+three methods for finding genes in annoted genomes using data from
+these centers: \textit{alignment-based}, \textit{composition based},
+or a combination of both~\cite{parra2007cegma}. The alignment-based
+method is used when trying to predict a coding gene (\emph{i.e.}.
+genes that produce proteins) by aligning a genomic DNA sequence with a
+cDNA sequence coding an homologous protein \cite{parra2007cegma}.
+This approach is also used in GeneWise\cite{birney2004genewise}. The
+alternative method, the composition-based one (also known
as \textit{ab initio}) is based on a probabilistic model of gene
structure to find genes according to the gene value probability
-(GeneID\cite{parra2000geneid}). In this section, we consider a new
-method of finding core genes from large amount of chloroplast genomes,
-as a solution of the problem resulting from the method stated in
-section two. This method is based on extracting gene features. A
-general overview of the system is illustrated in Figure \ref{Fig1}.\\
-
-\begin{figure}[H]
+(GeneID \cite{parra2000geneid}). Such annotated genomic data will be
+used to overcome the limitation of the first method described in the
+previous section. In fact, the second method we propose finds core
+genes from large amount of chloroplast genomes through genomic
+features extraction.
+
+Figure~\ref{Fig1} presents an overview of the entire method pipeline.
+More precisely, the second method consists of three
+stages: \textit{Genome annotation}, \textit{Core extraction},
+and \textit{Features Visualization} which highlights the
+relationships. To understand the whole core extraction process, we
+describe briefly each stage below. More details will be given in the
+coming subsections. The method uses as starting point some sequence
+database chosen among the many international databases storing
+nucleotide sequences, like the GenBank at NBCI \cite{Sayers01012011},
+the \textit{EMBL-Bank} \cite{apweiler1985swiss} in Europe
+or \textit{DDBJ} \cite{sugawara2008ddbj} in Japan. Different
+biological tools can analyze and annotate genomes by interacting with
+these databases to align and extract sequences to predict genes. The
+database in our method must be taken from any confident data source
+that stores annotated and/or unannotated chloroplast genomes. We have
+considered the GenBank-NCBI \cite{Sayers01012011} database as sequence
+database: 99~genomes of chloroplasts were retrieved. These genomes
+lie in the eleven type of chloroplast families and Table \ref{Tab2}
+summarizes their distribution in our dataset.
+
+\begin{figure}[h]
\centering
- \includegraphics[width=0.7\textwidth]{generalView}
-\caption{A general overview of the system}\label{Fig1}
+ \includegraphics[width=0.75\textwidth]{generalView}
+\caption{A general overview of the annotation-based approach}\label{Fig1}
\end{figure}
-In Figure 1, we illustrate the general overview of system pipeline: \textit{Database, Genomes annotation, Core extraction,} and \textit{relationships}. We will give a short discussion for each stage of the model in order to understand the whole core extraction process. This work starts with a gene Bank database; however, many international Banks for nucleotide sequence databases (such as, \textit{GenBank} \cite{Sayers01012011} in USA, \textit{EMBL-Bank} \cite{apweiler1985swiss} in Europe, and \textit{DDBJ} \cite{sugawara2008ddbj} in Japon) exist to store various genomes and DNA species. Different biological tools can analyse and annotate genomes by interacting with these databases to align and extract sequences to predict genes. The database in this model must be taken from any confident data source that stores annotated and/or unannotated chloroplast genomes. We consider GenBank-NCBI \cite{Sayers01012011} database to be our nucleotide sequences database. Annotation (as the second stage) is considered to be the first important task for extract gene features. Good annotation tool leads us to extract good gene feature. In this paper, two annotation techniques from \textit{NCBI, and Dogma} are used to extract \textit{genes features}. Extracting gene feature (as a third stage) can be anything like (genes names, gene sequences, protein sequence,...etc). Our methodology in this paper consider gene names, genes counts, and gene sequence for extracting core genes and producing chloroplast evolutionary tree. \\
-In last stage, features visualization represents methods to visualize genomes and/or gene evolution in chloroplast. We use the forms of tables, phylogenetic trees, graphs,...,etc to organize and represent genomes relationships to achieve the goal of representing gene evolution. In addition, comparing these forms with another annotation tool forms dedicated to large population of chloroplast genomes give us biological perspectives to the nature of chloroplasts evolution. \\
-A local database attached with each pipe stage is used to store all the informations of extraction process. The output from each stage in our system will be an input to the second stage and so on.
+Annotation, which is the first stage, is an important task for
+extracting gene features. Indeed, to extract good gene feature, a good
+annotation tool is obviously required. To obtain relevant annotated
+genomes, two annotation techniques from NCBI and Dogma are used. The
+extraction of gene feature, the next stage, can be anything like gene
+names, gene sequences, protein sequences, and so on. Our method
+considers gene names, gene counts, and gene sequence for extracting
+core genes and producing chloroplast evolutionary tree. The final
+stage allows to visualize genomes and/or gene evolution in
+chloroplast. Therefore we use representations like tables,
+phylogenetic trees, graphs, etc. to organize and show genomes
+relationships, and thus achieve the goal of representing gene
+evolution. In addition, comparing these representations with ones
+issued from another annotation tool dedicated to large population of
+chloroplast genomes give us biological perspectives to the nature of
+chloroplasts evolution. Notice that a local database linked with each
+pipe stage is used to store all the informations produced during the
+process.
-\subsection{Genomes Samples}
-In this research, we retrieve genomes of Chloroplasts from NCBI. Ninety nine genome of them are considered to work with. These genomes lies in the eleven type of chloroplast families. The distribution of genomes is illustrated in detail in Table \ref{Tab2}.
-
\input{population_Table}
+
+% MICHEL : TO BE CONTINUED FROM HERE
+
\subsection{Genome Annotation Techniques}
Genome annotation is the second stage in the model pipeline. Many techniques were developed to annotate chloroplast genomes but the problem is that they vary in the number and type of predicted genes (\emph{i.e.} the ability to predict genes and \textit{for example: Transfer RNA (tRNA)} and \textit{Ribosomal RNA (rRNA)} genes). Two annotation techniques from NCBI and Dogma are considered to analyse chloroplast genomes to examine the accuracy of predicted coding genes.