-The field of genome annotation pays a lot of attentions where the ability to collect and analysis genomical data can provide strong indicators for the study of life\cite{Eisen2007}. Four of genome annotation centers (such as, \textit{NCBI\cite{Sayers01012011}, Dogma \cite{RDogma}, cpBase \cite{de2002comparative}, CpGAVAS \cite{liu2012cpgavas}, and CEGMA\cite{parra2007cegma}}) present various types of annotation tools (\emph{i.e.} cost-effective sequencing methods\cite{Bakke2009}) on different annotation levels. Generally, previous studies used one of three methods for gene finding in annotated genome using these centers: \textit{alignment-based, composition based, or combination of both\cite{parra2007cegma}}. The alignment-based method is used when we try to predict a coding gene (\emph{i.e.}. genes that produce proteins) by aligning DNA sequence of gene to the protein of cDNA sequence of homology\cite{parra2007cegma}. This approach also is used in GeneWise\cite{birney2004genewise}. Composition-based method (known as \textit{ab initio}) is based on a probabilistic model of gene structure to find genes according to the gene value probability (GeneID\cite{parra2000geneid}). In this section, we consider a new method of finding core genes from large amount of chloroplast genomes, as a solution of the problem resulting from the method stated in section two. This method is based on extracting gene features. A general overview of the system is illustrated in Figure \ref{Fig1}.\\
-\begin{figure}[H]
+These last years the cost of sequencing genomes has been greatly
+reduced, and thus more and more genomes are sequenced. Therefore
+automatic annotation tools are required to deal with this continuously
+increasing amount of genomical data. Moreover, a reliable and accurate
+genome annotation process is needed in order to provide strong
+indicators for the study of life\cite{Eisen2007}.
+
+Various annotation tools (\emph{i.e.}, cost-effective sequencing
+methods\cite{Bakke2009}) producing genomic annotations at many levels
+of detail have been designed by different annotation centers. Among
+the major annotation centers we can notice NCBI\cite{Sayers01012011},
+Dogma \cite{RDogma}, cpBase \cite{de2002comparative},
+CpGAVAS \cite{liu2012cpgavas}, and
+CEGMA\cite{parra2007cegma}. Usually, previous studies used one out of
+three methods for finding genes in annoted genomes using data from
+these centers: \textit{alignment-based}, \textit{composition based},
+or a combination of both~\cite{parra2007cegma}. The alignment-based
+method is used when trying to predict a coding gene (\emph{i.e.}.
+genes that produce proteins) by aligning a genomic DNA sequence with a
+cDNA sequence coding an homologous protein \cite{parra2007cegma}.
+This approach is also used in GeneWise\cite{birney2004genewise}. The
+alternative method, the composition-based one (also known
+as \textit{ab initio}) is based on a probabilistic model of gene
+structure to find genes according to the gene value probability
+(GeneID \cite{parra2000geneid}). Such annotated genomic data will be
+used to overcome the limitation of the first method described in the
+previous section. In fact, the second method we propose finds core
+genes from large amount of chloroplast genomes through genomic
+features extraction.
+
+Figure~\ref{Fig1} presents an overview of the entire method pipeline.
+More precisely, the second method consists of three
+stages: \textit{Genome annotation}, \textit{Core extraction},
+and \textit{Features Visualization} which highlights the
+relationships. To understand the whole core extraction process, we
+describe briefly each stage below. More details will be given in the
+coming subsections. The method uses as starting point some sequence
+database chosen among the many international databases storing
+nucleotide sequences, like the GenBank at NBCI \cite{Sayers01012011},
+the \textit{EMBL-Bank} \cite{apweiler1985swiss} in Europe
+or \textit{DDBJ} \cite{sugawara2008ddbj} in Japan. Different
+biological tools can analyze and annotate genomes by interacting with
+these databases to align and extract sequences to predict genes. The
+database in our method must be taken from any confident data source
+that stores annotated and/or unannotated chloroplast genomes. We have
+considered the GenBank-NCBI \cite{Sayers01012011} database as sequence
+database: 99~genomes of chloroplasts were retrieved. These genomes
+lie in the eleven type of chloroplast families and Table \ref{Tab2}
+summarizes their distribution in our dataset.
+
+\begin{figure}[h]
\centering
- \includegraphics[width=0.7\textwidth]{generalView}
-\caption{A general overview of the system}\label{Fig1}
+ \includegraphics[width=0.75\textwidth]{generalView}
+\caption{A general overview of the annotation-based approach}\label{Fig1}
\end{figure}
-In Figure 1, we illustrate the general overview of system pipeline: \textit{Database, Genomes annotation, Core extraction,} and \textit{relationships}. We will give a short discussion for each stage of the model in order to understand the whole core extraction process. This work starts with a gene Bank database; however, many international Banks for nucleotide sequence databases (such as, \textit{GenBank} \cite{Sayers01012011} in USA, \textit{EMBL-Bank} \cite{apweiler1985swiss} in Europe, and \textit{DDBJ} \cite{sugawara2008ddbj} in Japon) exist to store various genomes and DNA species. Different biological tools can analyse and annotate genomes by interacting with these databases to align and extract sequences to predict genes. The database in this model must be taken from any confident data source that stores annotated and/or unannotated chloroplast genomes. We consider GenBank-NCBI \cite{Sayers01012011} database to be our nucleotide sequences database. Annotation (as the second stage) is considered to be the first important task for extract gene features. Good annotation tool leads us to extract good gene feature. In this paper, two annotation techniques from \textit{NCBI, and Dogma} are used to extract \textit{genes features}. Extracting gene feature (as a third stage) can be anything like (genes names, gene sequences, protein sequence,...etc). Our methodology in this paper consider gene names, genes counts, and gene sequence for extracting core genes and producing chloroplast evolutionary tree. \\
-In last stage, features visualization represents methods to visualize genomes and/or gene evolution in chloroplast. We use the forms of tables, phylogenetic trees, graphs,...,etc to organize and represent genomes relationships to achieve the goal of representing gene evolution. In addition, comparing these forms with another annotation tool forms dedicated to large population of chloroplast genomes give us biological perspectives to the nature of chloroplasts evolution. \\
-A local database attached with each pipe stage is used to store all the informations of extraction process. The output from each stage in our system will be an input to the second stage and so on.
-
-\subsection{Genomes Samples}
-In this research, we retrieve genomes of Chloroplasts from NCBI. Ninety nine genome of them are considered to work with. These genomes lies in the eleven type of chloroplast families. The distribution of genomes is illustrated in detail in Table \ref{Tab2}.
-
-\input{population_Table}
+Annotation, which is the first stage, is an important task for
+extracting gene features. Indeed, to extract good gene feature, a good
+annotation tool is obviously required. To obtain relevant annotated
+genomes, two annotation techniques from NCBI and Dogma are used. The
+extraction of gene feature, the next stage, can be anything like gene
+names, gene sequences, protein sequences, and so on. Our method
+considers gene names, gene counts, and gene sequence for extracting
+core genes and producing chloroplast evolutionary tree. The final
+stage allows to visualize genomes and/or gene evolution in
+chloroplast. Therefore we use representations like tables,
+phylogenetic trees, graphs, etc. to organize and show genomes
+relationships, and thus achieve the goal of representing gene
+evolution. In addition, comparing these representations with ones
+issued from another annotation tool dedicated to large population of
+chloroplast genomes give us biological perspectives to the nature of
+chloroplasts evolution. Notice that a local database linked with each
+pipe stage is used to store all the informations produced during the
+process.
+
+\input{population_Table}
+
+% MICHEL : TO BE CONTINUED FROM HERE
\subsection{Genome Annotation Techniques}
Genome annotation is the second stage in the model pipeline. Many techniques were developed to annotate chloroplast genomes but the problem is that they vary in the number and type of predicted genes (\emph{i.e.} the ability to predict genes and \textit{for example: Transfer RNA (tRNA)} and \textit{Ribosomal RNA (rRNA)} genes). Two annotation techniques from NCBI and Dogma are considered to analyse chloroplast genomes to examine the accuracy of predicted coding genes.
\subsubsection{Intersection Core Matrix (\textit{ICM})}
-The idea behind extracting core genes is to iteratively collect the maximum number of common genes between two genomes. To do so, the system builds an \textit{Intersection Core Matrix (ICM)}. ICM is a two dimensional symmetric matrix where each row and each column represents one genome. Each position in ICM stores the \textit{Intersection Scores(IS)}. IS is the cardinality number of a core genes which comes from intersecting one genome with other ones. Maximum cardinality results to select two genomes with their maximum core. Mathematically speaking, if we have an $n \times n$ matrix where $n \text{is the number of genomes in local database}$, then lets consider:\\
+The idea behind extracting core genes is to iteratively collect the maximum number of common genes between two genomes. To do so, the system builds an \textit{Intersection Core Matrix (ICM)}. ICM is a two dimensional symmetric matrix where each row and each column represents one genome. Each position in ICM stores the \textit{Intersection Scores(IS)}. IS is the cardinality number of a core genes which comes from intersecting one genome with other ones. Maximum cardinality results to select two genomes with their maximum core. Mathematically speaking, if we have an $n \times n$ matrix where $n$
+is the number of genomes in local database, then lets consider:\\
+
\begin{equation}
Score=\max_{i<j}\vert x_i \cap x_j\vert
\label{Eq1}
\end{equation}
-where $x_i, x_j$ are elements in the matrix. The generation of a new core genes is depending on the cardinality value of intersection scores, we call it \textit{Score}:
+
+\noindent where $x_i, x_j$ are elements in the matrix. The generation of a new core genes is depending on the cardinality value of intersection scores, we call it \textit{Score}:
$$\text{New Core} = \begin{cases}
\text{Ignored} & \text{if $\textit{Score}=0$;} \\
\text{new Core id} & \text{if $\textit{Score}>0$.}