+Annotation, which is the first stage, is an important task for
+extracting gene features. Indeed, to extract good gene feature, a good
+annotation tool is obviously required. To obtain relevant annotated
+genomes, two annotation techniques from NCBI and Dogma are used. The
+extraction of gene feature, the next stage, can be anything like gene
+names, gene sequences, protein sequences, and so on. Our method
+considers gene names, gene counts, and gene sequence for extracting
+core genes and producing chloroplast evolutionary tree. The final
+stage allows to visualize genomes and/or gene evolution in
+chloroplast. Therefore we use representations like tables,
+phylogenetic trees, graphs, etc. to organize and show genomes
+relationships, and thus achieve the goal of representing gene
+evolution. In addition, comparing these representations with ones
+issued from another annotation tool dedicated to large population of
+chloroplast genomes give us biological perspectives to the nature of
+chloroplasts evolution. Notice that a local database linked with each
+pipe stage is used to store all the informations produced during the
+process.
+
+\input{population_Table}
+
+\subsection{Genome annotation techniques}
+
+For the first stage, genome annotation, many techniques have been
+developed to annotate chloroplast genomes. These techniques differ
+from each others in the number and type of predicted genes (for
+example: \textit{Transfer RNA (tRNA)} and \textit{Ribosomal RNA
+(rRNA)} genes). Two annotation techniques from NCBI and Dogma are
+considered to analyze chloroplast genomes.
+
+\subsubsection{Genome annotation from NCBI}
+
+The objective is to generate sets of genes from each genome so that
+genes are organized without any duplication. The input is a list of
+chloroplast genomes annotated from NCBI. More precisely, all genomes
+are stored as \textit{.fasta} files which consists in a collection of
+protein coding genes\cite{parra2007cegma,RDogma} (gene that produce
+proteins) organized in coding sequences. To be able build the set of
+core genes, we need to preprocess these genomes
+using \textit{BioPython} package \cite{chapman2000biopython}. This
+step starts by converting each genome from FASTA file format to
+GenVision \cite{geneVision} format from DNASTAR. Each genome is thus
+converted in a list of genes, with gene names and gene counts. Gene
+name duplications can be accumulated during the treatment of a genome.
+These duplications come from gene fragments (\emph{e.g.} gene
+fragments treated with NCBI) and from chloroplast DNA sequences. To
+ensure that all the duplications are removed, each list of gene is
+translated into a set of genes. Note that NCBI genome annotation
+produces genes except \textit{Ribosomal (rRNA)} genes.
+
+\subsubsection{Genome annotation from Dogma}
+
+Dogma stands for \textit{Dual Organellar GenoMe Annotator}. It is an
+annotation tool developed at University of Texas in 2004 for plant
+chloroplast and animal mitochondrial genomes. This tool has its own
+database for translating a genome in all six reading frames and
+queries the amino acid sequence database using
+BLAST \cite{altschul1990basic} (\emph{i.e.} Blastx) with various
+parameters. Protein coding genes are identified in an input genome
+using sequence similarity of genes in Dogma database. In addition in
+comparison with NCBI annotation tool, Dogma can produce
+both \textit{Transfer RNAs (tRNA)} and \textit{Ribosomal RNAs (rRNA)},
+verify their start and end positions. Another difference is also that
+there is no gene duplication with Dogma after solving gene
+fragmentation. In fact, genome annotation with Dogma can be the key
+difference when extracting core genes.
+
+The Dogma annotation process is divided into two tasks. First, we
+manually annotate chloroplast genomes using Dogma web tool. The output
+of this step is supposed to be a collection of coding genes files for
+each genome, organized in GeneVision file. The second task is to solve
+the gene duplication problem and therefore we have use two
+methods. The first method, based on gene name, translates each genome
+into a set of genes without duplicates. The second method avoid gene
+duplication through a defragment process. In each iteration, this
+process starts by taking a gene from gene list, searches for gene
+duplication, if a duplication is found, it looks on the orientation of
+the fragment sequence. If it is positive it appends directly the
+sequence to gene files. Otherwise reverse complement operations are
+applied on the sequence, which is then also append to gene files.
+Finally, a check for missing start and stop codons is performed. At
+the end of the annotation process, all the genomes are fully
+annotated, their genes are defragmented, and gene counts are
+available.
+
+\subsection{Core genes extraction}
+
+The goal of this stage is to extract maximum core genes from sets of
+genes. To find core genes, the following methodology is applied.
+
+\subsubsection{Preprocessing}
+
+In order to extract core genomes in a suitable manner, the genomic
+data are preprocessed with two methods: on the one hand a method based
+on gene name and count, and on the other hand a method based on a
+sequence quality control test.
+
+In the first method, we extract a list of genes from each chloroplast
+genome. Then we store this list of genes in the database under genome
+nam and genes counts can be extracted by a specific length command.
+The \textit{Intersection Core Matrix}, described in next subsection,
+is then computed to extract the core genes. The problem with this
+method can be stated as follows: how can we ensure that the gene which
+is predicted in core genes is the same gene in leaf genomes? The
+answer to this problem is that if the sequences of any gene in a
+genome annotated from Dogma and NCBI are similar with respect to a
+given threshold, then we do not have any problem with this
+method. When the sequences are not similar we have a problem, because
+we cannot decide which sequence belongs to a gene in core genes.
+
+The second method is based on the underlying idea: we can predict the
+the best annotated genome by merging the annotated genomes from NCBI
+and Dogma according to a quality test on genes names and sequences. To
+obtain all quality genes of each genome, we consider the following
+hypothesis: any gene will appear in the predicted genome if and only
+if the annotated genes in NCBI and Dogma pass a specific threshold
+of \textit{quality control test}. In fact, the Needle-man Wunch
+algorithm is applied to compare both sequences with respect to a
+threshold. If the alignment score is above the threshold, then the
+gene will be retained in the predicted genome, otherwise the gene is
+ignored. Once the prediction of all genomes is done,
+the \textit{Intersection Core Matrix} is computed on these new genomes
+to extract core genes, as explained in Algorithm \ref{Alg3:thirdM}.
+
+\begin{algorithm}[H]
+\caption{Extract new genome based on gene quality test}
+\label{Alg3:thirdM}
+\begin{algorithmic}
+\REQUIRE $Gname \leftarrow \text{Genome Name}, Threshold \leftarrow 65$
+\ENSURE $geneList \leftarrow \text{Quality genes}$
+\STATE $dir(NCBI\_Genes) \leftarrow \text{NCBI genes of Gname}$
+\STATE $dir(Dogma\_Genes) \leftarrow \text{Dogma genes of Gname}$
+\STATE $geneList=\text{empty list}$
+\STATE $common=set(dir(NCBI\_Genes)) \cap set(dir(Dogma\_Genes))$
+\FOR{$\text{gene in common}$}
+ \STATE $g1 \leftarrow open(NCBI\_Genes(gene)).read()$
+ \STATE $g2 \leftarrow open(Dogma\_Genes(gene)).read()$
+ \STATE $score \leftarrow geneChk(g1,g2)$
+ \IF {$score > Threshold$}
+ \STATE $geneList \leftarrow gene$
+ \ENDIF
+\ENDFOR
+\RETURN $geneList$
+\end{algorithmic}
+\end{algorithm}
+
+\textbf{geneChk} is a subroutine used to find the best similarity score between
+two gene sequences after applying operations like \textit{reverse}, {\it complement},
+and {\it reverse complement}. Algorithm~\ref{Alg3:genechk} gives the outline of
+geneChk subroutine.
+
+\begin{algorithm}[H]
+\caption{Find the Maximum Similarity Score between two sequences}
+\label{Alg3:genechk}
+\begin{algorithmic}
+\REQUIRE $g1,g2 \leftarrow \text{NCBI gene sequence, Dogma gene sequence}$
+\ENSURE $\text{Maximum similarity score}$
+\STATE $score1 \leftarrow needle(g1,g2)$
+\STATE $score2 \leftarrow needle(g1,Reverse(g2))$
+\STATE $score3 \leftarrow needle(g1,Complement(g2))$
+\STATE $score4 \leftarrow needle(g1,Reverse(Complement(g2)))$
+\RETURN $max(score1,score2,score3,score4)$
+\end{algorithmic}
+\end{algorithm}
+
+\subsubsection{Intersection Core Matrix (\textit{ICM})}
+
+To extract core genes, we iteratively collect the maximum number of
+common genes between genomes and therefore during this stage
+an \textit{Intersection Core Matrix} (ICM) is built. ICM is a two
+dimensional symmetric matrix where each row and each column correspond
+to one genome. Hence, an element of the matrix stores
+the \textit{Intersection Score} (IS): the cardinality of the core
+genes set obtained by intersecting one genome with another
+one. Maximum cardinality results in selecting the two genomes having
+the maximum score. Mathematically speaking, if we have $n$ genomes in
+local database, the ICM is an $n \times n$ matrix whose elements
+satisfy:
+\begin{equation}
+score_{ij}=\vert g_i \cap g_j\vert
+\label{Eq1}
+\end{equation}
+\noindent where $1 \leq i \leq n$, $1 \leq j \leq n$, and $g_i, g_j$ are
+genomes. The generation of a new core gene depends obviously on the
+value of the intersection scores $score_{ij}$. More precisely, the
+idea is to consider a pair of genomes such that their score is the
+largest element in ICM. These two genomes are then removed from matrix
+and the resulting new core genome is added for the next iteration.
+The ICM is then updated to take into account the new core gene: new IS
+values are computed for it. This process is repeated until no new core
+gene can be obtained.
+
+We can observe that the ICM is very large due to the amount of
+data. As a consequence, the computation of the intersection scores is
+both time and memory consuming. However, since ICM is a symetric
+matrix we can reduce the computation overhead by considering only its
+triangular upper part. The time complexity for this process after
+enhancement is thus $O(\frac{n.(n-1)}{2})$. Algorithm ~\ref{Alg1:ICM}
+illustrates the construction of the ICM matrix and the extraction of
+the core genes, where \textit{GenomeList} represents the database
+storing all genomes data. At each iteration, it computes the maximum
+core genes with its two genomes parents.
+
+% ALGORITHM HAS BEEN REWRITTEN