+\subsubsection{Genome annotation from NCBI}
+
+The objective is to generate sets of genes from each genome so that
+genes are organized without any duplication. The input is a list of
+chloroplast genomes annotated from NCBI. More precisely, all genomes
+are stored as \textit{.fasta} files which consists in a collection of
+protein coding genes\cite{parra2007cegma,RDogma} (gene that produce
+proteins) organized in coding sequences. To be able build the set of
+core genes, we need to preprocess these genomes
+using \textit{BioPython} package \cite{chapman2000biopython}. This
+step starts by converting each genome from FASTA file format to
+GenVision \cite{geneVision} format from DNASTAR. Each genome is thus
+converted in a list of genes, with gene names and gene counts. Gene
+name duplications can be accumulated during the treatment of a genome.
+These duplications come from gene fragments (\emph{e.g.} gene
+fragments treated with NCBI) and from chloroplast DNA sequences. To
+ensure that all the duplications are removed, each list of gene is
+translated into a set of genes. Note that NCBI genome annotation
+produces genes except \textit{Ribosomal (rRNA)} genes.
+
+\subsubsection{Genome annotation from Dogma}
+
+Dogma stands for \textit{Dual Organellar GenoMe Annotator}. It is an
+annotation tool developed at University of Texas in 2004 for plant
+chloroplast and animal mitochondrial genomes. This tool has its own
+database for translating a genome in all six reading frames and
+queries the amino acid sequence database using
+BLAST \cite{altschul1990basic} (\emph{i.e.} Blastx) with various
+parameters. Protein coding genes are identified in an input genome
+using sequence similarity of genes in Dogma database. In addition in
+comparison with NCBI annotation tool, Dogma can produce
+both \textit{Transfer RNAs (tRNA)} and \textit{Ribosomal RNAs (rRNA)},
+verify their start and end positions. Another difference is also that
+there is no gene duplication with Dogma after solving gene
+fragmentation. In fact, genome annotation with Dogma can be the key
+difference when extracting core genes.
+
+The Dogma annotation process is divided into two tasks. First, we
+manually annotate chloroplast genomes using Dogma web tool. The output
+of this step is supposed to be a collection of coding genes files for
+each genome, organized in GeneVision file. The second task is to solve
+the gene duplication problem and therefore we have use two
+methods. The first method, based on gene name, translates each genome
+into a set of genes without duplicates. The second method avoid gene
+duplication through a defragment process. In each iteration, this
+process starts by taking a gene from gene list, searches for gene
+duplication, if a duplication is found, it looks on the orientation of
+the fragment sequence. If it is positive it appends directly the
+sequence to gene files. Otherwise reverse complement operations are
+applied on the sequence, which is then also append to gene files.
+Finally, a check for missing start and stop codons is performed. At
+the end of the annotation process, all the genomes are fully
+annotated, their genes are defragmented, and gene counts are
+available.
+
+\subsection{Core genes extraction}
+
+The goal of this stage is to extract maximum core genes from sets of
+genes. To find core genes, the following methodology is applied.
+
+\subsubsection{Preprocessing}
+
+In order to extract core genomes in a suitable manner, the genomic
+data are preprocessed with two methods: on the one hand a method based
+on gene name and count, and on the other hand a method based on a
+sequence quality control test.
+
+In the first method, we extract a list of genes from each chloroplast
+genome. Then we store this list of genes in the database under genome
+nam and genes counts can be extracted by a specific length command.
+The \textit{Intersection Core Matrix}, described in next subsection,
+is then computed to extract the core genes. The problem with this
+method can be stated as follows: how can we ensure that the gene which
+is predicted in core genes is the same gene in leaf genomes? The
+answer to this problem is that if the sequences of any gene in a
+genome annotated from Dogma and NCBI are similar with respect to a
+given threshold, then we do not have any problem with this
+method. When the sequences are not similar we have a problem, because
+we cannot decide which sequence belongs to a gene in core genes.
+
+The second method is based on the underlying idea: we can predict the
+the best annotated genome by merging the annotated genomes from NCBI
+and Dogma according to a quality test on genes names and sequences. To
+obtain all quality genes of each genome, we consider the following
+hypothesis: any gene will appear in the predicted genome if and only
+if the annotated genes in NCBI and Dogma pass a specific threshold
+of \textit{quality control test}. In fact, the Needle-man Wunch
+algorithm is applied to compare both sequences with respect to a
+threshold. If the alignment score is above the threshold, then the
+gene will be retained in the predicted genome, otherwise the gene is
+ignored. Once the prediction of all genomes is done,
+the \textit{Intersection Core Matrix} is computed on these new genomes
+to extract core genes, as explained in Algorithm \ref{Alg3:thirdM}.
+
+\begin{algorithm}[H]
+\caption{Extract new genome based on gene quality test}
+\label{Alg3:thirdM}
+\begin{algorithmic}
+\REQUIRE $Gname \leftarrow \text{Genome Name}, Threshold \leftarrow 65$
+\ENSURE $geneList \leftarrow \text{Quality genes}$
+\STATE $dir(NCBI\_Genes) \leftarrow \text{NCBI genes of Gname}$
+\STATE $dir(Dogma\_Genes) \leftarrow \text{Dogma genes of Gname}$
+\STATE $geneList=\text{empty list}$
+\STATE $common=set(dir(NCBI\_Genes)) \cap set(dir(Dogma\_Genes))$
+\FOR{$\text{gene in common}$}
+ \STATE $g1 \leftarrow open(NCBI\_Genes(gene)).read()$
+ \STATE $g2 \leftarrow open(Dogma\_Genes(gene)).read()$
+ \STATE $score \leftarrow geneChk(g1,g2)$
+ \IF {$score > Threshold$}
+ \STATE $geneList \leftarrow gene$
+ \ENDIF
+\ENDFOR
+\RETURN $geneList$
+\end{algorithmic}
+\end{algorithm}
+
+\textbf{geneChk} is a subroutine used to find the best similarity score between
+two gene sequences after applying operations like \textit{reverse}, {\it complement},
+and {\it reverse complement}. Algorithm~\ref{Alg3:genechk} gives the outline of
+geneChk subroutine.
+
+\begin{algorithm}[H]
+\caption{Find the Maximum Similarity Score between two sequences}
+\label{Alg3:genechk}
+\begin{algorithmic}
+\REQUIRE $g1,g2 \leftarrow \text{NCBI gene sequence, Dogma gene sequence}$
+\ENSURE $\text{Maximum similarity score}$
+\STATE $score1 \leftarrow needle(g1,g2)$
+\STATE $score2 \leftarrow needle(g1,Reverse(g2))$
+\STATE $score3 \leftarrow needle(g1,Complement(g2))$
+\STATE $score4 \leftarrow needle(g1,Reverse(Complement(g2)))$
+\RETURN $max(score1,score2,score3,score4)$
+\end{algorithmic}
+\end{algorithm}
+
+\subsubsection{Intersection Core Matrix (\textit{ICM})}