+In order to extract core genomes in a suitable manner, the genomic
+data are preprocessed with two methods: on the one hand a method based
+on gene name and count, and on the other hand a method based on a
+sequence quality control test.
+
+In the first method, we extract a list of genes from each chloroplast
+genome. Then we store this list of genes in the database under genome
+nam and genes counts can be extracted by a specific length command.
+The \textit{Intersection Core Matrix}, described in next subsection,
+is then computed to extract the core genes. The problem with this
+method can be stated as follows: how can we ensure that the gene which
+is predicted in core genes is the same gene in leaf genomes? The
+answer to this problem is that if the sequences of any gene in a
+genome annotated from Dogma and NCBI are similar with respect to a
+given threshold, then we do not have any problem with this
+method. When the sequences are not similar we have a problem, because
+we cannot decide which sequence belongs to a gene in core genes.
+
+The second method is based on the underlying idea: we can predict the
+the best annotated genome by merging the annotated genomes from NCBI
+and Dogma according to a quality test on genes names and sequences. To
+obtain all quality genes of each genome, we consider the following
+hypothesis: any gene will appear in the predicted genome if and only
+if the annotated genes in NCBI and Dogma pass a specific threshold
+of \textit{quality control test}. In fact, the Needle-man Wunch
+algorithm is applied to compare both sequences with respect to a
+threshold. If the alignment score is above the threshold, then the
+gene will be retained in the predicted genome, otherwise the gene is
+ignored. Once the prediction of all genomes is done,
+the \textit{Intersection Core Matrix} is computed on these new genomes
+to extract core genes, as explained in Algorithm \ref{Alg3:thirdM}.
+
+\begin{algorithm}[H]
+\caption{Extract new genome based on gene quality test}
+\label{Alg3:thirdM}
+\begin{algorithmic}
+\REQUIRE $Gname \leftarrow \text{Genome Name}, Threshold \leftarrow 65$
+\ENSURE $geneList \leftarrow \text{Quality genes}$
+\STATE $dir(NCBI\_Genes) \leftarrow \text{NCBI genes of Gname}$
+\STATE $dir(Dogma\_Genes) \leftarrow \text{Dogma genes of Gname}$
+\STATE $geneList=\text{empty list}$
+\STATE $common=set(dir(NCBI\_Genes)) \cap set(dir(Dogma\_Genes))$
+\FOR{$\text{gene in common}$}
+ \STATE $g1 \leftarrow open(NCBI\_Genes(gene)).read()$
+ \STATE $g2 \leftarrow open(Dogma\_Genes(gene)).read()$
+ \STATE $score \leftarrow geneChk(g1,g2)$
+ \IF {$score > Threshold$}
+ \STATE $geneList \leftarrow gene$
+ \ENDIF
+\ENDFOR
+\RETURN $geneList$
+\end{algorithmic}
+\end{algorithm}
+
+\textbf{geneChk} is a subroutine used to find the best similarity score between
+two gene sequences after applying operations like \textit{reverse}, {\it complement},
+and {\it reverse complement}. Algorithm~\ref{Alg3:genechk} gives the outline of
+geneChk subroutine.
+
+\begin{algorithm}[H]
+\caption{Find the Maximum Similarity Score between two sequences}
+\label{Alg3:genechk}
+\begin{algorithmic}
+\REQUIRE $g1,g2 \leftarrow \text{NCBI gene sequence, Dogma gene sequence}$
+\ENSURE $\text{Maximum similarity score}$
+\STATE $score1 \leftarrow needle(g1,g2)$
+\STATE $score2 \leftarrow needle(g1,Reverse(g2))$
+\STATE $score3 \leftarrow needle(g1,Complement(g2))$
+\STATE $score4 \leftarrow needle(g1,Reverse(Complement(g2)))$
+\RETURN $max(score1,score2,score3,score4)$
+\end{algorithmic}
+\end{algorithm}