+\subsubsection{Pre-Processing}
+We apply two pre-processing methods for organize and prepare genomes data, one method based on gene name and count, and the second method is based on sequence quality control test.\\
+In the first method, preparing chloroplasts genomes to extract core genes based on gene name and count starts after annotation process because genomes vary in genes counts and types according to the annotation used method. Then we store each genome in the database under genome name with the set of genes names. Genes counts can extracted simply by a specific length command. \textit{Intersection core matrix} will apply then to extract the core genes. The problem with this method is how we can quarantine that the gene predicted in core genes is the same gene in leaf genomes?. To answer this question, if the sequence of any gene in a genome annotated from dogma and NCBI are similar with respect to a threshold, we do not have any problem with this method. Otherwise, we have a problem, because we can not decide which sequence goes to a gene in core genes.
+The second pre-processing method state: we can predict the best annotated genome by merge the annotated genomes from NCBI and dogma based on the quality of genes names and sequences test. To generate all quality genes of each genome. the hypothesis state: Any gene will be in predicted genome if and only if the annotated genes between NCBI and Dogma pass a specific threshold of\textit{quality control test}. To accept the quality test, we applied Needle-man Wunch algorithm to compare two gene sequences with respect to pass a threshold. If the alignment score pass this threshold, then the gene will be in the predicted genome. Otherwise, the gene will be ignored. After predicting all genomes, \textit{Intersection core matrix} will apply on these new genomes to extract core genes. As shown in Algorithm \ref{Alg3:thirdM}.
+
+\begin{algorithm}[H]
+\caption{Extract new genome based on Gene Quality test}
+\label{Alg3:thirdM}
+\begin{algorithmic}
+\REQUIRE $Gname \leftarrow \text{Genome Name}, Threshold \leftarrow 65$
+\ENSURE $geneList \leftarrow \text{Quality genes}$
+\STATE $dir(NCBI\_Genes) \leftarrow \text{NCBI genes of Gname}$
+\STATE $dir(Dogma\_Genes) \leftarrow \text{Dogma genes of Gname}$
+\STATE $geneList=\text{empty list}$
+\STATE $common=set(dir(NCBI\_Genes)) \cap set(dir(Dogma\_Genes))$
+\FOR{$\text{gene in common}$}
+ \STATE $g1 \leftarrow open(NCBI\_Genes(gene)).read()$
+ \STATE $g2 \leftarrow open(Dogma\_Genes(gene)).read()$
+ \STATE $score \leftarrow geneChk(g1,g2)$
+ \IF {$score > Threshold$}
+ \STATE $geneList \leftarrow gene$
+ \ENDIF
+\ENDFOR
+\RETURN $geneList$
+\end{algorithmic}
+\end{algorithm}
+
+\textbf{geneChk} is a subroutine, it is used to find the best similarity score between two gene sequences after applying operations like \textit{reverse, complement, and reverse complement}. The algorithm of geneChk is illustrated in Algorithm \ref{Alg3:genechk}.
+
+\begin{algorithm}[H]
+\caption{Find the Maximum similarity score between two sequences}
+\label{Alg3:genechk}
+\begin{algorithmic}
+\REQUIRE $gen1,gen2 \leftarrow \text{NCBI gene sequence, Dogma gene sequence}$
+\ENSURE $\text{Maximum similarity score}$
+\STATE $Score1 \leftarrow needle(gen1,gen2)$
+\STATE $Score2 \leftarrow needle(gen1,Reverse(gen2))$
+\STATE $Score3 \leftarrow needle(gen1,Complement(gen2))$
+\STATE $Score4 \leftarrow needle(gen1,Reverse(Complement(gen2)))$
+\IF {$max(Score1, Score2, Score3, Score4)==Score1$}
+ \RETURN $Score1$
+\ELSIF {$max(Score1, Score2, Score3, Score4)==Score2$}
+ \RETURN $Score2$
+\ELSIF {$max(Score1, Score2, Score3, Score4)==Score3$}
+ \RETURN $Score3$
+\ELSIF {$max(Score1, Score2, Score3, Score4)==Score4$}
+ \RETURN $Score4$
+\ENDIF
+\end{algorithmic}
+\end{algorithm}
+
+\subsubsection{Intersection Core Matrix (\textit{ICM})}
+
+The idea behind extracting core genes is to collect iteratively the maximum number of common genes between two genomes. To do so, the system builds an \textit{Intersection core matrix (ICM)}. ICM is a two dimensional symmetric matrix where each row and column represent one genome. Each position in ICM stores the \textit{intersection scores(IS)}. The Intersection Score is the cardinality number of a core genes which comes from intersecting one genome with other ones. Maximum cardinality results to select two genomes with their maximum core. Mathematically speaking, if we have an $n \times n$ matrix where $n=\text{number of genomes in local database}$, then lets consider:\\
+
+\begin{equation}
+Score=\max_{i<j}\vert x_i \cap x_j\vert
+\label{Eq1}
+\end{equation}\\
+
+Where $x_i, x_j$ are elements in the matrix. The generation of a new core genes is depending on the cardinality value of intersection scores, we call it \textit{Score}:
+$$\text{New Core} = \begin{cases}
+\text{Ignored} & \text{if $\textit{Score}=0$;} \\
+\text{new Core id} & \text{if $\textit{Score}>0$.}
+\end{cases}$$
+
+if $\textit{Score}=0$ then we have \textit{disjoint relation} (i.e no common genes between two genomes). In this case the system ignores the genome that annul the core genes size. Otherwise, The system will removes these two genomes from ICM and add new core genomes with a \textit{coreID} of them to ICM for the calculation in next iteration. This process will reduce the size of ICM and repeat until all genomes are treated (i.e ICM has no more genomes).
+We observe that ICM is very large because of the amount of data that it stores. This results to be time and memory consuming for calculating the intersection scores. To increase the speed of calculations, it is sufficient to only calculate the upper triangle scores. The time complexity for this process after enhancement is thus $O(\frac{(n-1).n}{2})$. Algorithm \ref{Alg1:ICM} illustrates the construction of the ICM matrix and the extraction of the core genes where \textit{GenomeList}, represents the database where all genomes data are stored. At each iteration, it computes the maximum core genes with its two genomes parents.