-In the first method, the idea is to iterativelly collect the maximum number of common genes. To do so, the system builds an \textit{Intersection core matrix (ICM)}. ICM is a two dimensional symmetric matrix where each row and column represent one genome. Each position in ICM stores the \textit{intersection scores (IS)}. The Intersection Score is the cardinality number of a core genes comes from intersecting one ????? with other ??????. Maximum cardinality results to select two genomes with their maximum core. Mathematically speaking, if we have an $n \times n$ matrix where $n=\text{number of genomes in local database}$, then lets consider:\\
-
-\begin{equation}
-Score=\max_{i<j}\vert x_i \cap x_j\vert
-\label{Eq1}
-\end{equation}\\
-
-Where $x_i, x_j$ are elements in the matrix. The generation of a new core genes is depending on the cardinality value of intersection elements, we call it $Score$:\\
-$$\text{New Core} = \begin{cases}
-\text{Ignored} & \text{if $Score=0$;} \\
-\text{new Core id} & \text{if $Score>0$.}
-\end{cases}$$\\
-
-if $Score=0$ then we have \textit{disjoint relation} (i.e no common genes between two genomes). In this case the system ignore the vector that smash the core genes. Otherwise, The system will remove these two vectors from ICM and add new core vector with a \textit{coreID} of them to ICM for the calculation in next iteration. The partial core vectors generated with its values will store in the local database for reused to draw the tree. This process repeat until all vectors treated.
-We observe that ICM will result to be very large because of the huge amount of data that it stores. In addition, this will results to be time and memory consuming for calculating the intersection scores by using just genes names. To increase the speed of calculations, we can calculate the upper triangle scores only and exclude diagonal scores. This will reduce whole processing time and memory to half. The time complexity for this process after enhancement changed from $O(n^2-n)$ to $O(\frac{(n-1).n}{2})$. The Algorithm of construction the vector matrix and extracting the vector of maximum core genes where illustrated in Algorithm \ref{Alg1:ICM}. The output from this step is the maximum core vector with its two vectors to draw it in a tree.\\
-
-\begin{algorithm}[H]
-\caption{Extract Maximum Intersection Score}
-\label{Alg1:ICM}
-\begin{algorithmic}
-\REQUIRE $L \leftarrow \text{genomes vectors}$
-\ENSURE $B1 \leftarrow Max core vector$
-\FOR{$i \leftarrow 0:len(L)-1$}
- \STATE $core1 \leftarrow set(GenomeList[L[i]])$
- \STATE $score1 \leftarrow 0$
- \STATE $g1,g2 \leftarrow$ " "
- \FOR{$j \leftarrow i+1:len(L)$}
- \STATE $core2 \leftarrow set(GenomeList[L[i]])$
- \IF{$i < j$}
- \STATE $Core \leftarrow core1 \cap core2$
- \IF{$len(Core) > score1$}
- \STATE $g1 \leftarrow L[i]$
- \STATE $g2 \leftarrow L[j]$
- \STATE $Score \leftarrow len(Core)$
- \ELSIF{$len(Core) == 0$}
- \STATE $g1 \leftarrow L[i]$
- \STATE $g2 \leftarrow L[j]$
- \STATE $Score \leftarrow -1$
- \ENDIF
- \ENDIF
- \ENDFOR
- \STATE $B1[score1] \leftarrow (g1,g2)$
-\ENDFOR
-\RETURN $max(B1)$
-\end{algorithmic}
-\end{algorithm}
-\textit{GenomeList} represents the local database.\\
-
-In second Method, due to the number of annotated genomes, annotate each genome can be very exhausted task specially with Dogma, because dogma offer a web tool for annotation, so that, each genome must annotate using this web tool. This operation need to do manually. We prefer to recover this problem by choosing one reference chloroplast and querying each reference gene by using \textit{Blastn} to examin its existance in remaining unannotated genomes in blast database. Collect all match genomes from each gene hits, to satisfy the hypothesis "the gene who exists in maximum number of genomes also exist in a core genes". In addition, we can also extract the maximum core genes by examine how many genes present with each genome?. Algorithm \ref{Alg2:secondM}, state the general algorithm for second method. \\
-
-\begin{algorithm}[H]
-\caption{Extract Maximum Core genes based on Blast}
-\label{Alg2:secondM}
-\begin{algorithmic}
-\REQUIRE $Ref\_Genome \leftarrow \text{Accession No}$
-\ENSURE $core \leftarrow \text{Genomes for each gene}$
-\FOR{$gene \leftarrow Ref\_Genome$}
- \STATE $G\_list= \text{empty list}$
- \STATE $File \leftarrow Blastn(gene)$
- \STATE $G\_list \leftarrow File[\text{Genomes names}]$
- \STATE $Core \leftarrow [Accession\_No:G\_list]$
-\ENDFOR
-\RETURN $Core$
-\end{algorithmic}
-\end{algorithm}
-
-The hypothesis in last method state: we can predict the best annotated genome by merge the annotated genomes from NCBI and dogma based on the quality of genes names and sequences. To generate all quality genes of each genome. the hypothesis state: Any gene will be in predicted genome if and only if the annotated genes between NCBI and Dogma pass a specific threshold of\textit{quality control test}. To accept the quality test, we applied Needle-man Wunch algorithm to compare two gene sequences with respect to pass a threshold. If the alignment score pass this threshold, then the gene will be in the predicted genome, else the gene will be ignored. After predicting all genomes, one of previous two methods can be applied to extract core genes. As shown in Algorithm \ref{Alg3:thirdM}.
+\subsubsection{Pre-Processing}
+We apply two pre-processing methods for organize and prepare genomes data, one method based on gene name and count, and the second method is based on sequence quality control test.\\
+In the first method, preparing chloroplasts genomes to extract core genes based on gene name and count starts after annotation process because genomes vary in genes counts and types according to the annotation used method. Then we store each genome in the database under genome name with the set of genes names. Genes counts can extracted simply by a specific length command. \textit{Intersection core matrix} will apply then to extract the core genes. The problem with this method is how we can quarantine that the gene predicted in core genes is the same gene in leaf genomes?. To answer this question, if the sequence of any gene in a genome annotated from dogma and NCBI are similar with respect to a threshold, we do not have any problem with this method. Otherwise, we have a problem, because we can not decide which sequence goes to a gene in core genes.
+The second pre-processing method state: we can predict the best annotated genome by merge the annotated genomes from NCBI and dogma based on the quality of genes names and sequences test. To generate all quality genes of each genome. the hypothesis state: Any gene will be in predicted genome if and only if the annotated genes between NCBI and Dogma pass a specific threshold of\textit{quality control test}. To accept the quality test, we applied Needle-man Wunch algorithm to compare two gene sequences with respect to pass a threshold. If the alignment score pass this threshold, then the gene will be in the predicted genome. Otherwise, the gene will be ignored. After predicting all genomes, \textit{Intersection core matrix} will apply on these new genomes to extract core genes. As shown in Algorithm \ref{Alg3:thirdM}.