+
+The main drawback from the method of extracting core genes based on gene names and counts is that we can not depending only on genes names because of three causes: first, the genome may have not totally named (This can be found in early versions of NCBI genomes), so we will have some lost sequences. Second, we may have two genes sharing the same name, while their sequences are different. Third, we need to annotate all the genomes.
+
+\subsection{Extract Core Genes based on Genes Sequences}
+We discussed before on the hypothesis of the second method. In this section, we will implement this hypothesis by using ncbi-Blast alignment tool. Implementation of this method is dividing into two parts: \textit{Core genes from NCBI Annotation} and \textit{Core Genes from Dogma Annotation}. For instance, for the two parts, selecting a reference genome can be a key difference among predicting Core genes. After choosing a reference genome, Local blast database will then created to store the rest of Un-annotated chloroplast genomes. \\
+
+We will present the algorithm in the following steps:
+
+\begin{enumerate}
+\item Select a reference genome: we need to select good reference genome from our population, To do so, we can choose \textit{Lycopersicon esculentum cultivar LA3023 chloroplast NC\_007898.3} to be the reference genome if we consider the version of annotation, or \textit{Zea Mays NC\_001666.2} if we consider the largest number of coding genes based on NCBI annotation.The aim is to extract the maximum core genes. In order to achieve this goal, we choose \textit{Zea Mays NC\_001666.2} to be our reference genome.
+\item Build Blast database for the rest of unannotated genomes.
+\item Compare reference Genes: based on the genomes in the database. We querying each reference gene with the database by using \textbf{Blastn}. The result with alignment scores for each gene will store in separated file.
+\item Generate match table: In this table, each row represent referenced genes, while columns represent genomes. To fill this table, a developed code will open each output file for reference genes and extract the number of genomes and a list of genomes names where gene sequence have hits.
+\end{enumerate}
+
+The core genome can be extracted from the table by taking as possible the maximum number of genes that exists in the maximum number of genomes.
+
+\subsection{Extract Core Genes based on Gene Quality Control}
+The main idea from this method is to focus on genes quality to predict maximum core genes. By comparing only genes names or genes sequences from one annotation tool is not enough. The question here, does the predicted gene from NCBI is the same gene predicted by Dogma based on gene name and gene sequence?. If yes, then we can predict new quiality genomes based on quality control test with a specific threshold. Predicted Genomes comes from merging two annotation techniques. While if no, we can not depending neither on NCBI nor Dogma because of annotation error. Core genes can by predicted by using one of the previous methods.
+
+This method summarized in the following steps:\\
+
+\begin{enumerate}
+\item Retrieve the annotation of all genomes from NCBI and Dogma: in this step, we apply the annotation of all chloroplast genomes in the database using NCBI annotation and Dogma annotation tool.
+\item Predict quality genomes: the process is to pick a genome annotation from two techniques, extracting all common genes based on genes names, then applying Needle-man wunch algorithm to align the two sequences based on a specific threshold. If the alignment score pass the threshold, then this gene will removed from the competition and store it in quality genome by saving its name with the largest gene sequence with respect to start and end codons. All quality genomes will store in the form of GenVision file format.
+\item Extract Core genes: from the above two steps, we will have new genomes with quality genes, ofcourse, we have some genes lost here, because dogma produced tRNA and rRNA genes while NCBI did not generate them and vise-versa. Using first method to extract core genes will be sufficient because we already check their sequences.
+\item Display tree: An evolution tree then will be display based on the intersections of quality genomes.
+\end{enumerate}
+\pagebreak
\ No newline at end of file