-Dogma is an annotation tool developed in the university of Texas by \cite{RDogma} in 2004. Dogma is an abbreviation of \textit{Dual Organellar GenoMe Annotator}\cite{RDogma} for plant chloroplast and animal mitochondrial genomes.
-It has its own database for translated the genome in all six reading frames and query the amino acid sequence database using Blast\cite{altschul1990basic}(i.e Blastx) with various parameters, and to identify protein coding genes\cite{parra2007cegma,RDogma} in the input genome based on sequence similarity of genes in Dogma database. Further more, it can produce the \textit{Transfer RNAs (tRNA)}\cite{RDogma}, and the \textit{Ribosomal RNAs (rRNA)}\cite{RDogma} and verifying their start and end positions rather than NCBI annotation tool. There are no gene duplication with dogma after solving gene fragmentation. \\
-Genome Anntation with dogma can be the key difference of extracting core genes. In figure \ref{dog:Annotation}, The step of annotation divided into two tasks: First, It starts to annotate complete choloroplast genomes (i.e \textit{Unannotated genomes} from NCBI by using Dogma web tool. The whole annotation process was done manually. The output from dogma is considered to be collection of coding genes file for each genome in the form of GeneVision\cite{geneVision} file format.\\
-Where the second task is to solve gene fragments. Defragment process starts immediately after the first task to solve fragments of coding genes for each genome to avoid gene duplication. This process will looks on fragement orientation, if it is negative, then the process apply reverse complement operations on gene sequence. All genomes after this stage are fully annotated, their genes were de-fragmented, genes lists and counts were identified. These information stored in local database.\\
-\begin{figure}[H]
- \centering
- \includegraphics[width=0.7\textwidth]{Dogma_GeneName}
- \caption{Dogma Annotation for Chloroplast genomes}\label{dog:Annotation}
-\end{figure}
+Dogma is an annotation tool developed in the university of Texas in 2004. Dogma is an abbreviation of (\textit{Dual Organellar GenoMe Annotator}) for plant chloroplast and animal mitochondrial genomes.
+It has its own database for translating the genome in all six reading frames and query the amino acid sequence database using Blast\cite{altschul1990basic}(i.e Blastx) with various parameters. Further more, identify protein coding genes in the input genome based on sequence similarity of genes in Dogma database. In addition, it can produce the \textit{Transfer RNAs (tRNA)}, and the \textit{Ribosomal RNAs (rRNA)} and verifies their start and end positions rather than NCBI annotation tool. There are no gene duplication with dogma after solving gene fragmentation. \\
+Genome Annotation with dogma can be the key difference of extracting core genes. The step of annotation divided into two tasks: First, It starts to annotate complete chloroplast genome (i.e \textit{Unannotate genome from NCBI} by using Dogma web tool. This process was done manually. The output from dogma is considered to be collection of coding genes file for each genome in the form of GeneVision file format.
+Where the second task is to solve gene fragments. Two methods used to solve genes duplication for extract core genes. First, for the method based on gene name, all the duplications are removed, where each list of genes is translated into a set of genes. Second, for the method of gene quality test, defragment process starts immediately to solve fragments of coding genes for each genome to avoid gene duplication. In each iteration, this process starts by taking one gene from gene list, search for gene duplication, if exists, look on the orientation of the fragment sequence: if it is positive, then appending fragment sequence to gene file. Otherwise, the process applies reverse complement operations on gene sequence and append it to gene file. Additional process applied to check start and stop codon and try to find appropriate start and end codon in case of missing. All genomes after this stage are fully annotated, their genes were de-fragmented, genes lists and counts were identified.\\
+
+\subsection{Core Genes Extraction}
+The goal of this step is to extract maximum core genes from sets of genes. The methodology of finding core genes is as follow: \\
+
+\subsubsection{Pre-Processing}
+We apply two pre-processing methods for organize and prepare genomes data, one method based on gene name and count, and the second method is based on sequence quality control test.\\
+In the first method, preparing chloroplasts genomes to extract core genes based on gene name and count starts after annotation process because genomes vary in genes counts and types according to the annotation used method. Then we store each genome in the database under genome name with the set of genes names. Genes counts can extracted simply by a specific length command. \textit{Intersection core matrix} will apply then to extract the core genes. The problem with this method is how we can quarantine that the gene predicted in core genes is the same gene in leaf genomes?. To answer this question, if the sequence of any gene in a genome annotated from dogma and NCBI are similar with respect to a threshold, we do not have any problem with this method. Otherwise, we have a problem, because we can not decide which sequence goes to a gene in core genes.
+The second pre-processing method state: we can predict the best annotated genome by merge the annotated genomes from NCBI and dogma based on the quality of genes names and sequences test. To generate all quality genes of each genome. the hypothesis state: Any gene will be in predicted genome if and only if the annotated genes between NCBI and Dogma pass a specific threshold of\textit{quality control test}. To accept the quality test, we applied Needle-man Wunch algorithm to compare two gene sequences with respect to pass a threshold. If the alignment score pass this threshold, then the gene will be in the predicted genome. Otherwise, the gene will be ignored. After predicting all genomes, \textit{Intersection core matrix} will apply on these new genomes to extract core genes. As shown in Algorithm \ref{Alg3:thirdM}.