-Dogma is an annotation tool developed in the university of Texas by \cite{RDogma} in 2004. Dogma is an abbreviation of \textit{Dual Organellar GenoMe Annotator}\cite{RDogma} for plant chloroplast and animal mitochondrial genomes.
-It has its own database for translated the genome in all six reading frames and query the amino acid sequence database using Blast\cite{altschul1990basic}(i.e Blastx) with various parameters, and to identify protein coding genes\cite{parra2007cegma,RDogma} in the input genome based on sequence similarity of genes in Dogma database. Further more, it can produce the \textit{Transfer RNAs (tRNA)}\cite{RDogma}, and the \textit{Ribosomal RNAs (rRNA)}\cite{RDogma} and verifying their start and end positions rather than NCBI annotation tool. There are no gene duplication with dogma after solving gene fragmentation. \\
-Genome Anntation with dogma can be the key difference of extracting core genes. In figure \ref{dog:Annotation}, The step of annotation divided into two tasks: First, It starts to annotate complete choloroplast genomes (i.e \textit{Unannotated genomes} from NCBI by using Dogma web tool. The whole annotation process was done manually. The output from dogma is considered to be collection of coding genes file for each genome in the form of GeneVision\cite{geneVision} file format.\\
-Where the second task is to solve gene fragments. Defragment process starts immediately after the first task to solve fragments of coding genes for each genome to avoid gene duplication. This process will looks on fragement orientation, if it is negative, then the process apply reverse complement operations on gene sequence. All genomes after this stage are fully annotated, their genes were de-fragmented, genes lists and counts were identified. These information stored in local database.\\
-\begin{figure}[H]
- \centering
- \includegraphics[width=0.7\textwidth]{Dogma_GeneName}
- \caption{Dogma Annotation for Chloroplast genomes}\label{dog:Annotation}
-\end{figure}
-
-From these two tasks, we can obtain clearly one copy of coding genes. To ensure that genes produced from dogma annotation process is same as the genes in NCBI. We apply in parrallel a quality checking process that align each gene from dogma and NCBI with respect to a specific threshold.\\
-
-\subsection{Extract Core Genes}
-The goal of this step is trying to extract maximum core genes from sets of genes (\textit{Vectors}) in the local database. The methodology of finding core genes is dividing to three methods: \\
-
-The hypothesis in first method is based on extracting core genes by finding common genes among chloroplast genomes based on extracting gene feature (i.e Gene names, genes counts). Genomes vary in genes counts according to the method of annotation used, so that extracting maximum core genes can be done by constructing Intersection Core Matrix (\textit{ICM}).\\
-While the hypothesis of second method is based on comparing the sequence of reference genes of one annotated genome with other unannotated genomes sequences in Blast database, by using Blastn\cite{Sayers01012011} (nucleotide sequence alignment tool from NCBI). The last method, is based on merge all genes from NCBI and Dogma annotation, then apply a sequence similarity base method (Quality Control test) using Needle-man Wunch algorithm to predict a new genomes. Using predicted genomes to extract core genes using previous methods. Figure \ref{wholesystem}, illustrate the whole system operations.
-
-\begin{figure}[H]
- \centering
- \includegraphics[width=0.7\textwidth]{Whole_system}
- \caption{Total overview of the system pipline}\label{wholesystem}
-\end{figure}
-
-In the first method, the idea is to collect from each iteration the maximum number of common genes. To do so, the system build an \textit{Intersection core matrix(ICM)}. ICM here is a two dimensional symmetric matrix (considered as a vector space) where each row and column represent a vector for one genome. Each position in ICM stores the \textit{intersection scores}. Intersection Score(IS) is the cardinality number of a core genes comes from intersecting one vector with other vectors in vector space. Taking maximum cardinality from each row and then take the maximum of them will result to select the maximum cardinality in the vector space. Maximum cardinality results to select two genomes with their maximum core. Mathematically speaking, if we have an $n \times m$ vector space matrix where $n=m=\text{number of vectors in local database}$, then lets consider:\\
-
-\begin{equation}
-Score=\max_{i<j}\vert x_i \cap x_j\vert
-\label{Eq1}
-\end{equation}\\
-
-Where $x_i, x_j$ are vectors in the matrix. Generate new core genes is depending on the value of intersecting two vectors, we call it $Score$:\\
-$$\text{New Core} = \begin{cases}
-\text{Ignored} & \text{if $Score=0$;} \\
-\text{new Core id} & \text{if $Score>0$.}
-\end{cases}$$\\
-
-if $Score=0$ then we have \textit{disjoint relation} (i.e no common genes between two genomes). In this case the system ignore the vector that smash the core genes. Otherwise, The system will remove these two vectors from ICM and add new core vector with a \textit{coreID} of them to ICM for the calculation in next iteration. The partial core vectors generated with its values will store in the local database for reused to draw the tree. this process repeat until all vectors treated.
-We observe that ICM will result to be very large because of the huge amount of data that it stores. In addition, this will results to be time and memory consuming for calculating the intersection scores by using just genes names. To increase the speed of calculations, we can calculate the upper triangle scores only and exclude diagonal scores. This will reduce whole processing time and memory to half. The time complexity for this process after enhancement changed from $O(n^2-n)$ to $O(\frac{(n-1).n}{2})$. The Algorithm of construction the vector matrix and extracting the vector of maximum core genes where illustrated in Algorithm \ref{Alg1:ICM}. The output from this step is the maximum core vector with its two vectors to draw it in a tree.\\
-
-\begin{algorithm}[H]
-\caption{Extract Maximum Intersection Score}
-\label{Alg1:ICM}
-\begin{algorithmic}
-\REQUIRE $L \leftarrow \text{genomes vectors}$
-\ENSURE $B1 \leftarrow Max core vector$
-\FOR{$i \leftarrow 0:len(L)-1$}
- \STATE $core1 \leftarrow set(GenomeList[L[i]])$
- \STATE $score1 \leftarrow 0$
- \STATE $g1,g2 \leftarrow$ " "
- \FOR{$j \leftarrow i+1:len(L)$}
- \STATE $core2 \leftarrow set(GenomeList[L[i]])$
- \IF{$i < j$}
- \STATE $Core \leftarrow core1 \cap core2$
- \IF{$len(Core) > score1$}
- \STATE $g1 \leftarrow L[i]$
- \STATE $g2 \leftarrow L[j]$
- \STATE $Score \leftarrow len(Core)$
- \ELSIF{$len(Core) == 0$}
- \STATE $g1 \leftarrow L[i]$
- \STATE $g2 \leftarrow L[j]$
- \STATE $Score \leftarrow -1$
- \ENDIF
- \ENDIF
- \ENDFOR
- \STATE $B1[score1] \leftarrow (g1,g2)$
-\ENDFOR
-\RETURN $max(B1)$
-\end{algorithmic}
-\end{algorithm}
-\textit{GenomeList} represents the local database.\\
-
-In second Method, due to the number of annotated genomes, annotate each genome can be very exhausted task specially with Dogma, because dogma offer a web tool for annotation, so that, each genome must annotate using this web tool. This operation need to do manually. We prefer to recover this problem by choosing one reference chloroplast and querying each reference gene by using \textit{Blastn} to examin its existance in remaining unannotated genomes in blast database. collect all match genomes from each gene hits, to satisfy the hypothesis "the gene who exists in maximum number of genomes also exist in a core genes". In addition, we can also extract the maximum core genes by examine how many genes present with each genome?. Algorithm \ref{Alg2:secondM}, state the general algorithm for second method. \\
+Dogma is an annotation tool developed in the university of Texas in 2004. Dogma is an abbreviation of (\textit{Dual Organellar GenoMe Annotator}) for plant chloroplast and animal mitochondrial genomes.
+It has its own database for translating the genome in all six reading frames and query the amino acid sequence database using Blast\cite{altschul1990basic}(i.e Blastx) with various parameters. Further more, identify protein coding genes in the input genome based on sequence similarity of genes in Dogma database. In addition, it can produce the \textit{Transfer RNAs (tRNA)}, and the \textit{Ribosomal RNAs (rRNA)} and verifies their start and end positions rather than NCBI annotation tool. There are no gene duplication with dogma after solving gene fragmentation. \\
+Genome Annotation with dogma can be the key difference of extracting core genes. The step of annotation divided into two tasks: First, It starts to annotate complete chloroplast genome (i.e \textit{Unannotate genome from NCBI} by using Dogma web tool. This process was done manually. The output from dogma is considered to be collection of coding genes file for each genome in the form of GeneVision file format.
+Where the second task is to solve gene fragments. Two methods used to solve genes duplication for extract core genes. First, for the method based on gene name, all the duplications are removed, where each list of genes is translated into a set of genes. Second, for the method of gene quality test, defragment process starts immediately to solve fragments of coding genes for each genome to avoid gene duplication. In each iteration, this process starts by taking one gene from gene list, search for gene duplication, if exists, look on the orientation of the fragment sequence: if it is positive, then appending fragment sequence to gene file. Otherwise, the process applies reverse complement operations on gene sequence and append it to gene file. Additional process applied to check start and stop codon and try to find appropriate start and end codon in case of missing. All genomes after this stage are fully annotated, their genes were de-fragmented, genes lists and counts were identified.\\