-The goal of this step is to extract maximum core genes from sets of genes. The methodology of finding core genes is divided into three methods: \\
-
-The first method is based on extracting core genes by finding common genes feature (i.e Gene names, genes counts). Genomes vary in genes counts according to the annotation used method, so that extracting core genes can be done by constructing Intersection Core Matrix (\textit{ICM}).\\
-While the second method is based on comparing the sequence of reference genes of one annotated genome with other unannotated genomes sequences in Blast database, by using Blastn\cite{Sayers01012011} (nucleotide sequence alignment tool from NCBI). The last method, is based on merge all genes from NCBI and Dogma annotation, then apply a sequence similarity base method (Quality Control test) using Needle-man Wunch algorithm to predict a new genomes. Using predicted genomes to extract core genes using previous methods. Figure \ref{wholesystem}, illustrate the whole system operations.
-
-\begin{figure}[H]
- \centering
- \includegraphics[width=0.7\textwidth]{Whole_system}
- \caption{Total overview of the system pipeline}\label{wholesystem}
-\end{figure}
-
-In the first method, the idea is to iterativelly collect the maximum number of common genes. To do so, the system builds an \textit{Intersection core matrix (ICM)}. ICM is a two dimensional symmetric matrix where each row and column represent one genome. Each position in ICM stores the \textit{intersection scores (IS)}. The Intersection Score is the cardinality number of a core genes comes from intersecting one ????? with other ??????. Maximum cardinality results to select two genomes with their maximum core. Mathematically speaking, if we have an $n \times n$ matrix where $n=\text{number of genomes in local database}$, then lets consider:\\
-
-\begin{equation}
-Score=\max_{i<j}\vert x_i \cap x_j\vert
-\label{Eq1}
-\end{equation}\\
-
-Where $x_i, x_j$ are elements in the matrix. The generation of a new core genes is depending on the cardinality value of intersection elements, we call it $Score$:\\
-$$\text{New Core} = \begin{cases}
-\text{Ignored} & \text{if $Score=0$;} \\
-\text{new Core id} & \text{if $Score>0$.}
-\end{cases}$$\\
-
-if $Score=0$ then we have \textit{disjoint relation} (i.e no common genes between two genomes). In this case the system ignore the vector that smash the core genes. Otherwise, The system will remove these two vectors from ICM and add new core vector with a \textit{coreID} of them to ICM for the calculation in next iteration. The partial core vectors generated with its values will store in the local database for reused to draw the tree. This process repeat until all vectors treated.
-We observe that ICM will result to be very large because of the huge amount of data that it stores. In addition, this will results to be time and memory consuming for calculating the intersection scores by using just genes names. To increase the speed of calculations, we can calculate the upper triangle scores only and exclude diagonal scores. This will reduce whole processing time and memory to half. The time complexity for this process after enhancement changed from $O(n^2-n)$ to $O(\frac{(n-1).n}{2})$. The Algorithm of construction the vector matrix and extracting the vector of maximum core genes where illustrated in Algorithm \ref{Alg1:ICM}. The output from this step is the maximum core vector with its two vectors to draw it in a tree.\\
-
-\begin{algorithm}[H]
-\caption{Extract Maximum Intersection Score}
-\label{Alg1:ICM}
-\begin{algorithmic}
-\REQUIRE $L \leftarrow \text{genomes vectors}$
-\ENSURE $B1 \leftarrow Max core vector$
-\FOR{$i \leftarrow 0:len(L)-1$}
- \STATE $core1 \leftarrow set(GenomeList[L[i]])$
- \STATE $score1 \leftarrow 0$
- \STATE $g1,g2 \leftarrow$ " "
- \FOR{$j \leftarrow i+1:len(L)$}
- \STATE $core2 \leftarrow set(GenomeList[L[i]])$
- \IF{$i < j$}
- \STATE $Core \leftarrow core1 \cap core2$
- \IF{$len(Core) > score1$}
- \STATE $g1 \leftarrow L[i]$
- \STATE $g2 \leftarrow L[j]$
- \STATE $Score \leftarrow len(Core)$
- \ELSIF{$len(Core) == 0$}
- \STATE $g1 \leftarrow L[i]$
- \STATE $g2 \leftarrow L[j]$
- \STATE $Score \leftarrow -1$
- \ENDIF
- \ENDIF
- \ENDFOR
- \STATE $B1[score1] \leftarrow (g1,g2)$
-\ENDFOR
-\RETURN $max(B1)$
-\end{algorithmic}
-\end{algorithm}
-\textit{GenomeList} represents the local database.\\
-
-In second Method, due to the number of annotated genomes, annotate each genome can be very exhausted task specially with Dogma, because dogma offer a web tool for annotation, so that, each genome must annotate using this web tool. This operation need to do manually. We prefer to recover this problem by choosing one reference chloroplast and querying each reference gene by using \textit{Blastn} to examin its existance in remaining unannotated genomes in blast database. Collect all match genomes from each gene hits, to satisfy the hypothesis "the gene who exists in maximum number of genomes also exist in a core genes". In addition, we can also extract the maximum core genes by examine how many genes present with each genome?. Algorithm \ref{Alg2:secondM}, state the general algorithm for second method. \\
-
-\begin{algorithm}[H]
-\caption{Extract Maximum Core genes based on Blast}
-\label{Alg2:secondM}
-\begin{algorithmic}
-\REQUIRE $Ref\_Genome \leftarrow \text{Accession No}$
-\ENSURE $core \leftarrow \text{Genomes for each gene}$
-\FOR{$gene \leftarrow Ref\_Genome$}
- \STATE $G\_list= \text{empty list}$
- \STATE $File \leftarrow Blastn(gene)$
- \STATE $G\_list \leftarrow File[\text{Genomes names}]$
- \STATE $Core \leftarrow [Accession\_No:G\_list]$
-\ENDFOR
-\RETURN $Core$
-\end{algorithmic}
-\end{algorithm}