-The field of genome annotation pays a lot of attentions where the ability to collect and analysis genomical data can provide strong indicators for the study of life\cite{Eisen2007}. Four of genome annotation centers (such as, \textit{NCBI\cite{Sayers01012011}, Dogma \cite{RDogma}, cpBase \cite{de2002comparative}, CpGAVAS \cite{liu2012cpgavas}, and CEGMA\cite{parra2007cegma}}) present various types of annotation tools (\emph{i.e.} cost-effective sequencing methods\cite{Bakke2009}) on different annotation levels. Generally, previous studies used one of three methods for gene finding in annotated genome using these centers: \textit{alignment-based, composition based, or combination of both\cite{parra2007cegma}}. The alignment-based method is used when we try to predict a coding gene (\emph{i.e.}. genes that produce proteins) by aligning DNA sequence of gene to the protein of cDNA sequence of homology\cite{parra2007cegma}. This approach also is used in GeneWise\cite{birney2004genewise}. Composition-based method (known as \textit{ab initio}) is based on a probabilistic model of gene structure to find genes according to the gene value probability (GeneID\cite{parra2000geneid}). In this section, we consider a new method of finding core genes from large amount of chloroplast genomes, as a solution of the problem resulting from the method stated in section two. This method is based on extracting gene features. A general overview of the system is illustrated in Figure \ref{Fig1}.\\
+The field of genome annotation pays a lot of attentions where the
+ability to collect and analysis genomical data can provide strong
+indicators for the study of life\cite{Eisen2007}. Four of genome
+annotation centers (such as, \textit{NCBI\cite{Sayers01012011},
+Dogma \cite{RDogma}, cpBase \cite{de2002comparative},
+CpGAVAS \cite{liu2012cpgavas}, and CEGMA\cite{parra2007cegma}})
+present various types of annotation tools (\emph{i.e.} cost-effective
+sequencing methods\cite{Bakke2009}) on different annotation
+levels. Generally, previous studies used one of three methods for gene
+finding in annotated genome using these
+centers: \textit{alignment-based, composition based, or combination of
+both\cite{parra2007cegma}}. The alignment-based method is used when we
+try to predict a coding gene (\emph{i.e.}. genes that produce
+proteins) by aligning DNA sequence of gene to the protein of cDNA
+sequence of homology\cite{parra2007cegma}. This approach also is used
+in GeneWise\cite{birney2004genewise}. Composition-based method (known
+as \textit{ab initio}) is based on a probabilistic model of gene
+structure to find genes according to the gene value probability
+(GeneID\cite{parra2000geneid}). In this section, we consider a new
+method of finding core genes from large amount of chloroplast genomes,
+as a solution of the problem resulting from the method stated in
+section two. This method is based on extracting gene features. A
+general overview of the system is illustrated in Figure \ref{Fig1}.\\
\begin{figure}[H]
\centering
\subsection{Genomes Samples}
In this research, we retrieve genomes of Chloroplasts from NCBI. Ninety nine genome of them are considered to work with. These genomes lies in the eleven type of chloroplast families. The distribution of genomes is illustrated in detail in Table \ref{Tab2}.
-
-\input{population_Table}
-
+
+\input{population_Table}
\subsection{Genome Annotation Techniques}
Genome annotation is the second stage in the model pipeline. Many techniques were developed to annotate chloroplast genomes but the problem is that they vary in the number and type of predicted genes (\emph{i.e.} the ability to predict genes and \textit{for example: Transfer RNA (tRNA)} and \textit{Ribosomal RNA (rRNA)} genes). Two annotation techniques from NCBI and Dogma are considered to analyse chloroplast genomes to examine the accuracy of predicted coding genes.
\subsubsection{Intersection Core Matrix (\textit{ICM})}
-The idea behind extracting core genes is to iteratively collect the maximum number of common genes between two genomes. To do so, the system builds an \textit{Intersection Core Matrix (ICM)}. ICM is a two dimensional symmetric matrix where each row and each column represents one genome. Each position in ICM stores the \textit{Intersection Scores(IS)}. IS is the cardinality number of a core genes which comes from intersecting one genome with other ones. Maximum cardinality results to select two genomes with their maximum core. Mathematically speaking, if we have an $n \times n$ matrix where $n \text{is the number of genomes in local database}$, then lets consider:\\
+The idea behind extracting core genes is to iteratively collect the maximum number of common genes between two genomes. To do so, the system builds an \textit{Intersection Core Matrix (ICM)}. ICM is a two dimensional symmetric matrix where each row and each column represents one genome. Each position in ICM stores the \textit{Intersection Scores(IS)}. IS is the cardinality number of a core genes which comes from intersecting one genome with other ones. Maximum cardinality results to select two genomes with their maximum core. Mathematically speaking, if we have an $n \times n$ matrix where $n$
+is the number of genomes in local database, then lets consider:\\
+
\begin{equation}
Score=\max_{i<j}\vert x_i \cap x_j\vert
\label{Eq1}
\end{equation}
-where $x_i, x_j$ are elements in the matrix. The generation of a new core genes is depending on the cardinality value of intersection scores, we call it \textit{Score}:
+
+\noindent where $x_i, x_j$ are elements in the matrix. The generation of a new core genes is depending on the cardinality value of intersection scores, we call it \textit{Score}:
$$\text{New Core} = \begin{cases}
\text{Ignored} & \text{if $\textit{Score}=0$;} \\
\text{new Core id} & \text{if $\textit{Score}>0$.}