-The objective from this step is to organize genes, solve genes duplications, and generate sets of genes from each genome. The input to the system is our list of chloroplast genomes, annotated from NCBI\cite{Sayers01012011}. All genomes stored as \textit{.fasta} files include collection of Protein coding genes\cite{parra2007cegma,RDogma}(gene that produce proteins) with its coding sequences.
-As a preparation step to achieve the set of core genes, we need to translate these genomes using \textit{BioPython} package\cite{chapman2000biopython}, and extracting all information needed to find the core genes. A process starts by converting each genome in fasta format to GenVision\cite{geneVision} formats from DNASTAR, and this is not an easy job. The output from this operation is a lists of genes stored in a local database for each genome, their genes names and gene counts. In this stage, we will accumulate some Gene duplications with each genome treated. In other words, duplication in gene name can comes from genes fragments as long as chloroplast DNA sequences. We defines \textit{Identical state} to be the state that each gene present only one time in a genome (i.e Gene has no copy) without considering the position or gene orientation. This state can be reached by filtering the database from redundant gene name. To do this, we have two solutions: first, we made an orthography checking. Orthographe checking is used to merge fragments of a gene to form one gene.
-Second, we convert the list of genes names for each genome (i.e. after orthography check) in the database to be a set of genes names. Mathematically speaking, if $G=\left[g_1,g_2,g_3,g_1,g_3,g_4\right]$ is a list of genes names, by using the definition of a set in mathematics, we will have $set(G)=\{g_1,g_2,g_3,g_4\}$, and $|G|=4$ where $|G|$ is the cardinality number of the set $G$ which represent the number of genes in the set.\\
-The whole process of extracting core genome based on genes names and counts among genomes is illustrate in Figure \ref{NCBI:Annotation}.\\
-
-\begin{figure}[H]
- \centering
- \includegraphics[width=0.7\textwidth]{NCBI_GeneName}
- \caption{NCBI Annotation for Chloroplast genomes}
- \label{NCBI:Annotation}
-\end{figure}
+The objective from this step is to organize genes, solve gene duplications, and generate sets of genes from each genome. The input to the system is our list of chloroplast genomes, annotated from NCBI. All genomes stored as \textit{.fasta} files include collection of protein coding genes\cite{parra2007cegma,RDogma}(gene that produce proteins) with its coding sequences.
+As a preparation step to achieve the set of core genes, we need to analyse these genomes (using \textit{BioPython} package\cite{chapman2000biopython}
+), to extracting all information needed to find the core genes. The process starts by converting each genome from fasta format to GenVision\cite{geneVision} formats from DNASTAR. The outputs from this operation are lists of genes for each genome, their genes names and gene counts. In this stage, we accumulate some Gene duplications for each treated genome. In other words, duplication in gene name can comes from genes fragments as long as chloroplast DNA sequences. We defines \textit{Identical state} to be the state that each gene present only one time in a genome (i.e Gene has no copy) without considering the position or gene orientation. This state can be reached by filtering the database from redundant gene name.