5 \includegraphics[width=0.75\textwidth]{Whole_system}
6 \caption{A general overview of the annotation-based approach}\label{Fig1}
9 %Figure~\ref{Fig1} presents a general overview of the entire proposed pipeline
10 %for core and pan genomes production and exploitation, which consists of three stages: \textit{Genomes annotation}, \textit{Core extraction}, and \textit{Features Visualization}.
11 % To understand the whole core extraction process, we
12 % describe briefly each stage below. More details will be given in the
14 \color{red}In previous work \cite{Alkindy2014}, we proposed a pipeline for the extraction of core genome. In this work, the pipline is considered with quality test method in extracting core genes, for more details (see figure~\ref{Fig1}). As a starting point, an annotation uses a DNA sequences database % chosen among the many international databases storing %nucleotide sequences,
15 such as NCBI's GenBank~\cite{Sayers01012011}, the European \textit{EMBL} database~\cite{apweiler1985swiss}, or the Japanese \textit{DDBJ} one~\cite{sugawara2008ddbj}.
17 Further more, It is possible to obtain annotated genomes (DNA coding sequences with gene
18 names and locations) by interacting with these databases, either by directly downloading
19 annotated genomes delivered by these websites, or by launching an
20 annotation tool on complete downloaded genomes.
21 Obviously, this annotation stage must be of quality if we want
22 to obtain acceptable core and pan genomes.
23 % These last years the cost of sequencing genomes has been greatly
24 % reduced, and thus more and more genomes are sequenced. Therefore
25 % automatic annotation tools are required to deal with this continuously
26 % increasing amount of genomics data. %Moreover, a reliable and accurate
27 % %genome annotation process is needed in order to provide strong
28 %indicators for the study of life\cite{Eisen2007}.
29 %Various cost-effective annotation tools~\cite{Bakke2009} producing genomic annotations at many levels of detail have been designed recently, some reputed ones being: % NCBI~\cite{Sayers01012011}, DOGMA~\cite{RDOGMA}, cpBase~\cite{de2002comparative}, CpGAVAS~\cite{liu2012cpgavas}, and CEGMA~\cite{parra2007cegma}. Such tools usually use one out of the three following methods for finding gene locations in large DNA sequences: \textit{alignment-based}, \textit{composition based}, or a combination of both~\cite{parra2007cegma}. The alignment-based method is used when trying to predict a protein coding sequence by aligning a genomic DNA sequence with a cDNA sequence coding an already known homologous protein~\cite{parra2007cegma}.0 This approach is used for instance in GeneWise~\cite{birney2004genewise}. The alternative method, the composition-based one (also known as \textit{ab initio}) is based on probabilistic models of genes structure~\cite{parra2000geneid}. % to find genes according to the gene value probability
32 Using such annotated genomes, we will detail two general approaches for extracting the core genome, which is the third stage of the pipeline: the first one uses similarities computed on predicted coding sequences, while the second one uses all the information provided during the annotation stage.
34 \color{red}instead of considering only gene sequences taken from NCBI or DOGMA, a quality test process is take place by working with gene names and sequences to produce quality genes. However, we will show that such a simple idea is not so easy to realize, and that it is not sufficient to only consider gene names provided by such tools while it gives good results in previous work \cite{Alkindy2014}. \color{black}
37 Annotation, which is the first stage, is an important task for extracting gene features. Indeed, to extract good gene feature, a good annotation tool is obviously required.
38 Indeed, such annotations can be used in various manners (based on gene names, gene sequences, protein sequences, etc.) to extract the core and pan genomes.
39 We will subsequently propose methods that use gene names and sequences for extracting core genes and producing chloroplast evolutionary tree.
41 %\input{population_Table}
42 The final stage of our pipeline, only invoked in this article, is to take advantage
43 of the information produced during the core and pan genomes search.
44 This features visualization stage encompasses phylogenetic tree construction (see \cite{Alkindy2014} for more details)
45 using core genes, genes content evolution illustrated by core trees, functionality
46 investigations, and so on.
48 % allows to visualize genomes and/or gene evolution in chloroplast. Therefore we use representations like tables, phylogenetic trees, graphs, etc. to organize and show genomes relationships, and thus achieve the goal of representing gene
49 % evolution. In addition, comparing these representations with ones issued from another annotation tool dedicated to large population of chloroplast genomes give us biological perspectives to the nature of chloroplasts evolution. %Notice that a local database linked with each pipe stage is used to store all the information produced during the process.
51 For illustration purposes, we have considered % GenBank-NCBI~\cite{Sayers01012011} as sequence
53 99~genomes of chloroplasts downloaded from GenBank database~\cite{Sayers01012011}. These genomes
54 lie in the eleven type of chloroplast families (see \cite{Alkindy2014} for more details).%as described in Table~\ref{Tab2}.
55 Furthermore, two kinds of annotations will be considered in this document, namely the
56 ones provided by NCBI on the one hand, and the ones by DOGMA on the other hand.
58 %database in our method must be taken from any confident data source
59 %that stores annotated and/or unannotated chloroplast genomes.
60 % As stated in the previous section, we have
61 % considered GenBank-NCBI~\cite{Sayers01012011} as sequence