2 \color{red}Investigating in the evolution of genomes become a hard task due to the amount of evolutionary techniques and the amount of genomes that raises every day. The important question to understand here is: how can we clusterize large amounts of chloroplast species?, and what are the common genes that play a role in the process of evolution among these species?. Clusterizing collection of species aims to find the common genes that share the same functionality properties. In other words, clustering helps us to find the core and pan genome among species that share a common properties, such us gene name, gene sequence, family, \dots, etc. According to other studies, finding such core and/or pan genome is not an easy task due to a large amount of computation, and requiring a rigorous methodology. \color{black}
3 %Due to the recent evolution of sequencing techniques, the number of
4 %available genomes is rising steadily, raising the problem to determine
5 %what to do with such large sets of DNA data. An interesting question
6 %is to understand what are the common functionality of a collection
7 %of species or, conversely, to determine what is specific to a given
8 %species when compared to other ones belonging in the same genus, family, etc.
9 %Investigating such a problem means to find both core and pan genomes
10 %of a collection of species, that is, genes in common to all the species
11 %vs. genes present at least once in the set of genomes. However, to obtain
12 %trustworthy core and pan genomes is not an easy task, leading to a large
13 %amount of computation, and requiring a rigorous methodology. Surprisingly,
14 %as far as we know, this methodology in finding core and pan genomes has not really been
15 %investigated in detail. This research work tries to fill this gap
16 %by focusing only on chloroplastic genomes, whose reasonable sizes allow a deep study.
17 %% DNA analysis techniques have received a lot of attention these last
18 %% years, because they play an important role in understanding genomes
19 %% evolution over time, and in phylogenetic and genetic analyses.
20 %% However systematic approaches to determine
22 %%models of genomes evolution are based on the analysis of DNA
23 %%sequences, SNPs, mutations, and so on.
24 To achieve this goal, a collection of 99 chloroplasts are
25 considered in this article. Two methodologies will be
26 investigated, respectively based on sequence similarities and
28 from annotation tools.
29 The obtained results will finally be evaluated in terms of performances and
31 % Various genes prediction methods will be
32 % firstly compared, some of them being specific to chloroplastic genomes.
33 % Then clustering methods will be proposed and evaluated, in order
34 % to group these coding sequences by orthologous genes.
35 % % We have recently investigated
36 % the use of core (\emph{i.e.}, common genes) and pan genomes to infer
37 % evolutionary information on a collection of 99~chloroplasts. In
38 % particular, we have regarded methods to build a genes content
39 % evolutionary tree using distances to core genome. However, the
40 % production of reliable core and pan genomes is not an easy task, due
41 % to error annotations. The aim of this methodology article is to
42 % % investigate various ways to .
43 % We will first compare different approaches to
44 % construct such a tree using fully annotated genomes provided by NCBI and
45 % DOGMA, followed by a gene quality control among the common genes. Then
46 % we will explain how, by comparing sequences from DOGMA with NCBI
47 % contents, we achieved to identify the genes that play a key role in
48 % the dynamics of genomes evolution.
50 \textbf{Keywords:} Core genome, Methodology, Pan genome, Genes prediction, Coding sequences clustering, Chloroplasts, Gene quality test.