1 In this research work, we studied two %three
2 methodologies for extracting core genes from a large set of chloroplasts genomes, and we developed
3 Python programs to evaluate them in practice.
4 %Extracted core genomes
5 %depend on both gene names and sequences.
6 % Furthermore, that extract these core genes with the three methodologies.
8 We firstly considered to extract core genomes by the way of comparisons
9 (global alignment) of DNA sequences downloaded from NCBI database.
10 However this method failed to produce biologically
11 relevant core genomes, no matter the chosen similarity threshold, probably
12 due to annotation errors. We then considered to use the DOGMA annotation tool
13 to enhance the genes prediction process. The second method consisted in extracting
14 gene names either from NCBI gene features or from DOGMA results. A first
15 ``intersection core matrix (ICM)'' where built, in which each coefficient
16 stored the intersection cardinality of the two genomes placed at the extremities
17 of its row and column. New ICMs are
18 then constructed by selecting the maximum intersection score (IS) in this matrix,
19 removing the two genomes having this score, and adding the corresponding
20 core genome in a new ICM construction. %Finally, in the third method, a genes quality test has been added before the ICMs computation, to ensure that the genes obtained in the NCBI annotation files are the same %(\emph{i.e.}, gene name and sequence) than the ones produced by DOGMA.
21 % A genes quality test has then been introduced to construct new ICMs
23 % only constituted by the genes that successfully passed
24 % a specific similarity threshold of 65\% on their sequences.
26 % % then will take place to extract the core genes.
29 Core trees have finally been generated for each method, to investigate
30 the distribution of chloroplasts and core genomes. The tree from second
31 method based on DOGMA has revealed the best distribution of
32 chloroplasts regarding their evolutionary history. In particular, it appears to
33 us that each endosymbiosis event is well branched in the DOGMA core tree.
35 In future work, we intend to deepen the methodology evaluation by considering
36 new gene prediction tools and various similarity measures on both
37 gene names and sequences. Additionally, we will investigate new clustering
38 methods on the first approach, to improve the results quality in this promising way to
39 obtain core genes. Finally, the results produced with DOGMA will be
40 further investigated, biologically speaking: the genes content of each core
41 will be studied while phylogenetic relations between all these species