Paper2/conclusion.tex

   1 In this research work, we studied two %three
   2 methodologies for extracting core genes from a large set of chloroplasts genomes, and we developed
   3 Python programs to evaluate them in practice.
   4 %Extracted core genomes
   5 %depend on both gene names and sequences.
   6 % Furthermore, that extract these core genes with the three methodologies.
   7
   8 We firstly considered to extract core genomes by the way of comparisons
   9 (global alignment) of DNA sequences downloaded from NCBI database.
  10 However this method failed to produce biologically
  11 relevant core genomes, no matter the chosen similarity threshold, probably
  12 due to annotation errors. We then considered to use the DOGMA annotation tool
  13 to enhance the genes prediction process. The second method consisted in extracting
  14 gene names either from NCBI gene features or from DOGMA results. A first
  15 ``intersection core matrix (ICM)'' where built, in which each coefficient
  16 stored the intersection cardinality of the two genomes placed at the extremities
  17 of its row and column. New ICMs are
  18 then constructed by selecting the maximum intersection score (IS)  in this matrix,
  19 removing the two genomes having this score, and adding the corresponding
  20 core genome in a new ICM construction. %Finally, in the third method, a genes quality test has been added before the ICMs computation, to ensure that the genes obtained in the NCBI annotation files are the same %(\emph{i.e.}, gene name and sequence) than the ones produced by DOGMA.
  21 % A genes quality test has then been introduced  to construct new ICMs
  22 % on genomes
  23 % only constituted by the genes that successfully passed
  24 % a specific similarity threshold of 65\% on their sequences.
  25 % % , ICM
  26 % % then will take place to extract the core genes.
  27 %
  28
  29 Core trees have finally been generated for each method, to investigate
  30 the distribution of chloroplasts and core genomes. The tree from second
  31 method based on DOGMA has revealed the best distribution of
  32  chloroplasts regarding their evolutionary history. In particular, it appears to
  33 us that each endosymbiosis event is well branched in the DOGMA core tree.
  34
  35 In future work, we intend to deepen the methodology evaluation by considering
  36 new gene prediction tools and various similarity measures on both
  37 gene names and sequences. Additionally, we will investigate new clustering
  38 methods on the first approach, to improve the results quality in this promising way to
  39 obtain core genes. Finally, the results produced with DOGMA will be
  40 further investigated, biologically speaking: the genes content of each core
  41 will be studied while phylogenetic relations between all these species
  42 will be questioned.