1 \color{red}The idea behind the importance of identifying core genes is to understand the shared functionality of agiven set of species.
2 %Identifying core genes may be of importance to understand shared functionality and specificity of a given set of species, or to construct their phylogeny using curated sequences.
3 We introduced in previous work (see \cite{Alkindy2014}) two methods for discovering core and pan genes based on sequence similarity method and alignment based approache method. However, to
4 determine both core and pan genomes of a large set of DNA sequences, we consider in this work compare the same clustering algorithm of sequence similarity method proposed in previous work with new method as an improvement of alignment based approach by considering sequence quality control test. More precisely, we focus on
5 the following questions using a collection of 99~chloroplasts as illustrative example: how
6 can we identify the best core genome (that is, an artificially designed set of
7 coding sequences as close as possible to the real biological one) and
8 how to deduce scenarios regarding their genes loss.
10 The existance of Chloroplasts is behind the fact that \color{black}chloroplasts found in Eucaryotes have
11 an endosymbiotic origin, meaning
12 that they come from the incorporation of a photosynthetic bacteria (Cyanobacteria) within an eucaryotic cell. They are fundamental key elements in
13 living organisms history, as they are organelles responsible for
14 photosynthesis. This latter is the main way to produce organic matters
15 from mineral ones using solar energy. Consequently photosynthetic
16 organisms are at the basis of most ecosystem trophic chains. Indeed
17 photosynthesis in eucaryotes has allowed a great speciation in the lineage,
18 leading to a great biodiversity. From an ecological point of view,
19 photosynthetic organisms are at the origin of the presence of dioxygen
20 in the atmosphere (allowing extant life) and are the main source of mid
21 to long term carbon storage, which is fundamental regarding current
22 climate changes. However, the chloroplasts evolutionary history is not totally
23 well understood, at least large scale speaking, and their phylogeny requires
24 to be further investigated.
26 A key idea in phylogenetic classification is that a given DNA mutation shared
27 by at least two taxa has a larger probability to be inherited from a common
28 ancestor than to have occurred independently. Thus shared changes in genomes
29 allow to build relationships between species. In the case of chloroplasts,
30 an important category of genomes changes is the loss of functional genes,
31 either because they become ineffective or due to a transfer to the nucleus.
34 a small number of gene losses among species indicates
35 that these species are close to each other and belong to a similar lineage,
36 while a large loss means %that we have an evolutionary relationship
40 %Phylogenetic relationships are mainly built by comparison of sets of coding and non-coding sequences.
41 Phylogenies of photosynthetic plants are important to assess the origin
42 of chloroplasts and the modes of gene loss among lineages.
43 These phylogenies are usually done using a few chloroplastic genes,
44 some of them being not conserved in all the taxa.
45 %As phylogenetic relationships inferred from data matrices complete for each species included and with the same evolution history are better assumptions,
47 This is why selecting core genes may be of interest for a new investigation
48 of photosynthetic plants phylogeny.
49 %To depict the links between species clearly, we here intend to built a phylogenetic tree showing the relationships based on the distances among gene sequences of a core genome.
50 However, the circumscription of the core chloroplast genomes for a given set of photosynthetic organisms needs bioinformatics investigations using sequence annotation and comparison tools, and various choices
54 \color{red}Our intention in this research work regarding the methodology in core and pan genomes determination is to investigate the impact of these choices. on the results. A general presentation of the approaches detailed in this document is provided in the next section. Then we will study in Section~\ref{sec:simil} the use of annotated genomes from NCBI website~\cite{Sayers01012011} with a coding sequences clustering method based on the Needleman-Wunsch similarity scores~\cite{Rice2000}. %We will show that such an approach based on sequences similarity cannot lead to satisfactory results, biologically speaking.
55 %We will thus investigate name-sequence-based approaches in Section~\ref{sec:annot}, by using successively the gene names provided by NCBI and DOGMA~\cite{RDOGMA} annotations, where DOGMA is a recent annotation tool specific to chloroplasts.
56 While the second method will be proposed in Section~\ref{sec:mixed}, which intends to use gene name and sequence comparisons. \color{black}
57 %Ways to take advantage of the produced core genomes are introduced in Section~\ref{sec:features},
58 Information regarding computation time and memory usage are provided in Section~\ref{sec:implem}.
59 Finally, a discussion based on biological aspects regarding the evolutionary history of the considered genomes
60 will finalize our investigations, leading to our methodology proposal for core and pan genomes
61 discovery of chloroplasts %(Section~\ref{sec:discuss}).
62 This research work ends by a conclusion section, in which our investigations will be summarized and intended future work will be planned.
65 % Other possible scientific questions to consider for introduction improvement:
66 % Which bioinformatics tools are necessary for genes comparison in selected complete chloroplast genomes? Which bioinformatics tools are necessary to build a phylogeny of numerous genes and species, etc?