Paper2/intro.tex

   1 \color{red}The idea behind the importance of identifying core genes is to understand the shared functionality of agiven set of species.
   2 %Identifying  core genes  may be of importance  to understand shared functionality and specificity of a given set of species, or to construct their phylogeny using curated sequences.
   3 We introduced in previous work (see \cite{Alkindy2014}) two methods for discovering core and pan genes based on sequence similarity method and alignment based approache method. However, to
   4 determine both core and pan genomes of a large set of DNA sequences, we consider in this work compare the same clustering algorithm of sequence similarity method proposed in previous work with new method as an improvement of alignment based approach by considering sequence quality control test. More precisely,  we focus on
   5 the following  questions using a  collection of 99~chloroplasts as illustrative example: how
   6 can  we identify the  best core  genome (that is, an artificially designed set of
   7 coding sequences as close as possible to the real biological one) and
   8 how to deduce scenarios regarding their genes loss.
   9
  10 The existance of Chloroplasts is behind the fact that \color{black}chloroplasts found in Eucaryotes have
  11 an endosymbiotic origin, meaning
  12 that they come from the incorporation of a photosynthetic bacteria (Cyanobacteria) within an eucaryotic cell. They are fundamental key elements in
  13 living organisms history, as they are organelles responsible for
  14 photosynthesis. This latter is the main way to produce organic matters
  15 from mineral ones using solar energy. Consequently photosynthetic
  16 organisms are at the basis of most ecosystem trophic chains. Indeed
  17 photosynthesis in eucaryotes has allowed a great speciation in the lineage,
  18 leading to a great biodiversity. From an ecological point of view,
  19 photosynthetic organisms are at the origin of the presence of dioxygen
  20 in the atmosphere (allowing extant life) and are the main source of mid
  21 to long term carbon storage, which is fundamental regarding current
  22 climate changes. However, the chloroplasts evolutionary history is not totally
  23 well understood, at least large scale speaking, and their phylogeny requires
  24 to be further investigated.
  25
  26 A key idea in phylogenetic classification is that a given DNA mutation shared
  27 by at least two taxa has a larger probability to be inherited from a common
  28 ancestor than to have occurred independently. Thus shared changes in genomes
  29 allow to build relationships between species. In the case of chloroplasts,
  30 an important category of genomes changes is the loss of functional genes,
  31 either because they become ineffective or due to a transfer to the nucleus.
  32 Thereby
  33 %we hypothesize that
  34 a small number of gene losses  among species indicates
  35 that these species are close to  each other and belong to a similar lineage,
  36 while a  large loss  means %that we  have an  evolutionary relationship
  37 %between species  from
  38 %much more
  39 distant lineages.
  40 %Phylogenetic relationships are mainly built by comparison of sets of coding and non-coding sequences.
  41  Phylogenies of photosynthetic plants are important to assess the origin
  42  of chloroplasts and the modes of gene loss among lineages.
  43  These phylogenies are usually done using a few chloroplastic genes,
  44 some of them being not conserved in all the taxa.
  45  %As phylogenetic relationships inferred from data matrices complete for each species included and with the same evolution history are better assumptions,
  46 %we argue that
  47 This is why selecting core genes may be of interest for a new investigation
  48 of photosynthetic plants phylogeny.
  49 %To depict the links between species clearly, we here intend to built a phylogenetic tree showing the relationships based on the distances among gene sequences of a core genome.
  50 However, the circumscription of the core chloroplast genomes for a given set of photosynthetic organisms needs bioinformatics investigations using sequence annotation and comparison tools, and various choices
  51 %of tools
  52 are available.
  53
  54 \color{red}Our intention in this research work regarding the methodology in core and pan genomes determination is to investigate the impact of these choices. on the results. A general presentation of the approaches detailed in this document is provided in the next section. Then we will study in Section~\ref{sec:simil} the use of annotated genomes from NCBI website~\cite{Sayers01012011} with a coding  sequences clustering method based on the Needleman-Wunsch similarity scores~\cite{Rice2000}. %We will show that such an approach based on sequences similarity cannot lead to satisfactory results, biologically speaking.
  55 %We will thus investigate name-sequence-based approaches in Section~\ref{sec:annot}, by using successively the gene names provided by NCBI and DOGMA~\cite{RDOGMA} annotations, where DOGMA is a recent annotation tool specific to chloroplasts.
  56 While the second method will be proposed in Section~\ref{sec:mixed}, which intends to use gene name and sequence comparisons. \color{black}
  57 %Ways to take advantage of the produced core genomes are introduced in Section~\ref{sec:features},
  58  Information regarding computation time and memory usage are provided in Section~\ref{sec:implem}.
  59 Finally, a discussion based on biological aspects regarding the evolutionary history of the considered genomes
  60 will finalize our investigations, leading to our methodology proposal for core and pan genomes
  61 discovery of chloroplasts %(Section~\ref{sec:discuss}).
  62 This research work ends by a conclusion section, in which our investigations will be summarized and intended future work will be planned.
  63
  64
  65 % Other possible scientific questions to consider for introduction improvement:
  66 % Which bioinformatics tools are necessary for genes comparison in selected complete chloroplast genomes? Which bioinformatics tools are necessary to build a phylogeny of numerous genes and species, etc?
  67 %