2 The first method, described below, considers NCBI annotations and uses
3 a distance-based similarity measure. We start with the following
4 preliminary Definition:
8 Let $A=\{A,T,C,G\}$ be the nucleotides alphabet, and $A^\ast$ be the
9 set of finite words on $A$ (\emph{i.e.}, of DNA sequences). Let
10 $d:A^{\ast}\times A^{\ast}\rightarrow[0,1]$ be a distance on
11 $A^{\ast}$. Consider a given value $T\in[0,1]$ called a threshold. For
12 all $x,y\in A^{\ast}$, we will say that $x\sim_{d,T}y$ if
16 \noindent $\sim_{d,T}$ is obviously an equivalence relation and when $d=1-\Delta$, where $\Delta$ is the similarity scoring function embedded into the emboss package (Needleman-Wunch released by EMBL), we will simply denote $\sim_{d,0.1}$ by $\sim$.
18 The method begins by building an undirected graph based on similarity
19 rates $r_{ij}$ between DNA~sequences $g_{i}$ and $g_{j}$ (\emph{i.e.},
20 $r_{ij}=\Delta\left(g_{i},g_{j}\right)$). In this latter graph, nodes
21 are constituted by all the coding sequences of the set of genomes
22 under consideration, and there is an edge between $g_{i}$ and $g_{j}$
23 if the similarity rate $r_{ij}$ is greater than a given similarity
24 threshold. The Connected Components (CC) of the ``similarity'' graph
27 This process also results in an equivalence relation between sequences
28 in the same CC based on Definition~\ref{def1}. Any class for this
29 relation is called ``gene'' here, where its representatives
30 (DNA~sequences) are the ``alleles'' of this gene. Thus this first
31 method produces for each genome $G$, which is a set
32 $\left\{g_{1}^G,...,g_{m_G}^G\right\}$ of $m_{G}$ DNA coding
33 sequences, the projection of each sequence according to $\pi$, where
34 $\pi$ maps each sequence into its gene (class) according to $\sim$. In
35 other words, a genome $G$ is mapped into
36 $\left\{\pi(g_{1}^G),...,\pi(g_{m_G}^G)\right\}$. Note that a
37 projected genome has no duplicated gene since it is a set.
39 Consequently, the core genome (resp. the pan genome) of two genomes
40 $G_{1}$ and $G_{2}$ is defined as the intersection (resp. as the
41 union) of their projected genomes. We then consider the intersection
42 of all the projected genomes, which is the set of all the genes
43 $\dot{x}$ such that each genome has at least one allele in
44 $\dot{x}$. The pan genome is computed similarly as the union of all
45 the projected genomes. However such approach suffers from producing
46 too small core genomes, for any chosen similarity threshold, compared
47 to what is usually expected by biologists regarding these
48 chloroplasts. We are then left with the following questions: how can
49 we improve the confidence put in the produced core? Can we thus guess
50 the evolution scenario of these genomes?