1 The first method, described below, considers NCBI annotations and uses
2 a distance-based similarity measure. We start with the following
3 preliminary definition:
7 Let $A=\{A,T,C,G\}$ be the nucleotides alphabet, and $A^\ast$ be the
8 set of finite words on $A$ (\emph{i.e.}, of DNA sequences). Let
9 $d:A^{\ast}\times A^{\ast}\rightarrow[0,1]$ be a distance on
10 $A^{\ast}$. Consider a given value $T\in[0,1]$ called a threshold. For
11 all $x,y\in A^{\ast}$, we will say that $x\sim_{d,T}y$ if
15 %\noindent $\sim_{d,T}$ is obviously an equivalence relation and when $d=1-\Delta$, where $\Delta$ is the similarity scoring function embedded into the emboss package , we will simply denote $\sim_{d,0.1}$ by $\sim$.
17 Let be given a \emph{similarity} threshold $T$ and a distance $d$
18 (Needleman-Wunch released by EMBL for instance).
19 The method begins by building an undirected graph
20 between all the DNA~sequences $g$ of the set of genomes as follows:
21 there is an edge between $g_{i}$ and $g_{j}$
22 if $g_i \sim_{d,T} g_j$ is established.
23 This graph is further denoted as the ``similarity'' graph.
25 We thus consider that the pair of two coding sequences
26 $(g_i,g_j)$ belongs in the relation $\mathcal{R}$ if both $g_i$ and
27 $g_j$ belong in the same
28 connected component (CC), \textit{i.e.} if there is a path between $g_i$
29 and $g_j$ in the similarity graph. It is not hard to see that this relation is an
30 equivalence relation whereas $\sim$ is not.
33 Any class for this relation is called ``gene''
34 here, where its representatives
35 (DNA~sequences) are the ``alleles'' of this gene. Thus this first
36 method produces for each genome $G$, which is a set
37 $\left\{g_{1}^G,...,g_{m_G}^G\right\}$ of $m_{G}$ DNA coding
38 sequences, the projection of each sequence according to $\pi$, where
39 $\pi$ maps each sequence into its gene (class) according to $\mathcal{R}$. In
40 other words, a genome $G$ is mapped into
41 $\left\{\pi(g_{1}^G),...,\pi(g_{m_G}^G)\right\}$. Note that a
42 projected genome has no duplicated gene since it is a set.
44 Consequently, the core genome (resp. the pan genome) of two genomes
45 $G_{1}$ and $G_{2}$ is defined as the intersection (resp. as the
46 union) of their projected genomes. We then consider the intersection
47 of all the projected genomes, which is the set of all the genes
48 $\dot{x}$ such that each genome has at least one allele in
49 $\dot{x}$. The pan genome is computed similarly as the union of all
50 the projected genomes.
54 \includegraphics[scale=0.5]{stats.png}
56 \caption{Size of core and pan genomes w.r.t. the similarity threshold}\label{Fig:sim:core:pan}
59 The number of genes in the core genome and in the pan genome are
60 represented in Figure~\ref{Fig:sim:core:pan} with respect to the
62 First of all, the higher is the threshold,
63 the smaller the connected components are. In other words, the number
64 of alleles of one gene is small if the threshold is high.
65 When the threshold is high, the number of genes and the size of
66 pan genome is high too. However due to the construction method of the
67 core genome, this set of genes has few elements in such a situation.
68 This approach even suffers from producing
69 too small core genomes (of size 0 or 1), for any chosen similarity threshold, compared
70 to what is usually expected by biologists regarding these
71 chloroplasts. We are then left with the following questions: how can
72 we improve the confidence put in the produced core? Can we thus guess
73 the evolution scenario of these genomes?