2 We still need to propose good methodology to predict good core genes that reflect natural biological relationships among species. Proposing new methods and compare them with previous ones can give us an indicator of which method can produce good functional genes. In this section, we will recall the definition with a fast revision of similarity based method (see \cite{Alkindy2014} for further details).
3 This method, considers annotated genomes from NCBI and DOGMA and uses a distance-based similarity measure on genes' coding sequences. Such an approach requires annotated genomes, like the ones provided by the NCBI website.
5 \subsubsection{Theoretical presentation}
7 We start by fast revision of first method by the following preliminary definition~\cite{acgs13:onp,Alkindy2014}.\color{black}
11 Let $A=\{A,T,C,G\}$ be the nucleotides alphabet, and $A^\ast$ be the
12 set of finite words on $A$ (\emph{i.e.}, of DNA sequences). Let
13 $d:A^{\ast}\times A^{\ast}\rightarrow[0,1]$ be a function called
15 $A^{\ast}$. Consider a given value $T\in[0,1]$ called a threshold. For
16 all $x,y\in A^{\ast}$, we will say that $x\sim_{d,T}y$ if
20 %\noindent $\sim_{d,T}$ is obviously an equivalence relation and when $d=1-\Delta$, where $\Delta$ is the similarity scoring function embedded into the emboss package , we will simply denote $\sim_{d,0.1}$ by $\sim$.
22 Let be given a \emph{similarity} threshold $T$ and a \emph{similarity measure} $d$.
23 The method begins by building an undirected graph
24 between all the DNA~sequences $g$ of the set of genomes as follows:
25 there is an edge between $g_{i}$ and $g_{j}$
26 if $g_i \sim_{d,T} g_j$ is established.
27 This graph is further denoted as the ``similarity'' graph.
28 We thus say that two coding sequences
30 equivalent with respect to the relation $\mathcal{R}$ if both $g_i$ and
31 $g_j$ belong in the same
32 connected component (CC) of this similarity graph, \textit{i.e.}, if there is a path between $g_i$
33 and $g_j$ in the graph.
35 It is not hard to see that this relation is an
36 equivalence relation whereas $\sim$ is not.
37 Any class for this relation is called a ``gene''
38 in this article, where its representatives
39 (DNA~sequences) are the ``alleles'' of this gene, such abuse of language
40 being proposed to set our ideas down. Thus this first
41 method produces for each genome $G$, which is a set
42 $\left\{g_{1}^G,...,g_{m_G}^G\right\}$ of $m_{G}$ DNA coding
43 sequences, the projection of each sequence according to $\pi$, where
44 $\pi$ maps each sequence into its gene (class) according to $\mathcal{R}$. In
45 other words, a genome $G$ is mapped into
46 $\left\{\pi(g_{1}^G),...,\pi(g_{m_G}^G)\right\}$. Note that a
47 projected genome has no duplicated gene since it is a set.
49 Consequently, the core genome (resp., the pan genome) of two genomes
50 $G_{1}$ and $G_{2}$ is defined as the intersection (resp., as the
51 union) of their projected genomes. We finally consider the intersection
52 of all the projected genomes, which is the set of all the genes
53 $\dot{x}$ such that each genome has at least one allele in
54 $\dot{x}$. This set will constitute the core genome of the whole species
55 under consideration. The pan genome is computed similarly as the union of all
56 the projected genomes.
60 % \includegraphics[scale=0.5]{stats.png}
62 % \caption{Size of core and pan genomes w.r.t. the similarity threshold, first approach with NCBI annotations and Needleman-Wunsch similarity measure.}\label{Fig:sim:core:pan}
69 % \begin{tabular}{ccccccc}
71 % & \multicolumn{4}{c}{Method 1} & \multicolumn{2}{c}{Method 3} \\ \hline
72 % & \multicolumn{2}{c}{NCBI} & \multicolumn{2}{c}{DOGMA} & \multicolumn{2}{c}{NCBI and DOGMA} \\ \hline
73 % Threshold & core & pan & core & pan & core & pan \\ \hline
74 % 50 & 1 & 163 & 1 & 118 & \textbf{5} & \textbf{245} \\
75 % 51 & 1 & 291 & 1 & 194 & - & - \\
76 % 52 & 1 & 412 & 1 & 258 & - & - \\
77 % 53 & 1 & 508 & 1 & 321 & - & - \\
78 % 54 & 4 & 617 & 2 & 372 & - & - \\
79 % 55 & \textbf{5} & \textbf{692} & 2 & 409 & - & - \\
80 % 56 & \textbf{5} & \textbf{761} & \textbf{3} & \textbf{445} & - & - \\
81 % 57 & 4 & 832 & \textbf{3} & \textbf{459} & - & - \\
82 % 58 & 4 & 905 & 2 & 477 & - & - \\
83 % 59 & 4 & 976 & 2 & 497 & - & - \\
84 % 60 & 2 & 1032 & 2 & 519 & 4 & 242 \\
85 % 61 & 2 & 1113 & 2 & 553 & - & - \\
86 % 62 & 2 & 1186 & 2 & 580 & - & - \\
87 % 63 & 2 & 1264 & 2 & 607 & - & - \\
88 % 64 & 2 & 1352 & 2 & 644 & - & - \\
89 % 65 & 1 & 1454 & 2 & 685 & - & - \\
90 % 66 & 1 & 1544 & 1 & 756 & - & - \\
91 % 67 & 0 & 1652 & 1 & 838 & - & - \\
92 % 68 & 0 & 1775 & 1 & 912 & - & - \\
93 % 69 & 0 & 1886 & 1 & 1007 & - & - \\
94 % 70 & 0 & 2000 & 1 & 1116 & 3 & 242 \\
95 % 80 & 0 & 3541 & 0 & 2730 & 1 & 242 \\
96 % 90 & 0 & 5703 & 0 & 5181 & 0 & 241 \\ \hline
98 % \caption{Size of core and pan genomes w.r.t. the similarity threshold, first and third approaches.}
99 % \label{Fig:sim:core:pan}
104 \begin{tabular}{cccccccc}
106 & \multicolumn{4}{c}{Method 1} & \multicolumn{2}{c}{Method 2} \\
108 & \multicolumn{2}{c}{NCBI} & \multicolumn{2}{c}{DOGMA} & \multicolumn{2}{c}{NCBI and DOGMA} \\ \hline
109 Threshold & core & pan & core & pan & core & pan \\
111 50 & 1 & 163 & 1 & 118 & \textbf{5} & \textbf{245} \\
112 51 & 1 & 291 & 1 & 194 & - & - \\
113 52 & 1 & 412 & 1 & 258 & - & - \\
114 53 & 1 & 508 & 1 & 321 & - & - \\
115 54 & 4 & 617 & 2 & 372 & - & - \\
116 55 & \textbf{5} & \textbf{692} & 2 & 409 & - & - \\
117 56 & \textbf{5} & \textbf{761} & \textbf{3} & \textbf{445} & - & - \\
118 57 & 4 & 832 & \textbf{3} & \textbf{459} & - & - \\
119 58 & 4 & 905 & 2 & 477 & - & - \\
120 59 & 4 & 976 & 2 & 497 & - & - \\
121 60 & 2 & 1032 & 2 & 519 & 4 & 242 \\
122 61 & 2 & 1113 & 2 & 553 & - & - \\
123 62 & 2 & 1186 & 2 & 580 & - & - \\
124 63 & 2 & 1264 & 2 & 607 & - & - \\
125 64 & 2 & 1352 & 2 & 644 & - & - \\
126 65 & 1 & 1454 & 2 & 685 & - & - \\
127 66 & 1 & 1544 & 1 & 756 & - & - \\
128 67 & 0 & 1652 & 1 & 838 & - & - \\
129 68 & 0 & 1775 & 1 & 912 & - & - \\
130 69 & 0 & 1886 & 1 & 1007 & - & - \\
131 70 & 0 & 2000 & 1 & 1116 & 3 & 242 \\
132 80 & 0 & 3541 & 0 & 2730 & 1 & 242 \\
133 90 & 0 & 5703 & 0 & 5181 & 0 &241 \\
136 \caption{Size of core and pan genomes w.r.t. the similarity threshold, first and second
138 \label{Fig:sim:core:pan}
141 \subsubsection{Case study}% using NCBI annotations}
143 Let us now consider the 99 chloroplastic genomes introduced earlier.
144 We will use in this case study either the coding sequences
145 downloaded from NCBI website or the sequences predicted by DOGMA.
146 DOGMA, which stands for \textit{Dual Organellar GenoMe Annotator}, has
147 already been evoked in this article. This is a
148 tool developed in 2004 at University of Texas for annotating plant
149 chloroplast and animal mitochondrial genomes. This tool
150 translates a genome in all six reading frames and then
151 queries its own amino acid sequence database using
152 Blast (blastx~\cite{altschul1990basic}) with various ad hoc
153 parameters. The choice of DOGMA is natural, as this annotation tool is
154 reputed and specific to chloroplasts.
156 Each genome is thus constituted
157 by a list of coding sequences. In this illustration study,
158 we have evaluated the similarity between two sequences by
159 using a global alignment. More precisely, the measure $d$
160 introduced above is the similarity score provided after
161 a Needleman-Wunch global alignment, as obtained by running
162 the \emph{needle} command from the \emph{emboss} package
163 released by EMBL~\cite{Rice2000}. Parameters of the \emph{needle}
164 command are the default ones: 10.0 for gap open penalty and 0.5
167 The number of genes in the core genome and in the pan genome,
168 according to this first method using data and measure described above
169 have been computed using the supercomputer facilities of the M\'esocentre
170 de calcul de Franche-Comt\'e. Obtained results are
171 represented in Table~\ref{Fig:sim:core:pan} with respect to various
172 threshold values on Needleman-Wunsch similarity scores.
173 Remark that when the threshold is large,
174 we obtain more connected components, but with small sizes (a large number
175 of genes, with a few numbers of alleles for each of them). In other words,
177 %of alleles of one gene is naturally small if the threshold is large.
179 when the threshold is large, the
180 pan genome is large too. %However due to the construction method of the
181 %core genome, this set of genes has few elements in such a situation.
182 No matter the chosen annotation tool, this first approach suffers from producing
183 too small core genomes, for any chosen similarity threshold, compared
184 to what is usually expected by biologists.
186 %chloroplasts and their commonly accepted evolutionary scenario.
187 For NCBI, it is certainly due to a wrong determination of start and stop
188 codons in some annotated genomes, due to a large variety of annotation
189 tools used during genomes submission on the NCBI server, some of them
190 being old or deficient: such truncated
191 genes will not produce a large similarity score with their orthologous genes
192 present in other genomes. The case of DOGMA is more
193 difficult to explain as, according to our experiments and to the
194 state of the art, this gene prediction tool produces normally good
195 results in average. The best explanation of such an under-performance
196 is that a few genomes are very specific and far from the remainder ones, in terms
197 of gene contents, which leads to a small number of genes in the global core
198 genome. However this first approach cannot help us to determine
199 which genomes must be removed from our set of data. To do so, we
200 need to introduce a second approach based on gene names: from the
201 problematic gene names, we will be able to trace back to the
205 % This first illustration emphasizes the importance to deal with
206 % well-predicted coding sequences: gene prediction must be
207 % achieved with the same tool on each genome, and this tool must
208 % be well chosen among the state of the art ones. As we will use
209 % good annotation tools, a natural idea is to take advantage to
210 % the whole produced annotations (not only the predicted coding
211 % sequence, but its name and location too for instance). The implementation
212 % of this idea is the principle of the approaches for finding core
213 % and pan genomes detailed below.
214 % % We are then left with the following questions: how can
215 % we improve the confidence put in the produced core? That is, how to obtain
216 % such a core genome without considering coding sequences provided by the NCBI?