-
The first method, described below, considers NCBI annotations and uses
a distance-based similarity measure. We start with the following
-preliminary Definition:
+preliminary definition:
\begin{definition}
\label{def1}
$d(x,y)\leqslant T$.
\end{definition}
-\noindent $\sim_{d,T}$ is obviously an equivalence relation and when $d=1-\Delta$, where $\Delta$ is the similarity scoring function embedded into the emboss package (Needleman-Wunch released by EMBL), we will simply denote $\sim_{d,0.1}$ by $\sim$.
+%\noindent $\sim_{d,T}$ is obviously an equivalence relation and when $d=1-\Delta$, where $\Delta$ is the similarity scoring function embedded into the emboss package , we will simply denote $\sim_{d,0.1}$ by $\sim$.
+
+Let be given a \emph{similarity} threshold $T$ and a distance $d$
+(Needleman-Wunch released by EMBL for instance).
+The method begins by building an undirected graph
+between all the DNA~sequences $g$ of the set of genomes as follows:
+there is an edge between $g_{i}$ and $g_{j}$
+if $g_i \sim_{d,T} g_j$ is established.
+This graph is further denoted as the ``similarity'' graph.
+
+We thus consider that the pair of two coding sequences
+$(g_i,g_j)$ belongs in the relation $\mathcal{R}$ if both $g_i$ and
+$g_j$ belong in the same
+connected component (CC), \textit{i.e.} if there is a path between $g_i$
+and $g_j$ in the similarity graph. It is not hard to see that this relation is an
+equivalence relation whereas $\sim$ is not.
-The method begins by building an undirected graph based on similarity
-rates $r_{ij}$ between DNA~sequences $g_{i}$ and $g_{j}$ (\emph{i.e.},
-$r_{ij}=\Delta\left(g_{i},g_{j}\right)$). In this latter graph, nodes
-are constituted by all the coding sequences of the set of genomes
-under consideration, and there is an edge between $g_{i}$ and $g_{j}$
-if the similarity rate $r_{ij}$ is greater than a given similarity
-threshold. The Connected Components (CC) of the ``similarity'' graph
-are thus computed.
-This process also results in an equivalence relation between sequences
-in the same CC based on Definition~\ref{def1}. Any class for this
-relation is called ``gene'' here, where its representatives
+Any class for this relation is called ``gene''
+here, where its representatives
(DNA~sequences) are the ``alleles'' of this gene. Thus this first
method produces for each genome $G$, which is a set
$\left\{g_{1}^G,...,g_{m_G}^G\right\}$ of $m_{G}$ DNA coding
sequences, the projection of each sequence according to $\pi$, where
-$\pi$ maps each sequence into its gene (class) according to $\sim$. In
+$\pi$ maps each sequence into its gene (class) according to $\mathcal{R}$. In
other words, a genome $G$ is mapped into
$\left\{\pi(g_{1}^G),...,\pi(g_{m_G}^G)\right\}$. Note that a
projected genome has no duplicated gene since it is a set.
of all the projected genomes, which is the set of all the genes
$\dot{x}$ such that each genome has at least one allele in
$\dot{x}$. The pan genome is computed similarly as the union of all
-the projected genomes. However such approach suffers from producing
-too small core genomes, for any chosen similarity threshold, compared
+the projected genomes.
+
+\begin{figure}
+\begin{center}
+\includegraphics[scale=0.5]{stats.png}
+\end{center}
+\caption{Size of core and pan genomes w.r.t. the similarity threshold}\label{Fig:sim:core:pan}
+\end{figure}
+
+The number of genes in the core genome and in the pan genome are
+represented in Figure~\ref{Fig:sim:core:pan} with respect to the
+threshold value.
+First of all, the higher is the threshold,
+the smaller the connected components are. In other words, the number
+of alleles of one gene is small if the threshold is high.
+When the threshold is high, the number of genes and the size of
+pan genome is high too. However due to the construction method of the
+core genome, this set of genes has few elements in such a situation.
+This approach even suffers from producing
+too small core genomes (of size 0 or 1), for any chosen similarity threshold, compared
to what is usually expected by biologists regarding these
chloroplasts. We are then left with the following questions: how can
we improve the confidence put in the produced core? Can we thus guess