-This step considers as input the set
-$\{((g_1,g_2),r_{12}), (g_1,g_3),r_{13}), (g_{n-1},g{n}),r_{n-1.n})\}$ of
-$\frac{n(n-1)}{2}$ elements.
-Each one $(g_i,g_j),r_{ij})$ where $i < j$,
-is a pair that gives the similarity rate $r_{ij}$ between the two genes
-$g_{i}$ and $g_{j}$.
-
-The first step of this stage consists in building the following non-oriented
-graph further denoted as to \emph{similarity graph}.
-In this one, the vertices are the genes. There is an edge between
-$g_{i}$ and $g_{j}$ if the rate $r_{ij}$ is greater than a given similarity
-threshold $t$.
-
-We then define the relation $\sim$ such that
-$ x \sim y$ if $x$ and $y$ belong in the same connected component.
-Mathematically speaking, it is obvious that this
-defines an equivalence relation.
-Let $\dot{x}= \{y | x \sim y\}$
-denotes the equivalence class to which $x$ belongs.
-All the genes which are equivalent to each other
-are also elements of the same equivalence class.
-Let us then consider the set of all equivalence classes of the set of genes
-by $\sim$, denoted $X/\sim = \{\dot{x} | x \textrm{ is a gene}\}$.
-defined by $\pi(x) = \dot{x}$
-which maps each gene into it respective equivalence class by $\sim$.
-
-
-
-
-For each genome $[g_l,\ldots,g{l+m}]$, the second step computes
-the projection of each gene according to $\pi$.
-The resulting genome which is
-$$
-[\pi(g_l),\ldots,\pi(g{l+m})]
-$$
-is again of size $m$.
-
-Intuitively speaking, for two genes $g_i$ and $g_j$
-in the same equivalence class, there is path from $g_i$ and $g_j$.
-It signifies that each evolution step
-(represented by an edge in the similarity graph)
-has produced a gene s.t. the similarity with the previous one
-is greater than $t$.
-Genes $g_i$ and $g_j$ may thus have a common ancestor.
-
-
-We compute the core genome as follow.
-Each genome is projected according to $\pi$. We then consider the
-intersection of all the projected genomes which are considered as sets of genes
-and not as sequences of genes.
-This results as the set of all the class $\dot{x}$
-such that each genome has an gene $x$ in $\dot{x}$.
-The pan genome is computed similarly: the union of all the
-projected genomes in computed here.
+The first method, described below, considers NCBI annotations and uses
+a distance-based similarity measure. We start with the following
+preliminary definition:
+\begin{definition}
+\label{def1}
+Let $A=\{A,T,C,G\}$ be the nucleotides alphabet, and $A^\ast$ be the
+set of finite words on $A$ (\emph{i.e.}, of DNA sequences). Let
+$d:A^{\ast}\times A^{\ast}\rightarrow[0,1]$ be a distance on
+$A^{\ast}$. Consider a given value $T\in[0,1]$ called a threshold. For
+all $x,y\in A^{\ast}$, we will say that $x\sim_{d,T}y$ if
+$d(x,y)\leqslant T$.
+\end{definition}
+
+%\noindent $\sim_{d,T}$ is obviously an equivalence relation and when $d=1-\Delta$, where $\Delta$ is the similarity scoring function embedded into the emboss package , we will simply denote $\sim_{d,0.1}$ by $\sim$.
+
+Let be given a \emph{similarity} threshold $T$ and a distance $d$
+(Needleman-Wunch released by EMBL for instance).
+The method begins by building an undirected graph
+between all the DNA~sequences $g$ of the set of genomes as follows:
+there is an edge between $g_{i}$ and $g_{j}$
+if $g_i \sim_{d,T} g_j$ is established.
+This graph is further denoted as the ``similarity'' graph.
+
+We thus consider that the pair of two coding sequences
+$(g_i,g_j)$ belongs in the relation $\mathcal{R}$ if both $g_i$ and
+$g_j$ belong in the same
+connected component (CC), \textit{i.e.} if there is a path between $g_i$
+and $g_j$ in the similarity graph. It is not hard to see that this relation is an
+equivalence relation whereas $\sim$ is not.
+
+
+Any class for this relation is called ``gene''
+here, where its representatives
+(DNA~sequences) are the ``alleles'' of this gene. Thus this first
+method produces for each genome $G$, which is a set
+$\left\{g_{1}^G,...,g_{m_G}^G\right\}$ of $m_{G}$ DNA coding
+sequences, the projection of each sequence according to $\pi$, where
+$\pi$ maps each sequence into its gene (class) according to $\mathcal{R}$. In
+other words, a genome $G$ is mapped into
+$\left\{\pi(g_{1}^G),...,\pi(g_{m_G}^G)\right\}$. Note that a
+projected genome has no duplicated gene since it is a set.
+
+Consequently, the core genome (resp. the pan genome) of two genomes
+$G_{1}$ and $G_{2}$ is defined as the intersection (resp. as the
+union) of their projected genomes. We then consider the intersection
+of all the projected genomes, which is the set of all the genes
+$\dot{x}$ such that each genome has at least one allele in
+$\dot{x}$. The pan genome is computed similarly as the union of all
+the projected genomes.
+
+\begin{figure}
+\begin{center}
+\includegraphics[scale=0.5]{stats.png}
+\end{center}
+\caption{Size of core and pan genomes w.r.t. the similarity threshold}\label{Fig:sim:core:pan}
+\end{figure}
+
+The number of genes in the core genome and in the pan genome are
+represented in Figure~\ref{Fig:sim:core:pan} with respect to the
+threshold value.
+First of all, the higher is the threshold,
+the smaller the connected components are. In other words, the number
+of alleles of one gene is small if the threshold is high.
+When the threshold is high, the number of genes and the size of
+pan genome is high too. However due to the construction method of the
+core genome, this set of genes has few elements in such a situation.
+This approach even suffers from producing
+too small core genomes (of size 0 or 1), for any chosen similarity threshold, compared
+to what is usually expected by biologists regarding these
+chloroplasts. We are then left with the following questions: how can
+we improve the confidence put in the produced core? Can we thus guess
+the evolution scenario of these genomes?