X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/chloroplast13.git/blobdiff_plain/0f2f8faba6ce0dbfd8f3e8df9d31f67756cff6c1..449a44e17a928aa5b2905e62860896ccf83c0ceb:/classEquiv.tex?ds=sidebyside diff --git a/classEquiv.tex b/classEquiv.tex index 00b227d..f3d3ed1 100644 --- a/classEquiv.tex +++ b/classEquiv.tex @@ -1,7 +1,6 @@ - The first method, described below, considers NCBI annotations and uses a distance-based similarity measure. We start with the following -preliminary Definition: +preliminary definition: \begin{definition} \label{def1} @@ -13,25 +12,31 @@ all $x,y\in A^{\ast}$, we will say that $x\sim_{d,T}y$ if $d(x,y)\leqslant T$. \end{definition} -\noindent $\sim_{d,T}$ is obviously an equivalence relation and when $d=1-\Delta$, where $\Delta$ is the similarity scoring function embedded into the emboss package (Needleman-Wunch released by EMBL), we will simply denote $\sim_{d,0.1}$ by $\sim$. +%\noindent $\sim_{d,T}$ is obviously an equivalence relation and when $d=1-\Delta$, where $\Delta$ is the similarity scoring function embedded into the emboss package , we will simply denote $\sim_{d,0.1}$ by $\sim$. + +Let be given a \emph{similarity} threshold $T$ and a distance $d$ +(Needleman-Wunch released by EMBL for instance). +The method begins by building an undirected graph +between all the DNA~sequences $g$ of the set of genomes as follows: +there is an edge between $g_{i}$ and $g_{j}$ +if $g_i \sim_{d,T} g_j$ is established. +This graph is further denoted as the ``similarity'' graph. + +We thus consider that the pair of two coding sequences +$(g_i,g_j)$ belongs in the relation $\mathcal{R}$ if both $g_i$ and +$g_j$ belong in the same +connected component (CC), \textit{i.e.} if there is a path between $g_i$ +and $g_j$ in the similarity graph. It is not hard to see that this relation is an +equivalence relation whereas $\sim$ is not. -The method begins by building an undirected graph based on similarity -rates $r_{ij}$ between DNA~sequences $g_{i}$ and $g_{j}$ (\emph{i.e.}, -$r_{ij}=\Delta\left(g_{i},g_{j}\right)$). In this latter graph, nodes -are constituted by all the coding sequences of the set of genomes -under consideration, and there is an edge between $g_{i}$ and $g_{j}$ -if the similarity rate $r_{ij}$ is greater than a given similarity -threshold. The Connected Components (CC) of the ``similarity'' graph -are thus computed. -This process also results in an equivalence relation between sequences -in the same CC based on Definition~\ref{def1}. Any class for this -relation is called ``gene'' here, where its representatives +Any class for this relation is called ``gene'' +here, where its representatives (DNA~sequences) are the ``alleles'' of this gene. Thus this first method produces for each genome $G$, which is a set $\left\{g_{1}^G,...,g_{m_G}^G\right\}$ of $m_{G}$ DNA coding sequences, the projection of each sequence according to $\pi$, where -$\pi$ maps each sequence into its gene (class) according to $\sim$. In +$\pi$ maps each sequence into its gene (class) according to $\mathcal{R}$. In other words, a genome $G$ is mapped into $\left\{\pi(g_{1}^G),...,\pi(g_{m_G}^G)\right\}$. Note that a projected genome has no duplicated gene since it is a set. @@ -42,8 +47,26 @@ union) of their projected genomes. We then consider the intersection of all the projected genomes, which is the set of all the genes $\dot{x}$ such that each genome has at least one allele in $\dot{x}$. The pan genome is computed similarly as the union of all -the projected genomes. However such approach suffers from producing -too small core genomes, for any chosen similarity threshold, compared +the projected genomes. + +\begin{figure} +\begin{center} +\includegraphics[scale=0.5]{stats.png} +\end{center} +\caption{Size of core and pan genomes w.r.t. the similarity threshold}\label{Fig:sim:core:pan} +\end{figure} + +The number of genes in the core genome and in the pan genome are +represented in Figure~\ref{Fig:sim:core:pan} with respect to the +threshold value. +First of all, the higher is the threshold, +the smaller the connected components are. In other words, the number +of alleles of one gene is small if the threshold is high. +When the threshold is high, the number of genes and the size of +pan genome is high too. However due to the construction method of the +core genome, this set of genes has few elements in such a situation. +This approach even suffers from producing +too small core genomes (of size 0 or 1), for any chosen similarity threshold, compared to what is usually expected by biologists regarding these chloroplasts. We are then left with the following questions: how can we improve the confidence put in the produced core? Can we thus guess