X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/chloroplast13.git/blobdiff_plain/fb06fc26916e64c8e6c452ee7424b6433e255f0f..181fe86127b57e8c7df3e97082134e7a5b7b8618:/classEquiv.tex?ds=inline diff --git a/classEquiv.tex b/classEquiv.tex index b77c51a..f3d3ed1 100644 --- a/classEquiv.tex +++ b/classEquiv.tex @@ -1,55 +1,73 @@ -This step considers as input the set -$\{((g_1,g_2),r_{12}), (g_1,g_3),r_{13}), (g_{n-1},g{n}),r_{n-1.n})\}$ of -$\frac{n(n-1)}{2}$ elements. -Each one $(g_i,g_j),r_{ij})$ where $i < j$, -is a pair that gives the similarity rate $r_{ij}$ between the two genes -$g_{i}$ and $g_{j}$. - -The first step of this stage consists in building the following non-oriented -graph further denoted as to \emph{similarity graph}. -In this one, the vertices are the genes. There is an edge between -$g_{i}$ and $g_{j}$ if the rate $r_{ij}$ is greater than a given similarity -threshold $t$. - -We then define the relation $\sim$ such that -$ x \sim y$ if $x$ and $y$ belong in the same connected component. -Mathematically speaking, it is obvious that this -defines an equivalence relation. -Let $\dot{x}= \{y | x \sim y\}$ -denotes the equivalence class to which $x$ belongs. -All the genes which are equivalent to each other -are also elements of the same equivalence class. -Let us then consider the set of all equivalence classes of the set of genes -by $\sim$, denoted $X/\sim = \{\dot{x} | x \textrm{ is a gene}\}$. -defined by $\pi(x) = \dot{x}$ -which maps each gene into it respective equivalence class by $\sim$. - - - - -For each genome $[g_l,\ldots,g{l+m}]$, the second step computes -the projection of each gene according to $\pi$. -The resulting genome which is -$$ -[\pi(g_l),\ldots,\pi(g{l+m})] -$$ -is again of size $m$. - -Intuitively speaking, for two genes $g_i$ and $g_j$ -in the same equivalence class, there is path from $g_i$ and $g_j$. -It signifies that each evolution step -(represented by an edge in the similarity graph) -has produced a gene s.t. the similarity with the previous one -is greater than $t$. -Genes $g_i$ and $g_j$ may thus have a common ancestor. - - -We compute the core genome as follow. -Each genome is projected according to $\pi$. We then consider the -intersection of all the projected genomes which are considered as sets of genes -and not as sequences of genes. -This results as the set of all the class $\dot{x}$ -such that each genome has an gene $x$ in $\dot{x}$. -The pan genome is computed similarly: the union of all the -projected genomes in computed here. +The first method, described below, considers NCBI annotations and uses +a distance-based similarity measure. We start with the following +preliminary definition: +\begin{definition} +\label{def1} +Let $A=\{A,T,C,G\}$ be the nucleotides alphabet, and $A^\ast$ be the +set of finite words on $A$ (\emph{i.e.}, of DNA sequences). Let +$d:A^{\ast}\times A^{\ast}\rightarrow[0,1]$ be a distance on +$A^{\ast}$. Consider a given value $T\in[0,1]$ called a threshold. For +all $x,y\in A^{\ast}$, we will say that $x\sim_{d,T}y$ if +$d(x,y)\leqslant T$. +\end{definition} + +%\noindent $\sim_{d,T}$ is obviously an equivalence relation and when $d=1-\Delta$, where $\Delta$ is the similarity scoring function embedded into the emboss package , we will simply denote $\sim_{d,0.1}$ by $\sim$. + +Let be given a \emph{similarity} threshold $T$ and a distance $d$ +(Needleman-Wunch released by EMBL for instance). +The method begins by building an undirected graph +between all the DNA~sequences $g$ of the set of genomes as follows: +there is an edge between $g_{i}$ and $g_{j}$ +if $g_i \sim_{d,T} g_j$ is established. +This graph is further denoted as the ``similarity'' graph. + +We thus consider that the pair of two coding sequences +$(g_i,g_j)$ belongs in the relation $\mathcal{R}$ if both $g_i$ and +$g_j$ belong in the same +connected component (CC), \textit{i.e.} if there is a path between $g_i$ +and $g_j$ in the similarity graph. It is not hard to see that this relation is an +equivalence relation whereas $\sim$ is not. + + +Any class for this relation is called ``gene'' +here, where its representatives +(DNA~sequences) are the ``alleles'' of this gene. Thus this first +method produces for each genome $G$, which is a set +$\left\{g_{1}^G,...,g_{m_G}^G\right\}$ of $m_{G}$ DNA coding +sequences, the projection of each sequence according to $\pi$, where +$\pi$ maps each sequence into its gene (class) according to $\mathcal{R}$. In +other words, a genome $G$ is mapped into +$\left\{\pi(g_{1}^G),...,\pi(g_{m_G}^G)\right\}$. Note that a +projected genome has no duplicated gene since it is a set. + +Consequently, the core genome (resp. the pan genome) of two genomes +$G_{1}$ and $G_{2}$ is defined as the intersection (resp. as the +union) of their projected genomes. We then consider the intersection +of all the projected genomes, which is the set of all the genes +$\dot{x}$ such that each genome has at least one allele in +$\dot{x}$. The pan genome is computed similarly as the union of all +the projected genomes. + +\begin{figure} +\begin{center} +\includegraphics[scale=0.5]{stats.png} +\end{center} +\caption{Size of core and pan genomes w.r.t. the similarity threshold}\label{Fig:sim:core:pan} +\end{figure} + +The number of genes in the core genome and in the pan genome are +represented in Figure~\ref{Fig:sim:core:pan} with respect to the +threshold value. +First of all, the higher is the threshold, +the smaller the connected components are. In other words, the number +of alleles of one gene is small if the threshold is high. +When the threshold is high, the number of genes and the size of +pan genome is high too. However due to the construction method of the +core genome, this set of genes has few elements in such a situation. +This approach even suffers from producing +too small core genomes (of size 0 or 1), for any chosen similarity threshold, compared +to what is usually expected by biologists regarding these +chloroplasts. We are then left with the following questions: how can +we improve the confidence put in the produced core? Can we thus guess +the evolution scenario of these genomes?