X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/chloroplast13.git/blobdiff_plain/f8689d1f199221556dc52e80c5237c45addb0218..0f2f8faba6ce0dbfd8f3e8df9d31f67756cff6c1:/classEquiv.tex?ds=sidebyside diff --git a/classEquiv.tex b/classEquiv.tex index 41abda4..00b227d 100644 --- a/classEquiv.tex +++ b/classEquiv.tex @@ -1,28 +1,50 @@ -Identifying core genes is important to understand evolutionary and -functional phylogenies. Therefore, in this work we present two methods -to build a genes content evolutionary tree. More precisely, we focus -on the following questions considering a collection of -99~chloroplasts annotated from NCBI \cite{Sayers01012011} and Dogma -\cite{RDogma} : how can we identify the best core genome and what -is the evolutionary scenario of these chloroplasts. -Two methods are considered here. The first one is based on NCBI annotation, it is explained below. -We start by the following definition. + +The first method, described below, considers NCBI annotations and uses +a distance-based similarity measure. We start with the following +preliminary Definition: + \begin{definition} \label{def1} -Let $A=\{A,T,C,G\}$ be the nucleotides alphabet, and $A^\ast$ be the set of finite words on $A$ (\emph{i.e.}, of DNA sequences). Let $d:A^{\ast}\times A^{\ast}\rightarrow[0,1]$ be a distance on $A^{\ast}$. Consider a given value $T\in[0,1]$ called a threshold. For all $x,y\in A^{\ast}$, we will say that $x\sim_{d,T}y$ if $d(x,y)\leqslant T$. +Let $A=\{A,T,C,G\}$ be the nucleotides alphabet, and $A^\ast$ be the +set of finite words on $A$ (\emph{i.e.}, of DNA sequences). Let +$d:A^{\ast}\times A^{\ast}\rightarrow[0,1]$ be a distance on +$A^{\ast}$. Consider a given value $T\in[0,1]$ called a threshold. For +all $x,y\in A^{\ast}$, we will say that $x\sim_{d,T}y$ if +$d(x,y)\leqslant T$. \end{definition} -\noindent$\sim_{d,T}$ is obviously an equivalence relation. When $d=1-\Delta$, where $\Delta$ is the similarity scoring function embedded into the emboss package (Needleman-Wunch released by EMBL), we will simply denote $\sim_{d,0.1}$ by $\sim$. The method starts by building an undirected graph based on -the similarity rates $r_{ij}$ between sequences $g_{i}$ and $g_{j}$ (\emph{i.e.}, $r_{ij}=\Delta(g_{i},g_{j})$). -In this latter, nodes are constituted by all the coding sequences of the set of genomes under consideration, and there is an edge between $g_{i}$ and $g_{j}$ if the -similarity rate $r_{ij}$ is -greater than the given similarity threshold. The Connected Components -(CC) of the ``similarity'' graph are thus computed. -This produces an equivalence -relation between sequences in the same CC based on Definition~\ref{def1}. -Any class for this relation is called ``gene'' here, where its representatives (DNA sequences) are the ``alleles'' of this gene. Thus this first method produces for each genome $G$, which is a set $\{g_{1}^G,...,g_{m_G}^G\}$ of $m_{G}$ DNA coding sequences, the projection of each sequence according to $\pi$, where $\pi$ maps each sequence -into its gene (class) according to $\sim$. In other words, $G$ is mapped into $\{\pi(g_{1}^G),...,\pi(g_{m_G}^G)\}$. -Remark that a projected genome has no duplicated gene, as it is a set. The core genome (resp. the pan genome) of $G_{1}$ and $G_{2}$ is defined thus as the intersection (resp. as the union) of these projected genomes.\\ -We then consider the intersection of all the projected genomes, which is the set of all the genes $\dot{x}$ -such that each genome has at least one allele in $\dot{x}$. The pan genome is computed similarly as the union of all the projected genomes. However such approach suffers from producing too small core genomes, -for any chosen similarity threshold, compared to what is usually waited by biologists regarding these chloroplasts. We are then left with the following questions: how can we improve the confidence put in the produced core? Can we thus guess the evolution scenario of these genomes? \ No newline at end of file +\noindent $\sim_{d,T}$ is obviously an equivalence relation and when $d=1-\Delta$, where $\Delta$ is the similarity scoring function embedded into the emboss package (Needleman-Wunch released by EMBL), we will simply denote $\sim_{d,0.1}$ by $\sim$. + +The method begins by building an undirected graph based on similarity +rates $r_{ij}$ between DNA~sequences $g_{i}$ and $g_{j}$ (\emph{i.e.}, +$r_{ij}=\Delta\left(g_{i},g_{j}\right)$). In this latter graph, nodes +are constituted by all the coding sequences of the set of genomes +under consideration, and there is an edge between $g_{i}$ and $g_{j}$ +if the similarity rate $r_{ij}$ is greater than a given similarity +threshold. The Connected Components (CC) of the ``similarity'' graph +are thus computed. + +This process also results in an equivalence relation between sequences +in the same CC based on Definition~\ref{def1}. Any class for this +relation is called ``gene'' here, where its representatives +(DNA~sequences) are the ``alleles'' of this gene. Thus this first +method produces for each genome $G$, which is a set +$\left\{g_{1}^G,...,g_{m_G}^G\right\}$ of $m_{G}$ DNA coding +sequences, the projection of each sequence according to $\pi$, where +$\pi$ maps each sequence into its gene (class) according to $\sim$. In +other words, a genome $G$ is mapped into +$\left\{\pi(g_{1}^G),...,\pi(g_{m_G}^G)\right\}$. Note that a +projected genome has no duplicated gene since it is a set. + +Consequently, the core genome (resp. the pan genome) of two genomes +$G_{1}$ and $G_{2}$ is defined as the intersection (resp. as the +union) of their projected genomes. We then consider the intersection +of all the projected genomes, which is the set of all the genes +$\dot{x}$ such that each genome has at least one allele in +$\dot{x}$. The pan genome is computed similarly as the union of all +the projected genomes. However such approach suffers from producing +too small core genomes, for any chosen similarity threshold, compared +to what is usually expected by biologists regarding these +chloroplasts. We are then left with the following questions: how can +we improve the confidence put in the produced core? Can we thus guess +the evolution scenario of these genomes?