From 9d33c5f06454db8d752fa7bbe08f9bdaa3977b1a Mon Sep 17 00:00:00 2001 From: Michel Salomon Date: Tue, 19 Nov 2013 18:37:04 +0100 Subject: [PATCH] First modifications in section 3 --- abstract.tex | 2 +- annotated.tex | 102 ++++++++++++++++++++++++++++++------------- classEquiv.tex | 8 ++-- intro.tex | 2 +- main.tex | 2 +- population_Table.tex | 1 - 6 files changed, 79 insertions(+), 38 deletions(-) diff --git a/abstract.tex b/abstract.tex index 3c789e3..510b8ae 100644 --- a/abstract.tex +++ b/abstract.tex @@ -5,7 +5,7 @@ evolution over time, and in phylogenetic and genetic analyses. Various models of genomes evolution are based on the analysis of DNA sequences, SNPs, mutations, and so on. We have recently investigated the use of core (\emph{i.e.}, common genes) and pan genomes to infer -evolutionary information on a collection of 107 chloroplasts. In +evolutionary information on a collection of 99~chloroplasts. In particular, we have regarded methods to build a genes content evolutionary tree using distances to core genome. However, the production of reliable core and pan genomes is not an easy task, due diff --git a/annotated.tex b/annotated.tex index 3a7fbd1..9810db8 100644 --- a/annotated.tex +++ b/annotated.tex @@ -1,41 +1,83 @@ -The field of genome annotation pays a lot of attentions where the -ability to collect and analysis genomical data can provide strong -indicators for the study of life\cite{Eisen2007}. Four of genome -annotation centers (such as, \textit{NCBI\cite{Sayers01012011}, + +These last years the cost of sequencing genomes has been greatly +reduced, and thus more and more genomes are sequenced. Therefore +automatic annotation tools are required to deal with this continuously +increasing amount of genomical data. Moreover, a reliable and accurate +genome annotation process is needed in order to provide strong +indicators for the study of life\cite{Eisen2007}. + +Various annotation tools (\emph{i.e.}, cost-effective sequencing +methods\cite{Bakke2009}) producing genomic annotations at many levels +of detail have been designed by different annotation centers. Among +the major annotation centers we can notice NCBI\cite{Sayers01012011}, Dogma \cite{RDogma}, cpBase \cite{de2002comparative}, -CpGAVAS \cite{liu2012cpgavas}, and CEGMA\cite{parra2007cegma}}) -present various types of annotation tools (\emph{i.e.} cost-effective -sequencing methods\cite{Bakke2009}) on different annotation -levels. Generally, previous studies used one of three methods for gene -finding in annotated genome using these -centers: \textit{alignment-based, composition based, or combination of -both\cite{parra2007cegma}}. The alignment-based method is used when we -try to predict a coding gene (\emph{i.e.}. genes that produce -proteins) by aligning DNA sequence of gene to the protein of cDNA -sequence of homology\cite{parra2007cegma}. This approach also is used -in GeneWise\cite{birney2004genewise}. Composition-based method (known +CpGAVAS \cite{liu2012cpgavas}, and +CEGMA\cite{parra2007cegma}. Usually, previous studies used one out of +three methods for finding genes in annoted genomes using data from +these centers: \textit{alignment-based}, \textit{composition based}, +or a combination of both~\cite{parra2007cegma}. The alignment-based +method is used when trying to predict a coding gene (\emph{i.e.}. +genes that produce proteins) by aligning a genomic DNA sequence with a +cDNA sequence coding an homologous protein \cite{parra2007cegma}. +This approach is also used in GeneWise\cite{birney2004genewise}. The +alternative method, the composition-based one (also known as \textit{ab initio}) is based on a probabilistic model of gene structure to find genes according to the gene value probability -(GeneID\cite{parra2000geneid}). In this section, we consider a new -method of finding core genes from large amount of chloroplast genomes, -as a solution of the problem resulting from the method stated in -section two. This method is based on extracting gene features. A -general overview of the system is illustrated in Figure \ref{Fig1}.\\ - -\begin{figure}[H] +(GeneID \cite{parra2000geneid}). Such annotated genomic data will be +used to overcome the limitation of the first method described in the +previous section. In fact, the second method we propose finds core +genes from large amount of chloroplast genomes through genomic +features extraction. + +Figure~\ref{Fig1} presents an overview of the entire method pipeline. +More precisely, the second method consists of three +stages: \textit{Genome annotation}, \textit{Core extraction}, +and \textit{Features Visualization} which highlights the +relationships. To understand the whole core extraction process, we +describe briefly each stage below. More details will be given in the +coming subsections. The method uses as starting point some sequence +database chosen among the many international databases storing +nucleotide sequences, like the GenBank at NBCI \cite{Sayers01012011}, +the \textit{EMBL-Bank} \cite{apweiler1985swiss} in Europe +or \textit{DDBJ} \cite{sugawara2008ddbj} in Japan. Different +biological tools can analyze and annotate genomes by interacting with +these databases to align and extract sequences to predict genes. The +database in our method must be taken from any confident data source +that stores annotated and/or unannotated chloroplast genomes. We have +considered the GenBank-NCBI \cite{Sayers01012011} database as sequence +database: 99~genomes of chloroplasts were retrieved. These genomes +lie in the eleven type of chloroplast families and Table \ref{Tab2} +summarizes their distribution in our dataset. + +\begin{figure}[h] \centering - \includegraphics[width=0.7\textwidth]{generalView} -\caption{A general overview of the system}\label{Fig1} + \includegraphics[width=0.75\textwidth]{generalView} +\caption{A general overview of the annotation-based approach}\label{Fig1} \end{figure} -In Figure 1, we illustrate the general overview of system pipeline: \textit{Database, Genomes annotation, Core extraction,} and \textit{relationships}. We will give a short discussion for each stage of the model in order to understand the whole core extraction process. This work starts with a gene Bank database; however, many international Banks for nucleotide sequence databases (such as, \textit{GenBank} \cite{Sayers01012011} in USA, \textit{EMBL-Bank} \cite{apweiler1985swiss} in Europe, and \textit{DDBJ} \cite{sugawara2008ddbj} in Japon) exist to store various genomes and DNA species. Different biological tools can analyse and annotate genomes by interacting with these databases to align and extract sequences to predict genes. The database in this model must be taken from any confident data source that stores annotated and/or unannotated chloroplast genomes. We consider GenBank-NCBI \cite{Sayers01012011} database to be our nucleotide sequences database. Annotation (as the second stage) is considered to be the first important task for extract gene features. Good annotation tool leads us to extract good gene feature. In this paper, two annotation techniques from \textit{NCBI, and Dogma} are used to extract \textit{genes features}. Extracting gene feature (as a third stage) can be anything like (genes names, gene sequences, protein sequence,...etc). Our methodology in this paper consider gene names, genes counts, and gene sequence for extracting core genes and producing chloroplast evolutionary tree. \\ -In last stage, features visualization represents methods to visualize genomes and/or gene evolution in chloroplast. We use the forms of tables, phylogenetic trees, graphs,...,etc to organize and represent genomes relationships to achieve the goal of representing gene evolution. In addition, comparing these forms with another annotation tool forms dedicated to large population of chloroplast genomes give us biological perspectives to the nature of chloroplasts evolution. \\ -A local database attached with each pipe stage is used to store all the informations of extraction process. The output from each stage in our system will be an input to the second stage and so on. +Annotation, which is the first stage, is an important task for +extracting gene features. Indeed, to extract good gene feature, a good +annotation tool is obviously required. To obtain relevant annotated +genomes, two annotation techniques from NCBI and Dogma are used. The +extraction of gene feature, the next stage, can be anything like gene +names, gene sequences, protein sequences, and so on. Our method +considers gene names, gene counts, and gene sequence for extracting +core genes and producing chloroplast evolutionary tree. The final +stage allows to visualize genomes and/or gene evolution in +chloroplast. Therefore we use representations like tables, +phylogenetic trees, graphs, etc. to organize and show genomes +relationships, and thus achieve the goal of representing gene +evolution. In addition, comparing these representations with ones +issued from another annotation tool dedicated to large population of +chloroplast genomes give us biological perspectives to the nature of +chloroplasts evolution. Notice that a local database linked with each +pipe stage is used to store all the informations produced during the +process. -\subsection{Genomes Samples} -In this research, we retrieve genomes of Chloroplasts from NCBI. Ninety nine genome of them are considered to work with. These genomes lies in the eleven type of chloroplast families. The distribution of genomes is illustrated in detail in Table \ref{Tab2}. - \input{population_Table} + +% MICHEL : TO BE CONTINUED FROM HERE + \subsection{Genome Annotation Techniques} Genome annotation is the second stage in the model pipeline. Many techniques were developed to annotate chloroplast genomes but the problem is that they vary in the number and type of predicted genes (\emph{i.e.} the ability to predict genes and \textit{for example: Transfer RNA (tRNA)} and \textit{Ribosomal RNA (rRNA)} genes). Two annotation techniques from NCBI and Dogma are considered to analyse chloroplast genomes to examine the accuracy of predicted coding genes. diff --git a/classEquiv.tex b/classEquiv.tex index 829a8b4..f3d3ed1 100644 --- a/classEquiv.tex +++ b/classEquiv.tex @@ -14,7 +14,7 @@ $d(x,y)\leqslant T$. %\noindent $\sim_{d,T}$ is obviously an equivalence relation and when $d=1-\Delta$, where $\Delta$ is the similarity scoring function embedded into the emboss package , we will simply denote $\sim_{d,0.1}$ by $\sim$. -Let be given a \emph{similarity} threshold $T$ and a distance $d$, +Let be given a \emph{similarity} threshold $T$ and a distance $d$ (Needleman-Wunch released by EMBL for instance). The method begins by building an undirected graph between all the DNA~sequences $g$ of the set of genomes as follows: @@ -23,10 +23,10 @@ if $g_i \sim_{d,T} g_j$ is established. This graph is further denoted as the ``similarity'' graph. We thus consider that the pair of two coding sequences -$(g_i,g_j)$ belongs in the relation $\mathcal{R}$ if both $g_i$ an,d +$(g_i,g_j)$ belongs in the relation $\mathcal{R}$ if both $g_i$ and $g_j$ belong in the same connected component (CC), \textit{i.e.} if there is a path between $g_i$ -and $g_j$ in the similarity graph. It is not hard to see this relation is an +and $g_j$ in the similarity graph. It is not hard to see that this relation is an equivalence relation whereas $\sim$ is not. @@ -51,7 +51,7 @@ the projected genomes. \begin{figure} \begin{center} -\includegraphics[scale=0.4]{stats.png} +\includegraphics[scale=0.5]{stats.png} \end{center} \caption{Size of core and pan genomes w.r.t. the similarity threshold}\label{Fig:sim:core:pan} \end{figure} diff --git a/intro.tex b/intro.tex index 84c9ed8..a0a6050 100644 --- a/intro.tex +++ b/intro.tex @@ -1,7 +1,7 @@ Identifying core genes is important to understand evolutionary and functional phylogenies. Therefore, in this work we present methods to build a genes content evolutionary tree. More precisely, we focus on -the following questions considering a collection of 107~chloroplasts +the following questions considering a collection of 99~chloroplasts annotated from NCBI \cite{Sayers01012011} and Dogma \cite{RDogma}: how can we identify the best core genome and what is the evolutionary scenario of these chloroplasts. diff --git a/main.tex b/main.tex index 65e70d4..ae1e9d7 100755 --- a/main.tex +++ b/main.tex @@ -46,7 +46,7 @@ University of Franche-Comt\'{e}, France \\ % Main author : jfc \input{classEquiv} -\section{Annotations-based approaches} +\section{Annotation-based approaches} % Main author : bassam \input{annotated} diff --git a/population_Table.tex b/population_Table.tex index b184f60..6a65f00 100644 --- a/population_Table.tex +++ b/population_Table.tex @@ -1,5 +1,4 @@ \begin{center} - \begin{table} \tiny \caption[NCBI Genomes Families]{List of family groups of Chloroplast Genomes from NCBI\label{Tab2}} -- 2.39.5