X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/chloroplast13.git/blobdiff_plain/df455ffa41784599b14547931a127560bd5be108..b6d9122987057b1c8c45d6ca3ff752ea12ec9484:/annotated.tex diff --git a/annotated.tex b/annotated.tex index 720adc5..6bd0d55 100644 --- a/annotated.tex +++ b/annotated.tex @@ -1 +1,423 @@ - sdfdfadqdqaaaaaaaaaaaaaaaaaaa + +These last years the cost of sequencing genomes has been greatly +reduced, and thus more and more genomes are sequenced. Therefore +automatic annotation tools are required to deal with this continuously +increasing amount of genomical data. Moreover, a reliable and accurate +genome annotation process is needed in order to provide strong +indicators for the study of life\cite{Eisen2007}. + +Various annotation tools (\emph{i.e.}, cost-effective sequencing +methods\cite{Bakke2009}) producing genomic annotations at many levels +of detail have been designed by different annotation centers. Among +the major annotation centers we can notice NCBI\cite{Sayers01012011}, +Dogma \cite{RDogma}, cpBase \cite{de2002comparative}, +CpGAVAS \cite{liu2012cpgavas}, and +CEGMA\cite{parra2007cegma}. Usually, previous studies used one out of +three methods for finding genes in annoted genomes using data from +these centers: \textit{alignment-based}, \textit{composition based}, +or a combination of both~\cite{parra2007cegma}. The alignment-based +method is used when trying to predict a coding gene (\emph{i.e.}. +genes that produce proteins) by aligning a genomic DNA sequence with a +cDNA sequence coding an homologous protein \cite{parra2007cegma}. +This approach is also used in GeneWise\cite{birney2004genewise}. The +alternative method, the composition-based one (also known +as \textit{ab initio}) is based on a probabilistic model of gene +structure to find genes according to the gene value probability +(GeneID \cite{parra2000geneid}). Such annotated genomic data will be +used to overcome the limitation of the first method described in the +previous section. In fact, the second method we propose finds core +genes from large amount of chloroplast genomes through genomic +features extraction. + +Figure~\ref{Fig1} presents an overview of the entire method pipeline. +More precisely, the second method consists of three +stages: \textit{Genome annotation}, \textit{Core extraction}, +and \textit{Features Visualization} which highlights the +relationships. To understand the whole core extraction process, we +describe briefly each stage below. More details will be given in the +coming subsections. The method uses as starting point some sequence +database chosen among the many international databases storing +nucleotide sequences, like the GenBank at NBCI \cite{Sayers01012011}, +the \textit{EMBL-Bank} \cite{apweiler1985swiss} in Europe +or \textit{DDBJ} \cite{sugawara2008ddbj} in Japan. Different +biological tools can analyze and annotate genomes by interacting with +these databases to align and extract sequences to predict genes. The +database in our method must be taken from any confident data source +that stores annotated and/or unannotated chloroplast genomes. We have +considered the GenBank-NCBI \cite{Sayers01012011} database as sequence +database: 99~genomes of chloroplasts were retrieved. These genomes +lie in the eleven type of chloroplast families and Table \ref{Tab2} +summarizes their distribution in our dataset. + +\begin{figure}[h] + \centering + \includegraphics[width=0.75\textwidth]{generalView} +\caption{A general overview of the annotation-based approach}\label{Fig1} +\end{figure} + +Annotation, which is the first stage, is an important task for +extracting gene features. Indeed, to extract good gene feature, a good +annotation tool is obviously required. To obtain relevant annotated +genomes, two annotation techniques from NCBI and Dogma are used. The +extraction of gene feature, the next stage, can be anything like gene +names, gene sequences, protein sequences, and so on. Our method +considers gene names, gene counts, and gene sequence for extracting +core genes and producing chloroplast evolutionary tree. The final +stage allows to visualize genomes and/or gene evolution in +chloroplast. Therefore we use representations like tables, +phylogenetic trees, graphs, etc. to organize and show genomes +relationships, and thus achieve the goal of representing gene +evolution. In addition, comparing these representations with ones +issued from another annotation tool dedicated to large population of +chloroplast genomes give us biological perspectives to the nature of +chloroplasts evolution. Notice that a local database linked with each +pipe stage is used to store all the informations produced during the +process. + +\input{population_Table} + +\subsection{Genome annotation techniques} + +For the first stage, genome annotation, many techniques have been +developed to annotate chloroplast genomes. These techniques differ +from each others in the number and type of predicted genes (for +example: \textit{Transfer RNA (tRNA)} and \textit{Ribosomal RNA +(rRNA)} genes). Two annotation techniques from NCBI and Dogma are +considered to analyze chloroplast genomes. + +\subsubsection{Genome annotation from NCBI} + +The objective is to generate sets of genes from each genome so that +genes are organized without any duplication. The input is a list of +chloroplast genomes annotated from NCBI. More precisely, all genomes +are stored as \textit{.fasta} files which consists in a collection of +protein coding genes\cite{parra2007cegma,RDogma} (gene that produce +proteins) organized in coding sequences. To be able build the set of +core genes, we need to preprocess these genomes +using \textit{BioPython} package \cite{chapman2000biopython}. This +step starts by converting each genome from FASTA file format to +GenVision \cite{geneVision} format from DNASTAR. Each genome is thus +converted in a list of genes, with gene names and gene counts. Gene +name duplications can be accumulated during the treatment of a genome. +These duplications come from gene fragments (\emph{e.g.} gene +fragments treated with NCBI) and from chloroplast DNA sequences. To +ensure that all the duplications are removed, each list of gene is +translated into a set of genes. Note that NCBI genome annotation +produces genes except \textit{Ribosomal (rRNA)} genes. + +\subsubsection{Genome annotation from Dogma} + +Dogma stands for \textit{Dual Organellar GenoMe Annotator}. It is an +annotation tool developed at University of Texas in 2004 for plant +chloroplast and animal mitochondrial genomes. This tool has its own +database for translating a genome in all six reading frames and +queries the amino acid sequence database using +BLAST \cite{altschul1990basic} (\emph{i.e.} Blastx) with various +parameters. Protein coding genes are identified in an input genome +using sequence similarity of genes in Dogma database. In addition in +comparison with NCBI annotation tool, Dogma can produce +both \textit{Transfer RNAs (tRNA)} and \textit{Ribosomal RNAs (rRNA)}, +verify their start and end positions. Another difference is also that +there is no gene duplication with Dogma after solving gene +fragmentation. In fact, genome annotation with Dogma can be the key +difference when extracting core genes. + +The Dogma annotation process is divided into two tasks. First, we +manually annotate chloroplast genomes using Dogma web tool. The output +of this step is supposed to be a collection of coding genes files for +each genome, organized in GeneVision file. The second task is to solve +the gene duplication problem and therefore we have use two +methods. The first method, based on gene name, translates each genome +into a set of genes without duplicates. The second method avoid gene +duplication through a defragment process. In each iteration, this +process starts by taking a gene from gene list, searches for gene +duplication, if a duplication is found, it looks on the orientation of +the fragment sequence. If it is positive it appends directly the +sequence to gene files. Otherwise reverse complement operations are +applied on the sequence, which is then also append to gene files. +Finally, a check for missing start and stop codons is performed. At +the end of the annotation process, all the genomes are fully +annotated, their genes are defragmented, and gene counts are +available. + +\subsection{Core genes extraction} + +The goal of this stage is to extract maximum core genes from sets of +genes. To find core genes, the following methodology is applied. + +\subsubsection{Preprocessing} + +In order to extract core genomes in a suitable manner, the genomic +data are preprocessed with two methods: on the one hand a method based +on gene name and count, and on the other hand a method based on a +sequence quality control test. + +In the first method, we extract a list of genes from each chloroplast +genome. Then we store this list of genes in the database under genome +nam and genes counts can be extracted by a specific length command. +The \textit{Intersection Core Matrix}, described in next subsection, +is then computed to extract the core genes. The problem with this +method can be stated as follows: how can we ensure that the gene which +is predicted in core genes is the same gene in leaf genomes? The +answer to this problem is that if the sequences of any gene in a +genome annotated from Dogma and NCBI are similar with respect to a +given threshold, then we do not have any problem with this +method. When the sequences are not similar we have a problem, because +we cannot decide which sequence belongs to a gene in core genes. + +The second method is based on the underlying idea: we can predict the +the best annotated genome by merging the annotated genomes from NCBI +and Dogma according to a quality test on genes names and sequences. To +obtain all quality genes of each genome, we consider the following +hypothesis: any gene will appear in the predicted genome if and only +if the annotated genes in NCBI and Dogma pass a specific threshold +of \textit{quality control test}. In fact, the Needle-man Wunch +algorithm is applied to compare both sequences with respect to a +threshold. If the alignment score is above the threshold, then the +gene will be retained in the predicted genome, otherwise the gene is +ignored. Once the prediction of all genomes is done, +the \textit{Intersection Core Matrix} is computed on these new genomes +to extract core genes, as explained in Algorithm \ref{Alg3:thirdM}. + +\begin{algorithm}[H] +\caption{Extract new genome based on gene quality test} +\label{Alg3:thirdM} +\begin{algorithmic} +\REQUIRE $Gname \leftarrow \text{Genome Name}, Threshold \leftarrow 65$ +\ENSURE $geneList \leftarrow \text{Quality genes}$ +\STATE $dir(NCBI\_Genes) \leftarrow \text{NCBI genes of Gname}$ +\STATE $dir(Dogma\_Genes) \leftarrow \text{Dogma genes of Gname}$ +\STATE $geneList=\text{empty list}$ +\STATE $common=set(dir(NCBI\_Genes)) \cap set(dir(Dogma\_Genes))$ +\FOR{$\text{gene in common}$} + \STATE $g1 \leftarrow open(NCBI\_Genes(gene)).read()$ + \STATE $g2 \leftarrow open(Dogma\_Genes(gene)).read()$ + \STATE $score \leftarrow geneChk(g1,g2)$ + \IF {$score > Threshold$} + \STATE $geneList \leftarrow gene$ + \ENDIF +\ENDFOR +\RETURN $geneList$ +\end{algorithmic} +\end{algorithm} + +\textbf{geneChk} is a subroutine used to find the best similarity score between +two gene sequences after applying operations like \textit{reverse}, {\it complement}, +and {\it reverse complement}. Algorithm~\ref{Alg3:genechk} gives the outline of +geneChk subroutine. + +\begin{algorithm}[H] +\caption{Find the Maximum Similarity Score between two sequences} +\label{Alg3:genechk} +\begin{algorithmic} +\REQUIRE $g1,g2 \leftarrow \text{NCBI gene sequence, Dogma gene sequence}$ +\ENSURE $\text{Maximum similarity score}$ +\STATE $score1 \leftarrow needle(g1,g2)$ +\STATE $score2 \leftarrow needle(g1,Reverse(g2))$ +\STATE $score3 \leftarrow needle(g1,Complement(g2))$ +\STATE $score4 \leftarrow needle(g1,Reverse(Complement(g2)))$ +\RETURN $max(score1,score2,score3,score4)$ +\end{algorithmic} +\end{algorithm} + +\subsubsection{Intersection Core Matrix (\textit{ICM})} + +To extract core genes, we iteratively collect the maximum number of +common genes between genomes and therefore during this stage +an \textit{Intersection Core Matrix} (ICM) is built. ICM is a two +dimensional symmetric matrix where each row and each column correspond +to one genome. Hence, an element of the matrix stores +the \textit{Intersection Score} (IS): the cardinality of the core +genes set obtained by intersecting one genome with another +one. Maximum cardinality results in selecting the two genomes having +the maximum score. Mathematically speaking, if we have $n$ genomes in +local database, the ICM is an $n \times n$ matrix whose elements +satisfy: +\begin{equation} +score_{ij}=\vert g_i \cap g_j\vert +\label{Eq1} +\end{equation} +\noindent where $1 \leq i \leq n$, $1 \leq j \leq n$, and $g_i, g_j$ are +genomes. The generation of a new core gene depends obviously on the +value of the intersection scores $score_{ij}$. More precisely, the +idea is to consider a pair of genomes such that their score is the +largest element in ICM. These two genomes are then removed from matrix +and the resulting new core genome is added for the next iteration. +The ICM is then updated to take into account the new core gene: new IS +values are computed for it. This process is repeated until no new core +gene can be obtained. + +We can observe that the ICM is very large due to the amount of +data. As a consequence, the computation of the intersection scores is +both time and memory consuming. However, since ICM is a symetric +matrix we can reduce the computation overhead by considering only its +triangular upper part. The time complexity for this process after +enhancement is thus $O(\frac{n.(n-1)}{2})$. Algorithm ~\ref{Alg1:ICM} +illustrates the construction of the ICM matrix and the extraction of +the core genes, where \textit{GenomeList} represents the database +storing all genomes data. At each iteration, it computes the maximum +core genes with its two genomes parents. + +% ALGORITHM HAS BEEN REWRITTEN + +\begin{algorithm}[H] +\caption{Extract Maximum Intersection Score} +\label{Alg1:ICM} +\begin{algorithmic} +\REQUIRE $L \leftarrow \text{genomes sets}$ +\ENSURE $B1 \leftarrow \text{Max Core set}$ +\FOR{$i \leftarrow 1:len(L)-1$} + \STATE $score \leftarrow 0$ + \STATE $core1 \leftarrow set(GenomeList[L[i]])$ + \STATE $g1 \leftarrow L[i]$ + \FOR{$j \leftarrow i+1:len(L)$} + \STATE $core2 \leftarrow set(GenomeList[L[j]])$ + \STATE $core \leftarrow core1 \cap core2$ + \IF{$len(core) > score$} + \STATE $score \leftarrow len(core)$ + \STATE $g2 \leftarrow L[j]$ + \ENDIF + \ENDFOR + \STATE $B1[score] \leftarrow (g1,g2)$ +\ENDFOR +\RETURN $max(B1)$ +\end{algorithmic} +\end{algorithm} + +\subsection{Features visualization} + +The goal is to visualize results by building a tree of evolution. All +core genes generated represent an important information in the tree, +because they provide ancestor information of two or more +genomes. Each node in the tree represents one chloroplast genome or +one predicted core and labelled as \textit{(Genes count:Family name\_Scientific +names\_Accession number)}. While an edge is labelled with the number of +lost genes from a leaf genome or an intermediate core gene. Such +numbers are very interesting because they give an information about +the evolution: how many genes were lost between two species whether +they belong to the same family or not. By the principle of +classification, a small number of genes lost among species indicates +that those species are close to each other and belong to same family, +while a large lost means that we have an evolutionary relationship +between species from different families. To depict the links between +species clearly, we built a phylogenetic tree showing the +relationships based on the distances among genes sequences. Many tools +are available to obtain a such tree, for example: +PHYML\cite{guindon2005phyml}, +RAxML{\cite{stamatakis2008raxml,stamatakis2005raxml}, BioNJ, and +TNT\cite{goloboff2008tnt}}. In this work, we chose to use +RAxML\cite{stamatakis2008raxml,stamatakis2005raxml} because it is +fast, accurate, and can build large trees when dealing with a large +number of genomic sequences. + +The procedure used to built a phylogenetic tree is as follows: +\begin{enumerate} +\item For each gene in a core gene, extract its sequence and store it in the database. +\item Use multiple alignment tools such as (****to be write after see christophe****) +to align these sequences with each others. +\item Use an outer-group genome from cyanobacteria to calculate distances. +\item Submit the resulting aligned sequences to RAxML program to compute +the distances and finally draw the phylogenetic tree. +\end{enumerate} + +\begin{figure}[H] + \centering \includegraphics[width=0.75\textwidth]{Whole_system} \caption{Overview + of the pipeline}\label{wholesystem} +\end{figure} + +\section{Implementation} + +The different algorithms have been implemented using Python version +2.7, on a laptop running Ubuntu~12.04~LTS. More precisely, the +computer is a Dell Latitude laptop - model E6430 with 6~GiB memory and +a quad-core Intel core~i5~processor with an operating frequency of +2.5~GHz. Many python packages such as os, Biopython, memory\_profile, +re, numpy, time, shutil, and xlsxwriter were used to extract core +genes from large amount of chloroplast genomes. + +\begin{center} +\begin{table}[b] +\caption{Type of annotation, execution time, and core genes +for each method}\label{Etime} +{\scriptsize +\begin{tabular}{p{2cm}p{0.5cm}p{0.25cm}p{0.5cm}p{0.25cm}p{0.5cm}p{0.25cm}p{0.5cm}p{0.25cm}p{0.5cm}p{0.2cm}} +\hline\hline + Method & \multicolumn{2}{c}{Annotation} & \multicolumn{2}{c}{Features} & \multicolumn{2}{c}{Exec. time (min.)} & \multicolumn{2}{c}{Core genes} & \multicolumn{2}{c}{Bad genomes} \\ +~ & N & D & Name & Seq & N & D & N & D & N & D \\ +\hline +Gene prediction & $\surd$ & - & - & $\surd$ & ? & - & ? & - & 0 & -\\[0.5ex] +Gene features & $\surd$ & $\surd$ & $\surd$ & - & 4.98 & 1.52 & 28 & 10 & 1 & 0\\[0.5ex] +Gene quality & $\surd$ & $\surd$ & $\surd$ & $\surd$ & \multicolumn{2}{c}{$\simeq$3 days + 1.29} & \multicolumn{2}{c}{4} & \multicolumn{2}{c}{1}\\[1ex] +\hline +\end{tabular} +} +\end{table} +\end{center} + +\vspace{-1cm} + +Table~\ref{Etime} presents for each method the annotation type, +execution time, and the number of core genes. We use the following +notations: \textbf{N} denotes NCBI, while \textbf{D} means DOGMA, +and \textbf{Seq} is for sequence. The first {\it Annotation} columns +represent the algorithm used to annotate chloroplast genomes, the {\it +Features} columns mean the kind of gene feature used to extract core +genes: gene name, gene sequence, or both of them. It can be seen that +almost all methods need low {\it Execution time} to extract core genes +from large chloroplast genome. Only the gene quality method requires +several days of computation (about 3-4 days) for sequence comparisons, +once the quality genomes are construced it takes just 1.29~minutes to +extract core gene. Thanks to this low execution times we can use these +methods to extract core genes on a personal computer rather than main +frames or parallel computers. The lowest execution time: 1.52~minutes, +is obtained with the second method using Dogma annotations. The number +of {\it Core genes} represents the amount of genes in the last core +genome. The main goal is to find the maximum core genes that simulate +biological background of chloroplasts. With NCBI we have 28 genes for +96 genomes, instead of 10 genes for 97 genomes with +Dogma. Unfortunately, the biological distribution of genomes with NCBI +in core tree do not reflect good biological perspective, whereas with +DOGMA the distribution of genomes is biologically relevant. {\it Bad +genomes} gives the number of genomes that destroy core genes due to +low number of gene intersection. \textit{NC\_012568.1 Micromonas +pusilla} is the only genome which destroyed the core genome with NCBI +annotations for both gene features and gene quality methods. + +The second important factor is the amount of memory being used by each +methodology. Table \ref{mem} shows the memory usage of each +method. We used a package from PyPI~(\textit{the Python Package +Index}) named \textit{Memory\_profile} (located at~{\tt +https://pypi.python.org/pypi}) to extract all the values in +table~\ref{mem}. In this table, the values are presented in megabyte +unit and \textit{gV} means genevision~file~format. We can notice that +the level of memory which is used is relatively low for all methods +and is available on any personal computer. The different values also +show that the gene features method based on Dogma annotations has the +more reasonable memory usage, except when extracting core +sequences. The third method gives the lowest values if we already have +the quality genomes, otherwise it will consume far more +memory. Moreover, the amount of memory used by the third method also +depends on the size of each genome. + +\begin{center} +\begin{table}[H] +\caption{Memory usages in (MB) for each methodology}\label{mem} +{\scriptsize +\begin{tabular}{p{2.5cm}p{1.5cm}p{1cm}p{1cm}p{1cm}p{1cm}p{1cm}p{1cm}} +\hline\hline +Method& & Load Gen. & Conv. gV & Read gV & ICM & Core tree & Core Seq. \\ +\hline +Gene prediction & ~ & ~ & ~ & ~ & ~ & ~ & ~\\ +\multirow{2}{*}{Gene Features} & NCBI & 15.4 & 18.9 & 17.5 & 18 & 18 & 28.1\\ + & DOGMA& 15.3 & 15.3 & 16.8 & 17.8 & 17.9 & 31.2\\ +Gene Quality & ~ & 15.3 & $\le$3G & 16.1 & 17 & 17.1 & 24.4\\ +\hline +\end{tabular} +} +\end{table} +\end{center} + + + + +