From: bassam al-kindy Date: Tue, 12 Nov 2013 16:14:31 +0000 (+0100) Subject: finish implementation section X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/chloroplast13.git/commitdiff_plain/e9e910d9fdd598e7ddc3f4a4ce4b9e98b69a34ac finish implementation section --- diff --git a/annotated.tex b/annotated.tex index b1f4254..2729939 100644 --- a/annotated.tex +++ b/annotated.tex @@ -132,7 +132,13 @@ We observe that ICM is very large because of the amount of data that it stores. \end{algorithm} \subsection{Features Visualization} -The goal here is to visualize the results by building a tree of evolution. All Core genes generated with their genes are very important information in the tree, because they can be viewed as an ancestor information for two genomes or more. Further more, each node represents a genome or core as \textit{(Genes count:Family name, Scientific names, Accession number)}, Edges represent numbers of lost genes from genomes-core or core-core relationship. The number of lost genes here can represent an important factor for evolution, it represents how much lost of genes for the species in same or different families. By the principle of classification, small number of gene lost among species indicate that those species are close to each other and belong to same family, while big genes lost means that we have an evolutionary relationship between species from different families. To see the picture clearly, Phylogenetic tree is an evolutionary tree generated also by the system. Generating this tree is based on the distances among genes sequences. There are many resources to build such tree (for example: PHYML\cite{guindon2005phyml}, RAxML{\cite{stamatakis2008raxml,stamatakis2005raxml}, BioNJ , and TNT\cite{goloboff2008tnt}}. We consider to use RAxML\cite{stamatakis2008raxml,stamatakis2005raxml} because it is very fast for build large trees even for hundered sequences, it is also accurate by calculating bootstrap. +The goal here is to visualize the results by building a tree of evolution. All Core genes generated with their genes are very important information in the tree, because they can be viewed as an ancestor information for two genomes or more. Further more, each node represents a genome or core as \textit{(Genes count:Family name, Scientific names, Accession number)}, Edges represent numbers of lost genes from genomes-core or core-core relationship. The number of lost genes here can represent an important factor for evolution, it represents how much lost of genes for the species in same or different families. By the principle of classification, small number of gene lost among species indicate that those species are close to each other and belong to same family, while big genes lost means that we have an evolutionary relationship between species from different families. To see the picture clearly, Phylogenetic tree is an evolutionary tree generated also by the system. Generating this tree is based on the distances among genes sequences. There are many resources to build such tree (for example: PHYML\cite{guindon2005phyml}, RAxML{\cite{stamatakis2008raxml,stamatakis2005raxml}, BioNJ , and TNT\cite{goloboff2008tnt}}. We consider to use RAxML\cite{stamatakis2008raxml,stamatakis2005raxml} because it is fast and accurate for build large trees for large count of genomes sequences. The procedure of constructing phylogenetic tree stated in the following steps: + +\begin{enumerate} +\item Extract gene sequence for all gene in all core genes, store it in database. +\item Use multiple alignment tool such as (****to be write after see christophe****) to align these sequences with each others. +\item aligned genomes sequences then submitted to RAxML program to compute the distances and draw phylogenetic tree. +\end{enumerate} \begin{figure}[H] \centering @@ -147,15 +153,21 @@ We implemented four algorithms to extract maximum core genes from large amount o \subsubsection{Core Genes based on NCBI Annotation} The first idea to construct the core genome is based on the extraction of Genes names (as gene presence or absence). For instant, in this stage neither sequence comparison nor new annotation were made, we just want to extract all genes with counts stored in each chloroplast genome, then find the intersection core genes based on gene names. \\ -The pipeline of extracting core genes can summarize in the following steps:\\ -First, we apply the genome annotation method using NCBI annotation tool. Genome quality check can be used in this step to ensure that genomes pass some quality condition. Then, the system lunch annotation process using NCBI to extract code genes (i.e \textit{exons}) and solve gene fragments. From NCBI, we did not observe any problem with genes fragments, but there are a problem of genes orthography (e.g two different genes sequences with same gene name). After we obtain all annotated genomes from NCBI to the local database, the code will then automatically will generate GenVision\cite{geneVision} file format to lunch the second step to extract coding genes names and counts. The competition will start by building intersection matrix to intersect genomes vectors in the local database with the others. New core vector for two leaf vectors will generate and a specific \textit{CoreId} will assign to it. an evolutionary tree will take place by using all data generated from step 1 and 2. The tree will also display the amount of genes lost from each intersection iteration. A specific excel file will be generated that store all the data in local database. The whole operation illstrate in Figure \ref{NCBI:geneextraction}. +The pipeline of extracting core genes can summarize in the following steps according to pre-processing method used:\\ + +\begin{enumerate} +\item We downloads already annotated chloroplast genomes in the form of fasta coding genes (i.e \textit{exons}). +\item Extract genes names and apply to solve gene duplication using first method. +\item Convert fasta file format to geneVision file format to generate ICM. +\item Calculate ICM matrix to find maximum core \textit{Score}. New core genes for two genomes will generate and a specific \textit{CoreId} will assign to it. This process continue until no elements remain in the matrix. +\item Evolutionary tree will take place by using all data generated from step 1 and 4. The tree will also display the amount of genes lost from each intersection iteration. A specific excel file will be generated that store all the data in local database. +\end{enumerate} +There main drawback with this method is genes orthography (e.g two different genes sequences with same gene name). In this case, Gene lost is considered by solving gene duplication based on first method to solve gene duplication. \subsubsection{Core Genes based on Dogma Annotation} The main goal is to get as much as possible the core genes of maximum coding genes names. According to NCBI annotation problem based on \cite{Bakke2009}, annotation method like dogma can give us more reliable coding genes than NCBI. This is because NCBI annotation can carry some annotation and gene identification errors. The general overview of whole process of extraction illustrated in figure \ref{wholesystem}. -\subsubsection{extracting core genes based on genes names and counts} - extracting core genes based on genes names and counts summarized in the following steps:\\ \begin{enumerate} \item We apply the genome annotation manually using Dogma annotation tool. @@ -163,33 +175,21 @@ extracting core genes based on genes names and counts summarized in the followin \item Generate ICM matrix to calculate maximum core genes. \item Draw the evolutionary tree by extracted all genes sequences from each core. Then applying multiple alignment process on the sequences to calculate the distance among cores to draw a phylogenetic tree. - - - - - - - - the code will lunch genes de-fragments process to avoid genes duplications. little problems of genes orthography (e.g two different genes sequences with same gene name) where exists. After we obtain all annotated genomes from dogma, we store it in the local database. The code will then automatically lunch the second step to extract coding genes names and counts. The competition will start by building intersection matrix to intersect genomes vectors in the local database with the others. New core vector for two leaf vectors will generate and a specific \textit{CoreId} will assign to it. an evolutionary tree will take place by using all data generated from step 1 and 2. The tree will also display the amount of genes lost from each intersection iteration. A specific excel file will be generated that store all the data in local database. The whole operation illustrate in Figure \ref{dogma:geneextraction}. \end{enumerate} -\begin{figure}[H] - \centering - \includegraphics[width=0.7\textwidth]{Dogma_geneextraction} - \caption{Extract core genes based on Dogma gene names and counts}\label{dogma:geneextraction} -\end{figure} The main drawback from the method of extracting core genes based on gene names and counts is that we can not depending only on genes names because of three causes: first, the genome may have not totally named (This can be found in early versions of NCBI genomes), so we will have some lost sequences. Second, we may have two genes sharing the same name, while their sequences are different. Third, we need to annotate all the genomes. \subsection{Extract Core Genes based on Gene Quality Control} -The main idea from this method is to focus on genes quality to predict maximum core genes. By comparing only genes names or genes sequences from one annotation tool is not enough. The question here, does the predicted gene from NCBI is the same gene predicted by Dogma based on gene name and gene sequence?. If yes, then we can predict new quality genomes based on quality control test with a specific threshold. Predicted Genomes comes from merging two annotation techniques. While if no, we can not depending neither on NCBI nor Dogma because of annotation error. Core genes can by predicted by using one of the previous methods. +The main idea from this method is to focus on genes quality to predict maximum core genes. By comparing only genes names from one annotation tool is not enough. The question here, does the predicted gene from NCBI is the same gene predicted by Dogma based on gene name and gene sequence?. If yes, then we can predict new quality genomes based on quality control test with a specific threshold. Predicted Genomes comes from merging two annotation techniques. While if no, we can not depending neither on NCBI nor Dogma because of annotation error. Core genes can by predicted by using one of the +\subsubsection{Core genes based on NCBI and Dogma Annotation} This method summarized in the following steps:\\ \begin{enumerate} \item Retrieve the annotation of all genomes from NCBI and Dogma: in this step, we apply the annotation of all chloroplast genomes in the database using NCBI annotation and Dogma annotation tool. -\item Predict quality genomes: the process is to pick a genome annotation from two techniques, extracting all common genes based on genes names, then applying Needle-man wunch algorithm to align the two sequences based on a specific threshold. If the alignment score pass the threshold, then this gene will removed from the competition and store it in quality genome by saving its name with the largest gene sequence with respect to start and end codons. All quality genomes will store in the form of GenVision file format. -\item Extract Core genes: from the above two steps, we will have new genomes with quality genes, ofcourse, we have some genes lost here, because dogma produced tRNA and rRNA genes while NCBI did not generate them and vise-versa. Build ICM to extract core genes will be sufficient because we already check their sequences. +\item Convert NCBI genomes to GeneVision file format, then apply the second method of gene defragmentation methods for NCBI and dogma genomes. +\item Predict quality genomes: the process is to pick a genome annotation from two sources, extracting all common genes based on genes names, then applying Needle-man wunch algorithm to align the two sequences based on a threshold equal to 65\%. If the alignment score pass the threshold, then this gene will removed from the competition and store it in quality genome by saving its name with the largest gene sequence with respect to start and end codons. All quality genomes will store in the form of GenVision file format. +\item Extract Core genes: from the above two steps, we will have new genomes with quality genes, ofcourse, we have some genes lost here, because dogma produced tRNA and rRNA genes while NCBI did not generate rRNA genes and vise-versa. Build ICM to extract core genes will be sufficient because we already check genes sequences. \item Display tree: An evolution tree then will be display based on the intersections of quality genomes. \end{enumerate} -\pagebreak \ No newline at end of file