From fa40469ad69c4706a4a32f67e7f846fd73e1a597 Mon Sep 17 00:00:00 2001 From: bassam al-kindy Date: Mon, 4 Nov 2013 11:53:17 +0100 Subject: [PATCH] Add algorithm to method 3 --- annotated.tex | 61 +++++++++++++++++++++++++++++++++++++++++++++------ 1 file changed, 54 insertions(+), 7 deletions(-) diff --git a/annotated.tex b/annotated.tex index 450002c..79dadbd 100644 --- a/annotated.tex +++ b/annotated.tex @@ -141,10 +141,10 @@ In second Method, due to the number of annotated genomes, annotate each genome c \label{Alg2:secondM} \begin{algorithmic} \REQUIRE $Ref\_Genome \leftarrow \text{Accession No}$ -\ENSURE $Core \leftarrow \text{Genes in each genome}$ -\FOR{$i \leftarrow Ref\_Genome$} - \STATE $G\_list=[ ]$ - \STATE $File \leftarrow Blastn(i)$ +\ENSURE $core \leftarrow \text{Genomes for each gene}$ +\FOR{$gene \leftarrow Ref\_Genome$} + \STATE $G\_list= \text{empty list}$ + \STATE $File \leftarrow Blastn(gene)$ \STATE $G\_list \leftarrow File[\text{Genomes names}]$ \STATE $Core \leftarrow [Accession\_No:G\_list]$ \ENDFOR @@ -152,8 +152,54 @@ In second Method, due to the number of annotated genomes, annotate each genome c \end{algorithmic} \end{algorithm} -The hypothesis in last method state: we can predict the best annotated genome by merge the annotated genomes from NCBI and dogma based on the quality of genes names and sequences. To generate all quality genes of each genome. the hypothesis state: Any gene will be in predicted genome if and only if the annotated genes between NCBI and Dogma pass a specific threshold of\textit{quality control test}. To accept the quality test, we applied Needle-man Wunch algorithm to compare two gene sequences with respect to pass a threshold. If the alignment score pass this threshold, then the gene will be in the predicted genome, else the gene will be ignored. After predicting all genomes, one of previous two methods can be applied to extract core genes. - +The hypothesis in last method state: we can predict the best annotated genome by merge the annotated genomes from NCBI and dogma based on the quality of genes names and sequences. To generate all quality genes of each genome. the hypothesis state: Any gene will be in predicted genome if and only if the annotated genes between NCBI and Dogma pass a specific threshold of\textit{quality control test}. To accept the quality test, we applied Needle-man Wunch algorithm to compare two gene sequences with respect to pass a threshold. If the alignment score pass this threshold, then the gene will be in the predicted genome, else the gene will be ignored. After predicting all genomes, one of previous two methods can be applied to extract core genes. As shown in Algorithm \ref{Alg3:thirdM}. + +\begin{algorithm}[H] +\caption{Extract new genome based on Gene Quality test} +\label{Alg3:thirdM} +\begin{algorithmic} +\REQUIRE $Gname \leftarrow \text{Genome Name}, Threshold \leftarrow 65$ +\ENSURE $geneList \leftarrow \text{Quality genes}$ +\STATE $dir(NCBI\_Genes) \leftarrow \text{NCBI genes of Gname}$ +\STATE $dir(Dogma\_Genes) \leftarrow \text{Dogma genes of Gname}$ +\STATE $geneList=\text{empty list}$ +\STATE $common=set(dir(NCBI\_Genes)) \cap set(dir(Dogma\_Genes))$ +\FOR{$\text{gene in common}$} + \STATE $g1 \leftarrow open(NCBI\_Genes(gene)).read()$ + \STATE $g2 \leftarrow open(Dogma\_Genes(gene)).read()$ + \STATE $score \leftarrow geneChk(g1,g2)$ + \IF {$score > Threshold$} + \STATE $geneList \leftarrow gene$ + \ENDIF +\ENDFOR +\RETURN $geneList$ +\end{algorithmic} +\end{algorithm} + +Here, geneChk is a subroutine in python, it is used to find the best similarity score between two gene sequences after applying operations like \textit{reverse, complement, and reverse complement}. The algorithm of geneChk is illustrated in Algorithm \ref{Alg3:genechk}. + +\begin{algorithm}[H] +\caption{Find the Maximum similarity score between two sequences} +\label{Alg3:genechk} +\begin{algorithmic} +\REQUIRE $gen1,gen2 \leftarrow \text{NCBI gene sequence, Dogma gene sequence}$ +\ENSURE $\text{Maximum similarity score}$ +\STATE $Score1 \leftarrow needle(gen1,gen2)$ +\STATE $Score2 \leftarrow needle(gen1,Reverse(gen2))$ +\STATE $Score3 \leftarrow needle(gen1,Complement(gen2))$ +\STATE $Score4 \leftarrow needle(gen1,Reverse(Complement(gen2)))$ +\IF {$max(Score1, Score2, Score3, Score4)==Score1$} + \RETURN $Score1$ +\ELSIF {$max(Score1, Score2, Score3, Score4)==Score2$} + \RETURN $Score2$ +\ELSIF {$max(Score1, Score2, Score3, Score4)==Score3$} + \RETURN $Score3$ +\ELSIF {$max(Score1, Score2, Score3, Score4)==Score4$} + \RETURN $Score4$ +\ENDIF +\end{algorithmic} +\end{algorithm} + \subsection{Visualizing Relationships} The goal here is to visualizing the results by build a tree of evolution. The system can produce this tree automatically by using Dot graphs package\cite{gansner2002drawing} from Graphviz library and all information available in a database. Core genes generated with their genes can be very important information in the tree, because they can viewed as an ancestor information for two genomes or more. Further more, each node represents a genome or core as \textit{(Genes count:Family name, Scientific names, Accession number)}, Edges represent numbers of lost genes from genomes-core or core-core relationship. The number of lost genes here can represent an important factor for evolution, it represents how much lost of genes for the species in same or different families. By the principle of classification, small number of gene lost among species indicate that those species are close to each other and belong to same family, while big genes lost means that species is far to be in the same family. To see the picture clearly, Phylogenetic tree is an evolutionary tree generated also by the system. Generating this tree is based on the distances among genes sequences. There are many resources to build such tree (for example: PHYML\cite{guindon2005phyml}, RAxML{\cite{stamatakis2008raxml,stamatakis2005raxml}, BioNJ , and TNT\cite{goloboff2008tnt}}. We consider to use RAxML\cite{stamatakis2008raxml,stamatakis2005raxml} to generate this tree. @@ -209,4 +255,5 @@ This method summarized in the following steps:\\ \item Predict quality genomes: the process is to pick a genome annotation from two techniques, extracting all common genes based on genes names, then applying Needle-man wunch algorithm to align the two sequences based on a specific threshold. If the alignment score pass the threshold, then this gene will removed from the competition and store it in quality genome by saving its name with the largest gene sequence with respect to start and end codons. All quality genomes will store in the form of GenVision file format. \item Extract Core genes: from the above two steps, we will have new genomes with quality genes, ofcourse, we have some genes lost here, because dogma produced tRNA and rRNA genes while NCBI did not generate them and vise-versa. Using first method to extract core genes will be sufficient because we already check their sequences. \item Display tree: An evolution tree then will be display based on the intersections of quality genomes. -\end{enumerate} \ No newline at end of file +\end{enumerate} +\pagebreak \ No newline at end of file -- 2.39.5