X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/chloroplast13.git/blobdiff_plain/023abe68272c9371d78a52331610cfd4c3602c5c..181fe86127b57e8c7df3e97082134e7a5b7b8618:/annotated.tex?ds=inline diff --git a/annotated.tex b/annotated.tex index 52ce278..b43666b 100644 --- a/annotated.tex +++ b/annotated.tex @@ -266,15 +266,15 @@ core genes with its two genomes parents. \begin{algorithmic} \REQUIRE $L \leftarrow \text{genomes sets}$ \ENSURE $B1 \leftarrow \text{Max Core set}$ -\FOR{$i \leftarrow 0:len(L)-1$} +\FOR{$i \leftarrow 1:len(L)-1$} \STATE $score \leftarrow 0$ \STATE $core1 \leftarrow set(GenomeList[L[i]])$ \STATE $g1 \leftarrow L[i]$ \FOR{$j \leftarrow i+1:len(L)$} \STATE $core2 \leftarrow set(GenomeList[L[j]])$ - \STATE $Core \leftarrow core1 \cap core2$ - \IF{$len(Core) > score$} - \STATE $score \leftarrow len(Core)$ + \STATE $core \leftarrow core1 \cap core2$ + \IF{$len(core) > score$} + \STATE $score \leftarrow len(core)$ \STATE $g2 \leftarrow L[j]$ \ENDIF \ENDFOR @@ -295,11 +295,7 @@ names\_Accession number)}. While an edge is labelled with the number of lost genes from a leaf genome or an intermediate core gene. Such numbers are very interesting because they give an information about the evolution: how many genes were lost between two species whether -they belong to the same family or not. By the principle of -classification, a small number of genes lost among species indicates -that those species are close to each other and belong to same family, -while a large lost means that we have an evolutionary relationship -between species from different families. To depict the links between +they belong to the same lineage or not. Phylogenetic relationships are mainly built by comparison of sets of coding and non-coding sequences. Phylogenies of photosynthetic plants are important to assess the origin of chloroplasts (REF) and the modalities of gene loss among lineages. These phylogenies are usually done using less than ten chloroplastic genes (REF), and some of them may not be conserved by evolution process for every taxa. As phylogenetic relationships inferred from data matrices complete for each species included and with the same evolution history are better assumptions, we selected core genomes for a new investivation of photosynthetic plants phylogeny. To depict the links between species clearly, we built a phylogenetic tree showing the relationships based on the distances among genes sequences. Many tools are available to obtain a such tree, for example: @@ -315,59 +311,108 @@ The procedure used to built a phylogenetic tree is as follows: \item For each gene in a core gene, extract its sequence and store it in the database. \item Use multiple alignment tools such as (****to be write after see christophe****) to align these sequences with each others. -\item we use an outer-group genome from cyanobacteria to calculate distances. -\item Submit the resulting aligned sequences to RAxML program to compute the distances and finally draw the phylogenetic tree. +\item Use an outer-group genome from cyanobacteria to calculate distances. +\item Submit the resulting aligned sequences to RAxML program to compute +the distances and finally draw the phylogenetic tree. \end{enumerate} \begin{figure}[H] - \centering \includegraphics[width=0.75\textwidth]{Whole_system} - \caption{Overview of the pipeline}\label{wholesystem} + \centering \includegraphics[width=0.75\textwidth]{Whole_system} \caption{Overview + of the pipeline}\label{wholesystem} \end{figure} \section{Implementation} -We implemented the three algorithms using dell laptop model latitude E6430 with 4 GB of memory and Intel core i5 processor of 2.6 Ghz and 3 MB of cash. We built the code using python version 2.7 under ubuntu 12.04 LTS. We also used python packages such as os, Biopython, memory\_profile, re, numpy, time, shutil, and xlsxwriter to extract core genes from large amount of chloroplast genomes. Table \ref{Etime}, show the annotation type, execution time, and the number of core genes for each method: + +The different algorithms have been implemented using Python version +2.7, on a laptop running Ubuntu~12.04~LTS. More precisely, the +computer is a Dell Latitude laptop - model E6430 with 6~GiB memory and +a quad-core Intel core~i5~processor with an operating frequency of +2.5~GHz. Many python packages such as os, Biopython, memory\_profile, +re, numpy, time, shutil, and xlsxwriter were used to extract core +genes from large amount of chloroplast genomes. \begin{center} -\begin{tiny} -\begin{table}[H] -\caption{Type of Annotation, Execution Time, and core genes for each method}\label{Etime} -\begin{tabular}{p{2.5cm}p{0.5cm}p{0.5cm}p{0.5cm}p{0.5cm}p{0.5cm}p{0.5cm}p{0.5cm}p{0.5cm}p{0.5cm}p{0.2cm}} +\begin{table}[b] +\caption{Type of annotation, execution time, and core genes +for each method}\label{Etime} +{\scriptsize +\begin{tabular}{p{2cm}p{0.5cm}p{0.25cm}p{0.5cm}p{0.25cm}p{0.5cm}p{0.25cm}p{0.5cm}p{0.25cm}p{0.5cm}p{0.2cm}} \hline\hline - & \multicolumn{2}{c}{Annotation} & \multicolumn{2}{c}{Features} & \multicolumn{2}{c}{E. Time} & \multicolumn{2}{c}{C. genes} & \multicolumn{2}{c}{Bad Gen.} \\ + Method & \multicolumn{2}{c}{Annotation} & \multicolumn{2}{c}{Features} & \multicolumn{2}{c}{Exec. time (min.)} & \multicolumn{2}{c}{Core genes} & \multicolumn{2}{c}{Bad genomes} \\ ~ & N & D & Name & Seq & N & D & N & D & N & D \\ \hline -Gene prediction & $\surd$ & - & - & $\surd$ & ? & - & ? & - & 0 & -\\[0.5ex] +Gene prediction & $\surd$ & - & - & $\surd$ & 1.7 & - & ? & - & 0 & -\\[0.5ex] Gene Features & $\surd$ & $\surd$ & $\surd$ & - & 4.98 & 1.52 & 28 & 10 & 1 & 0\\[0.5ex] Gene Quality & $\surd$ & $\surd$ & $\surd$ & $\surd$ & \multicolumn{2}{c}{$\simeq$3 days + 1.29} & \multicolumn{2}{c}{4} & \multicolumn{2}{c}{1}\\[1ex] \hline \end{tabular} +} \end{table} -\end{tiny} \end{center} -In table \ref{Etime}, we show that all methods need low execution time to finish extracting core genes from large chloroplast genomes except in gene quality method where we need about 3-4 days for sequence comparisons to construct quality genomes then it takes just 1.29 minute to extract core genes. This low execution time give us a privilage to use these methods to extract core genes on a personal comuters rather than main frames or parallel computers. In the table, \textbf{N} means NCBI, \textbf{D} means DOGMA, and \textbf{Seq} means Sequence. Annotation is represent the type of algorithm used to annotate chloroplast genome. We can see that the two last methods used the same annotation sources. Features means the type of gene feature used to extract core genes, and this is done by extracting gene name, gene sequence, or both of them. The execution time is represented the whole time needed to extract core genes in minutes. We can see in the table that the second method specially with DOGMA annotation has the lowest execution time of 1.52 minute. In last method We needs approxemetly three days (this period is depend on the amount of genomes) to finish the operation of extracting quality genomes only, while the execution time will be 1.29 minute if we have quality genomes. The number of core genes is represents the amount of genes in the last core genome. The main goal is to find the maximum core genes that simulate biological background of chloroplasts. With NCBI we have 28 genes for 96 genomes instead of 10 genes with DOGMA for 97 genomes. But the biological distribution of genomes with NCBI in core tree did not reflect good biological perspective. While in the core tree with DOGMA, the distribution of genomes are biologically good. Bad genomes are the number of genomes that destroy core genes because of the low number of gene intersection. \textit{NC\_012568.1 Micromonas pusilla}, is the only genome that observed to destroy the core genome with NCBI based on the method of gene features and in the third method of gene quality. \\ - -The second important factor is the amount of memory usage in each methodology. Table \ref{mem} show the amounts of memory consumption by each method. +\vspace{-1cm} + +Table~\ref{Etime} presents for each method the annotation type, +execution time, and the number of core genes. We use the following +notations: \textbf{N} denotes NCBI, while \textbf{D} means DOGMA, +and \textbf{Seq} is for sequence. The first {\it Annotation} columns +represent the algorithm used to annotate chloroplast genomes, the {\it +Features} columns mean the kind of gene feature used to extract core +genes: gene name, gene sequence, or both of them. It can be seen that +almost all methods need low {\it Execution time} to extract core genes +from large chloroplast genome. Only the gene quality method requires +several days of computation (about 3-4 days) for sequence comparisons, +once the quality genomes are construced it takes just 1.29~minutes to +extract core gene. Thanks to this low execution times we can use these +methods to extract core genes on a personal computer rather than main +frames or parallel computers. The lowest execution time: 1.52~minutes, +is obtained with the second method using Dogma annotations. The number +of {\it Core genes} represents the amount of genes in the last core +genome. The main goal is to find the maximum core genes that simulate +biological background of chloroplasts. With NCBI we have 28 genes for +96 genomes, instead of 10 genes for 97 genomes with +Dogma. Unfortunately, the biological distribution of genomes with NCBI +in core tree do not reflect good biological perspective, whereas with +DOGMA the distribution of genomes is biologically relevant. {\it Bad +genomes} gives the number of genomes that destroy core genes due to +low number of gene intersection. \textit{NC\_012568.1 Micromonas +pusilla} is the only genome which destroyed the core genome with NCBI +annotations for both gene features and gene quality methods. + +The second important factor is the amount of memory being used by each +methodology. Table \ref{mem} shows the memory usage of each +method. We used a package from PyPI~(\textit{the Python Package +Index}) named \textit{Memory\_profile} (located at~{\tt +https://pypi.python.org/pypi}) to extract all the values in +table~\ref{mem}. In this table, the values are presented in megabyte +unit and \textit{gV} means genevision~file~format. We can notice that +the level of memory which is used is relatively low for all methods +and is available on any personal computer. The different values also +show that the gene features method based on Dogma annotations has the +more reasonable memory usage, except when extracting core +sequences. The third method gives the lowest values if we already have +the quality genomes, otherwise it will consume far more +memory. Moreover, the amount of memory used by the third method also +depends on the size of each genome. \begin{center} -\begin{tiny} \begin{table}[H] \caption{Memory usages in (MB) for each methodology}\label{mem} +{\scriptsize \begin{tabular}{p{2.5cm}p{1.5cm}p{1cm}p{1cm}p{1cm}p{1cm}p{1cm}p{1cm}} \hline\hline -Method& & Load Gen. & Conv. gV & Read gV & ICM & Gen. tree & Core Seq. \\ +Method& & Load Gen. & Conv. gV & Read gV & ICM & Core tree & Core Seq. \\ \hline -Gene prediction & ~ & ~ & ~ & ~ & ~ & ~ & ~\\ +Gene prediction & NCBI & 108 & - & - & - & - & -\\ \multirow{2}{*}{Gene Features} & NCBI & 15.4 & 18.9 & 17.5 & 18 & 18 & 28.1\\ & DOGMA& 15.3 & 15.3 & 16.8 & 17.8 & 17.9 & 31.2\\ -Gene Quality & ~ & 15.3 & $\le$200 & 16.1 & 17 & 17.1 & 24.4\\ +Gene Quality & ~ & 15.3 & $\le$3G & 16.1 & 17 & 17.1 & 24.4\\ \hline \end{tabular} +} \end{table} -\end{tiny} \end{center} -We used a package from PyPI~(\textit{the Python Package Index}) where located at~ (https://pypi.python.org/pypi) named \textit{Memory\_profile} to extract all the values in table \ref{mem}. In this table, all the values are presented in mega bytes and \textit{gV} means genevision file format. We see that all memory levels in all methods are reletively low and can be available in any personal computer. All memory values shows that the method of gene features based on DOGMA annotation have the more resonable memory values to extract core genome from loading genomes until extracting core sequences. The third method, gives us the lowest values if we already have the quality genomes, but it will consume high memory locations if we do not have them. Also, the amount of memory locations in the third method vary according to the size of each genome.\\