Paper2/implementation.tex

   1 \color{red}
   2 All different  algorithms have  been implemented using  Python on a personal computer running Ubuntu~12.04 with 6~GiB memory and
   3 a  quad-core Intel  core~i5~processor with  an operating  frequency of
   4 2.5~GHz. %All the programs can be downloaded at \url{http://......} .
   5 %genes  from large  amount of  chloroplast  genomes.
   6
   7 \begin{center}
   8 \begin{table}[H]
   9 \caption{Type of annotation, execution time, and core genes.}\label{Etime}
  10 {\scriptsize
  11 \begin{tabular}{p{2cm}p{0.5cm}p{0.25cm}p{0.5cm}p{0.25cm}p{0.5cm}p{0.25cm}p{0.5cm}p{0.25cm}p{0.5cm}p{0.2cm}}
  12 \hline\hline
  13  Method & \multicolumn{2}{c}{Annotation} & \multicolumn{2}{c}{Features} & \multicolumn{2}{c}{Exec. time (min.)} & \multicolumn{2}{c}{Core genes} & \multicolumn{2}{c}{Bad genomes} \\
  14 ~ & N & D & Name & Seq & N & D & N & D & N & D \\
  15 \hline
  16 Gene prediction & $\surd$ & $\surd$ & - & $\surd$ & 1.7 & - & ? & - & 0 & -\\[0.5ex]
  17 %Gene Features & $\surd$ & $\surd$ & $\surd$ & - & 4.98 & 1.52 & 28 & 10 & 1 & 0\\[0.5ex]
  18 Gene Quality & $\surd$ & $\surd$ & $\surd$ & $\surd$ & \multicolumn{2}{c}{$\simeq$3 days + 1.29} & \multicolumn{2}{c}{4} & \multicolumn{2}{c}{1}\\[1ex]
  19 \hline
  20 \end{tabular}
  21 }
  22 \end{table}
  23 \end{center}
  24
  25 \vspace{-1cm}
  26
  27 Table~\ref{Etime}  presents  for  each  method  the  annotation  type,
  28 execution time,  and the  number of core  genes. We use  the following
  29 notations:  \textbf{N}  denotes NCBI,  while  \textbf{D} means  DOGMA,
  30 and \textbf{Seq}  is for sequence. The first two {\it Annotation} columns
  31 represent the algorithm used to annotate chloroplast genomes. The next two ones {\it
  32 Features} columns mean  the kind  of gene feature used to extract core
  33 genes: gene name, gene sequence, or  both of them. It can be seen that
  34 almost all methods need low {\it Execution time} expended in minutes to extract core genes
  35 from the large set of chloroplast genomes. Only the gene quality method requires
  36 several days of computation (about 3-4 days) for sequence comparisons. However,
  37 once the quality genomes are well constructed, it only takes 1.29~minutes to
  38 extract core gene. Thanks to this low execution times that gave us a privilege to use these
  39 methods to extract core genes  on a personal computer rather than main
  40 frames or parallel computers. The lowest execution time: 1.52~minutes,
  41 is obtained with the second method using Dogma annotations. The number
  42 of {\it  Core genes} represents the  amount of genes in  the last core
  43 genome. The main goal is to  find the maximum core genes that simulate
  44 biological background of chloroplasts. With  NCBI we have 28 genes for
  45 96   genomes,   instead   of    10   genes   for   97   genomes   with
  46 Dogma. Unfortunately, the biological distribution of genomes with NCBI
  47 in core tree do not  reflect good biological perspective, whereas with
  48 DOGMA the  distribution of genomes is biologically  relevant. Some a few genomes maybe destroying core genes due to
  49 low  number  of  gene  intersection. More precisely, \textit{NC\_012568.1  Micromonas pusilla} is the only genome who destroyes the core genome with NCBI
  50 annotations for both gene features and gene quality methods.
  51
  52 The second important factor is the amount of memory nessecary in each
  53 methodology.  Table   \ref{mem}  shows  the  memory   usage  of  each method.
  54 In this table, the values are  presented in megabyte
  55 unit and \textit{gV} means  genevision~file~format. We can notice that
  56 the level  of memory which is  used is relatively low  for all methods
  57 and is available  on any personal computer. The  different values also
  58 show that the gene features  method based on Dogma annotations has the
  59 more   reasonable   memory   usage,   except  when   extracting   core
  60 sequences. The third method gives the lowest values if we already have
  61 the   quality   genomes,   otherwise   it  will   consume   far   more
  62 memory. Moreover, the  amount of memory, which is used by the third method also
  63 depends on the size of each genome.
  64
  65
  66 \begin{table}[H]
  67 \centering
  68 \caption{Memory usages in (MB) for each methodology}\label{mem}
  69 \tabcolsep=0.11cm
  70 {\scriptsize
  71 \begin{tabular}{p{2.5cm}@{\hskip 0.1mm}p{1.5cm}@{\hskip 0.1mm}p{1cm}@{\hskip 0.1mm}p{1cm}@{\hskip 0.1mm}p{1cm}@{\hskip 0.1mm}p{1cm}@{\hskip 0.1mm}p{1cm}@{\hskip 0.1mm}p{1cm}}
  72 \hline\hline
  73 Method& & Load Gen. & Conv. gV & Read gV & ICM & Core tree & Core Seq. \\
  74 \hline
  75 Gene prediction & NCBI & 108 & - & - & - & - & -\\
  76 %\multirow{2}{*}{Gene Features} & NCBI & 15.4 & 18.9 & 17.5 & 18 & 18 & 28.1\\
  77               %& DOGMA& 15.3 & 15.3 & 16.8 & 17.8 & 17.9 & 31.2\\
  78 Gene Quality  & ~ & 15.3 & $\le$3G & 16.1 & 17 & 17.1 & 24.4\\
  79 \hline
  80 \end{tabular}
  81 }
  82 \end{table}
  83 \color{black}
  84 %\begin{figure}[H]
  85 %  \centering \includegraphics[width=0.75\textwidth]{Whole_system} \caption{Overview
  86 %  of the pipeline, third approach}\label{wholesystem}
  87 %\end{figure}
  88
  89 \begin{figure}[!ht]
  90     \subfloat[Sizes of core genome\label{subfig-1:core}]{%
  91       \includegraphics[width=0.5\textwidth]{coregenome}
  92     }
  93     \hfill
  94     \subfloat[Sizes of pan genome\label{subfig-2:pan}]{%
  95       \includegraphics[width=0.5\textwidth]{pangenome}
  96     }
  97     \caption{Sizes of Core and Pan genomes for first and second method.}
  98     \label{fig:sizes of core and pan}
  99   \end{figure}
 100
 101
 102 \begin{figure}[!ht]
 103     \subfloat[genes coverage of NCBI genomes\label{Cover:NCBI}]{%
 104       \includegraphics[width=0.5\textwidth]{cover_ncbi}
 105     }
 106     \hfill
 107     \subfloat[genes coverage of DOGMA genomes\label{cover:dogma}]{%
 108       \includegraphics[width=0.5\textwidth]{cover_dogma}
 109     }
 110     \caption{Gene comparisons cover from NCBI and DOGMA, second method}
 111     \label{fig:sizes of core and pan}
 112   \end{figure}
 113
 114 \color{red}
 115 Figure~\ref{fig:sizes of core and pan} represent the sizes of core and pan genomes produced from the two methods. In figure~\ref{subfig-1:core} core genes are predicted, note that max core genes do not mean good genes. We are looking for genes that meet it's biological principles. The core genes produced from the first method specially from DOGMA can reflect its biological meaning, we will explain later in the section of disscusion the reason why. In figure~\ref{subfig-2:pan}, we can see that the values of pan genome from second method is still steady with different thresholds the second method, while in the first method pan genes increases when the threshold increased.
 116
 117 Furthermore, we calculate the correlation coeffecient formula for the second method and the results shows that the correlation for the annotation from DOGMA was $0.97$ while with NCBI was $0.69$.
 118
 119 \color{black}