2 All different algorithms have been implemented using Python on a personal computer running Ubuntu~12.04 with 6~GiB memory and
3 a quad-core Intel core~i5~processor with an operating frequency of
4 2.5~GHz. %All the programs can be downloaded at \url{http://......} .
5 %genes from large amount of chloroplast genomes.
9 \caption{Type of annotation, execution time, and core genes.}\label{Etime}
11 \begin{tabular}{p{2cm}p{0.5cm}p{0.25cm}p{0.5cm}p{0.25cm}p{0.5cm}p{0.25cm}p{0.5cm}p{0.25cm}p{0.5cm}p{0.2cm}}
13 Method & \multicolumn{2}{c}{Annotation} & \multicolumn{2}{c}{Features} & \multicolumn{2}{c}{Exec. time (min.)} & \multicolumn{2}{c}{Core genes} & \multicolumn{2}{c}{Bad genomes} \\
14 ~ & N & D & Name & Seq & N & D & N & D & N & D \\
16 Gene prediction & $\surd$ & $\surd$ & - & $\surd$ & 1.7 & - & ? & - & 0 & -\\[0.5ex]
17 %Gene Features & $\surd$ & $\surd$ & $\surd$ & - & 4.98 & 1.52 & 28 & 10 & 1 & 0\\[0.5ex]
18 Gene Quality & $\surd$ & $\surd$ & $\surd$ & $\surd$ & \multicolumn{2}{c}{$\simeq$3 days + 1.29} & \multicolumn{2}{c}{4} & \multicolumn{2}{c}{1}\\[1ex]
27 Table~\ref{Etime} presents for each method the annotation type,
28 execution time, and the number of core genes. We use the following
29 notations: \textbf{N} denotes NCBI, while \textbf{D} means DOGMA,
30 and \textbf{Seq} is for sequence. The first two {\it Annotation} columns
31 represent the algorithm used to annotate chloroplast genomes. The next two ones {\it
32 Features} columns mean the kind of gene feature used to extract core
33 genes: gene name, gene sequence, or both of them. It can be seen that
34 almost all methods need low {\it Execution time} expended in minutes to extract core genes
35 from the large set of chloroplast genomes. Only the gene quality method requires
36 several days of computation (about 3-4 days) for sequence comparisons. However,
37 once the quality genomes are well constructed, it only takes 1.29~minutes to
38 extract core gene. Thanks to this low execution times that gave us a privilege to use these
39 methods to extract core genes on a personal computer rather than main
40 frames or parallel computers. The lowest execution time: 1.52~minutes,
41 is obtained with the second method using Dogma annotations. The number
42 of {\it Core genes} represents the amount of genes in the last core
43 genome. The main goal is to find the maximum core genes that simulate
44 biological background of chloroplasts. With NCBI we have 28 genes for
45 96 genomes, instead of 10 genes for 97 genomes with
46 Dogma. Unfortunately, the biological distribution of genomes with NCBI
47 in core tree do not reflect good biological perspective, whereas with
48 DOGMA the distribution of genomes is biologically relevant. Some a few genomes maybe destroying core genes due to
49 low number of gene intersection. More precisely, \textit{NC\_012568.1 Micromonas pusilla} is the only genome who destroyes the core genome with NCBI
50 annotations for both gene features and gene quality methods.
52 The second important factor is the amount of memory nessecary in each
53 methodology. Table \ref{mem} shows the memory usage of each method.
54 In this table, the values are presented in megabyte
55 unit and \textit{gV} means genevision~file~format. We can notice that
56 the level of memory which is used is relatively low for all methods
57 and is available on any personal computer. The different values also
58 show that the gene features method based on Dogma annotations has the
59 more reasonable memory usage, except when extracting core
60 sequences. The third method gives the lowest values if we already have
61 the quality genomes, otherwise it will consume far more
62 memory. Moreover, the amount of memory, which is used by the third method also
63 depends on the size of each genome.
68 \caption{Memory usages in (MB) for each methodology}\label{mem}
71 \begin{tabular}{p{2.5cm}@{\hskip 0.1mm}p{1.5cm}@{\hskip 0.1mm}p{1cm}@{\hskip 0.1mm}p{1cm}@{\hskip 0.1mm}p{1cm}@{\hskip 0.1mm}p{1cm}@{\hskip 0.1mm}p{1cm}@{\hskip 0.1mm}p{1cm}}
73 Method& & Load Gen. & Conv. gV & Read gV & ICM & Core tree & Core Seq. \\
75 Gene prediction & NCBI & 108 & - & - & - & - & -\\
76 %\multirow{2}{*}{Gene Features} & NCBI & 15.4 & 18.9 & 17.5 & 18 & 18 & 28.1\\
77 %& DOGMA& 15.3 & 15.3 & 16.8 & 17.8 & 17.9 & 31.2\\
78 Gene Quality & ~ & 15.3 & $\le$3G & 16.1 & 17 & 17.1 & 24.4\\
85 % \centering \includegraphics[width=0.75\textwidth]{Whole_system} \caption{Overview
86 % of the pipeline, third approach}\label{wholesystem}
90 \subfloat[Sizes of core genome\label{subfig-1:core}]{%
91 \includegraphics[width=0.5\textwidth]{coregenome}
94 \subfloat[Sizes of pan genome\label{subfig-2:pan}]{%
95 \includegraphics[width=0.5\textwidth]{pangenome}
97 \caption{Sizes of Core and Pan genomes for first and second method.}
98 \label{fig:sizes of core and pan}
103 \subfloat[genes coverage of NCBI genomes\label{Cover:NCBI}]{%
104 \includegraphics[width=0.5\textwidth]{cover_ncbi}
107 \subfloat[genes coverage of DOGMA genomes\label{cover:dogma}]{%
108 \includegraphics[width=0.5\textwidth]{cover_dogma}
110 \caption{Gene comparisons cover from NCBI and DOGMA, second method}
111 \label{fig:sizes of core and pan}
115 Figure~\ref{fig:sizes of core and pan} represent the sizes of core and pan genomes produced from the two methods. In figure~\ref{subfig-1:core} core genes are predicted, note that max core genes do not mean good genes. We are looking for genes that meet it's biological principles. The core genes produced from the first method specially from DOGMA can reflect its biological meaning, we will explain later in the section of disscusion the reason why. In figure~\ref{subfig-2:pan}, we can see that the values of pan genome from second method is still steady with different thresholds the second method, while in the first method pan genes increases when the threshold increased.
117 Furthermore, we calculate the correlation coeffecient formula for the second method and the results shows that the correlation for the annotation from DOGMA was $0.97$ while with NCBI was $0.69$.