3 % \subsubsection{Processing annotated genomes from NCBI}
5 % The objective here is to generate sets of genes from each genome so that
6 % genes are organized without any duplication. The input is a list of
7 % chloroplasts annotated genomes downloaded from the NCBI website. All genomes,
8 % which consist in collections of
9 % protein coding sequences~\cite{parra2007cegma,RDOGMA},
10 % are stored as \textit{fasta} files. To be able to build the set of
11 % core genes, we need to preprocess these genomes
12 % using \textit{BioPython} package \cite{chapman2000biopython}.
14 % First of all, we starts by converting each genome from \textit{fasta} file to
15 % GenVision \cite{geneVision} format from DNASTAR. Each genome is thus
16 % converted in a list of genes, containing both gene names and occurrences.
18 % name duplication can be accumulated during the treatment of a genome.
20 % These duplication come from gene fragments (\emph{e.g.} gene
21 % fragments treated with NCBI) and from chloroplast DNA sequences. To
22 % ensure that all the duplication are removed, each list of gene is
23 % translated into a set of genes.
24 % % Note that NCBI genome annotation
25 % produces genes except \textit{Ribosomal (rRNA)} genes.
28 %\subsubsection{Processing genomes annotated by DOGMA}
30 % Protein coding genes are identified in an input genome
31 %using sequence similarity of genes in DOGMA database. In addition in
32 %comparison with NCBI annotation tool,
34 % both \textit{transfer RNAs (tRNA)} and \textit{ribosomal RNAs (rRNA)}.
36 % The DOGMA annotation process is divided into two tasks. First, we
37 % manually annotate chloroplast genomes using DOGMA web tool. The output
38 % of this step is supposed to be a collection of coding genes files for
39 % each genome, organized in GeneVision file. The second task is to solve
40 % the gene duplication problem and therefore we have used two
45 \includegraphics[scale=0.3]{gensim}
46 \caption{Part of the implementation of the second method, compaire the common genes from NCBI and DOGMA.}
50 \color{red}The second approach in this paper is an enhasement of the ICM\textit{Intersection Core Matrix} proposed in \cite{Alkindy2014} by considering gene names to find core genome. Based on gene names spelling, When they realizing simple homogenization of names provided by NCBI, they miss core genes which have slightly different name formats. \color{black} To enlarge the size of the core genome, to be as close as possible to the true natural one, we propose to integrate a similarity distance on gene names. Each similarity will be computed between a name from DOGMA, which operates as a reference here, and a name from NCBI as shown in figure~\ref{Meth2:gensim}.
52 The proposed distance is the Levenshtein one, which is close to the Needleman-Wunsch, except that gap opening and extension penalties are equal. The same name is then set to sequences whose NCBI names are close according to this edit distance.
54 The risk, by doing so, is to merge genes that are different but whose names are similar (for instance, ND4 and ND4L are two different mitochondrial genes but with similar names). The solution is thus to compare, in a second stage, the similarity of DNA sequences too (with a Needleman-Wunsch global alignment), and to simply ignore the gene if this similarity is below a given threshold.
56 By doing so, the second approach is designed, which takes the fundamental idea contained in the annotation-based approaches in the previous work. Remark that this approach is simply a deeper processing of the naming stage in the second approach in \cite{Alkindy2014}, the other stages being identical.
58 The DNA similarity computation raises another problem in the case of DOGMA:
59 contrary to what happens with gene features in NCBI, genes predicted by
60 DOGMA may be fragmented in several parts. Such genes are signaled
61 in the GeneVision file produced by DOGMA, as each fragment is in this file
62 and with the same gene name. A gene whose name is present at least twice
63 in the file is thus either a duplicated gene or a fragmented one.
64 Obviously, fragmented genes must be defragmented before the DNA similarity computation stage (remark that such a defragmentation has already been realized on NCBI website). As the orientation of each fragment is given in the GeneVision output, this defragmentation consists in concatenating all the possible permutations, and only keeping the permutation with the best similarity score to other sequences having the same gene name, if this score is larger than the given threshold.
66 To put it in a nutshell, the genomes list of gene names are firstly updated in this third approach, following the process detailed in
67 Algorithm~\ref{Alg3:thirdM}, while Algorithm~\ref{Alg3:genechk} outlines
68 the \emph{geneChk} subroutine. These updated genomes are secondly sent to
69 Algorithm~\cite{Alkindy2014}, which will produce the desired core genomes, see
70 Figure~\ref{wholesystem} for an updated pipeline.
72 %The first method, based on gene name, translates each genome
73 % %into a set of genes without duplicates.
74 % The second method avoid gene
75 % duplication through a defragment process. In each iteration, this
76 % process starts by taking a gene from gene list, searches for gene
77 % duplication, if a duplication is found, it looks on the orientation of
78 % the fragment sequence. If it is positive it appends directly the
79 % sequence to gene files. Otherwise reverse complement operations are
80 % applied on the sequence, which is then also append to gene files.
81 % Finally, a check for missing start and stop codons is performed. At
82 % the end of the annotation process, all the genomes are fully
83 % annotated, their genes are defragmented, and gene counts are
85 % Remark that there is no gene duplication with gene annotations
86 % from DOGMA after applying gene de-fragmentation process. In
87 % fact, genome annotation with DOGMA can be the key difference when extracting core genes.
92 % \subsubsection{Preprocessing}
94 % In order to extract core genomes in a suitable manner, the genomic
95 % data are preprocessed with two methods: on the one hand a method based
96 % on gene name and count, and on the other hand a method based on a
97 % sequence quality control test.
99 % In the first method, we extract a list of genes from each chloroplast
100 % genome. Then we store this list of genes in the database under genome
101 % nam and genes counts can be extracted by a specific length command.
102 % The \textit{Intersection Core Matrix}, described in next subsection,
103 % is then computed to extract the core genes. The problem with this
104 % method can be stated as follows: how can we ensure that the gene which
105 % is predicted in core genes is the same gene in leaf genomes? The
106 % answer to this problem is that if the sequences of any gene in a
107 % genome annotated from DOGMA and NCBI are similar with respect to a
108 % given threshold, the method is operational when the sequences are not similar. The problem of attribution of a sequence to a gene in the core genome come to light.
110 % The second method is based on the underlying idea that it is possible to predict the the best annotated genome by merging the annotated genomes from NCBI
111 % and DOGMA according to a quality test on genes names and sequences. To
112 % obtain all quality genes of each genome, we consider the following
113 % hypothesis: any gene will appear in the predicted genome if and only
114 % if the annotated genes in NCBI and DOGMA pass a specific threshold
115 % of \textit{quality control test}. In fact, the Needle-man Wunch
116 % algorithm is applied to compare both sequences with respect to a
117 % threshold. If the alignment score is above the threshold, then the
118 % gene will be retained in the predicted genome, otherwise the gene is
119 % ignored. Once the prediction of all genomes is done,
120 % the \textit{Intersection Core Matrix} is computed on these new genomes
121 % to extract core genes, as explained in Algorithm \ref{Alg3:thirdM}.
125 \caption{Maximum similarity score between two sequences(geneChk)}
128 \REQUIRE $g1,g2 \leftarrow \text{NCBI gene sequence, DOGMA gene sequence}$
129 \ENSURE $\text{Maximum similarity score}$
130 \STATE $score1 \leftarrow needle(g1,g2)$
131 \STATE $score2 \leftarrow needle(g1,Reverse(Complement(g2)))$
132 \RETURN $max(score1,score2)$
139 \caption{Extract new genome based on gene quality test}\label{Alg3:thirdM}
141 \REQUIRE {$Gname \leftarrow \text{Genome Name}, Threshold \leftarrow 60, RNGenes \leftarrow \text{[ ]}, RDGenes \leftarrow \text{[ ]},PNGenes \leftarrow \text{[ ]}, PDGenes \leftarrow \text{[ ]}$}
142 \ENSURE $geneList \leftarrow \text{Quality genes}$
143 \FOR{$\text{gene in NCBI genes of Gname}$}
144 \IF {$\text{gene in RNGenes}$}
145 \STATE $dir(NCBI\_Genes) \leftarrow savePermutation(gene)$
146 \STATE $PNGenes \leftarrow gene$
148 \STATE $RNGenes \leftarrow gene$
151 \FOR{$\text{gene in Dogma genes of Gname}$}
152 \IF {$\text{gene in RDGenes}$}
153 \STATE $dir(Dogma\_Genes) \leftarrow savePermutation(gene)$
154 \STATE $PDGenes \leftarrow gene$
156 \STATE $RDGenes \leftarrow gene$
159 \STATE $geneList=\text{empty list}$
160 \STATE $common=set(dir(NCBI\_Genes)) \cap set(dir(Dogma\_Genes))$
161 \FOR{$\text{gene in common}$}
162 \STATE $\text{scores} \leftarrow \text{[ ]}$
163 \IF {$\text{gene NOT in PNGenes AND gene NOT in PDGenes}$}
165 \STATE $scores \leftarrow geneChk(g1,g2)$
166 \ELSIF {$\text{gene in PNGenes AND NOT gene in PDGenes}$}
167 \STATE $PGene \leftarrow loadPermutations('N',gene)$
169 \FOR {$\text{X in PGene}$}
171 \STATE $scores \leftarrow geneChk(g1,g2)$
173 \ELSIF {$\text{gene in PDGenes} AND \text{gene NOT in PNGenes}$}
174 \STATE $PGene \leftarrow loadPermutations('D',gene)$
176 \FOR {$\text{X in PGene}$}
178 \STATE $scores \leftarrow geneChk(g1,g2)$
180 \ELSIF {$\text{gene in PDGenes} AND \text{gene in PNGenes}$}
181 \FOR {$\text{X in loadPermutations('N',gene)}$}
182 \FOR {$\text{Y in loadPermutations('D',gene)}$}
184 \STATE $scores \leftarrow geneChk(g1,g2)$
187 \STATE $score \leftarrow max(scores)$
188 \IF {$score > Threshold$}
189 \STATE $geneList \leftarrow gene$
197 %\begin{algorithm}[H]
199 %\caption{Extract new genome based on gene quality test}\label{Alg3:thirdM}
201 %\REQUIRE {$Gname \leftarrow \text{Genome Name}, Threshold \leftarrow 60, RNGenes \leftarrow \text{[ ]}, RDGenes \leftarrow \text{[ ]},PNGenes \leftarrow \text{[ ]}, PDGenes \leftarrow \text{[ ]}$}
202 %\ENSURE $geneList \leftarrow \text{Quality genes}$
203 %\FOR{$\text{gene in NCBI genes of Gname}$}
204 % \IF {$\text{gene in RNGenes}$}
205 % \STATE $dir(NCBI\_Genes) \leftarrow savePermutation(gene)$
206 % \STATE $PNGenes \leftarrow gene$
208 % \STATE $RNGenes \leftarrow gene$
211 %\FOR{$\text{gene in Dogma genes of Gname}$}
212 % \IF {$\text{gene in RDGenes}$}
213 % \STATE $dir(Dogma\_Genes) \leftarrow savePermutation(gene)$
214 % \STATE $PDGenes \leftarrow gene$
216 % \STATE $RDGenes \leftarrow gene$
219 %\STATE $geneList=\text{empty list}$
220 %\STATE $common=set(dir(NCBI\_Genes)) \cap set(dir(Dogma\_Genes))$
221 %\FOR{$\text{gene in common}$}
222 % \STATE $\text{scores} \leftarrow \text{[ ]}$
223 % \IF {$\text{gene NOT in PNGenes AND gene NOT in PDGenes}$}
224 % \STATE $g1 \leftarrow open(NCBI\_Genes(gene)).read()$
225 % \STATE $g2 \leftarrow open(Dogma\_Genes(gene)).read()$
226 % \STATE $score \leftarrow geneChk(g1,g2)$
227 % \IF {$score > Threshold$}
228 % \STATE $geneList \leftarrow gene$
230 % \ELSIF {$\text{gene in PNGenes AND NOT gene in PDGenes}$}
231 % \STATE $PGene \leftarrow loadPermutations('N',gene)$
232 % \STATE $g2 \leftarrow open(Dogma\_Genes(gene)).read()$
233 % \FOR {$\text{X in PGene}$}
234 % \STATE $g1 \leftarrow open(NCBI\_Genes(X)).read()$
235 % \STATE $scores \leftarrow geneChk(g1,g2)$
237 % \STATE $score \leftarrow max(scores)$
238 % \IF {$score > Threshold$}
239 % \STATE $geneList \leftarrow gene$
241 % \ELSIF {$\text{gene in PDGenes} AND \text{gene NOT in PNGenes}$}
242 % \STATE $PGene \leftarrow loadPermutations('D',gene)$
243 % \STATE $g1 \leftarrow open(NCBI\_Genes(gene)).read()$
244 % \FOR {$\text{X in PGene}$}
245 % \STATE $g2 \leftarrow open(Dogma\_Genes(X)).read()$
246 % \STATE $scores \leftarrow geneChk(g1,g2)$
247 % \STATE $score \leftarrow max(scores)$
249 % \IF {$score > Threshold$}
250 % \STATE $geneList \leftarrow gene$
252 % \ELSIF {$\text{gene in PDGenes} AND \text{gene in PNGenes}$}
253 % \FOR {$\text{X in loadPermutations('N',gene)}$}
254 % \FOR {$\text{Y in loadPermutations('D',gene)}$}
255 % \STATE $g1 \leftarrow open(NCBI\_Genes(X)).read()$
256 % \STATE $g2 \leftarrow open(Dogma\_Genes(Y)).read()$
257 % \STATE $scores \leftarrow geneChk(g1,g2)$
259 % \STATE $score \leftarrow max(scores)$
260 % \IF {$score > Threshold$}
261 % \STATE $geneList \leftarrow gene$
274 % \textbf{geneChk} is a subroutine used to find the best similarity score between
275 % two gene sequences after applying operations like \textit{reverse}, {\it complement},
276 % and {\it reverse complement}.