Paper2/Mixed.tex

   1
   2 %
   3 % \subsubsection{Processing annotated genomes from NCBI}
   4 %
   5 % The objective here is to generate sets  of genes from each  genome so that
   6 % genes are organized  without any duplication.  The input  is a list of
   7 % chloroplasts annotated genomes downloaded from the NCBI website. All genomes,
   8 % which consist in collections of
   9 % protein  coding sequences~\cite{parra2007cegma,RDOGMA},
  10 % are stored as \textit{fasta} files.   To be able to build the set of
  11 % core    genes,     we    need    to     preprocess    these    genomes
  12 % using  \textit{BioPython}  package \cite{chapman2000biopython}.
  13 % %
  14 % First of all, we starts by  converting  each  genome from  \textit{fasta}  file  to
  15 % GenVision \cite{geneVision}  format from DNASTAR. Each  genome is thus
  16 % converted in a list of genes, containing both gene names and occurrences.
  17 % Gene
  18 % name duplication can be accumulated during the treatment of a genome.
  19 % %
  20 % These  duplication  come   from  gene  fragments  (\emph{e.g.}   gene
  21 % fragments treated  with NCBI) and  from chloroplast DNA  sequences. To
  22 % ensure that  all the  duplication are removed,  each list of  gene is
  23 % translated  into a  set of  genes.
  24 % % Note that  NCBI genome  annotation
  25 % produces genes except \textit{Ribosomal (rRNA)} genes.
  26 %
  27
  28 %\subsubsection{Processing genomes annotated by DOGMA}
  29
  30 % Protein  coding genes are  identified in an  input genome
  31 %using sequence similarity of genes  in DOGMA database.  In addition in
  32 %comparison   with   NCBI    annotation   tool,
  33 % It   can   detect
  34 % both \textit{transfer RNAs (tRNA)} and \textit{ribosomal RNAs (rRNA)}.
  35 %
  36 % The DOGMA  annotation process  is divided into  two tasks.   First, we
  37 % manually annotate chloroplast genomes using DOGMA web tool. The output
  38 % of this step is supposed to  be a collection of coding genes files for
  39 % each genome, organized in GeneVision file. The second task is to solve
  40 % the  gene   duplication  problem  and   therefore  we  have   used  two
  41 % methods.
  42
  43 \begin{figure}[H]
  44 \centering
  45 \includegraphics[scale=0.3]{gensim}
  46 \caption{Part of the implementation of the second method, compaire the common genes from NCBI and DOGMA.}
  47 \label{Meth2:gensim}
  48 \end{figure}
  49
  50 \color{red}The second approach in this paper is an enhasement of the ICM\textit{Intersection Core Matrix} proposed in \cite{Alkindy2014} by considering gene names to find core genome. Based on gene names spelling, When they realizing simple homogenization of names provided by NCBI, they miss core genes which have slightly different name formats. \color{black} To enlarge the size of the core genome, to be as close as possible to the true natural one, we propose to integrate a similarity distance on gene names. Each similarity will be computed between a name from DOGMA, which operates as a reference here, and a name from NCBI as shown in figure~\ref{Meth2:gensim}.
  51
  52 The proposed distance  is the Levenshtein one, which is close to the Needleman-Wunsch, except that gap opening and extension penalties are equal. The same name is then set to sequences whose NCBI names are close according to this edit distance.
  53
  54 The risk, by doing so, is to merge genes that are different but whose names are similar (for instance, ND4 and ND4L are two different mitochondrial genes but with similar names). The solution is thus to compare, in a second stage, the similarity of DNA sequences too (with a Needleman-Wunsch global alignment), and to simply ignore the gene if this similarity is below a given threshold.
  55
  56 By doing so, the second approach is designed, which takes the fundamental idea contained in the annotation-based approaches in the previous work. Remark that this approach is simply a deeper processing of the naming stage in the second approach in \cite{Alkindy2014}, the other stages being identical.
  57
  58 The DNA similarity computation raises another problem in the case of DOGMA:
  59 contrary to what happens with gene features in NCBI, genes predicted by
  60 DOGMA may be fragmented in several parts. Such genes are signaled
  61 in the GeneVision file produced by DOGMA, as each fragment is in this file
  62 and with the same gene name. A gene whose name is present at least twice
  63 in the file is thus either a duplicated gene or a fragmented one.
  64 Obviously, fragmented genes must be defragmented before  the DNA similarity computation stage (remark that such a defragmentation has already been realized on NCBI website). As the orientation of each fragment is given in the GeneVision output, this defragmentation consists in concatenating all the possible permutations, and only keeping the permutation with the best similarity score to other sequences having the same gene name, if this score is larger than the given threshold.
  65
  66 To put it in a nutshell, the genomes list of gene names are firstly updated in this third approach, following the process detailed in
  67 Algorithm~\ref{Alg3:thirdM}, while Algorithm~\ref{Alg3:genechk} outlines
  68 the \emph{geneChk} subroutine. These updated genomes are secondly sent to
  69 Algorithm~\cite{Alkindy2014}, which will produce the desired core genomes, see
  70 Figure~\ref{wholesystem} for an updated pipeline.
  71
  72 %The first method, based  on gene name, translates each genome
  73 % %into a set  of genes without duplicates.
  74 % The  second method avoid gene
  75 % duplication  through a  defragment  process. In  each iteration,  this
  76 % process  starts by taking  a gene  from gene  list, searches  for gene
  77 % duplication, if a duplication is found, it looks on the orientation of
  78 % the  fragment sequence.   If it  is positive  it appends  directly the
  79 % sequence to  gene files.  Otherwise reverse  complement operations are
  80 % applied  on the sequence,  which is  then also  append to  gene files.
  81 % Finally, a  check for missing start  and stop codons  is performed. At
  82 % the  end  of  the  annotation  process,  all  the  genomes  are  fully
  83 % annotated,  their   genes  are  defragmented,  and   gene  counts  are
  84 % available.
  85 % Remark that there is no gene duplication with gene annotations
  86 % from DOGMA after applying gene de-fragmentation process. In
  87 % fact, genome annotation with DOGMA can be the key difference when extracting core genes.
  88 %
  89 %
  90 %
  91 %
  92 % \subsubsection{Preprocessing}
  93 %
  94 % In order  to extract  core genomes in  a suitable manner,  the genomic
  95 % data are preprocessed with two methods: on the one hand a method based
  96 % on gene  name and count,  and on  the other hand  a method based  on a
  97 % sequence quality control test.
  98 %
  99 % In the first method, we extract  a list of genes from each chloroplast
 100 % genome.  Then we store this list of genes in the database under genome
 101 % nam and  genes counts can be  extracted by a  specific length command.
 102 % The \textit{Intersection  Core Matrix}, described  in next subsection,
 103 % is then  computed to  extract the core  genes.  The problem  with this
 104 % method can be stated as follows: how can we ensure that the gene which
 105 % is  predicted in  core genes  is the  same gene  in leaf  genomes? The
 106 % answer  to this problem  is that  if the  sequences of  any gene  in a
 107 % genome annotated  from DOGMA  and NCBI are  similar with respect  to a
 108 % given  threshold,  the method is operational when the sequences are not similar. The problem of attribution of a sequence to a gene in the core genome come to light.
 109
 110 % The second method is based on  the underlying idea that it is possible to predict the the best annotated  genome by merging the annotated  genomes from NCBI
 111 % and DOGMA according to a quality test on genes names and sequences. To
 112 % obtain all  quality genes  of each genome,  we consider  the following
 113 % hypothesis: any gene  will appear in the predicted  genome if and only
 114 % if the  annotated genes  in NCBI and  DOGMA pass a  specific threshold
 115 % of  \textit{quality  control test}.   In  fact,  the Needle-man  Wunch
 116 % algorithm  is applied  to compare  both  sequences with  respect to  a
 117 % threshold. If  the alignment  score is above  the threshold,  then the
 118 % gene will be  retained in the predicted genome,  otherwise the gene is
 119 % ignored.   Once    the   prediction   of   all    genomes   is   done,
 120 % the \textit{Intersection Core Matrix} is computed on these new genomes
 121 % to extract core genes, as explained in Algorithm \ref{Alg3:thirdM}.
 122 %
 123
 124 \begin{algorithm}[H]
 125 \caption{Maximum similarity score between two sequences(geneChk)}
 126 \label{Alg3:genechk}
 127 \begin{algorithmic}
 128 \REQUIRE $g1,g2 \leftarrow \text{NCBI gene sequence, DOGMA gene sequence}$
 129 \ENSURE $\text{Maximum similarity score}$
 130 \STATE $score1 \leftarrow needle(g1,g2)$
 131 \STATE $score2 \leftarrow needle(g1,Reverse(Complement(g2)))$
 132 \RETURN $max(score1,score2)$
 133 \end{algorithmic}
 134 \end{algorithm}
 135
 136
 137 \begin{algorithm}[H]
 138 \tiny
 139 \caption{Extract new genome based on gene quality test}\label{Alg3:thirdM}
 140 \begin{algorithmic}
 141 \REQUIRE {$Gname \leftarrow \text{Genome Name}, Threshold \leftarrow 60, RNGenes \leftarrow \text{[ ]}, RDGenes \leftarrow \text{[ ]},PNGenes \leftarrow \text{[ ]}, PDGenes \leftarrow \text{[ ]}$}
 142 \ENSURE $geneList \leftarrow \text{Quality genes}$
 143 \FOR{$\text{gene in NCBI genes of Gname}$}
 144     \IF {$\text{gene in RNGenes}$}
 145         \STATE $dir(NCBI\_Genes) \leftarrow savePermutation(gene)$
 146         \STATE $PNGenes \leftarrow gene$
 147     \ELSE
 148         \STATE $RNGenes \leftarrow gene$
 149     \ENDIF
 150 \ENDFOR
 151 \FOR{$\text{gene in Dogma genes of Gname}$}
 152     \IF {$\text{gene in RDGenes}$}
 153         \STATE $dir(Dogma\_Genes) \leftarrow savePermutation(gene)$
 154         \STATE $PDGenes \leftarrow gene$
 155     \ELSE
 156         \STATE $RDGenes \leftarrow gene$
 157     \ENDIF
 158 \ENDFOR
 159 \STATE $geneList=\text{empty list}$
 160 \STATE $common=set(dir(NCBI\_Genes)) \cap set(dir(Dogma\_Genes))$
 161 \FOR{$\text{gene in common}$}
 162     \STATE $\text{scores} \leftarrow \text{[ ]}$
 163     \IF {$\text{gene NOT in PNGenes AND gene NOT in PDGenes}$}
 164                 \STATE \dots
 165         \STATE $scores \leftarrow geneChk(g1,g2)$
 166     \ELSIF {$\text{gene in PNGenes AND NOT gene in PDGenes}$}
 167         \STATE $PGene \leftarrow loadPermutations('N',gene)$
 168         \dots
 169         \FOR {$\text{X in PGene}$}
 170                 \STATE \dots
 171                 \STATE $scores \leftarrow geneChk(g1,g2)$
 172             \ENDFOR
 173          \ELSIF {$\text{gene in PDGenes} AND \text{gene NOT in PNGenes}$}
 174             \STATE $PGene \leftarrow loadPermutations('D',gene)$
 175             \STATE \dots
 176             \FOR {$\text{X in PGene}$}
 177                 \STATE \dots
 178                 \STATE $scores \leftarrow geneChk(g1,g2)$
 179             \ENDFOR
 180     \ELSIF {$\text{gene in PDGenes} AND \text{gene in PNGenes}$}
 181             \FOR {$\text{X in loadPermutations('N',gene)}$}
 182                 \FOR {$\text{Y in loadPermutations('D',gene)}$}
 183                         \STATE \dots
 184                         \STATE $scores \leftarrow geneChk(g1,g2)$
 185                 \ENDFOR
 186             \ENDFOR
 187 \STATE $score \leftarrow max(scores)$
 188 \IF {$score > Threshold$}
 189     \STATE $geneList \leftarrow gene$
 190 \ENDIF
 191 \ENDIF
 192 \ENDFOR
 193 \RETURN $geneList$
 194 \end{algorithmic}
 195 \end{algorithm}
 196
 197 %\begin{algorithm}[H]
 198 %\tiny
 199 %\caption{Extract new genome based on gene quality test}\label{Alg3:thirdM}
 200 %\begin{algorithmic}
 201 %\REQUIRE {$Gname \leftarrow \text{Genome Name}, Threshold \leftarrow 60, RNGenes \leftarrow \text{[ ]}, RDGenes \leftarrow \text{[ ]},PNGenes \leftarrow \text{[ ]}, PDGenes \leftarrow \text{[ ]}$}
 202 %\ENSURE $geneList \leftarrow \text{Quality genes}$
 203 %\FOR{$\text{gene in NCBI genes of Gname}$}
 204 %    \IF {$\text{gene in RNGenes}$}
 205 %        \STATE $dir(NCBI\_Genes) \leftarrow savePermutation(gene)$
 206 %        \STATE $PNGenes \leftarrow gene$
 207 %    \ELSE
 208 %       \STATE $RNGenes \leftarrow gene$
 209 %    \ENDIF
 210 %\ENDFOR
 211 %\FOR{$\text{gene in Dogma genes of Gname}$}
 212 %    \IF {$\text{gene in RDGenes}$}
 213 %        \STATE $dir(Dogma\_Genes) \leftarrow savePermutation(gene)$
 214 %        \STATE $PDGenes \leftarrow gene$
 215 %    \ELSE
 216 %       \STATE $RDGenes \leftarrow gene$
 217 %    \ENDIF
 218 %\ENDFOR
 219 %\STATE $geneList=\text{empty list}$
 220 %\STATE $common=set(dir(NCBI\_Genes)) \cap set(dir(Dogma\_Genes))$
 221 %\FOR{$\text{gene in common}$}
 222 %    \STATE $\text{scores} \leftarrow \text{[ ]}$
 223 %    \IF {$\text{gene NOT in PNGenes AND gene NOT in PDGenes}$}
 224 %        \STATE $g1 \leftarrow open(NCBI\_Genes(gene)).read()$
 225 %        \STATE $g2 \leftarrow open(Dogma\_Genes(gene)).read()$
 226 %        \STATE $score \leftarrow geneChk(g1,g2)$
 227 %        \IF {$score > Threshold$}
 228 %            \STATE $geneList \leftarrow gene$
 229 %        \ENDIF
 230 %    \ELSIF {$\text{gene in PNGenes AND NOT gene in PDGenes}$}
 231 %        \STATE $PGene \leftarrow loadPermutations('N',gene)$
 232 %        \STATE $g2 \leftarrow open(Dogma\_Genes(gene)).read()$
 233 %        \FOR {$\text{X in PGene}$}
 234 %               \STATE $g1 \leftarrow open(NCBI\_Genes(X)).read()$
 235 %               \STATE $scores \leftarrow geneChk(g1,g2)$
 236 %           \ENDFOR
 237 %           \STATE $score \leftarrow max(scores)$
 238 %           \IF {$score > Threshold$}
 239 %                   \STATE $geneList \leftarrow gene$
 240 %           \ENDIF
 241 %       \ELSIF {$\text{gene in PDGenes} AND \text{gene NOT in PNGenes}$}
 242 %           \STATE $PGene \leftarrow loadPermutations('D',gene)$
 243 %           \STATE $g1 \leftarrow open(NCBI\_Genes(gene)).read()$
 244 %           \FOR {$\text{X in PGene}$}
 245 %               \STATE $g2 \leftarrow open(Dogma\_Genes(X)).read()$
 246 %               \STATE $scores \leftarrow geneChk(g1,g2)$
 247 %               \STATE $score \leftarrow max(scores)$
 248 %           \ENDFOR
 249 %           \IF {$score > Threshold$}
 250 %               \STATE $geneList \leftarrow gene$
 251 %           \ENDIF
 252 %    \ELSIF {$\text{gene in PDGenes} AND \text{gene in PNGenes}$}
 253 %           \FOR {$\text{X in loadPermutations('N',gene)}$}
 254 %               \FOR {$\text{Y in loadPermutations('D',gene)}$}
 255 %                       \STATE $g1 \leftarrow open(NCBI\_Genes(X)).read()$
 256 %                   \STATE $g2 \leftarrow open(Dogma\_Genes(Y)).read()$
 257 %                       \STATE $scores \leftarrow geneChk(g1,g2)$
 258 %               \ENDFOR
 259 %           \STATE $score \leftarrow max(scores)$
 260 %           \IF {$score > Threshold$}
 261 %               \STATE $geneList \leftarrow gene$
 262 %           \ENDIF
 263 %           \ENDFOR
 264 %       \ENDIF
 265 %\ENDFOR
 266 %\RETURN $geneList$
 267 %\end{algorithmic}
 268 %\end{algorithm}
 269
 270
 271
 272
 273 %
 274 % \textbf{geneChk} is a subroutine used to find the best similarity score between
 275 % two gene sequences after applying operations like \textit{reverse}, {\it complement},
 276 % and {\it reverse complement}.
 277
 278
 279