-The field of Genome annotation pay a lot of attentions where the ability to collect and analysis genomical data can provide strong indicator for the study of life\cite{Eisen2007}. A lot of genome annotation centers present various types of annotations tools (i.e cost-effective sequencing methods\cite{Bakke2009}) on different annotation levels. Two method of gene finding in annotated genome can be categorized as: Alignment-based, composition based, or combination of both\cite{parra2007cegma}. The Alignment-based method is used when we try to predict a coding gene (i.e. Genes that produce proteins) by aligning DNA sequence of gene to the protein of cDNA sequence of homology\cite{parra2007cegma}. This approache also used in GeneWise\cite{birney2004genewise} with known splicing signals. Composition-based mothod (known as \textit{ab initio} is based on a probabilistic model of gene structure to find genes and/or new genes according to the probability gene value, this method like GeneID\cite{parra2000geneid}. In this section, we will consider a new method of finding core genes from large amount of chloroplast genomes, as a solution of the previous method where stated in section two. This method is based on extracting gene features. The question now is how can we have good annotation genome? To answer this question, we need to focusing on studying the annotation accuracy\cite{Bakke2009} of the genome. A general overview of the system is illustrated in Figure \ref{Fig1}.\\
+The field of Genome annotation pays a lot of attentions where the ability to collect and analysis genomical data can provide strong indicator for the study of life\cite{Eisen2007}. Four of genome annotation centers, (such as, \textit{NCBI\cite{Sayers01012011}, Dogma \cite{RDogma}, cpBase \cite{de2002comparative}, CpGAVAS \cite{liu2012cpgavas}, and CEGMA\cite{parra2007cegma}}), present various types of annotations tools (i.e cost-effective sequencing methods\cite{Bakke2009}) on different annotation levels. Generally, one of three methods of gene finding in annotated genome can be categorized using these centers: \textit{alignment-based, composition based, or combination of both\cite{parra2007cegma}}. The alignment-based method is used when we try to predict a coding gene (i.e. genes that produce proteins) by aligning DNA sequence of gene to the protein of cDNA sequence of homology\cite{parra2007cegma}. This approache also is used in GeneWise\cite{birney2004genewise}. Composition-based mothod (known as \textit{ab initio}) is based on a probabilistic model of gene structure to find genes and/or new genes according to the probability gene value (GeneID\cite{parra2000geneid}). In this section, we will consider a new method of finding core genes from large amount of chloroplast genomes, as a solution of the problem resulting from the method stated in section two. This method is based on extracting gene features. A general overview of the system is illustrated in Figure \ref{Fig1}.\\
\begin{figure}[H]
\centering
\caption{A general overview of the system}\label{Fig1}
\end{figure}
-In Figure 1, we illustrate the general overview of system pipeline: \textit{Database, Genomes annotation, Core extraction, } and \textit{relationships}. We will give a short discussion for each stage in the model in order to understand all core extraction process. Good database (as a first stage) will produce good results, however, many international Banks for nucleotide sequence databases like (GenBank in USA, EMBL-Bank in Europe, and DDBJ in Japon) where exists to store various genomes and DNA species. A lot of Biological tool interact with these databases for (Genome Annotation, Gene extraction, alignments, ... , etc). The database in this model must be taken from any confident data source that store annotated and/or unannotated chloroplast genomes. We will consider GenBank- NCBI database to be our nucleotide sequences database. Annotation (as the second stage) is consider to be the first important task for Extract Gene Features. Thanks to good annotation tool that lead us to extract good gene features. In this paper, two annotation techniques from \textit{NCBI, and Dogma} will be used to extract \textit{genes features}. Extracting Gene feature (as a third stage) can be anything like (genes names, gene sequences, protein sequence,...etc). Our methodologies in this paper will consider gene names, gene counts, and gene sequences for extracting core genes and producing chloroplast evolutionary tree. \\
-In last stage, for achieving our goals with what the biological expert needs, we used the form of (tables, phylogenetic trees, graphs,...,etc) to organize and represent genomes relationships and gene evolution. In addition, comparing these forms with the results from another annotation tool like Dogma\cite{RDogma} for large population of chloroplast genomes that give us biological perspective to the nature of chloroplast evolution. \\
-A Local database attached with each pipe stage used to store all information of extraction process. The output from each stage in our system will be an input to the second stage and so on.
+In Figure 1, we illustrate the general overview of system pipeline: \textit{Database, Genomes annotation, Core extraction,} and \textit{relationships}. We will give a short discussion for each stage of the model in order to understand all core extraction process. This work starts with a gene Bank database; however, many international Banks for nucleotide sequence databases (such as, \textit{GenBank} \citep{Sayers01012011} in USA, \textit{EMBL-Bank} \cite{apweiler1985swiss} in Europe, and \textit{DDBJ} \cite{sugawara2008ddbj} in Japon) where exist to store various genomes and DNA species. Different Biological tools are provided to analyse and annotate genomes by interacting with these databases to align and extract sequences to predict genes. The database in this model must be taken from any confident data source that store annotated and/or unannotated chloroplast genomes. We will consider GenBank-NCBI \citep{Sayers01012011} database to be our nucleotide sequences database. Annotation (as the second stage) is considered to be the first important task for Extract Gene Features. Good annotation tool lead us to extracts good gene feature. In this paper, two annotation techniques from \textit{NCBI, and Dogma} used to extract \textit{one genes features}. Extracting Gene feature (as a third stage) can be anything like (genes names, gene sequences, protein sequence,...etc). Our methodologies in this paper consider gene names, gene counts, and gene sequences for extracting core genes and producing chloroplast evolutionary tree. \\
+
+In last stage, to achieve the goal of gene evolution with what the biological expert needs, we used the form of (tables, phylogenetic trees, graphs,...,etc) to organize and represent genomes relationships. In addition, compare these forms with another annotation tool forms for large population of chloroplast genomes give us biological perspective to the nature of chloroplasts evolution. \\
+A Local database attached with each pipe stage is used to store all the informations of extraction process. The output from each stage in our system will be an input to the second stage and so on.
\subsection{Genomes Samples}
-In this research, we retrieved 107 genomes of Chloroplasts from NCBI. Ninety nine genomes of them were considered to work with. These genomes lies in the 11 type of chloroplast families, as shown in Table \ref{Tab1}. The distribution of genomes illustrated in detail in Table \ref{Tab2}.
-
-\begin{table}[H]
-\caption{distribution on Chloroplast Families}\label{Tab1}
-\centering
-\begin{tabular}{c c}
-\hline\hline
-Family & Genome Counts \\ [0.5ex]
-\hline
-Brown Algae & 11 \\
-Red Algae & 03 \\
-Green Algae & 17 \\
-Angiosperms & 46 \\
-Brypoytes & 03 \\
-Dinoflagellates & 02 \\
-Euglena & 02 \\
-Fern & 05 \\
-Gymnosperms & 07 \\
-Lycopodiophyta & 02 \\
-Haptophytes & 01 \\ [1ex]
-\hline
-\end{tabular}
-\end{table}
+In this research, we retrieved genomes of Chloroplasts from NCBI. Ninety nine genome of them were considered to work with. These genomes lies in the eleven type of chloroplast families, as shown in Table \ref{Tab1}. The distribution of genomes is illustrated in detail in Table \ref{Tab2}.
\input{population_Table}
\subsection{Genome Annotation Techniques}
-Genome annotation is considered as the second stage in the model pipeline. Many annotation techniques were developed for annotate chloroplast genomes but the problem is that they vary in the number and type of predicting genes (i.e the ability to predict genes and \textit{for example: Transfere RNA (tRNA)} and \textit{Ribosomal RNA (rRNA)} genes). Two annotation techniques from NCBI and Dogma are considered to analyse chloroplast genomes to examine the accuracy of predicted coding genes. Figure \ref{NCBI_annotation}, illstrate two annotation technique.\\
-
-\begin{figure}[H]
-\centering
-\includegraphics[width=0.7\textwidth]{NCBI_annotation}
-\caption{Genome annotation using either NCBI or Dogma}\label{NCBI_annotation}
-\end{figure}
-
-With each annotation model, we provide a quality check class for the flow of chloroplast genomes, as illustrated in figure \ref{NCBI:Annotation}. This class has a direct access to NCBI taxonomy database based on genome accession number to retrieve information for the genome. These information contains \textit{[Scientific name, lineage, Division, taxonomy ID, parentID, and Accession No]}. Examining each genome with this class (i.e based on some parameters), can ignore some genomes from this competition that not match a specific control condition.
+Genome annotation is the second stage in the model pipeline. Many techniques were developed to annotate chloroplast genomes but the problem is that they vary in the number and type of predicting genes (i.e the ability to predict genes and \textit{for example: Transfere RNA (tRNA)} and \textit{Ribosomal RNA (rRNA)} genes). Two annotation techniques from NCBI and Dogma are considered to analyse chloroplast genomes to examine the accuracy of predicted coding genes.
\subsubsection{genome annotation from NCBI}
-The objective from this step is to organize genes, solve genes duplications, and generate sets of genes from each genome. The input to the system is our list of chloroplast genomes, annotated from NCBI\cite{Sayers01012011}. All genomes stored as \textit{.fasta} files include collection of Protein coding genes\cite{parra2007cegma,RDogma}(gene that produce proteins) with its coding sequences.
-As a preparation step to achieve the set of core genes, we need to translate these genomes using \textit{BioPython} package\cite{chapman2000biopython}, and extracting all information needed to find the core genes. A process starts by converting each genome in fasta format to GenVision\cite{geneVision} formats from DNASTAR, and this is not an easy job. The output from this operation is a lists of genes stored in a local database for each genome, their genes names and gene counts. In this stage, we will accumulate some Gene duplications with each genome treated. In other words, duplication in gene name can comes from genes fragments as long as chloroplast DNA sequences. We defines \textit{Identical state} to be the state that each gene present only one time in a genome (i.e Gene has no copy) without considering the position or gene orientation. This state can be reached by filtering the database from redundant gene name. To do this, we have two solutions: first, we made an orthography checking. Orthographe checking is used to merge fragments of a gene to form one gene.
-Second, we convert the list of genes names for each genome (i.e. after orthography check) in the database to be a set of genes names. Mathematically speaking, if $G=\left[g_1,g_2,g_3,g_1,g_3,g_4\right]$ is a list of genes names, by using the definition of a set in mathematics, we will have $set(G)=\{g_1,g_2,g_3,g_4\}$, and $|G|=4$ where $|G|$ is the cardinality number of the set $G$ which represent the number of genes in the set.\\
-The whole process of extracting core genome based on genes names and counts among genomes is illustrate in Figure \ref{NCBI:Annotation}.\\
-
-\begin{figure}[H]
- \centering
- \includegraphics[width=0.7\textwidth]{NCBI_GeneName}
- \caption{NCBI Annotation for Chloroplast genomes}
- \label{NCBI:Annotation}
-\end{figure}
+The objective from this step is to organize genes, solve gene duplications, and generate sets of genes from each genome. The input to the system is our list of chloroplast genomes, annotated from NCBI. All genomes stored as \textit{.fasta} files include collection of protein coding genes\cite{parra2007cegma,RDogma}(gene that produce proteins) with its coding sequences.
+As a preparation step to achieve the set of core genes, we need to analyse these genomes (using \textit{BioPython} package\cite{chapman2000biopython}
+), to extracting all information needed to find the core genes. The process starts by converting each genome from fasta format to GenVision\cite{geneVision} formats from DNASTAR. The outputs from this operation are lists of genes for each genome, their genes names and gene counts. In this stage, we accumulate some Gene duplications for each treated genome. In other words, duplication in gene name can comes from genes fragments as long as chloroplast DNA sequences. We defines \textit{Identical state} to be the state that each gene present only one time in a genome (i.e Gene has no copy) without considering the position or gene orientation. This state can be reached by filtering the database from redundant gene name.
\subsubsection{Genome annotation from Dogma}
-Dogma is an annotation tool developed in the university of Texas by \cite{RDogma} in 2004. Dogma is an abbreviation of \textit{Dual Organellar GenoMe Annotator}\cite{RDogma} for plant chloroplast and animal mitochondrial genomes.
-It has its own database for translated the genome in all six reading frames and query the amino acid sequence database using Blast\cite{altschul1990basic}(i.e Blastx) with various parameters, and to identify protein coding genes\cite{parra2007cegma,RDogma} in the input genome based on sequence similarity of genes in Dogma database. Further more, it can produce the \textit{Transfer RNAs (tRNA)}\cite{RDogma}, and the \textit{Ribosomal RNAs (rRNA)}\cite{RDogma} and verifying their start and end positions rather than NCBI annotation tool. There are no gene duplication with dogma after solving gene fragmentation. \\
+Dogma \cite{RDogma} is an annotation tool developed in the university of Texas in 2004. Dogma is an abbreviation of (\textit{Dual Organellar GenoMe Annotator}) for plant chloroplast and animal mitochondrial genomes.
+It has its own database for translating the genome in all six reading frames and query the amino acid sequence database using Blast\cite{altschul1990basic}(i.e Blastx) with various parameters. Further more, identify protein coding genes\cite{parra2007cegma,RDogma} in the input genome based on sequence similarity of genes in Dogma database. In addition, it can produce the \textit{Transfer RNAs (tRNA)}, and the \textit{Ribosomal RNAs (rRNA)} and verifies their start and end positions rather than NCBI annotation tool. There are no gene duplication with dogma after solving gene fragmentation. \\
Genome Anntation with dogma can be the key difference of extracting core genes. In figure \ref{dog:Annotation}, The step of annotation divided into two tasks: First, It starts to annotate complete choloroplast genomes (i.e \textit{Unannotated genomes} from NCBI by using Dogma web tool. The whole annotation process was done manually. The output from dogma is considered to be collection of coding genes file for each genome in the form of GeneVision\cite{geneVision} file format.\\
-Where the second task is to solve gene fragments. Defragment process starts immediately after the first task to solve fragments of coding genes for each genome to avoid gene duplication. This process will looks on fragement orientation, if it is negative, then the process apply reverse complement operations on gene sequence. All genomes after this stage are fully annotated, their genes were de-fragmented, genes lists and counts were identified. These information stored in local database.\\
+Where the second task is to solve gene fragments. Defragment process starts immediately after the first task to solve fragments of coding genes for each genome to avoid gene duplication. This process looks for fragment orientation: if it is negative, then the process applis reverse complement operations on gene sequence. All genomes after this stage are fully annotated, their genes were de-fragmented, genes lists and counts were identified.\\
+
\begin{figure}[H]
\centering
\includegraphics[width=0.7\textwidth]{Dogma_GeneName}
\caption{Dogma Annotation for Chloroplast genomes}\label{dog:Annotation}
\end{figure}
-From these two tasks, we can obtain clearly one copy of coding genes. To ensure that genes produced from dogma annotation process is same as the genes in NCBI. We apply in parrallel a quality checking process that align each gene from dogma and NCBI with respect to a specific threshold.\\
-
-\subsection{Extract Core Genes}
-The goal of this step is trying to extract maximum core genes from sets of genes (\textit{Vectors}) in the local database. The methodology of finding core genes is dividing to three methods: \\
+\subsection{Core Genes Extraction}
+The goal of this step is to extract maximum core genes from sets of genes. The methodology of finding core genes is divided into three methods: \\
-The hypothesis in first method is based on extracting core genes by finding common genes among chloroplast genomes based on extracting gene feature (i.e Gene names, genes counts). Genomes vary in genes counts according to the method of annotation used, so that extracting maximum core genes can be done by constructing Intersection Core Matrix (\textit{ICM}).\\
-While the hypothesis of second method is based on comparing the sequence of reference genes of one annotated genome with other unannotated genomes sequences in Blast database, by using Blastn\cite{Sayers01012011} (nucleotide sequence alignment tool from NCBI). The last method, is based on merge all genes from NCBI and Dogma annotation, then apply a sequence similarity base method (Quality Control test) using Needle-man Wunch algorithm to predict a new genomes. Using predicted genomes to extract core genes using previous methods. Figure \ref{wholesystem}, illustrate the whole system operations.
+The first method is based on extracting core genes by finding common genes feature (i.e Gene names, genes counts). Genomes vary in genes counts according to the annotation used method, so that extracting core genes can be done by constructing Intersection Core Matrix (\textit{ICM}).\\
+While the second method is based on comparing the sequence of reference genes of one annotated genome with other unannotated genomes sequences in Blast database, by using Blastn\cite{Sayers01012011} (nucleotide sequence alignment tool from NCBI). The last method, is based on merge all genes from NCBI and Dogma annotation, then apply a sequence similarity base method (Quality Control test) using Needle-man Wunch algorithm to predict a new genomes. Using predicted genomes to extract core genes using previous methods. Figure \ref{wholesystem}, illustrate the whole system operations.
\begin{figure}[H]
\centering
\caption{Total overview of the system pipeline}\label{wholesystem}
\end{figure}
-In the first method, the idea is to collect from each iteration the maximum number of common genes. To do so, the system build an \textit{Intersection core matrix(ICM)}. ICM here is a two dimensional symmetric matrix (considered as a vector space) where each row and column represent a vector for one genome. Each position in ICM stores the \textit{intersection scores}. Intersection Score(IS) is the cardinality number of a core genes comes from intersecting one vector with other vectors in vector space. Taking maximum cardinality from each row and then take the maximum of them will result to select the maximum cardinality in the vector space. Maximum cardinality results to select two genomes with their maximum core. Mathematically speaking, if we have an $n \times m$ vector space matrix where $n=m=\text{number of vectors in local database}$, then lets consider:\\
+In the first method, the idea is to iterativelly collect the maximum number of common genes. To do so, the system builds an \textit{Intersection core matrix (ICM)}. ICM is a two dimensional symmetric matrix where each row and column represent one genome. Each position in ICM stores the \textit{intersection scores (IS)}. The Intersection Score is the cardinality number of a core genes comes from intersecting one ????? with other ??????. Maximum cardinality results to select two genomes with their maximum core. Mathematically speaking, if we have an $n \times n$ matrix where $n=\text{number of genomes in local database}$, then lets consider:\\
\begin{equation}
Score=\max_{i<j}\vert x_i \cap x_j\vert
\label{Eq1}
\end{equation}\\
-Where $x_i, x_j$ are vectors in the matrix. Generate new core genes is depending on the value of intersecting two vectors, we call it $Score$:\\
+Where $x_i, x_j$ are elements in the matrix. The generation of a new core genes is depending on the cardinality value of intersection elements, we call it $Score$:\\
$$\text{New Core} = \begin{cases}
\text{Ignored} & \text{if $Score=0$;} \\
\text{new Core id} & \text{if $Score>0$.}
DOI={10.1089/cmb.2010.0092}
}
+@article{Eisen2007,
+ author = {Eisen, Jonathan A},
+ journal = {PLoS Biol},
+ publisher = {Public Library of Science},
+ title = {Environmental Shotgun Sequencing: Its Potential and Challenges for Studying the Hidden World of Microbes},
+ year = {2007},
+ month = {03},
+ volume = {5},
+ url = {http://dx.doi.org/10.1371%2Fjournal.pbio.0050082},
+ pages = {e82},
+ abstract = {
+ <p>Environmental shotgun sequencing promises to reveal novel and fundamental insights into the hidden world of microbes, but the complexity of analysis required to realize this potential poses unique interdisciplinary challenges.</p>
+ },
+ number = {3},
+ doi = {10.1371/journal.pbio.0050082}
+}
+
+@article{Sayers01012011,
+author = {Sayers, Eric W. and Barrett, Tanya and Benson, Dennis A. and Bolton, Evan and Bryant, Stephen H. and Canese, Kathi and Chetvernin, Vyacheslav and Church, Deanna M. and DiCuccio, Michael and Federhen, Scott and Feolo, Michael and Fingerman, Ian M. and Geer, Lewis Y. and Helmberg, Wolfgang and Kapustin, Yuri and Landsman, David and Lipman, David J. and Lu, Zhiyong and Madden, Thomas L. and Madej, Tom and Maglott, Donna R. and Marchler-Bauer, Aron and Miller, Vadim and Mizrachi, Ilene and Ostell, James and Panchenko, Anna and Phan, Lon and Pruitt, Kim D. and Schuler, Gregory D. and Sequeira, Edwin and Sherry, Stephen T. and Shumway, Martin and Sirotkin, Karl and Slotta, Douglas and Souvorov, Alexandre and Starchenko, Grigory and Tatusova, Tatiana A. and Wagner, Lukas and Wang, Yanli and Wilbur, W. John and Yaschenko, Eugene and Ye, Jian},
+title = {Database resources of the National Center for Biotechnology Information},
+volume = {39},
+number = {suppl 1},
+pages = {D38-D51},
+year = {2011},
+doi = {10.1093/nar/gkq1172},
+URL = {http://nar.oxfordjournals.org/content/39/suppl_1/D38.abstract},
+eprint = {http://nar.oxfordjournals.org/content/39/suppl_1/D38.full.pdf+html},
+journal = {Nucleic Acids Research}
+}
+
+@Article{RDogma,
+AUTHOR = {Stacia K. Wyman, Robert K. Jansen and Jeffrey L. Boore},
+TITLE = {Automatic annotation of organellar genomes
+with DOGMA},
+JOURNAL = {BIOINFORMATICS, oxford Press},
+VOLUME = {20},
+YEAR = {2004},
+NUMBER = {172004},
+PAGES = {3252-3255},
+URL={http://www.biosci.utexas.edu/ib/faculty/jansen/pubs/Wyman%20et%20al.%202004.pdf},
+}
+
+@article{de2002comparative,
+ title={Comparative analysis of chloroplast genomes: functional annotation, genome-based phylogeny, and deduced evolutionary patterns},
+ author={De Las Rivas, Javier and Lozano, Juan Jose and Ortiz, Angel R},
+ journal={Genome research},
+ volume={12},
+ number={4},
+ pages={567--583},
+ year={2002},
+ publisher={Cold Spring Harbor Lab}
+}
+
+@article{liu2012cpgavas,
+ title={CpGAVAS, an integrated web server for the annotation, visualization, analysis, and GenBank submission of completely sequenced chloroplast genome sequences},
+ author={Liu, Chang and Shi, Linchun and Zhu, Yingjie and Chen, Haimei and Zhang, Jianhui and Lin, Xiaohan and Guan, Xiaojun},
+ journal={BMC genomics},
+ volume={13},
+ number={1},
+ pages={715},
+ year={2012},
+ publisher={BioMed Central Ltd}
+}
+
@article{parra2007cegma,
title={CEGMA: a pipeline to accurately annotate core genes in eukaryotic genomes},
author={Parra, Genis and Bradnam, Keith and Korf, Ian},
publisher={Cold Spring Harbor Lab}
}
+@article{apweiler1985swiss,
+ title={SWISS-PROT AND ITS COMPUTER-ANNOTATED SUPPLEMENT TREMBL: HOW TO PRODUCE HIGH QUALITY AUTOMATIC ANNOTATION},
+ author={Apweiler, Rolf and O’Donovan, Claire and Martin, Maria Jesus and Fleischmann, Wolfgang and Hermjakob, Henning and Moeller, Steffen and Contrino, Sergio and Junker, Vivien},
+ journal={EUR. J. BIOCHEM},
+ volume={147},
+ pages={9--15},
+ year={1985},
+ url={http://www.ebi.ac.uk/ena/}
+}
+
+@article{sugawara2008ddbj,
+ title={DDBJ with new system and face},
+ author={Sugawara, Hideaki and Ogasawara, Osamu and Okubo, Kousaku and Gojobori, Takashi and Tateno, Yoshio},
+ journal={Nucleic acids research},
+ volume={36},
+ number={suppl 1},
+ pages={D22--D24},
+ year={2008},
+ publisher={Oxford Univ Press}
+}
+
@article{chapman2000biopython,
title={Biopython: Python tools for computational biology},
author={Chapman, Brad and Chang, Jeffrey},
volume = {7},
url = {http://dx.doi.org/10.1371%2Fjournal.pone.0052841},
pages = {e52841},
- abstract = {<p>Molecular and phylogeographic studies have led to the definition within the <italic>Mycobacterium tuberculosis</italic> complex (MTBC) of a number of geotypes and ecotypes showing a preferential geographic location or host preference. The MTBC is thought to have emerged in Africa, most likely the Horn of Africa, and to have spread worldwide with human migrations. Under this assumption, there is a possibility that unknown deep branching lineages are present in this region. We genotyped by spoligotyping and multiple locus variable number of tandem repeats (VNTR) analysis (MLVA) 435 MTBC isolates recovered from patients. Four hundred and eleven isolates were collected in the Republic of Djibouti over a 12 year period, with the other 24 isolates originating from neighbouring countries. All major <italic>M. tuberculosis</italic> lineages were identified, with only two <italic>M. africanum</italic> and one <italic>M. bovis</italic> isolates. Upon comparison with typing data of worldwide origin we observed that several isolates showed clustering characteristics compatible with new deep branching. Whole genome sequencing (WGS) of seven isolates and comparison with available WGS data from 38 genomes distributed in the different lineages confirms the identification of ancestral nodes for several clades and most importantly of one new lineage, here referred to as lineage 7. Investigation of specific deletions confirms the novelty of this lineage, and analysis of its precise phylogenetic position indicates that the other three superlineages constituting the MTBC emerged independently but within a relatively short timeframe from the Horn of Africa. The availability of such strains compared to the predominant lineages and sharing very ancient ancestry will open new avenues for identifying some of the genetic factors responsible for the success of the modern lineages. Additional deep branching lineages may be readily and efficiently identified by large-scale MLVA screening of isolates from sub-Saharan African countries followed by WGS analysis of a few selected isolates.</p>},
number = {12},
doi = {10.1371/journal.pone.0052841}
}
-@article{Sayers01012011,
-author = {Sayers, Eric W. and Barrett, Tanya and Benson, Dennis A. and Bolton, Evan and Bryant, Stephen H. and Canese, Kathi and Chetvernin, Vyacheslav and Church, Deanna M. and DiCuccio, Michael and Federhen, Scott and Feolo, Michael and Fingerman, Ian M. and Geer, Lewis Y. and Helmberg, Wolfgang and Kapustin, Yuri and Landsman, David and Lipman, David J. and Lu, Zhiyong and Madden, Thomas L. and Madej, Tom and Maglott, Donna R. and Marchler-Bauer, Aron and Miller, Vadim and Mizrachi, Ilene and Ostell, James and Panchenko, Anna and Phan, Lon and Pruitt, Kim D. and Schuler, Gregory D. and Sequeira, Edwin and Sherry, Stephen T. and Shumway, Martin and Sirotkin, Karl and Slotta, Douglas and Souvorov, Alexandre and Starchenko, Grigory and Tatusova, Tatiana A. and Wagner, Lukas and Wang, Yanli and Wilbur, W. John and Yaschenko, Eugene and Ye, Jian},
-title = {Database resources of the National Center for Biotechnology Information},
-volume = {39},
-number = {suppl 1},
-pages = {D38-D51},
-year = {2011},
-doi = {10.1093/nar/gkq1172},
-abstract ={In addition to maintaining the GenBank® nucleic acid sequence database, the National Center for Biotechnology Information (NCBI) provides analysis and retrieval resources for the data in GenBank and other biological data made available through the NCBI Web site. NCBI resources include Entrez, the Entrez Programming Utilities, MyNCBI, PubMed, PubMed Central (PMC), Entrez Gene, the NCBI Taxonomy Browser, BLAST, BLAST Link (BLink), Primer-BLAST, COBALT, Electronic PCR, OrfFinder, Splign, ProSplign, RefSeq, UniGene, HomoloGene, ProtEST, dbMHC, dbSNP, dbVar, Epigenomics, Cancer Chromosomes, Entrez Genomes and related tools, the Map Viewer, Model Maker, Evidence Viewer, Trace Archive, Sequence Read Archive, Retroviral Genotyping Tools, HIV-1/Human Protein Interaction Database, Gene Expression Omnibus (GEO), Entrez Probe, GENSAT, Online Mendelian Inheritance in Man (OMIM), Online Mendelian Inheritance in Animals (OMIA), the Molecular Modeling Database (MMDB), the Conserved Domain Database (CDD), the Conserved Domain Architecture Retrieval Tool (CDART), IBIS, Biosystems, Peptidome, OMSSA, Protein Clusters and the PubChem suite of small molecule databases. Augmenting many of the Web applications are custom implementations of the BLAST program optimized to search specialized data sets. All of these resources can be accessed through the NCBI home page at www.ncbi.nlm.nih.gov.},
-URL = {http://nar.oxfordjournals.org/content/39/suppl_1/D38.abstract},
-eprint = {http://nar.oxfordjournals.org/content/39/suppl_1/D38.full.pdf+html},
-journal = {Nucleic Acids Research}
-}
+
@article{zafar2002coregenes,
title={CoreGenes: A computational tool for identifying and cataloging},
PubMedID = {17623808},
ISSN = {1088-9051}
}
-@article{Eisen2007,
- author = {Eisen, Jonathan A},
- journal = {PLoS Biol},
- publisher = {Public Library of Science},
- title = {Environmental Shotgun Sequencing: Its Potential and Challenges for Studying the Hidden World of Microbes},
- year = {2007},
- month = {03},
- volume = {5},
- url = {http://dx.doi.org/10.1371%2Fjournal.pbio.0050082},
- pages = {e82},
- abstract = {
- <p>Environmental shotgun sequencing promises to reveal novel and fundamental insights into the hidden world of microbes, but the complexity of analysis required to realize this potential poses unique interdisciplinary challenges.</p>
- },
- number = {3},
- doi = {10.1371/journal.pbio.0050082}
-}
-@Article{RDogma,
-AUTHOR = {Stacia K. Wyman, Robert K. Jansen and Jeffrey L. Boore},
-TITLE = {Automatic annotation of organellar genomes
-with DOGMA},
-JOURNAL = {BIOINFORMATICS, oxford Press},
-VOLUME = {20},
-YEAR = {2004},
-NUMBER = {172004},
-PAGES = {3252-3255},
-URL={http://www.biosci.utexas.edu/ib/faculty/jansen/pubs/Wyman%20et%20al.%202004.pdf},
-}
+
@article{guindon2005phyml,
title={PHYML Online—a web server for fast maximum likelihood-based phylogenetic inference},
author={Guindon, Stephane and Lethiec, Franck and Duroux, Patrice and Gascuel, Olivier},
volume = {4},
url = {http://dx.doi.org/10.1371%2Fjournal.pone.0006291},
pages = {e6291},
- abstract = {
-<p>Genome annotations are accumulating rapidly and depend heavily on automated annotation systems. Many genome centers offer annotation systems but no one has compared their output in a systematic way to determine accuracy and inherent errors. Errors in the annotations are routinely deposited in databases such as NCBI and used to validate subsequent annotation errors. We submitted the genome sequence of halophilic archaeon <italic>Halorhabdus utahensis</italic> to be analyzed by three genome annotation services. We have examined the output from each service in a variety of ways in order to compare the methodology and effectiveness of the annotations, as well as to explore the genes, pathways, and physiology of the previously unannotated genome. The annotation services differ considerably in gene calls, features, and ease of use. We had to manually identify the origin of replication and the species-specific consensus ribosome-binding site. Additionally, we conducted laboratory experiments to test <italic>H. utahensis</italic> growth and enzyme activity. Current annotation practices need to improve in order to more accurately reflect a genome's biological potential. We make specific recommendations that could improve the quality of microbial annotation projects.</p>
-},
number = {7},
doi = {10.1371/journal.pone.0006291}
}