X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/chloroplast13.git/blobdiff_plain/9164a34ac78d5cc31c7b460d9a3c264c855910d8..1e61bcbdaedf0f72f05c3dd365fc11d3a2071330:/annotated.tex?ds=inline diff --git a/annotated.tex b/annotated.tex index e188973..3a7fbd1 100644 --- a/annotated.tex +++ b/annotated.tex @@ -1,4 +1,26 @@ -The field of genome annotation pays a lot of attentions where the ability to collect and analysis genomical data can provide strong indicators for the study of life\cite{Eisen2007}. Four of genome annotation centers (such as, \textit{NCBI\cite{Sayers01012011}, Dogma \cite{RDogma}, cpBase \cite{de2002comparative}, CpGAVAS \cite{liu2012cpgavas}, and CEGMA\cite{parra2007cegma}}) present various types of annotation tools (\emph{i.e.} cost-effective sequencing methods\cite{Bakke2009}) on different annotation levels. Generally, previous studies used one of three methods for gene finding in annotated genome using these centers: \textit{alignment-based, composition based, or combination of both\cite{parra2007cegma}}. The alignment-based method is used when we try to predict a coding gene (\emph{i.e.}. genes that produce proteins) by aligning DNA sequence of gene to the protein of cDNA sequence of homology\cite{parra2007cegma}. This approach also is used in GeneWise\cite{birney2004genewise}. Composition-based method (known as \textit{ab initio}) is based on a probabilistic model of gene structure to find genes according to the gene value probability (GeneID\cite{parra2000geneid}). In this section, we consider a new method of finding core genes from large amount of chloroplast genomes, as a solution of the problem resulting from the method stated in section two. This method is based on extracting gene features. A general overview of the system is illustrated in Figure \ref{Fig1}.\\ +The field of genome annotation pays a lot of attentions where the +ability to collect and analysis genomical data can provide strong +indicators for the study of life\cite{Eisen2007}. Four of genome +annotation centers (such as, \textit{NCBI\cite{Sayers01012011}, +Dogma \cite{RDogma}, cpBase \cite{de2002comparative}, +CpGAVAS \cite{liu2012cpgavas}, and CEGMA\cite{parra2007cegma}}) +present various types of annotation tools (\emph{i.e.} cost-effective +sequencing methods\cite{Bakke2009}) on different annotation +levels. Generally, previous studies used one of three methods for gene +finding in annotated genome using these +centers: \textit{alignment-based, composition based, or combination of +both\cite{parra2007cegma}}. The alignment-based method is used when we +try to predict a coding gene (\emph{i.e.}. genes that produce +proteins) by aligning DNA sequence of gene to the protein of cDNA +sequence of homology\cite{parra2007cegma}. This approach also is used +in GeneWise\cite{birney2004genewise}. Composition-based method (known +as \textit{ab initio}) is based on a probabilistic model of gene +structure to find genes according to the gene value probability +(GeneID\cite{parra2000geneid}). In this section, we consider a new +method of finding core genes from large amount of chloroplast genomes, +as a solution of the problem resulting from the method stated in +section two. This method is based on extracting gene features. A +general overview of the system is illustrated in Figure \ref{Fig1}.\\ \begin{figure}[H] \centering @@ -12,9 +34,8 @@ A local database attached with each pipe stage is used to store all the informat \subsection{Genomes Samples} In this research, we retrieve genomes of Chloroplasts from NCBI. Ninety nine genome of them are considered to work with. These genomes lies in the eleven type of chloroplast families. The distribution of genomes is illustrated in detail in Table \ref{Tab2}. - -\input{population_Table} - + +\input{population_Table} \subsection{Genome Annotation Techniques} Genome annotation is the second stage in the model pipeline. Many techniques were developed to annotate chloroplast genomes but the problem is that they vary in the number and type of predicted genes (\emph{i.e.} the ability to predict genes and \textit{for example: Transfer RNA (tRNA)} and \textit{Ribosomal RNA (rRNA)} genes). Two annotation techniques from NCBI and Dogma are considered to analyse chloroplast genomes to examine the accuracy of predicted coding genes. @@ -77,12 +98,15 @@ The second pre-processing method states: we can predict the best annotated genom \subsubsection{Intersection Core Matrix (\textit{ICM})} -The idea behind extracting core genes is to iteratively collect the maximum number of common genes between two genomes. To do so, the system builds an \textit{Intersection Core Matrix (ICM)}. ICM is a two dimensional symmetric matrix where each row and each column represents one genome. Each position in ICM stores the \textit{Intersection Scores(IS)}. IS is the cardinality number of a core genes which comes from intersecting one genome with other ones. Maximum cardinality results to select two genomes with their maximum core. Mathematically speaking, if we have an $n \times n$ matrix where $n \text{is the number of genomes in local database}$, then lets consider:\\ +The idea behind extracting core genes is to iteratively collect the maximum number of common genes between two genomes. To do so, the system builds an \textit{Intersection Core Matrix (ICM)}. ICM is a two dimensional symmetric matrix where each row and each column represents one genome. Each position in ICM stores the \textit{Intersection Scores(IS)}. IS is the cardinality number of a core genes which comes from intersecting one genome with other ones. Maximum cardinality results to select two genomes with their maximum core. Mathematically speaking, if we have an $n \times n$ matrix where $n$ +is the number of genomes in local database, then lets consider:\\ + \begin{equation} Score=\max_{i0$.}