From 9d33c5f06454db8d752fa7bbe08f9bdaa3977b1a Mon Sep 17 00:00:00 2001
From: Michel Salomon <salomon@caseb.iut-bm.univ-fcomte.fr>
Date: Tue, 19 Nov 2013 18:37:04 +0100
Subject: [PATCH] First modifications in section 3

---
 abstract.tex         |   2 +-
 annotated.tex        | 102 ++++++++++++++++++++++++++++++-------------
 classEquiv.tex       |   8 ++--
 intro.tex            |   2 +-
 main.tex             |   2 +-
 population_Table.tex |   1 -
 6 files changed, 79 insertions(+), 38 deletions(-)

diff --git a/abstract.tex b/abstract.tex
index 3c789e3..510b8ae 100644
--- a/abstract.tex
+++ b/abstract.tex
@@ -5,7 +5,7 @@ evolution over time, and in phylogenetic and genetic analyses. Various
 models  of  genomes  evolution  are  based  on  the  analysis  of  DNA
 sequences, SNPs,  mutations, and so on. We  have recently investigated
 the use of  core (\emph{i.e.}, common genes) and  pan genomes to infer
-evolutionary  information on  a  collection of  107 chloroplasts.   In
+evolutionary  information  on  a  collection of  99~chloroplasts.   In
 particular,  we  have  regarded  methods  to  build  a  genes  content
 evolutionary  tree  using  distances  to core  genome.   However,  the
 production of reliable  core and pan genomes is not  an easy task, due
diff --git a/annotated.tex b/annotated.tex
index 3a7fbd1..9810db8 100644
--- a/annotated.tex
+++ b/annotated.tex
@@ -1,41 +1,83 @@
-The  field of genome  annotation pays  a lot  of attentions  where the
-ability  to collect  and analysis  genomical data  can  provide strong
-indicators  for  the study  of  life\cite{Eisen2007}.  Four of  genome
-annotation   centers   (such  as,   \textit{NCBI\cite{Sayers01012011},
+
+These  last years  the cost  of  sequencing genomes  has been  greatly
+reduced,  and thus  more and  more genomes  are  sequenced.  Therefore
+automatic annotation tools are required to deal with this continuously
+increasing amount of genomical data. Moreover, a reliable and accurate
+genome  annotation  process  is  needed  in order  to  provide  strong
+indicators for the study of life\cite{Eisen2007}.
+
+Various  annotation   tools  (\emph{i.e.},  cost-effective  sequencing
+methods\cite{Bakke2009}) producing genomic  annotations at many levels
+of detail  have been designed  by different annotation  centers. Among
+the major annotation  centers we can notice NCBI\cite{Sayers01012011},
 Dogma       \cite{RDogma},       cpBase      \cite{de2002comparative},
-CpGAVAS    \cite{liu2012cpgavas},   and   CEGMA\cite{parra2007cegma}})
-present various types  of annotation tools (\emph{i.e.} cost-effective
-sequencing    methods\cite{Bakke2009})    on   different    annotation
-levels. Generally, previous studies used one of three methods for gene
-finding       in        annotated       genome       using       these
-centers: \textit{alignment-based, composition based, or combination of
-both\cite{parra2007cegma}}. The alignment-based method is used when we
-try  to  predict  a  coding  gene  (\emph{i.e.}.  genes  that  produce
-proteins)  by aligning DNA  sequence of  gene to  the protein  of cDNA
-sequence of homology\cite{parra2007cegma}.  This approach also is used
-in GeneWise\cite{birney2004genewise}.  Composition-based method (known
+CpGAVAS                   \cite{liu2012cpgavas},                   and
+CEGMA\cite{parra2007cegma}. Usually, previous  studies used one out of
+three methods  for finding  genes in annoted  genomes using  data from
+these  centers: \textit{alignment-based},  \textit{composition based},
+or a  combination of both~\cite{parra2007cegma}.   The alignment-based
+method  is used  when trying  to predict  a coding  gene (\emph{i.e.}.
+genes that produce proteins) by aligning a genomic DNA sequence with a
+cDNA  sequence  coding  an homologous  protein  \cite{parra2007cegma}.
+This approach is  also used in GeneWise\cite{birney2004genewise}.  The
+alternative   method,   the    composition-based   one   (also   known
 as  \textit{ab initio})  is based  on  a probabilistic  model of  gene
 structure  to  find genes  according  to  the  gene value  probability
-(GeneID\cite{parra2000geneid}).  In this  section, we  consider  a new
-method of finding core genes from large amount of chloroplast genomes,
-as  a solution  of the  problem resulting  from the  method  stated in
-section  two. This  method is  based  on extracting  gene features.  A
-general overview of the system is illustrated in Figure \ref{Fig1}.\\
-
-\begin{figure}[H]  
+(GeneID \cite{parra2000geneid}).  Such  annotated genomic data will be
+used to overcome  the limitation of the first  method described in the
+previous section.   In fact, the  second method we propose  finds core
+genes  from  large  amount  of  chloroplast  genomes  through  genomic
+features extraction.
+
+Figure~\ref{Fig1} presents an overview  of the entire method pipeline.
+More    precisely,    the   second    method    consists   of    three
+stages:   \textit{Genome    annotation},   \textit{Core   extraction},
+and    \textit{Features    Visualization}    which   highlights    the
+relationships.  To  understand the  whole core extraction  process, we
+describe briefly each  stage below. More details will  be given in the
+coming subsections.   The method uses as starting  point some sequence
+database  chosen  among   the  many  international  databases  storing
+nucleotide sequences, like  the GenBank at NBCI \cite{Sayers01012011},
+the    \textit{EMBL-Bank}     \cite{apweiler1985swiss}    in    Europe
+or   \textit{DDBJ}   \cite{sugawara2008ddbj}   in  Japan.    Different
+biological tools can analyze  and annotate genomes by interacting with
+these databases to  align and extract sequences to  predict genes. The
+database in  our method must be  taken from any  confident data source
+that stores annotated and/or unannotated chloroplast genomes.  We have
+considered the GenBank-NCBI \cite{Sayers01012011} database as sequence
+database:  99~genomes of chloroplasts  were retrieved.   These genomes
+lie in  the eleven type  of chloroplast families and  Table \ref{Tab2}
+summarizes their distribution in our dataset.
+
+\begin{figure}[h]  
   \centering
-    \includegraphics[width=0.7\textwidth]{generalView}
-\caption{A general overview of the system}\label{Fig1}
+    \includegraphics[width=0.75\textwidth]{generalView}
+\caption{A general overview of the annotation-based approach}\label{Fig1}
 \end{figure}
 
-In Figure 1, we illustrate the general overview of system pipeline: \textit{Database, Genomes annotation, Core extraction,} and \textit{relationships}. We will give a short discussion for each stage of the model in order to understand the whole core extraction process. This work starts with a gene Bank database; however, many international Banks for nucleotide sequence databases (such as, \textit{GenBank} \cite{Sayers01012011} in USA, \textit{EMBL-Bank} \cite{apweiler1985swiss} in Europe, and \textit{DDBJ} \cite{sugawara2008ddbj} in Japon) exist to store various genomes and DNA species. Different biological tools can analyse and annotate genomes by interacting with these databases to  align and extract sequences to predict genes. The database in this model must be taken from any confident data source that stores annotated and/or unannotated chloroplast genomes. We consider GenBank-NCBI \cite{Sayers01012011} database to be our nucleotide sequences database. Annotation (as the second stage) is considered to be the first important task for extract gene features. Good annotation tool leads us to extract good gene feature. In this paper, two annotation techniques from \textit{NCBI, and Dogma} are used to extract \textit{genes features}. Extracting gene feature (as a third stage) can be anything like (genes names, gene sequences, protein sequence,...etc). Our methodology in this paper consider gene names, genes counts, and gene sequence for extracting core genes and producing chloroplast evolutionary tree. \\
-In last stage, features visualization represents methods to visualize genomes and/or gene evolution in chloroplast. We use the forms of tables, phylogenetic trees, graphs,...,etc to organize and represent genomes relationships to achieve the goal of representing gene evolution. In addition, comparing these forms with another annotation tool forms dedicated to large population of chloroplast genomes give us biological perspectives to the nature of chloroplasts evolution. \\
-A local database attached with each pipe stage is used to store all the informations of extraction process. The output from each stage in our system will be an input to the second stage and so on.
+Annotation,  which  is the  first  stage,  is  an important  task  for
+extracting gene features. Indeed, to extract good gene feature, a good
+annotation tool  is obviously  required. To obtain  relevant annotated
+genomes, two annotation  techniques from NCBI and Dogma  are used. The
+extraction of gene feature, the  next stage, can be anything like gene
+names,  gene  sequences, protein  sequences,  and  so  on. Our  method
+considers gene  names, gene counts,  and gene sequence  for extracting
+core  genes and  producing  chloroplast evolutionary  tree. The  final
+stage   allows  to   visualize  genomes   and/or  gene   evolution  in
+chloroplast.    Therefore   we   use  representations   like   tables,
+phylogenetic  trees,  graphs,  etc.   to  organize  and  show  genomes
+relationships,  and  thus  achieve   the  goal  of  representing  gene
+evolution.   In addition,  comparing these  representations  with ones
+issued from  another annotation tool dedicated to  large population of
+chloroplast genomes  give us biological perspectives to  the nature of
+chloroplasts evolution. Notice that  a local database linked with each
+pipe stage is  used to store all the  informations produced during the
+process.
 
-\subsection{Genomes Samples}
-In this research, we retrieve genomes of Chloroplasts from NCBI. Ninety nine genome of them are considered to work with. These genomes lies in the eleven type of chloroplast families. The distribution of genomes is illustrated in detail in Table \ref{Tab2}.
-	
 \input{population_Table}
+	
+% MICHEL : TO BE CONTINUED FROM HERE
+
 \subsection{Genome Annotation Techniques}
 Genome annotation is the second stage in the model pipeline. Many techniques were developed to annotate chloroplast genomes but the problem is that they vary in the number and type of predicted genes (\emph{i.e.} the ability to predict genes and \textit{for example: Transfer RNA (tRNA)} and \textit{Ribosomal RNA (rRNA)} genes). Two annotation techniques from NCBI and Dogma are considered to analyse chloroplast genomes to examine the accuracy of predicted coding genes.   
 
diff --git a/classEquiv.tex b/classEquiv.tex
index 829a8b4..f3d3ed1 100644
--- a/classEquiv.tex
+++ b/classEquiv.tex
@@ -14,7 +14,7 @@ $d(x,y)\leqslant T$.
 
 %\noindent $\sim_{d,T}$ is obviously an equivalence relation and when $d=1-\Delta$, where $\Delta$ is the similarity scoring function embedded into the emboss package , we will simply denote $\sim_{d,0.1}$ by $\sim$.
 
-Let be given a \emph{similarity} threshold $T$  and a distance $d$, 
+Let be given a \emph{similarity} threshold $T$  and a distance $d$
 (Needleman-Wunch released by EMBL for instance).
 The method begins by building  an undirected graph 
 between all the DNA~sequences $g$ of the set  of genomes as follows:
@@ -23,10 +23,10 @@ if  $g_i \sim_{d,T} g_j$ is established.
 This graph is further denoted as the ``similarity'' graph.
 
 We thus consider that the pair of two coding sequences 
-$(g_i,g_j)$ belongs in the relation $\mathcal{R}$ if both $g_i$ an,d 
+$(g_i,g_j)$ belongs in the relation $\mathcal{R}$ if both $g_i$ and 
 $g_j$  belong in the same 
 connected component (CC), \textit{i.e.} if there is a path between $g_i$ 
-and $g_j$ in the similarity graph. It is not hard to see this relation is an
+and $g_j$ in the similarity graph. It is not hard to see that this relation is an
 equivalence relation whereas $\sim$ is not.
 
 
@@ -51,7 +51,7 @@ the projected  genomes.
 
 \begin{figure}
 \begin{center}
-\includegraphics[scale=0.4]{stats.png}
+\includegraphics[scale=0.5]{stats.png}
 \end{center}
 \caption{Size of core and pan genomes w.r.t. the similarity threshold}\label{Fig:sim:core:pan}
 \end{figure}
diff --git a/intro.tex b/intro.tex
index 84c9ed8..a0a6050 100644
--- a/intro.tex
+++ b/intro.tex
@@ -1,7 +1,7 @@
 Identifying  core genes  is important  to understand  evolutionary and
 functional phylogenies. Therefore, in  this work we present methods to
 build a genes  content evolutionary tree. More precisely,  we focus on
-the following  questions considering a  collection of 107~chloroplasts
+the following  questions considering a  collection of 99~chloroplasts
 annotated from NCBI \cite{Sayers01012011} and Dogma \cite{RDogma}: how
 can  we identify the  best core  genome and  what is  the evolutionary
 scenario of these chloroplasts.
diff --git a/main.tex b/main.tex
index 65e70d4..ae1e9d7 100755
--- a/main.tex
+++ b/main.tex
@@ -46,7 +46,7 @@ University of Franche-Comt\'{e}, France \\
 % Main author : jfc
 \input{classEquiv}
 
-\section{Annotations-based approaches}
+\section{Annotation-based approaches}
 % Main author : bassam
 \input{annotated}
 
diff --git a/population_Table.tex b/population_Table.tex
index b184f60..6a65f00 100644
--- a/population_Table.tex
+++ b/population_Table.tex
@@ -1,5 +1,4 @@
 \begin{center}  
-  
   \begin{table}
     \tiny
     \caption[NCBI Genomes Families]{List of family groups of Chloroplast Genomes from NCBI\label{Tab2}}
-- 
2.39.5