update section two

author bassam al-kindy <bassam.al-kindy@lifc>

Mon, 18 Nov 2013 10:28:58 +0000 (11:28 +0100)

committer bassam al-kindy <bassam.al-kindy@lifc>

Mon, 18 Nov 2013 10:28:58 +0000 (11:28 +0100)
author bassam al-kindy <bassam.al-kindy@lifc>
Mon, 18 Nov 2013 10:28:58 +0000 (11:28 +0100)
committer bassam al-kindy <bassam.al-kindy@lifc>
Mon, 18 Nov 2013 10:28:58 +0000 (11:28 +0100)
diff --git a/Whole_system.png b/Whole_system.png

index 91f9909ed1a8f8c533d16ab2f7f195920be8c092..3741af355ae4c4f74b363a8a86ce5e00f599f96e 100644 (file)

Binary files a/Whole_system.png and b/Whole_system.png differ
diff --git a/annotated.tex b/annotated.tex

index e1889730f1cbbd38d2464dba9abf353905155088..762d50c60baa5a2db681c9f03b93d2467c74390c 100644 (file)
--- a/annotated.tex
+++ b/annotated.tex
@@ -12,9 +12,8 @@ A local database attached with each pipe stage is used to store all the informat
  
  \subsection{Genomes Samples}
  In this research, we retrieve genomes of Chloroplasts from NCBI. Ninety nine genome of them are considered to work with. These genomes lies in the eleven type of chloroplast families. The distribution of genomes is illustrated in detail in Table \ref{Tab2}.
  
  \subsection{Genomes Samples}
  In this research, we retrieve genomes of Chloroplasts from NCBI. Ninety nine genome of them are considered to work with. These genomes lies in the eleven type of chloroplast families. The distribution of genomes is illustrated in detail in Table \ref{Tab2}.
-
-\input{population_Table}       
-
+       
+\input{population_Table}
  \subsection{Genome Annotation Techniques}
  Genome annotation is the second stage in the model pipeline. Many techniques were developed to annotate chloroplast genomes but the problem is that they vary in the number and type of predicted genes (\emph{i.e.} the ability to predict genes and \textit{for example: Transfer RNA (tRNA)} and \textit{Ribosomal RNA (rRNA)} genes). Two annotation techniques from NCBI and Dogma are considered to analyse chloroplast genomes to examine the accuracy of predicted coding genes.   
  
  \subsection{Genome Annotation Techniques}
  Genome annotation is the second stage in the model pipeline. Many techniques were developed to annotate chloroplast genomes but the problem is that they vary in the number and type of predicted genes (\emph{i.e.} the ability to predict genes and \textit{for example: Transfer RNA (tRNA)} and \textit{Ribosomal RNA (rRNA)} genes). Two annotation techniques from NCBI and Dogma are considered to analyse chloroplast genomes to examine the accuracy of predicted coding genes.   
  
@@ -77,12 +76,15 @@ The second pre-processing method states: we can predict the best annotated genom
  
  \subsubsection{Intersection Core Matrix (\textit{ICM})}
  
  
  \subsubsection{Intersection Core Matrix (\textit{ICM})}
  
-The idea behind extracting core genes is to iteratively collect the maximum number of common genes between two genomes. To do so, the system builds an \textit{Intersection Core Matrix (ICM)}. ICM is a two dimensional symmetric matrix where each row and each column represents one genome. Each position in ICM stores the \textit{Intersection Scores(IS)}. IS is the cardinality number of a core genes which comes from intersecting one genome with other ones. Maximum cardinality results to select two genomes with their maximum core. Mathematically speaking, if we have an $n \times n$ matrix where $n \text{is the number of genomes in local database}$, then lets consider:\\
+The idea behind extracting core genes is to iteratively collect the maximum number of common genes between two genomes. To do so, the system builds an \textit{Intersection Core Matrix (ICM)}. ICM is a two dimensional symmetric matrix where each row and each column represents one genome. Each position in ICM stores the \textit{Intersection Scores(IS)}. IS is the cardinality number of a core genes which comes from intersecting one genome with other ones. Maximum cardinality results to select two genomes with their maximum core. Mathematically speaking, if we have an $n \times n$ matrix where $n$  
+is the number of genomes in local database, then lets consider:\\
+
  \begin{equation}
  Score=\max_{i<j}\vert x_i \cap x_j\vert
  \label{Eq1}
  \end{equation}
  \begin{equation}
  Score=\max_{i<j}\vert x_i \cap x_j\vert
  \label{Eq1}
  \end{equation}
-where $x_i, x_j$ are elements in the matrix. The generation of a new core genes is depending on the cardinality value of intersection scores, we call it \textit{Score}:
+
+\noindent where $x_i, x_j$ are elements in the matrix. The generation of a new core genes is depending on the cardinality value of intersection scores, we call it \textit{Score}:
  $$\text{New Core} = \begin{cases} 
  \text{Ignored} & \text{if $\textit{Score}=0$;} \\
  \text{new Core id} & \text{if $\textit{Score}>0$.}
  $$\text{New Core} = \begin{cases} 
  \text{Ignored} & \text{if $\textit{Score}=0$;} \\
  \text{new Core id} & \text{if $\textit{Score}>0$.}
diff --git a/biblio.bib b/biblio.bib

index 38ab7d7e9946662941b4249390a094e3a4940cca..510d794936e608d109e181f3a81a97b1e20f2c5b 100644 (file)
--- a/biblio.bib
+++ b/biblio.bib
@@ -1,3 +1,28 @@
+@article{Sayers01012011,
+author = {Sayers, Eric W. and Barrett, Tanya and Benson, Dennis A. and Bolton, Evan and Bryant, Stephen H. and Canese, Kathi and Chetvernin, Vyacheslav and Church, Deanna M. and DiCuccio, Michael and Federhen, Scott and Feolo, Michael and Fingerman, Ian M. and Geer, Lewis Y. and Helmberg, Wolfgang and Kapustin, Yuri and Landsman, David and Lipman, David J. and Lu, Zhiyong and Madden, Thomas L. and Madej, Tom and Maglott, Donna R. and Marchler-Bauer, Aron and Miller, Vadim and Mizrachi, Ilene and Ostell, James and Panchenko, Anna and Phan, Lon and Pruitt, Kim D. and Schuler, Gregory D. and Sequeira, Edwin and Sherry, Stephen T. and Shumway, Martin and Sirotkin, Karl and Slotta, Douglas and Souvorov, Alexandre and Starchenko, Grigory and Tatusova, Tatiana A. and Wagner, Lukas and Wang, Yanli and Wilbur, W. John and Yaschenko, Eugene and Ye, Jian}, 
+title = {Database resources of the National Center for Biotechnology Information},
+volume = {39}, 
+number = {suppl 1}, 
+pages = {D38-D51}, 
+year = {2011}, 
+doi = {10.1093/nar/gkq1172},  
+URL = {http://nar.oxfordjournals.org/content/39/suppl_1/D38.abstract}, 
+eprint = {http://nar.oxfordjournals.org/content/39/suppl_1/D38.full.pdf+html}, 
+journal = {Nucleic Acids Research} 
+}
+
+@Article{RDogma,
+AUTHOR = {Stacia K. Wyman, Robert K. Jansen and Jeffrey L. Boore},
+TITLE = {Automatic annotation of organellar genomes
+with DOGMA},
+JOURNAL = {BIOINFORMATICS, oxford Press},
+VOLUME = {20},
+YEAR = {2004},
+NUMBER = {172004},
+PAGES = {3252-3255},
+URL={http://www.biosci.utexas.edu/ib/faculty/jansen/pubs/Wyman%20et%20al.%202004.pdf},
+}
+
  @article{SMMR+13, 
  title={Genomic analysis of smooth tubercle bacilli provides insights into ancestry and pathoadaptation of Mycobacterium tuberculosis}, 
  url={http://www.nature.com/ng/journal/v45/n2/full/ng.2517.html}, 
  @article{SMMR+13, 
  title={Genomic analysis of smooth tubercle bacilli provides insights into ancestry and pathoadaptation of Mycobacterium tuberculosis}, 
  url={http://www.nature.com/ng/journal/v45/n2/full/ng.2517.html}, 
@@ -38,31 +63,6 @@ DOI={10.1089/cmb.2010.0092}
      doi = {10.1371/journal.pbio.0050082}
  }        
  
      doi = {10.1371/journal.pbio.0050082}
  }        
  
-@article{Sayers01012011,
-author = {Sayers, Eric W. and Barrett, Tanya and Benson, Dennis A. and Bolton, Evan and Bryant, Stephen H. and Canese, Kathi and Chetvernin, Vyacheslav and Church, Deanna M. and DiCuccio, Michael and Federhen, Scott and Feolo, Michael and Fingerman, Ian M. and Geer, Lewis Y. and Helmberg, Wolfgang and Kapustin, Yuri and Landsman, David and Lipman, David J. and Lu, Zhiyong and Madden, Thomas L. and Madej, Tom and Maglott, Donna R. and Marchler-Bauer, Aron and Miller, Vadim and Mizrachi, Ilene and Ostell, James and Panchenko, Anna and Phan, Lon and Pruitt, Kim D. and Schuler, Gregory D. and Sequeira, Edwin and Sherry, Stephen T. and Shumway, Martin and Sirotkin, Karl and Slotta, Douglas and Souvorov, Alexandre and Starchenko, Grigory and Tatusova, Tatiana A. and Wagner, Lukas and Wang, Yanli and Wilbur, W. John and Yaschenko, Eugene and Ye, Jian}, 
-title = {Database resources of the National Center for Biotechnology Information},
-volume = {39}, 
-number = {suppl 1}, 
-pages = {D38-D51}, 
-year = {2011}, 
-doi = {10.1093/nar/gkq1172},  
-URL = {http://nar.oxfordjournals.org/content/39/suppl_1/D38.abstract}, 
-eprint = {http://nar.oxfordjournals.org/content/39/suppl_1/D38.full.pdf+html}, 
-journal = {Nucleic Acids Research} 
-}
-
-@Article{RDogma,
-AUTHOR = {Stacia K. Wyman, Robert K. Jansen and Jeffrey L. Boore},
-TITLE = {Automatic annotation of organellar genomes
-with DOGMA},
-JOURNAL = {BIOINFORMATICS, oxford Press},
-VOLUME = {20},
-YEAR = {2004},
-NUMBER = {172004},
-PAGES = {3252-3255},
-URL={http://www.biosci.utexas.edu/ib/faculty/jansen/pubs/Wyman%20et%20al.%202004.pdf},
-}
-
  @article{de2002comparative,
    title={Comparative analysis of chloroplast genomes: functional annotation, genome-based phylogeny, and deduced evolutionary patterns},
    author={De Las Rivas, Javier and Lozano, Juan Jose and Ortiz, Angel R},
  @article{de2002comparative,
    title={Comparative analysis of chloroplast genomes: functional annotation, genome-based phylogeny, and deduced evolutionary patterns},
    author={De Las Rivas, Javier and Lozano, Juan Jose and Ortiz, Angel R},
diff --git a/classEquiv.tex b/classEquiv.tex

index b77c51ac3bddcd20373d15cbd79facfd35360925..41abda49eafb820da73bfe905fea10d0361d7be4 100644 (file)
--- a/classEquiv.tex
+++ b/classEquiv.tex
@@ -1,55 +1,28 @@
-This step considers as input the set 
-$\{((g_1,g_2),r_{12}), (g_1,g_3),r_{13}), (g_{n-1},g{n}),r_{n-1.n})\}$ of 
-$\frac{n(n-1)}{2}$ elements. 
-Each one $(g_i,g_j),r_{ij})$ where $i < j$, 
-is a pair that gives the similarity rate $r_{ij}$ between the two genes  
-$g_{i}$ and $g_{j}$.
-
-The first step of this stage consists in building the following non-oriented
-graph further denoted as to \emph{similarity graph}.
-In this one, the vertices are the genes. There is an edge between 
-$g_{i}$ and $g_{j}$ if the rate $r_{ij}$ is greater than a given similarity 
-threshold $t$.
-
-We then define the relation $\sim$  such that
-$ x \sim y$ if $x$ and $y$ belong in the same connected component.
-Mathematically speaking, it is obvious that this 
-defines an equivalence relation. 
-Let $\dot{x}= \{y  | x \sim y\}$
-denotes the equivalence class to which $x$ belongs.
-All the genes which are  equivalent to each other
-are also elements of the same equivalence class.
-Let us then consider the set of all equivalence classes of the set of genes 
-by $\sim$, denoted $X/\sim = \{\dot{x} | x \textrm{ is a gene}\}$. 
-defined by $\pi(x) = \dot{x}$
-which maps each gene  into it respective equivalence class by $\sim$.
-
-
-
-
-For each genome $[g_l,\ldots,g{l+m}]$, the second step computes 
-the projection of each gene according to $\pi$. 
-The resulting genome  which is 
-$$
-[\pi(g_l),\ldots,\pi(g{l+m})]
-$$ 
-is again of size $m$.
-
-Intuitively speaking, for two genes $g_i$ and $g_j$ 
-in the same equivalence class, there is path from  $g_i$ and $g_j$.
-It signifies that  each evolution step 
-(represented by an edge in the similarity graph) 
-has produced a gene s.t. the similarity with the previous one 
-is greater than $t$. 
-Genes $g_i$ and $g_j$ may thus have a common ancestor.
-
-
-We compute the core genome as follow.
-Each genome is projected according to $\pi$. We then consider the 
-intersection of all the projected genomes which are considered as sets of genes
-and not as sequences of genes.
-This results as the set of all the class $\dot{x}$
-such that each genome has an gene $x$ in  $\dot{x}$.
-The pan genome is computed similarly: the union of all the 
-projected genomes in computed here.
-
+Identifying  core genes  is important  to understand  evolutionary and
+functional phylogenies. Therefore, in this work we present two methods
+to build a  genes content evolutionary tree. More  precisely, we focus
+on   the    following   questions   considering    a   collection   of
+99~chloroplasts  annotated from  NCBI \cite{Sayers01012011} and  Dogma
+\cite{RDogma} : how can we identify the best core genome and what
+is the evolutionary scenario of these chloroplasts.
+Two methods are considered here. The first one is based on NCBI annotation, it is explained below.
+We start by the following definition.
+\begin{definition}
+\label{def1}
+Let $A=\{A,T,C,G\}$ be the nucleotides alphabet, and $A^\ast$ be the set of finite words on $A$ (\emph{i.e.}, of DNA sequences). Let $d:A^{\ast}\times A^{\ast}\rightarrow[0,1]$ be a distance on $A^{\ast}$. Consider a given value $T\in[0,1]$ called a threshold. For all $x,y\in A^{\ast}$, we will say that $x\sim_{d,T}y$ if $d(x,y)\leqslant T$. 
+\end{definition}
+
+\noindent$\sim_{d,T}$ is obviously an equivalence relation. When $d=1-\Delta$, where $\Delta$ is the similarity scoring function embedded into the emboss package (Needleman-Wunch released by EMBL), we will simply denote $\sim_{d,0.1}$ by $\sim$. The method starts by building an undirected graph based on
+the similarity rates $r_{ij}$  between sequences $g_{i}$ and $g_{j}$ (\emph{i.e.}, $r_{ij}=\Delta(g_{i},g_{j})$).
+In this latter, nodes are constituted by all the coding sequences of the set of genomes under consideration, and there is an edge between $g_{i}$ and $g_{j}$ if the 
+similarity rate $r_{ij}$ is
+greater than the given similarity threshold. The Connected Components
+(CC) of the ``similarity'' graph are thus computed.
+This produces an equivalence 
+relation between sequences in the same CC based on Definition~\ref{def1}.
+Any class for this relation is called ``gene'' here, where its representatives (DNA sequences) are the ``alleles'' of this gene. Thus this first method produces for each genome $G$, which is a set $\{g_{1}^G,...,g_{m_G}^G\}$ of $m_{G}$ DNA coding sequences, the projection of each sequence according to $\pi$, where $\pi$ maps each sequence
+into its gene (class) according to $\sim$. In other words, $G$ is mapped into $\{\pi(g_{1}^G),...,\pi(g_{m_G}^G)\}$.  
+Remark that a projected genome has no duplicated gene, as it is a set. The core  genome (resp. the pan genome) of $G_{1}$ and $G_{2}$ is defined thus as the intersection (resp. as the union) of these projected genomes.\\
+We then consider the intersection of all the projected genomes, which is the set of all the genes $\dot{x}$
+such that each genome has at least one allele in $\dot{x}$. The pan genome is computed similarly as the union of all the projected genomes. However such approach suffers from producing too small core genomes, 
+for any chosen similarity threshold, compared to what is usually waited by biologists regarding these chloroplasts. We are then left with the following questions: how can we improve the confidence put in the produced core? Can we thus guess the evolution scenario of these genomes?
+\ No newline at end of file
diff --git a/main.tex b/main.tex

index 162054bae25892c41960d305f5644b02dad8dc35..921954bd512db7ccad7d4fe25f573fbed33063f7 100755 (executable)
--- a/main.tex
+++ b/main.tex
@@ -9,7 +9,12 @@
  \usepackage{pdflscape}
  \usepackage{multirow,longtable}
  \usepackage{amsmath,mathtools}
  \usepackage{pdflscape}
  \usepackage{multirow,longtable}
  \usepackage{amsmath,mathtools}
+\usepackage{amssymb}
+\usepackage[standard]{ntheorem}
+\usepackage{stmaryrd}
  \usepackage[utf8]{inputenc}
  \usepackage[utf8]{inputenc}
+\usepackage{tikz}
+\usetikzlibrary{shapes,arrows}
  
  
  % correct bad hyphenation here
  
  
  % correct bad hyphenation here
diff --git a/population_Table.tex b/population_Table.tex

index 9915d1a937f064cd401e4b0b1f16e40a5cec3a1f..b184f60c174f7c452ba27fa9caea4f1cf316a51f 100644 (file)
--- a/population_Table.tex
+++ b/population_Table.tex
@@ -2,6 +2,7 @@
    
    \begin{table}
      \tiny
    
    \begin{table}
      \tiny
+    \caption[NCBI Genomes Families]{List of family groups of Chloroplast Genomes from NCBI\label{Tab2}}
      \begin{minipage}{0.50\textwidth}
        \setlength{\tabcolsep}{4pt}
        \begin{tabular}{|p{0.1cm}|p{0.1cm}|p{1.3cm}|p{3cm}|}
      \begin{minipage}{0.50\textwidth}
        \setlength{\tabcolsep}{4pt}
        \begin{tabular}{|p{0.1cm}|p{0.1cm}|p{1.3cm}|p{3cm}|}
@@ -160,11 +161,7 @@
    Dinoflagellates,
    Euglena,
    Haptophytes, and Lycopodiophyta respectively.
    Dinoflagellates,
    Euglena,
    Haptophytes, and Lycopodiophyta respectively.
-
    \normalsize
    \normalsize
-  \caption[NCBI Genomes Families]{List of family groups of Chloroplast Genomes from NCBI\label{Tab2}}
-
-
    \end{table}
  \end{center}  
  
    \end{table}
  \end{center}
author	bassam al-kindy <bassam.al-kindy@lifc>
	Mon, 18 Nov 2013 10:28:58 +0000 (11:28 +0100)
committer	bassam al-kindy <bassam.al-kindy@lifc>
	Mon, 18 Nov 2013 10:28:58 +0000 (11:28 +0100)
Whole_system.png		patch \| blob \| history
annotated.tex		patch \| blob \| history
biblio.bib		patch \| blob \| history
classEquiv.tex		patch \| blob \| history
main.tex		patch \| blob \| history
population_Table.tex		patch \| blob \| history