\subsection{Genomes Samples}
In this research, we retrieve genomes of Chloroplasts from NCBI. Ninety nine genome of them are considered to work with. These genomes lies in the eleven type of chloroplast families. The distribution of genomes is illustrated in detail in Table \ref{Tab2}.
-
-\input{population_Table}
-
+
+\input{population_Table}
\subsection{Genome Annotation Techniques}
Genome annotation is the second stage in the model pipeline. Many techniques were developed to annotate chloroplast genomes but the problem is that they vary in the number and type of predicted genes (\emph{i.e.} the ability to predict genes and \textit{for example: Transfer RNA (tRNA)} and \textit{Ribosomal RNA (rRNA)} genes). Two annotation techniques from NCBI and Dogma are considered to analyse chloroplast genomes to examine the accuracy of predicted coding genes.
\subsubsection{Intersection Core Matrix (\textit{ICM})}
-The idea behind extracting core genes is to iteratively collect the maximum number of common genes between two genomes. To do so, the system builds an \textit{Intersection Core Matrix (ICM)}. ICM is a two dimensional symmetric matrix where each row and each column represents one genome. Each position in ICM stores the \textit{Intersection Scores(IS)}. IS is the cardinality number of a core genes which comes from intersecting one genome with other ones. Maximum cardinality results to select two genomes with their maximum core. Mathematically speaking, if we have an $n \times n$ matrix where $n \text{is the number of genomes in local database}$, then lets consider:\\
+The idea behind extracting core genes is to iteratively collect the maximum number of common genes between two genomes. To do so, the system builds an \textit{Intersection Core Matrix (ICM)}. ICM is a two dimensional symmetric matrix where each row and each column represents one genome. Each position in ICM stores the \textit{Intersection Scores(IS)}. IS is the cardinality number of a core genes which comes from intersecting one genome with other ones. Maximum cardinality results to select two genomes with their maximum core. Mathematically speaking, if we have an $n \times n$ matrix where $n$
+is the number of genomes in local database, then lets consider:\\
+
\begin{equation}
Score=\max_{i<j}\vert x_i \cap x_j\vert
\label{Eq1}
\end{equation}
-where $x_i, x_j$ are elements in the matrix. The generation of a new core genes is depending on the cardinality value of intersection scores, we call it \textit{Score}:
+
+\noindent where $x_i, x_j$ are elements in the matrix. The generation of a new core genes is depending on the cardinality value of intersection scores, we call it \textit{Score}:
$$\text{New Core} = \begin{cases}
\text{Ignored} & \text{if $\textit{Score}=0$;} \\
\text{new Core id} & \text{if $\textit{Score}>0$.}
+@article{Sayers01012011,
+author = {Sayers, Eric W. and Barrett, Tanya and Benson, Dennis A. and Bolton, Evan and Bryant, Stephen H. and Canese, Kathi and Chetvernin, Vyacheslav and Church, Deanna M. and DiCuccio, Michael and Federhen, Scott and Feolo, Michael and Fingerman, Ian M. and Geer, Lewis Y. and Helmberg, Wolfgang and Kapustin, Yuri and Landsman, David and Lipman, David J. and Lu, Zhiyong and Madden, Thomas L. and Madej, Tom and Maglott, Donna R. and Marchler-Bauer, Aron and Miller, Vadim and Mizrachi, Ilene and Ostell, James and Panchenko, Anna and Phan, Lon and Pruitt, Kim D. and Schuler, Gregory D. and Sequeira, Edwin and Sherry, Stephen T. and Shumway, Martin and Sirotkin, Karl and Slotta, Douglas and Souvorov, Alexandre and Starchenko, Grigory and Tatusova, Tatiana A. and Wagner, Lukas and Wang, Yanli and Wilbur, W. John and Yaschenko, Eugene and Ye, Jian},
+title = {Database resources of the National Center for Biotechnology Information},
+volume = {39},
+number = {suppl 1},
+pages = {D38-D51},
+year = {2011},
+doi = {10.1093/nar/gkq1172},
+URL = {http://nar.oxfordjournals.org/content/39/suppl_1/D38.abstract},
+eprint = {http://nar.oxfordjournals.org/content/39/suppl_1/D38.full.pdf+html},
+journal = {Nucleic Acids Research}
+}
+
+@Article{RDogma,
+AUTHOR = {Stacia K. Wyman, Robert K. Jansen and Jeffrey L. Boore},
+TITLE = {Automatic annotation of organellar genomes
+with DOGMA},
+JOURNAL = {BIOINFORMATICS, oxford Press},
+VOLUME = {20},
+YEAR = {2004},
+NUMBER = {172004},
+PAGES = {3252-3255},
+URL={http://www.biosci.utexas.edu/ib/faculty/jansen/pubs/Wyman%20et%20al.%202004.pdf},
+}
+
@article{SMMR+13,
title={Genomic analysis of smooth tubercle bacilli provides insights into ancestry and pathoadaptation of Mycobacterium tuberculosis},
url={http://www.nature.com/ng/journal/v45/n2/full/ng.2517.html},
doi = {10.1371/journal.pbio.0050082}
}
-@article{Sayers01012011,
-author = {Sayers, Eric W. and Barrett, Tanya and Benson, Dennis A. and Bolton, Evan and Bryant, Stephen H. and Canese, Kathi and Chetvernin, Vyacheslav and Church, Deanna M. and DiCuccio, Michael and Federhen, Scott and Feolo, Michael and Fingerman, Ian M. and Geer, Lewis Y. and Helmberg, Wolfgang and Kapustin, Yuri and Landsman, David and Lipman, David J. and Lu, Zhiyong and Madden, Thomas L. and Madej, Tom and Maglott, Donna R. and Marchler-Bauer, Aron and Miller, Vadim and Mizrachi, Ilene and Ostell, James and Panchenko, Anna and Phan, Lon and Pruitt, Kim D. and Schuler, Gregory D. and Sequeira, Edwin and Sherry, Stephen T. and Shumway, Martin and Sirotkin, Karl and Slotta, Douglas and Souvorov, Alexandre and Starchenko, Grigory and Tatusova, Tatiana A. and Wagner, Lukas and Wang, Yanli and Wilbur, W. John and Yaschenko, Eugene and Ye, Jian},
-title = {Database resources of the National Center for Biotechnology Information},
-volume = {39},
-number = {suppl 1},
-pages = {D38-D51},
-year = {2011},
-doi = {10.1093/nar/gkq1172},
-URL = {http://nar.oxfordjournals.org/content/39/suppl_1/D38.abstract},
-eprint = {http://nar.oxfordjournals.org/content/39/suppl_1/D38.full.pdf+html},
-journal = {Nucleic Acids Research}
-}
-
-@Article{RDogma,
-AUTHOR = {Stacia K. Wyman, Robert K. Jansen and Jeffrey L. Boore},
-TITLE = {Automatic annotation of organellar genomes
-with DOGMA},
-JOURNAL = {BIOINFORMATICS, oxford Press},
-VOLUME = {20},
-YEAR = {2004},
-NUMBER = {172004},
-PAGES = {3252-3255},
-URL={http://www.biosci.utexas.edu/ib/faculty/jansen/pubs/Wyman%20et%20al.%202004.pdf},
-}
-
@article{de2002comparative,
title={Comparative analysis of chloroplast genomes: functional annotation, genome-based phylogeny, and deduced evolutionary patterns},
author={De Las Rivas, Javier and Lozano, Juan Jose and Ortiz, Angel R},
-This step considers as input the set
-$\{((g_1,g_2),r_{12}), (g_1,g_3),r_{13}), (g_{n-1},g{n}),r_{n-1.n})\}$ of
-$\frac{n(n-1)}{2}$ elements.
-Each one $(g_i,g_j),r_{ij})$ where $i < j$,
-is a pair that gives the similarity rate $r_{ij}$ between the two genes
-$g_{i}$ and $g_{j}$.
-
-The first step of this stage consists in building the following non-oriented
-graph further denoted as to \emph{similarity graph}.
-In this one, the vertices are the genes. There is an edge between
-$g_{i}$ and $g_{j}$ if the rate $r_{ij}$ is greater than a given similarity
-threshold $t$.
-
-We then define the relation $\sim$ such that
-$ x \sim y$ if $x$ and $y$ belong in the same connected component.
-Mathematically speaking, it is obvious that this
-defines an equivalence relation.
-Let $\dot{x}= \{y | x \sim y\}$
-denotes the equivalence class to which $x$ belongs.
-All the genes which are equivalent to each other
-are also elements of the same equivalence class.
-Let us then consider the set of all equivalence classes of the set of genes
-by $\sim$, denoted $X/\sim = \{\dot{x} | x \textrm{ is a gene}\}$.
-defined by $\pi(x) = \dot{x}$
-which maps each gene into it respective equivalence class by $\sim$.
-
-
-
-
-For each genome $[g_l,\ldots,g{l+m}]$, the second step computes
-the projection of each gene according to $\pi$.
-The resulting genome which is
-$$
-[\pi(g_l),\ldots,\pi(g{l+m})]
-$$
-is again of size $m$.
-
-Intuitively speaking, for two genes $g_i$ and $g_j$
-in the same equivalence class, there is path from $g_i$ and $g_j$.
-It signifies that each evolution step
-(represented by an edge in the similarity graph)
-has produced a gene s.t. the similarity with the previous one
-is greater than $t$.
-Genes $g_i$ and $g_j$ may thus have a common ancestor.
-
-
-We compute the core genome as follow.
-Each genome is projected according to $\pi$. We then consider the
-intersection of all the projected genomes which are considered as sets of genes
-and not as sequences of genes.
-This results as the set of all the class $\dot{x}$
-such that each genome has an gene $x$ in $\dot{x}$.
-The pan genome is computed similarly: the union of all the
-projected genomes in computed here.
-
+Identifying core genes is important to understand evolutionary and
+functional phylogenies. Therefore, in this work we present two methods
+to build a genes content evolutionary tree. More precisely, we focus
+on the following questions considering a collection of
+99~chloroplasts annotated from NCBI \cite{Sayers01012011} and Dogma
+\cite{RDogma} : how can we identify the best core genome and what
+is the evolutionary scenario of these chloroplasts.
+Two methods are considered here. The first one is based on NCBI annotation, it is explained below.
+We start by the following definition.
+\begin{definition}
+\label{def1}
+Let $A=\{A,T,C,G\}$ be the nucleotides alphabet, and $A^\ast$ be the set of finite words on $A$ (\emph{i.e.}, of DNA sequences). Let $d:A^{\ast}\times A^{\ast}\rightarrow[0,1]$ be a distance on $A^{\ast}$. Consider a given value $T\in[0,1]$ called a threshold. For all $x,y\in A^{\ast}$, we will say that $x\sim_{d,T}y$ if $d(x,y)\leqslant T$.
+\end{definition}
+
+\noindent$\sim_{d,T}$ is obviously an equivalence relation. When $d=1-\Delta$, where $\Delta$ is the similarity scoring function embedded into the emboss package (Needleman-Wunch released by EMBL), we will simply denote $\sim_{d,0.1}$ by $\sim$. The method starts by building an undirected graph based on
+the similarity rates $r_{ij}$ between sequences $g_{i}$ and $g_{j}$ (\emph{i.e.}, $r_{ij}=\Delta(g_{i},g_{j})$).
+In this latter, nodes are constituted by all the coding sequences of the set of genomes under consideration, and there is an edge between $g_{i}$ and $g_{j}$ if the
+similarity rate $r_{ij}$ is
+greater than the given similarity threshold. The Connected Components
+(CC) of the ``similarity'' graph are thus computed.
+This produces an equivalence
+relation between sequences in the same CC based on Definition~\ref{def1}.
+Any class for this relation is called ``gene'' here, where its representatives (DNA sequences) are the ``alleles'' of this gene. Thus this first method produces for each genome $G$, which is a set $\{g_{1}^G,...,g_{m_G}^G\}$ of $m_{G}$ DNA coding sequences, the projection of each sequence according to $\pi$, where $\pi$ maps each sequence
+into its gene (class) according to $\sim$. In other words, $G$ is mapped into $\{\pi(g_{1}^G),...,\pi(g_{m_G}^G)\}$.
+Remark that a projected genome has no duplicated gene, as it is a set. The core genome (resp. the pan genome) of $G_{1}$ and $G_{2}$ is defined thus as the intersection (resp. as the union) of these projected genomes.\\
+We then consider the intersection of all the projected genomes, which is the set of all the genes $\dot{x}$
+such that each genome has at least one allele in $\dot{x}$. The pan genome is computed similarly as the union of all the projected genomes. However such approach suffers from producing too small core genomes,
+for any chosen similarity threshold, compared to what is usually waited by biologists regarding these chloroplasts. We are then left with the following questions: how can we improve the confidence put in the produced core? Can we thus guess the evolution scenario of these genomes?
\ No newline at end of file
\usepackage{pdflscape}
\usepackage{multirow,longtable}
\usepackage{amsmath,mathtools}
+\usepackage{amssymb}
+\usepackage[standard]{ntheorem}
+\usepackage{stmaryrd}
\usepackage[utf8]{inputenc}
+\usepackage{tikz}
+\usetikzlibrary{shapes,arrows}
% correct bad hyphenation here
\begin{table}
\tiny
+ \caption[NCBI Genomes Families]{List of family groups of Chloroplast Genomes from NCBI\label{Tab2}}
\begin{minipage}{0.50\textwidth}
\setlength{\tabcolsep}{4pt}
\begin{tabular}{|p{0.1cm}|p{0.1cm}|p{1.3cm}|p{3cm}|}
Dinoflagellates,
Euglena,
Haptophytes, and Lycopodiophyta respectively.
-
\normalsize
- \caption[NCBI Genomes Families]{List of family groups of Chloroplast Genomes from NCBI\label{Tab2}}
-
-
\end{table}
\end{center}