3 %\subsubsection{Using genes names provided by annotation tools}
5 %Instead of using the sequences predicted by annotation tools, we can
6 %try to use the names associated to these sequences, when available.
7 %The basic idea is thus to annotate all the sequences using a given
8 %software, and to consider as core gene each sequence whose name can
9 %be found in all the genomes.
10 %Two annotation techniques will be used in the remainder of this article,
11 %namely DOGMA and NCBI.
14 %It is true that the NCBI annotations are of varying
15 %qualities, and sometimes such annotations are totally erroneous. As stated before, it is due to the
16 %large variety of annotation tools that can been used during each
17 %sequence submission process. However, we also considered it in this
18 %article, as this database contains human-curated annotations. To say this
19 %another way, DOGMA automatic annotations are good in average, while
20 %NCBI contains very good human-based annotations together with very badly
22 %Let us finally remark that DOGMA also predict the locations of
23 %\textit{ribosomal RNA (rRNA)}, while they are not provided in
24 %gene features from NCBI. Thus core genomes constructed on NCBI
25 %data will not contain rRNA.
27 We now investigate core and pan genomes design
28 using each of the two tools separately, which will constitute the second
29 approach detailed in this article. From now on we will consider annotated
30 genomes: either ``genes features'' downloaded from the NCBI, or the
33 %\subsubsection{Names processing}
35 %As DOGMA is a deterministic annotation tool, when a given gene
36 %is detected twice in two genomes, the same name will be attached
37 %to the two coding sequences: DOGMA spells exactly in the same manner
38 %the two gene names. So each genome is replaced by a list of gene
39 %names, and finding the core genes common to two genomes simply
40 %consists in intersecting the two lists of genes. The sole problem
41 %we have detected using DOGMA on our 97 chloroplasts is the case
42 %of the RPS12 gene: some genomes contain RPS12\_3end
43 %or RPS12\_5end in the DOGMA result. We have manually
44 %considered that all these representatives belong to the same gene,
47 %Dealing with NCBI names is more complicated, as various annotation
48 %tools have been used together with human annotations, and because there is
49 %no spelling rule for gene names. For instance, NAD6 mitochondrial gene is
50 %sometimes written as ND6, while we can find RPOC1, RPOC1A, and RPOC1B in
51 %our chloroplasts. So if we simply consider NCBI data without
52 %treatment, intersecting two genomes provided as list of gene names often
53 %lead to duplication of misspelled genes. Automatic names homogenization is thus required
54 %on NCBI annotations, the question being where to draw the line
55 %on correcting errors in the spelling of genes ? In this second approach,
56 %we propose to automate only obvious modifications like putting all names
57 %in capital letters and removing useless symbols as ``\_'', ``('', and ``)''.
58 %Remark that such simple renaming process cannot tackle with the situations of NAD6 or
59 %RPOC1 evoked above. To go further in automatic corrections requires
60 %to use edit distances like the Levenshtein one, however such an use will
61 %raise false positives (different genes with close names will be homogenized).
62 %To solve this problem, a compromise that reduces the number of false positives, by considering the similarity between DNA sequences of genes having similar names, will be detailed in the third approach.
64 %At this stage, we now consider that each genome is mapped to a list of gene
65 %names, where names have been homogenized in the NCBI case.
69 %\subsubsection{Core genes extraction}
71 %% The goal of this stage is to extract maximum core genes from sets of
72 %% genes. To find core genes, the following methodology is applied.
75 %%\subsubsection{Intersection Core Matrix (\textit{ICM})}
77 %To extract core genes, we iteratively collect the maximum number of
78 %common genes between genomes, therefore during this stage
79 %an \textit{Intersection Core Matrix} (ICM) is built. ICM is a two
80 %dimensional symmetric matrix where each row and each column correspond
81 %to one genome. Hence, an element of the matrix stores
82 %the \textit{Intersection Score} (IS): the cardinality of the core
83 %genes set obtained by intersecting the two genomes.
84 %%Maximum cardinality results in selecting the two genomes having
86 %Mathematically speaking, if we have $n$ genomes in
87 %local database, the ICM is an $n \times n$ matrix whose elements
90 %score_{ij}=\vert g_i \cap g_j\vert
93 %\noindent where $1 \leq i \leq n$, $1 \leq j \leq n$, and $g_i, g_j$ are
94 %genomes. The generation of a new core genome depends obviously on the
95 %value of the intersection scores $score_{ij}$. More precisely, the
96 %idea is to consider a pair of genomes such that their score is the
97 %largest element in the ICM. These two genomes are then removed from the matrix
98 %and the resulting new core genome is added for the next iteration.
99 %The ICM is then updated to take into account the new core genome: new IS
100 %values are computed for it. This process is repeated until no new core
101 %genome can be obtained.
103 %We can observe that the ICM is relatively large due to the amount of
104 %species. As a consequence, the computation of the intersection scores is
105 %both time and memory consuming. However, since ICM is obviously a symmetric
106 %matrix we can reduce the computation overhead by considering only its
107 %triangular upper part. The time complexity for this process %after
109 %is thus $O(\frac{n.(n-1)}{2})$. Algorithm~\ref{Alg1:ICM}
110 %illustrates the construction of the ICM matrix and the extraction of
111 %the core genomes, where \textit{GenomeList} represents the database
112 %storing all genomes data. At each iteration, this algorithm computes the maximum
113 %core genome with its two parents (genomes).
115 %% ALGORITHM HAS BEEN REWRITTEN
117 %\begin{algorithm}[H]
118 %\caption{Extract Maximum Intersection Score}
121 %\REQUIRE $L \leftarrow \text{genomes sets}$
122 %\ENSURE $B1 \leftarrow \text{Max Core set}$
123 %\FOR{$i \leftarrow 1:len(L)-1$}
124 % \STATE $score \leftarrow 0$
125 % \STATE $core1 \leftarrow set(GenomeList[L[i]])$
126 % \STATE $g1 \leftarrow L[i]$
127 % \FOR{$j \leftarrow i+1:len(L)$}
128 % \STATE $core2 \leftarrow set(GenomeList[L[j]])$
129 % \STATE $core \leftarrow core1 \cap core2$
130 % \IF{$len(core) > score$}
131 % \STATE $score \leftarrow len(core)$
132 % \STATE $g2 \leftarrow L[j]$
135 % \STATE $B1[score] \leftarrow (g1,g2)$
141 %For complete core trees based either on NCBI names or on DOGMA ones, (see \url{http://members.femto-st.fr/christophe-guyeux/}).
142 %%\color{red} The second approach is dependent on gene names spelling. When realizing simple homogenization of names provided by NCBI, we miss core genes which have slightly different name formats. So that, good annotation tool is highly required. \color{black}
144 %%\subsection{Features visualization}
145 %%The last stage of the proposed pipeline is naturally to take advantage
146 %%of the produced core and pan genomes for biological studies. As
147 %%this key stage is not directly related to the methodology for core
148 %%and pan genomes discovery, we will only outline a few tasks that
149 %%can be operated on the produced data.
153 %%\includegraphics[scale=0.215]{tree}
154 %%\caption{Part of a core genomes evolutionary tree (NCBI gene names)}
158 %%Obtained results may be visualized by building a core genomes evolutionary tree.
159 %%% All core genes generated represent an important information in the tree,
160 %%% because they provide ancestor information of two or more
162 %%Each node in this tree represents a chloroplast genome or
163 %%a predicted core, as depicted in Figure~\ref{coreTree}. In this
164 %%figure, nodes labels are of the form
165 %%\textit{(Genes number:Family name\_Scientific name\_Accession number)},
166 %%while an edge is labeled with the number of
167 %%gene loss when compared to its parents (a leaf genome or an intermediate
168 %%core genome). Such numbers can answer questions like:
169 %%how many genes are different between two species? Which functionality has
170 %%been lost between an ancestor and its children ? For complete core trees
171 %%based either on NCBI names or on DOGMA ones, see supplementary data.
175 %%A second application of such data is obviously to build accurate phylogenetic
176 %%trees, using tools like
177 %%PHYML\cite{guindon2005phyml} or
178 %%RAxML{\cite{stamatakis2008raxml,stamatakis2005raxml}.
179 %%Consider a set of species, the last common core genome in the core tree
180 %%contains all the genes shared in common by these species. These genes may be
181 %%multi aligned to serve as input of the phylogenetic tools mentioned above.
182 %%An example of such a phylogenetic tree on core 58 (NCBI cores tree, see
183 %%supplementary data) is provided in Appendix~\ref{philoTree}. Remark that, in
184 %%order to constitute a relevant outgroup, we have simply blasted each gene
185 %%of this core on a chosen \emph{Cyanobacteria}.