closedgenomes.tex

   1 The approach is further based on the ability to decide how far is each
   2 genome from each others. To achieve this, we combine XXX metrics which are
   3 detailed in this part.
   4
   5 \subsection{Core SNP based Metric}
   6 Due to the definition of the core genome, for each element $\dot{x}$
   7 in this set, there is a gene $x \in \dot{x}$ in each genome.
   8 Let us consider a class
   9 $\dot{x}= \{y  | x \sim y\}$.
  10
  11 \JFC{Il faudrait être cohérent: deux génomes proches devraient partout avoir
  12 soit une métrique élevée soit une métrique très faible}
  13
  14 %1/ On SNPs of the core genome strict
  15 All the $y$ are thus aligned
  16 thanks to a global alignment tool. The SNPs may thus be extracted.
  17 For each genome, one can thus compute the vector of Boolean values
  18 memorizing at index $i$ whether the SNP $i$ is present in one of its gene
  19 (positive value) or  not (null value).
  20 A Hamming distance between two vectors allows to build the distance
  21 between two genes.
  22 This metric is further referred as to $m_S$.
  23
  24 % plus il y a de diff, plus le nombre est élevé
  25
  26
  27 %2/ On SNPs of the core genome strict, each gene having the same weight
  28 The $m_S$ method does not consider genes to have the same incidence in the
  29 metric value. A gene with many SNPs has a larger influence in
  30 the metric computation than a gene with fewer ones.
  31 The metric further referred as to $m_{|S|}$ gives the same weight to each gene
  32 without considering the number of SNP it contains.
  33
  34 % plus il y a de diff, plus le nombre est élevé
  35
  36 \subsection{Symmetric Difference based Metric}
  37 The third metric consider the symmetric difference $\Delta$
  38 between the two sets $G_1$ and $G_2$ of genes recalled hereafter
  39 $$
  40 G_1\Delta G2 =
  41 (G1\cup G_2)\setminus (G1\cap G_2) = (G1\setminus G_2)\cup(G_2\setminus G1).
  42 $$
  43 The cardinality of $G_1\Delta G2$, give the metric.
  44 This metric is furthered referred as to $m_{\Delta}$.
  45
  46 Practically, let $k$ be the number of all the equivalence classes. Due to the definition of the pan genome, this number is equal to the cardinality of this set.
  47 For each genome, if we only consider which gene belongs into it \textit{i.e.}, if  we abstract away all the position this gene appears, this genome may be
  48 memorized as a vector of $k$ Boolean values. The element at index $i, 0 \le i \le k-1$ is true if and only if the $i$-th gene of the pan genome belongs to this
  49 one.
  50 This metric is equal to the Hamming distance between the two corresponding
  51 vectors of Boolean values.
  52
  53 % plus il y a de diff, plus le nombre est élevé
  54
  55
  56
  57 % 4/ Using EPFL method
  58 \subsection{Adjacency based metric}
  59 Following~\cite{23424133}, a sequence
  60 of all the adjacencies, which is present in
  61 a genomes at least is computed. This sequence
  62 is augmented with the pan genome content.
  63 Then, each genome is compared
  64 with such a sequence and a boolean vector is produced with the following rule.
  65 %If the element $i$ in the sequence of adjacencies or content is present
  66 %in the genome, the
  67
  68
  69
  70
  71
  72
  73 \subsection{Shared Synteny based Metric}
  74 Given two genomes abstracted as sequences of classes, it is classical
  75 to computes all the maximum shared synteny chains.
  76
  77 % Attention ici, moins il y a de diff, plus le nombre est élevé
  78 There are then three issues with such a set of shared synteny chains:
  79 \begin{itemize}
  80 \item let $m_{Y}$ be the metric, which returns the
  81 length of the largest chains;
  82 \item let $m_{\overline{Y}}$ be the metric, which returns the
  83 average length of synteny chains;
  84 \item finally, let $m_{|Y|}$ be the metric, which returns the
  85 number of synteny chains.
  86 \end{itemize}