correc preface

[book_gpu.git] / BookGPU / Chapters / chapter12 / ch12.tex~
diff --git a/BookGPU/Chapters/chapter12/ch12.tex~ b/BookGPU/Chapters/chapter12/ch12.tex~

index a3deec08ff28e5a244029f02f026b7381cd611fa..eaa4f9cf5463126e3e4102134007aa09f5797d3f 100755 (executable)
--- a/BookGPU/Chapters/chapter12/ch12.tex~
+++ b/BookGPU/Chapters/chapter12/ch12.tex~
@@ -4,138 +4,158 @@
  %%                          %%
  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   
-\chapterauthor{}{}
+%\chapterauthor{}{}
+\chapterauthor{Lilia Ziane Khodja}{Femto-ST Institute, University of Franche-Comte, France}
+\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte, France}
+\chapterauthor{Jacques Bahi}{Femto-ST Institute, University of Franche-Comte, France}
+
  \chapter{Solving sparse linear systems with GMRES and CG methods on GPU clusters}
+\label{ch12}
  
  %%--------------------------%%
  %%       SECTION 1          %%
  %%--------------------------%%
  \section{Introduction}
-\label{sec:01}
-The sparse linear systems are used to model many scientific and industrial problems, such as the environmental simulations or
-the industrial processing of the complex or non-Newtonian fluids. Moreover, the resolution of these problems often involves the
-solving of such linear systems which is considered as the most expensive process in terms of time execution and memory space.
-Therefore, solving sparse linear systems must be as efficient as possible in order to deal with problems of ever increasing size.
-
-There are, in the jargon of numerical analysis, different methods of solving sparse linear systems that we can classify in two
-classes: the direct and iterative methods. However, the iterative methods are often more suitable than their counterpart, direct
-methods, for solving large sparse linear systems. Indeed, they are less memory consuming and easier to parallelize on parallel
-computers than direct methods. Different computing platforms, sequential and parallel computers, are used for solving sparse
-linear systems with iterative solutions. Nowadays, graphics processing units (GPUs) have become attractive for solving these
-linear systems, due to their computing power and their ability to compute faster than traditional CPUs.
-
-In Section~\ref{sec:02}, we describe the general principle of two well-known iterative methods: the conjugate gradient method and
-the generalized minimal residual method. In Section~\ref{sec:03}, we give the main key points of the parallel implementation of both
-methods on a cluster of GPUs. Then, in Section~\ref{sec:04}, we present the experimental results obtained on a CPU cluster and on
-a GPU cluster, for solving sparse linear systems associated to matrices of different structures. Finally, in Section~\ref{sec:05},
-we apply the hypergraph partitioning technique to reduce the total communication volume between the computing nodes and, thus, to
-improve the execution times of the parallel algorithms of both iterative methods.   
+\label{ch12:sec:01}
+The sparse linear systems are used to model many scientific and industrial problems,
+such as the environmental simulations or the industrial processing of the complex or
+non-Newtonian fluids. Moreover, the resolution of these problems often involves the
+solving of such linear systems which is considered as the most expensive process in
+terms of execution time and memory space. Therefore, solving sparse linear systems
+must be as efficient as possible in order to deal with problems of ever increasing
+size.
+
+There are, in the jargon of numerical analysis, different methods of solving sparse
+linear systems that can be classified in two classes: the direct and iterative methods.
+However, the iterative methods are often more suitable than their counterpart, direct
+methods, for solving these systems. Indeed, they are less memory consuming and easier
+to parallelize on parallel computers than direct methods. Different computing platforms,
+sequential and parallel computers, are used for solving sparse linear systems with iterative
+solutions. Nowadays, graphics processing units (GPUs) have become attractive for solving
+these systems, due to their computing power and their ability to compute faster than
+traditional CPUs.
+
+In Section~\ref{ch12:sec:02}, we describe the general principle of two well-known iterative
+methods: the conjugate gradient method and the generalized minimal residual method. In Section~\ref{ch12:sec:03},
+we give the main key points of the parallel implementation of both methods on a cluster of
+GPUs. Then, in Section~\ref{ch12:sec:04}, we present the experimental results obtained on a
+CPU cluster and on a GPU cluster, for solving sparse linear systems associated to matrices
+of different structures. Finally, in Section~\ref{ch12:sec:05}, we apply the hypergraph partitioning
+technique to reduce the total communication volume between the computing nodes and, thus,
+to improve the execution times of the parallel algorithms of both iterative methods.   
  
  
  %%--------------------------%%
  %%       SECTION 2          %%
  %%--------------------------%%
  \section{Krylov iterative methods}
-\label{sec:02}
-Let us consider the following system of $n$ linear equations in $\mathbb{R}$: 
+\label{ch12:sec:02}
+Let us consider the following system of $n$ linear equations\index{Sparse~linear~system}
+in $\mathbb{R}$: 
  \begin{equation}
  Ax=b,
-\label{eq:01}
+\label{ch12:eq:01}
  \end{equation}
-where $A\in\mathbb{R}^{n\times n}$ is a sparse nonsingular square matrix, $x\in\mathbb{R}^{n}$ is the solution vector,
-$b\in\mathbb{R}^{n}$ is the right-hand side and $n\in\mathbb{N}$ is a large integer number. 
-
-The iterative methods for solving the large sparse linear system~(\ref{eq:01}) proceed by successive iterations of a same
-block of elementary operations, during which an infinite number of approximate solutions $\{x_k\}_{k\geq 0}$ are computed.
-Indeed, from an initial guess $x_0$, an iterative method determines at each iteration $k>0$ an approximate solution $x_k$
-which, gradually, converges to the exact solution $x^{*}$ as follows:
+where $A\in\mathbb{R}^{n\times n}$ is a sparse nonsingular square matrix, $x\in\mathbb{R}^{n}$
+is the solution vector, $b\in\mathbb{R}^{n}$ is the right-hand side and $n\in\mathbb{N}$ is a
+large integer number. 
+
+The iterative methods\index{Iterative~method} for solving the large sparse linear system~(\ref{ch12:eq:01})
+proceed by successive iterations of a same block of elementary operations, during which an
+infinite number of approximate solutions $\{x_k\}_{k\geq 0}$ are computed. Indeed, from an
+initial guess $x_0$, an iterative method determines at each iteration $k>0$ an approximate
+solution $x_k$ which, gradually, converges to the exact solution $x^{*}$ as follows:
  \begin{equation}
  x^{*}=\lim\limits_{k\to\infty}x_{k}=A^{-1}b.
-\label{eq:02}
+\label{ch12:eq:02}
  \end{equation}
-The number of iterations necessary to reach the exact solution $x^{*}$ is not known beforehand and can be infinite. In
-practice, an iterative method often finds an approximate solution $\tilde{x}$ after a fixed number of iterations and/or
-when a given convergence criterion is satisfied as follows:   
+The number of iterations necessary to reach the exact solution $x^{*}$ is not known beforehand
+and can be infinite. In practice, an iterative method often finds an approximate solution $\tilde{x}$
+after a fixed number of iterations and/or when a given convergence criterion\index{Convergence}
+is satisfied as follows:   
  \begin{equation}
  \|b-A\tilde{x}\| < \varepsilon,
-\label{eq:03}
+\label{ch12:eq:03}
  \end{equation}
-where $\varepsilon<1$ is the required convergence tolerance threshold. 
-
-Some of the most iterative methods that have proven their efficiency for solving large sparse linear systems are those
-called \textit{Krylov sub-space methods}~\cite{ref1}. In the present chapter, we describe two Krylov methods which are
-widely used: the conjugate gradient method (CG) and the generalized minimal residual method (GMRES). In practice, the
-Krylov sub-space methods are usually used with preconditioners that allow to improve their convergence. So, in what
-follows, the CG and GMRES methods are used for solving the left-preconditioned sparse linear system:
+where $\varepsilon<1$ is the required convergence tolerance threshold\index{Convergence!Tolerance~threshold}. 
+
+Some of the most iterative methods that have proven their efficiency for solving large sparse
+linear systems are those called \textit{Krylov subspace methods}~\cite{ch12:ref1}\index{Iterative~method!Krylov~subspace}.
+In the present chapter, we describe two Krylov methods which are widely used: the conjugate
+gradient method (CG) and the generalized minimal residual method (GMRES). In practice, the
+Krylov subspace methods are usually used with preconditioners that allow to improve their
+convergence. So, in what follows, the CG and GMRES methods are used for solving the left-preconditioned\index{Sparse~linear~system!Preconditioned}
+sparse linear system:
  \begin{equation}
  M^{-1}Ax=M^{-1}b,
-\label{eq:11}
+\label{ch12:eq:11}
  \end{equation}
  where $M$ is the preconditioning matrix.
  
+
  %%****************%%
  %%****************%%
  \subsection{CG method}
-\label{sec:02.01}
-The conjugate gradient method is initially developed by Hestenes and Stiefel in 1952~\cite{ref2}. It is one of the well
-known iterative method for solving large sparse linear systems. In addition, it can be adapted for solving nonlinear 
-equations and optimization problems. However, it can only be applied to problems with positive definite symmetric matrices.
-
-The main idea of the CG method is the computation of a sequence of approximate solutions $\{x_k\}_{k\geq 0}$ in a Krylov
-sub-space of order $k$ as follows: 
+\label{ch12:sec:02.01}
+The conjugate gradient method is initially developed by Hestenes and Stiefel in 1952~\cite{ch12:ref2}.
+It is one of the well known iterative method for solving large sparse linear systems. In addition, it
+can be adapted for solving nonlinear equations and optimization problems. However, it can only be applied
+to problems with positive definite symmetric matrices.
+
+The main idea of the CG method\index{Iterative~method!CG} is the computation of a sequence of approximate
+solutions $\{x_k\}_{k\geq 0}$ in a Krylov subspace\index{Iterative~method!Krylov~subspace} of order $k$ as
+follows: 
  \begin{equation}
  x_k \in x_0 + \mathcal{K}_k(A,r_0),
-\label{eq:04}
+\label{ch12:eq:04}
  \end{equation}
-such that the Galerkin condition must be satisfied:
+such that the Galerkin condition\index{Galerkin~condition} must be satisfied:
  \begin{equation}
  r_k \bot \mathcal{K}_k(A,r_0),
-\label{eq:05}
+\label{ch12:eq:05}
  \end{equation}
-where $x_0$ is the initial guess, $r_k=b-Ax_k$ is the residual of the computed solution $x_k$ and $\mathcal{K}_k$ the Krylov
-sub-space of order $k$: \[\mathcal{K}_k(A,r_0) \equiv\text{span}\{r_0, Ar_0, A^2r_0,\ldots, A^{k-1}r_0\}.\]
+where $x_0$ is the initial guess, $r_k=b-Ax_k$ is the residual of the computed solution $x_k$ and $\mathcal{K}_k$
+the Krylov subspace of order $k$: \[\mathcal{K}_k(A,r_0) \equiv\text{span}\{r_0, Ar_0, A^2r_0,\ldots, A^{k-1}r_0\}.\]
  In fact, CG is based on the construction of a sequence $\{p_k\}_{k\in\mathbb{N}}$ of direction vectors in $\mathcal{K}_k$
  which are pairwise $A$-conjugate ($A$-orthogonal):
  \begin{equation}
  \begin{array}{ll}
  p_i^T A p_j = 0, & i\neq j. 
  \end{array} 
-\label{eq:06}
+\label{ch12:eq:06}
  \end{equation}
  At each iteration $k$, an approximate solution $x_k$ is computed by recurrence as follows:  
  \begin{equation}
  \begin{array}{ll}
  x_k = x_{k-1} + \alpha_k p_k, & \alpha_k\in\mathbb{R}.
  \end{array} 
-\label{eq:07}
+\label{ch12:eq:07}
  \end{equation}
  Consequently, the residuals $r_k$ are computed in the same way:
  \begin{equation}
  r_k = r_{k-1} - \alpha_k A p_k. 
-\label{eq:08}
+\label{ch12:eq:08}
  \end{equation}
-In the case where all residuals are nonzero, the direction vectors $p_k$ can be determined so that the following recurrence 
-holds:
+In the case where all residuals are nonzero, the direction vectors $p_k$ can be determined so that
+the following recurrence holds:
  \begin{equation}
  \begin{array}{lll}
  p_0=r_0, & p_k=r_k+\beta_k p_{k-1}, & \beta_k\in\mathbb{R}.
  \end{array} 
-\label{eq:09}
+\label{ch12:eq:09}
  \end{equation}
-Moreover, the scalars $\{\alpha_k\}_{k>0}$ are chosen so as to minimize the $A$-norm error $\|x^{*}-x_k\|_A$ over the Krylov
-sub-space $\mathcal{K}_{k}$ and the scalars $\{\beta_k\}_{k>0}$ are chosen so as to ensure that the direction vectors are
-pairwise $A$-conjugate. So, the assumption that matrix $A$ is symmetric and the recurrences~(\ref{eq:08}) and~(\ref{eq:09})
-allow to deduce that:
+Moreover, the scalars $\{\alpha_k\}_{k>0}$ are chosen so as to minimize the $A$-norm error $\|x^{*}-x_k\|_A$
+over the Krylov subspace $\mathcal{K}_{k}$ and the scalars $\{\beta_k\}_{k>0}$ are chosen so as to ensure
+that the direction vectors are pairwise $A$-conjugate. So, the assumption that matrix $A$ is symmetric and
+the recurrences~(\ref{ch12:eq:08}) and~(\ref{ch12:eq:09}) allow to deduce that:
  \begin{equation}
  \begin{array}{ll}
  \alpha_{k}=\frac{r^{T}_{k-1}r_{k-1}}{p_{k}^{T}Ap_{k}}, & \beta_{k}=\frac{r_{k}^{T}r_{k}}{r_{k-1}^{T}r_{k-1}}.
  \end{array}
-\label{eq:10}
+\label{ch12:eq:10}
  \end{equation}
  
  \begin{algorithm}[!t]
-  \SetLine
-  \linesnumbered
    Choose an initial guess $x_0$\;
    $r_{0} = b - A x_{0}$\;
    $convergence$ = false\;
@@ -160,62 +180,70 @@ allow to deduce that:
      }
    }
  \caption{Left-preconditioned CG method}
-\label{alg:01}
+\label{ch12:alg:01}
  \end{algorithm}
  
-Algorithm~\ref{alg:01} shows the main key points of the preconditioned CG method. It allows to solve the left-preconditioned
-sparse linear system~(\ref{eq:11}). In this algorithm, $\varepsilon$ is the convergence tolerance threshold, $maxiter$ is the maximum
-number of iterations and $(\cdot,\cdot)$ defines the dot product between two vectors in $\mathbb{R}^{n}$. At every iteration, a direction
-vector $p_k$ is determined, so that it is orthogonal to the preconditioned residual $z_k$ and to the direction vectors $\{p_i\}_{i<k}$
-previously determined (from line~$8$ to line~$13$). Then, at lines~$16$ and~$17$ , the iterate $x_k$ and the residual $r_k$ are computed
-using formulas~(\ref{eq:07}) and~(\ref{eq:08}), respectively. The CG method converges after, at most, $n$ iterations. In practice, the CG
-algorithm stops when the tolerance threshold $\varepsilon$ and/or the maximum number of iterations $maxiter$ are reached.
+Algorithm~\ref{ch12:alg:01} shows the main key points of the preconditioned CG method. It allows
+to solve the left-preconditioned\index{Sparse~linear~system!Preconditioned} sparse linear system~(\ref{ch12:eq:11}).
+In this algorithm, $\varepsilon$ is the convergence tolerance threshold, $maxiter$ is the maximum
+number of iterations and $(\cdot,\cdot)$ defines the dot product between two vectors in $\mathbb{R}^{n}$.
+At every iteration, a direction vector $p_k$ is determined, so that it is orthogonal to the preconditioned
+residual $z_k$ and to the direction vectors $\{p_i\}_{i<k}$ previously determined (from line~$8$ to
+line~$13$). Then, at lines~$16$ and~$17$, the iterate $x_k$ and the residual $r_k$ are computed using
+formulas~(\ref{ch12:eq:07}) and~(\ref{ch12:eq:08}), respectively. The CG method converges after, at
+most, $n$ iterations. In practice, the CG algorithm stops when the tolerance threshold\index{Convergence!Tolerance~threshold}
+$\varepsilon$ and/or the maximum number of iterations\index{Convergence!Maximum~number~of~iterations}
+$maxiter$ are reached.
+
  
  %%****************%%
  %%****************%%
  \subsection{GMRES method} 
-\label{sec:02.02}
-The iterative GMRES method is developed by Saad and Schultz in 1986~\cite{ref3} as a generalization of the minimum residual method
-MINRES~\cite{ref4}. Indeed, GMRES can be applied for solving symmetric or nonsymmetric linear systems. 
-
-The main principle of the GMRES method is to find an approximation minimizing at best the residual norm. In fact, GMRES
-computes a sequence of approximate solutions $\{x_k\}_{k>0}$ in a Krylov sub-space $\mathcal{K}_k$ as follows:
+\label{ch12:sec:02.02}
+The iterative GMRES method is developed by Saad and Schultz in 1986~\cite{ch12:ref3} as a generalization
+of the minimum residual method MINRES~\cite{ch12:ref4}\index{Iterative~method!MINRES}. Indeed, GMRES can
+be applied for solving symmetric or nonsymmetric linear systems. 
+
+The main principle of the GMRES method\index{Iterative~method!GMRES} is to find an approximation minimizing
+at best the residual norm. In fact, GMRES computes a sequence of approximate solutions $\{x_k\}_{k>0}$ in
+a Krylov subspace\index{Iterative~method!Krylov~subspace} $\mathcal{K}_k$ as follows:
  \begin{equation}
  \begin{array}{ll}
  x_k \in x_0 + \mathcal{K}_k(A, v_1),& v_1=\frac{r_0}{\|r_0\|_2},
  \end{array}
-\label{eq:12}
+\label{ch12:eq:12}
  \end{equation} 
-so that the Petrov-Galerkin condition is satisfied:
+so that the Petrov-Galerkin condition\index{Petrov-Galerkin~condition} is satisfied:
  \begin{equation}
  \begin{array}{ll}
  r_k \bot A \mathcal{K}_k(A, v_1).
  \end{array}
-\label{eq:13}
+\label{ch12:eq:13}
  \end{equation}
-GMRES uses the Arnoldi process~\cite{ref5} to construct an orthonormal basis $V_k$ for the Krylov sub-space $\mathcal{K}_k$
-and an upper Hessenberg matrix $\bar{H}_k$ of order $(k+1)\times k$:
+GMRES uses the Arnoldi process~\cite{ch12:ref5}\index{Iterative~method!Arnoldi~process} to construct an
+orthonormal basis $V_k$ for the Krylov subspace $\mathcal{K}_k$ and an upper Hessenberg matrix\index{Hessenberg~matrix}
+$\bar{H}_k$ of order $(k+1)\times k$:
  \begin{equation}
  \begin{array}{ll}
  V_k = \{v_1, v_2,\ldots,v_k\}, & \forall k>1, v_k=A^{k-1}v_1,
  \end{array}
-\label{eq:14}
+\label{ch12:eq:14}
  \end{equation}
  and
  \begin{equation}
  V_k A = V_{k+1} \bar{H}_k.
-\label{eq:15}
+\label{ch12:eq:15}
  \end{equation}
  
-Then, at each iteration $k$, an approximate solution $x_k$ is computed in the Krylov sub-space $\mathcal{K}_k$ spanned by $V_k$
-as follows:
+Then, at each iteration $k$, an approximate solution $x_k$ is computed in the Krylov subspace $\mathcal{K}_k$
+spanned by $V_k$ as follows:
  \begin{equation}
  \begin{array}{ll}
  x_k = x_0 + V_k y, & y\in\mathbb{R}^{k}.
  \end{array}
-\label{eq:16}
+\label{ch12:eq:16}
  \end{equation}
-From both formulas~(\ref{eq:15}) and~(\ref{eq:16}) and $r_k=b-Ax_k$, we can deduce that:
+From both formulas~(\ref{ch12:eq:15}) and~(\ref{ch12:eq:16}) and $r_k=b-Ax_k$, we can deduce that:
  \begin{equation}
  \begin{array}{lll}
    r_{k} & = & b - A (x_{0} + V_{k}y) \\
@@ -223,34 +251,34 @@ From both formulas~(\ref{eq:15}) and~(\ref{eq:16}) and $r_k=b-Ax_k$, we can dedu
          & = & \beta v_{1} - V_{k+1}\bar{H}_{k}y \\
          & = & V_{k+1}(\beta e_{1} - \bar{H}_{k}y),
  \end{array}
-\label{eq:17}
+\label{ch12:eq:17}
  \end{equation}
-such that $\beta=\|r_0\|_2$ and $e_1=(1,0,\cdots,0)$ is the first vector of the canonical basis of $\mathbb{R}^k$. So,
-the vector $y$ is chosen in $\mathbb{R}^k$ so as to minimize at best the Euclidean norm of the residual $r_k$. Consequently,
-a linear least-squares problem of size $k$ is solved:
+such that $\beta=\|r_0\|_2$ and $e_1=(1,0,\cdots,0)$ is the first vector of the canonical basis of
+$\mathbb{R}^k$. So, the vector $y$ is chosen in $\mathbb{R}^k$ so as to minimize at best the Euclidean
+norm of the residual $r_k$. Consequently, a linear least-squares problem of size $k$ is solved:
  \begin{equation}
  \underset{y\in\mathbb{R}^{k}}{min}\|r_{k}\|_{2}=\underset{y\in\mathbb{R}^{k}}{min}\|\beta e_{1}-\bar{H}_{k}y\|_{2}.
-\label{eq:18}
+\label{ch12:eq:18}
  \end{equation}
-The QR factorization of matrix $\bar{H}_k$ is used to compute the solution of this problem by using Givens rotations~\cite{ref1,ref3},
-such that:
+The QR factorization of matrix $\bar{H}_k$ is used to compute the solution of this problem by using
+Givens rotations~\cite{ch12:ref1,ch12:ref3}, such that:
  \begin{equation}
  \begin{array}{lll}
  \bar{H}_{k}=Q_{k}R_{k}, & Q_{k}\in\mathbb{R}^{(k+1)\times (k+1)}, & R_{k}\in\mathbb{R}^{(k+1)\times k},
  \end{array}
-\label{eq:19}
+\label{ch12:eq:19}
  \end{equation}
  where $Q_kQ_k^T=I_k$ and $R_k$ is an upper triangular matrix.
  
-The GMRES method computes an approximate solution with a sufficient precision after, at most, $n$ iterations ($n$ is the size of the 
-sparse linear system to be solved). However, the GMRES algorithm must construct and store in the memory an orthonormal basis $V_k$ whose
-size is proportional to the number of iterations required to achieve the convergence. Then, to avoid a huge memory storage, the GMRES
-method must be restarted at each $m$ iterations, such that $m$ is very small ($m\ll n$), and with $x_m$ as the initial guess to the
-next iteration. This allows to limit the size of the basis $V$ to $m$ orthogonal vectors.
+The GMRES method computes an approximate solution with a sufficient precision after, at most, $n$
+iterations ($n$ is the size of the sparse linear system to be solved). However, the GMRES algorithm
+must construct and store in the memory an orthonormal basis $V_k$ whose size is proportional to the
+number of iterations required to achieve the convergence. Then, to avoid a huge memory storage, the
+GMRES method must be restarted at each $m$ iterations, such that $m$ is very small ($m\ll n$), and
+with $x_m$ as the initial guess to the next iteration. This allows to limit the size of the basis
+$V$ to $m$ orthogonal vectors.
  
  \begin{algorithm}[!t]
-  \SetLine
-  \linesnumbered
    Choose an initial guess $x_0$\;
    $convergence$ = false\;
    $k = 1$\;
@@ -281,190 +309,245 @@ next iteration. This allows to limit the size of the basis $V$ to $m$ orthogonal
      }
    }
  \caption{Left-preconditioned GMRES method with restarts}
-\label{alg:02}
+\label{ch12:alg:02}
  \end{algorithm}
  
-Algorithm~\ref{alg:02} shows the main key points of the GMRES method with restarts. It solves the left-preconditioned sparse linear
-system~(\ref{eq:11}), such that $M$ is the preconditioning matrix. At each iteration $k$, GMRES uses the Arnoldi process (defined
-from line~$7$ to line~$17$) to construct a basis $V_m$ of $m$ orthogonal vectors and an upper Hessenberg matrix $\bar{H}_m$ of size
-$(m+1)\times m$. Then, it solves the linear least-squares problem of size $m$ to find the vector $y\in\mathbb{R}^{m}$ which minimizes
-at best the residual norm (line~$18$). Finally, it computes an approximate solution $x_m$ in the Krylov sub-space spanned by $V_m$ 
-(line~$19$). The GMRES algorithm is stopped when the residual norm is sufficiently small ($\|r_m\|_2<\varepsilon$) and/or the maximum
-number of iterations ($maxiter$) is reached.
+Algorithm~\ref{ch12:alg:02} shows the main key points of the GMRES method with restarts.
+It solves the left-preconditioned\index{Sparse~linear~system!Preconditioned} sparse linear
+system~(\ref{ch12:eq:11}), such that $M$ is the preconditioning matrix. At each iteration
+$k$, GMRES uses the Arnoldi process\index{Iterative~method!Arnoldi~process} (defined from
+line~$7$ to line~$17$) to construct a basis $V_m$ of $m$ orthogonal vectors and an upper
+Hessenberg matrix\index{Hessenberg~matrix} $\bar{H}_m$ of size $(m+1)\times m$. Then, it
+solves the linear least-squares problem of size $m$ to find the vector $y\in\mathbb{R}^{m}$
+which minimizes at best the residual norm (line~$18$). Finally, it computes an approximate
+solution $x_m$ in the Krylov subspace spanned by $V_m$ (line~$19$). The GMRES algorithm is
+stopped when the residual norm is sufficiently small ($\|r_m\|_2<\varepsilon$) and/or the
+maximum number of iterations\index{Convergence!Maximum~number~of~iterations} ($maxiter$)
+is reached.
+
  
  %%--------------------------%%
  %%       SECTION 3          %%
  %%--------------------------%%
  \section{Parallel implementation on a GPU cluster}
-\label{sec:03}
-In this section, we present the parallel algorithms of both iterative CG and GMRES methods for GPU clusters.
-The implementation is performed on a GPU cluster composed of different computing nodes, such that each node
-is a CPU core managed by a MPI process and equipped with a GPU card. The parallelization of these algorithms
-is carried out by using the MPI communication routines between the GPU computing nodes and the CUDA programming
-environment inside each node. In what follows, the algorithms of the iterative methods are called iterative
-solvers.
+\label{ch12:sec:03}
+In this section, we present the parallel algorithms of both iterative CG\index{Iterative~method!CG}
+and GMRES\index{Iterative~method!GMRES} methods for GPU clusters. The implementation is performed on
+a GPU cluster composed of different computing nodes, such that each node is a CPU core managed by a
+MPI process and equipped with a GPU card. The parallelization of these algorithms is carried out by
+using the MPI communication routines between the GPU computing nodes\index{Computing~node} and the
+CUDA programming environment inside each node. In what follows, the algorithms of the iterative methods
+are called iterative solvers.
+
  
  %%****************%%
  %%****************%%
  \subsection{Data partitioning}
-\label{sec:03.01}
-The parallel solving of the large sparse linear system~(\ref{eq:11}) requires a data partitioning between the computing
-nodes of the GPU cluster. Let $p$ denotes the number of the computing nodes on the GPU cluster. The partitioning operation
-consists in the decomposition of the vectors and matrices, involved in the iterative solver, in $p$ portions. Indeed, this
-operation allows to assign to each computing node $i$:
-\begin{itemize*}
+\label{ch12:sec:03.01}
+The parallel solving of the large sparse linear system~(\ref{ch12:eq:11}) requires a data partitioning
+between the computing nodes of the GPU cluster. Let $p$ denotes the number of the computing nodes on the
+GPU cluster. The partitioning operation consists in the decomposition of the vectors and matrices, involved
+in the iterative solver, in $p$ portions. Indeed, this operation allows to assign to each computing node
+$i$:
+\begin{itemize}
  \item a portion of size $\frac{n}{p}$ elements of each vector,
  \item a sparse rectangular sub-matrix $A_i$ of size $(\frac{n}{p},n)$ and,
  \item a square preconditioning sub-matrix $M_i$ of size $(\frac{n}{p},\frac{n}{p})$, 
-\end{itemize*} 
-where $n$ is the size of the sparse linear system to be solved. In the first instance, we perform a naive row-wise partitioning
-(decomposition row-by-row) on the data of the sparse linear systems to be solved. Figure~\ref{fig:01} shows an example of a row-wise
-data partitioning between four computing nodes of a sparse linear system (sparse matrix $A$, solution vector $x$ and right-hand
-side $b$) of size $16$ unknown values. 
+\end{itemize} 
+where $n$ is the size of the sparse linear system to be solved. In the first instance, we perform a naive
+row-wise partitioning (decomposition row-by-row) on the data of the sparse linear systems to be solved.
+Figure~\ref{ch12:fig:01} shows an example of a row-wise data partitioning between four computing nodes
+of a sparse linear system (sparse matrix $A$, solution vector $x$ and right-hand side $b$) of size $16$
+unknown values. 
  
  \begin{figure}
  \centerline{\includegraphics[scale=0.35]{Chapters/chapter12/figures/partition}}
  \caption{A data partitioning of the sparse matrix $A$, the solution vector $x$ and the right-hand side $b$ into four portions.}
-\label{fig:01}
+\label{ch12:fig:01}
  \end{figure}
  
+
  %%****************%%
  %%****************%%
  \subsection{GPU computing}
-\label{sec:03.02}
-After the partitioning operation, all the data involved from this operation must be transferred from the CPU memories to the GPU
-memories, in order to be processed by GPUs. We use two functions of the CUBLAS library (CUDA Basic Linear Algebra Subroutines),
-developed by Nvidia~\cite{ref6}: \verb+cublasAlloc()+ for the memory allocations on GPUs and \verb+cublasSetVector()+ for the
-memory copies from the CPUs to the GPUs.
-
-An efficient implementation of CG and GMRES solvers on a GPU cluster requires to determine all parts of their codes that can be
-executed in parallel and, thus, take advantage of the GPU acceleration. As many Krylov sub-space methods, the CG and GMRES methods
-are mainly based on arithmetic operations dealing with vectors or matrices: sparse matrix-vector multiplications, scalar-vector
-multiplications, dot products, Euclidean norms, AXPY operations ($y\leftarrow ax+y$ where $x$ and $y$ are vectors and $a$ is a
-scalar) and so on. These vector operations are often easy to parallelize and they are more efficient on parallel computers when
-they work on large vectors. Therefore, all the vector operations used in CG and GMRES solvers must be executed by the GPUs as kernels.
-
-We use the kernels of the CUBLAS library to compute some vector operations of CG and GMRES solvers. The following kernels of CUBLAS
-(dealing with double floating point) are used: \verb+cublasDdot()+ for the dot products, \verb+cublasDnrm2()+ for the Euclidean
-norms and \verb+cublasDaxpy()+ for the AXPY operations. For the rest of the data-parallel operations, we code their kernels in CUDA.
-In the CG solver, we develop a kernel for the XPAY operation ($y\leftarrow x+ay$) used at line~$12$ in Algorithm~\ref{alg:01}. In the
-GMRES solver, we program a kernel for the scalar-vector multiplication (lines~$7$ and~$15$ in Algorithm~\ref{alg:02}), a kernel for
-solving the least-squares problem and a kernel for the elements updates of the solution vector $x$.
-
-The least-squares problem in the GMRES method is solved by performing a QR factorization on the Hessenberg matrix $\bar{H}_m$ with
-plane rotations and, then, solving the triangular system by backward substitutions to compute $y$. Consequently, solving the least-squares
-problem on the GPU is not interesting. Indeed, the triangular solves are not easy to parallelize and inefficient on GPUs. However,
-the least-squares problem to solve in the GMRES method with restarts has, generally, a very small size $m$. Therefore, we develop
-an inexpensive kernel which must be executed in sequential by a single CUDA thread. 
-
-The most important operation in CG and GMRES methods is the sparse matrix-vector multiplication (SpMV), because it is often an
-expensive operation in terms of execution time and memory space. Moreover, it requires to take care of the storage format of the
-sparse matrix in the memory. Indeed, the naive storage, row-by-row or column-by-column, of a sparse matrix can cause a significant
-waste of memory space and execution time. In addition, the sparsity nature of the matrix often leads to irregular memory accesses
-to read the matrix nonzero values. So, the computation of the SpMV multiplication on GPUs can involve non coalesced accesses to
-the global memory, which slows down even more its performances. One of the most efficient compressed storage formats of sparse
-matrices on GPUs is HYB format~\cite{ref7}. It is a combination of ELLpack (ELL) and Coordinate (COO) formats. Indeed, it stores
-a typical number of nonzero values per row in ELL format and remaining entries of exceptional rows in COO format. It combines
-the efficiency of ELL due to the regularity of its memory accesses and the flexibility of COO which is insensitive to the matrix
-structure. Consequently, we use the HYB kernel~\cite{ref8} developed by Nvidia to implement the SpMV multiplication of CG and
-GMRES methods on GPUs. Moreover, to avoid the non coalesced accesses to the high-latency global memory, we fill the elements of
-the iterate vector $x$ in the cached texture memory.
+\label{ch12:sec:03.02}
+After the partitioning operation, all the data involved from this operation must be
+transferred from the CPU memories to the GPU memories, in order to be processed by
+GPUs. We use two functions of the CUBLAS\index{CUBLAS} library (CUDA Basic Linear
+Algebra Subroutines), developed by Nvidia~\cite{ch12:ref6}: \verb+cublasAlloc()+
+for the memory allocations on GPUs and \verb+cublasSetVector()+ for the memory
+copies from the CPUs to the GPUs.
+
+An efficient implementation of CG and GMRES solvers on a GPU cluster requires to
+determine all parts of their codes that can be executed in parallel and, thus, take
+advantage of the GPU acceleration. As many Krylov subspace methods, the CG and GMRES
+methods are mainly based on arithmetic operations dealing with vectors or matrices:
+sparse matrix-vector multiplications, scalar-vector multiplications, dot products,
+Euclidean norms, AXPY operations ($y\leftarrow ax+y$ where $x$ and $y$ are vectors
+and $a$ is a scalar) and so on. These vector operations are often easy to parallelize
+and they are more efficient on parallel computers when they work on large vectors.
+Therefore, all the vector operations used in CG and GMRES solvers must be executed
+by the GPUs as kernels.
+
+We use the kernels of the CUBLAS library to compute some vector operations of CG and
+GMRES solvers. The following kernels of CUBLAS (dealing with double floating point)
+are used: \verb+cublasDdot()+ for the dot products, \verb+cublasDnrm2()+ for the
+Euclidean norms and \verb+cublasDaxpy()+ for the AXPY operations. For the rest of
+the data-parallel operations, we code their kernels in CUDA. In the CG solver, we
+develop a kernel for the XPAY operation ($y\leftarrow x+ay$) used at line~$12$ in
+Algorithm~\ref{ch12:alg:01}. In the GMRES solver, we program a kernel for the scalar-vector
+multiplication (lines~$7$ and~$15$ in Algorithm~\ref{ch12:alg:02}), a kernel for
+solving the least-squares problem and a kernel for the elements updates of the solution
+vector $x$.
+
+The least-squares problem in the GMRES method is solved by performing a QR factorization
+on the Hessenberg matrix\index{Hessenberg~matrix} $\bar{H}_m$ with plane rotations and,
+then, solving the triangular system by backward substitutions to compute $y$. Consequently,
+solving the least-squares problem on the GPU is not interesting. Indeed, the triangular
+solves are not easy to parallelize and inefficient on GPUs. However, the least-squares
+problem to solve in the GMRES method with restarts has, generally, a very small size $m$.
+Therefore, we develop an inexpensive kernel which must be executed in sequential by a
+single CUDA thread. 
+
+The most important operation in CG\index{Iterative~method!CG} and GMRES\index{Iterative~method!GMRES}
+methods is the sparse matrix-vector multiplication (SpMV)\index{SpMV~multiplication},
+because it is often an expensive operation in terms of execution time and memory space.
+Moreover, it requires to take care of the storage format of the sparse matrix in the
+memory. Indeed, the naive storage, row-by-row or column-by-column, of a sparse matrix
+can cause a significant waste of memory space and execution time. In addition, the sparsity
+nature of the matrix often leads to irregular memory accesses to read the matrix nonzero
+values. So, the computation of the SpMV multiplication on GPUs can involve non coalesced
+accesses to the global memory, which slows down even more its performances. One of the
+most efficient compressed storage formats\index{Compressed~storage~format} of sparse
+matrices on GPUs is HYB\index{Compressed~storage~format!HYB} format~\cite{ch12:ref7}.
+It is a combination of ELLpack (ELL) and Coordinate (COO) formats. Indeed, it stores
+a typical number of nonzero values per row in ELL\index{Compressed~storage~format!ELL}
+format and remaining entries of exceptional rows in COO format. It combines the efficiency
+of ELL due to the regularity of its memory accesses and the flexibility of COO\index{Compressed~storage~format!COO}
+which is insensitive to the matrix structure. Consequently, we use the HYB kernel~\cite{ch12:ref8}
+developed by Nvidia to implement the SpMV multiplication of CG and GMRES methods on GPUs.
+Moreover, to avoid the non coalesced accesses to the high-latency global memory, we fill
+the elements of the iterate vector $x$ in the cached texture memory.
+
  
  %%****************%%
  %%****************%%
  \subsection{Data communications}
-\label{sec:03.03}
-All the computing nodes of the GPU cluster execute in parallel the same iterative solver (Algorithm~\ref{alg:01} or Algorithm~\ref{alg:02})
-adapted to GPUs, but on their own portions of the sparse linear system: $M^{-1}_iA_ix_i=M^{-1}_ib_i$, $0\leq i<p$. However in order to solve
-the complete sparse linear system~(\ref{eq:11}), synchronizations must be performed between the local computations of the computing nodes
-over the cluster. In what follows, two computing nodes sharing data are called neighboring nodes.
-
-As already mentioned, the most important operation of CG and GMRES methods is the SpMV multiplication. In the parallel implementation of
-the iterative methods, each computing node $i$ performs the SpMV multiplication on its own sparse rectangular sub-matrix $A_i$. Locally, it
-has only sub-vectors of size $\frac{n}{p}$ corresponding to rows of its sub-matrix $A_i$. However, it also requires the vector elements
-of its neighbors, corresponding to the column indices on which its sub-matrix has nonzero values (see Figure~\ref{fig:01}). So, in addition
-to the local vectors, each node must also manage vector elements shared with neighbors and required to compute the SpMV multiplication.
-Therefore, the iterate vector $x$ managed by each computing node is composed of a local sub-vector $x^{local}$ of size $\frac{n}{p}$ and a
-sub-vector of shared elements $x^{shared}$. In the same way, the vector used to construct the orthonormal basis of the Krylov sub-space 
-(vectors $p$ and $v$ in CG and GMRES methods, respectively) is composed of a local sub-vector and a shared sub-vector. 
-
-Therefore, before computing the SpMV multiplication, the neighboring nodes over the GPU cluster must exchange between them the shared 
-vector elements necessary to compute this multiplication. First, each computing node determines, in its local sub-vector, the vector
-elements needed by other nodes. Then, the neighboring nodes exchange between them these shared vector elements. The data exchanges 
-are implemented by using the MPI point-to-point communication routines: blocking sends with \verb+MPI_Send()+ and nonblocking receives
-with \verb+MPI_Irecv()+. Figure~\ref{fig:02} shows an example of data exchanges between \textit{Node 1} and its neighbors \textit{Node 0},
-\textit{Node 2} and \textit{Node 3}. In this example, the iterate matrix $A$ split between these four computing nodes is that presented
-in Figure~\ref{fig:01}.
+\label{ch12:sec:03.03}
+All the computing nodes of the GPU cluster execute in parallel the same iterative solver
+(Algorithm~\ref{ch12:alg:01} or Algorithm~\ref{ch12:alg:02}) adapted to GPUs, but on their
+own portions of the sparse linear system\index{Sparse~linear~system}: $M^{-1}_iA_ix_i=M^{-1}_ib_i$,
+$0\leq i<p$. However, in order to solve the complete sparse linear system~(\ref{ch12:eq:11}),
+synchronizations must be performed between the local computations of the computing nodes over
+the cluster. In what follows, two computing nodes sharing data are called neighboring nodes\index{Neighboring~node}.
+
+As already mentioned, the most important operation of CG and GMRES methods is the SpMV multiplication.
+In the parallel implementation of the iterative methods, each computing node $i$ performs the
+SpMV multiplication on its own sparse rectangular sub-matrix $A_i$. Locally, it has only sub-vectors
+of size $\frac{n}{p}$ corresponding to rows of its sub-matrix $A_i$. However, it also requires
+the vector elements of its neighbors, corresponding to the column indices on which its sub-matrix
+has nonzero values (see Figure~\ref{ch12:fig:01}). So, in addition to the local vectors, each
+node must also manage vector elements shared with neighbors and required to compute the SpMV
+multiplication. Therefore, the iterate vector $x$ managed by each computing node is composed
+of a local sub-vector $x^{local}$ of size $\frac{n}{p}$ and a sub-vector of shared elements $x^{shared}$.
+In the same way, the vector used to construct the orthonormal basis of the Krylov subspace (vectors
+$p$ and $v$ in CG and GMRES methods, respectively) is composed of a local sub-vector and a shared
+sub-vector. 
+
+Therefore, before computing the SpMV multiplication\index{SpMV~multiplication}, the neighboring
+nodes\index{Neighboring~node} over the GPU cluster must exchange between them the shared vector
+elements necessary to compute this multiplication. First, each computing node determines, in its
+local sub-vector, the vector elements needed by other nodes. Then, the neighboring nodes exchange
+between them these shared vector elements. The data exchanges are implemented by using the MPI
+point-to-point communication routines: blocking\index{MPI~subroutines!Blocking} sends with \verb+MPI_Send()+
+and nonblocking\index{MPI~subroutines!Nonblocking} receives with \verb+MPI_Irecv()+. Figure~\ref{ch12:fig:02}
+shows an example of data exchanges between \textit{Node 1} and its neighbors \textit{Node 0}, \textit{Node 2}
+and \textit{Node 3}. In this example, the iterate matrix $A$ split between these four computing
+nodes is that presented in Figure~\ref{ch12:fig:01}.
  
  \begin{figure}
  \centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/compress}}
  \caption{Data exchanges between \textit{Node 1} and its neighbors \textit{Node 0}, \textit{Node 2} and \textit{Node 3}.}
-\label{fig:02}
+\label{ch12:fig:02}
  \end{figure}
  
-After the synchronization operation, the computing nodes receive, from their respective neighbors, the shared elements in a sub-vector
-stored in a compressed format. However, in order to compute the SpMV multiplication, the computing nodes operate on sparse global vectors
-(see Figure~\ref{fig:02}). In this case, the received vector elements must be copied to the corresponding indices in the global vector.
-So as not to need to perform this at each iteration, we propose to reorder the columns of each sub-matrix $\{A_i\}_{0\leq i<p}$, so that
-the shared sub-vectors could be used in their compressed storage formats. Figure~\ref{fig:03} shows a reordering of a sparse sub-matrix
-(sub-matrix of \textit{Node 1}). 
+After the synchronization operation, the computing nodes receive, from their respective neighbors,
+the shared elements in a sub-vector stored in a compressed format. However, in order to compute the
+SpMV multiplication, the computing nodes operate on sparse global vectors (see Figure~\ref{ch12:fig:02}).
+In this case, the received vector elements must be copied to the corresponding indices in the global
+vector. So as not to need to perform this at each iteration, we propose to reorder the columns of
+each sub-matrix $\{A_i\}_{0\leq i<p}$, so that the shared sub-vectors could be used in their compressed
+storage formats. Figure~\ref{ch12:fig:03} shows a reordering of a sparse sub-matrix (sub-matrix of
+\textit{Node 1}). 
  
  \begin{figure}
-\centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/reorder}}
+\centerline{\includegraphics[scale=0.35]{Chapters/chapter12/figures/reorder}}
  \caption{Columns reordering of a sparse sub-matrix.}
-\label{fig:03}
+\label{ch12:fig:03}
  \end{figure}
  
-A GPU cluster is a parallel platform with a distributed memory. So, the synchronizations and communication data between GPU nodes are
-carried out by passing messages. However, GPUs can not communicate between them in direct way. Then, CPUs via MPI processes are in charge
-of the synchronizations within the GPU cluster. Consequently, the vector elements to be exchanged must be copied from the GPU memory
-to the CPU memory and vice-versa before and after the synchronization operation between CPUs. We have used the CBLAS communication subroutines
-to perform the data transfers between a CPU core and its GPU: \verb+cublasGetVector()+ and \verb+cublasSetVector()+. Finally, in addition
-to the data exchanges, GPU nodes perform reduction operations to compute in parallel the dot products and Euclidean norms. This is implemented
-by using the MPI global communication \verb+MPI_Allreduce()+.
+A GPU cluster\index{GPU~cluster} is a parallel platform with a distributed memory. So, the synchronizations
+and communication data between GPU nodes are carried out by passing messages. However, GPUs can not communicate
+between them in direct way. Then, CPUs via MPI processes are in charge of the synchronizations within the GPU
+cluster. Consequently, the vector elements to be exchanged must be copied from the GPU memory to the CPU memory
+and vice-versa before and after the synchronization operation between CPUs. We have used the CUBLAS\index{CUBLAS}
+communication subroutines to perform the data transfers between a CPU core and its GPU: \verb+cublasGetVector()+
+and \verb+cublasSetVector()+. Finally, in addition to the data exchanges, GPU nodes perform reduction operations
+to compute in parallel the dot products and Euclidean norms. This is implemented by using the MPI global communication\index{MPI~subroutines!Global}
+\verb+MPI_Allreduce()+.
+
  
  
  %%--------------------------%%
  %%       SECTION 4          %%
  %%--------------------------%%
  \section{Experimental results}
-\label{sec:04}
-In this section, we present the performances of the parallel CG and GMRES linear solvers obtained on a cluster of $12$ GPUs. Indeed, this GPU
-cluster of tests is composed of six machines connected by $20$Gbps InfiniBand network. Each machine is a Quad-Core Xeon E5530 CPU running at
-$2.4$GHz and providing $12$GB of RAM with a memory bandwidth of $25.6$GB/s. In addition, two Tesla C1060 GPUs are connected to each machine via
-a PCI-Express 16x Gen 2.0 interface with a throughput of $8$GB/s. A Tesla C1060 GPU contains $240$ cores running at $1.3$GHz and providing a
-global memory of $4$GB with a memory bandwidth of $102$GB/s. Figure~\ref{fig:04} shows the general scheme of the GPU cluster that we used in
-the experimental tests.
+\label{ch12:sec:04}
+In this section, we present the performances of the parallel CG and GMRES linear solvers obtained
+on a cluster of $12$ GPUs. Indeed, this GPU cluster of tests is composed of six machines connected
+by $20$Gbps InfiniBand network. Each machine is a Quad-Core Xeon E5530 CPU running at $2.4$GHz and
+providing $12$GB of RAM with a memory bandwidth of $25.6$GB/s. In addition, two Tesla C1060 GPUs are
+connected to each machine via a PCI-Express 16x Gen 2.0 interface with a throughput of $8$GB/s. A
+Tesla C1060 GPU contains $240$ cores running at $1.3$GHz and providing a global memory of $4$GB with
+a memory bandwidth of $102$GB/s. Figure~\ref{ch12:fig:04} shows the general scheme of the GPU cluster\index{GPU~cluster}
+that we used in the experimental tests.
  
  \begin{figure}
  \centerline{\includegraphics[scale=0.25]{Chapters/chapter12/figures/cluster}}
  \caption{General scheme of the GPU cluster of tests composed of six machines, each with two GPUs.}
-\label{fig:04}
+\label{ch12:fig:04}
  \end{figure}
  
-Linux cluster version 2.6.39 OS is installed on CPUs. C programming language is used for coding the parallel algorithms of both methods on the
-GPU cluster. CUDA version 4.0~\cite{ref9} is used for programming GPUs, using CUBLAS library~\cite{ref6} to deal with vector operations in GPUs
-and, finally, MPI routines of OpenMPI 1.3.3 are used to carry out the communications between CPU cores. Indeed, the experiments are done on a
-cluster of $12$ computing nodes, where each node is managed by a MPI process and it is composed of one CPU core and one GPU card.
-
-All tests are made on double-precision floating point operations. The parameters of both linear solvers are initialized as follows: the residual
-tolerance threshold $\varepsilon=10^{-12}$, the maximum number of iterations $maxiter=500$, the right-hand side $b$ is filled with $1.0$ and the
-initial guess $x_0$ is filled with $0.0$. In addition, we limited the Arnoldi process used in the GMRES method to $16$ iterations ($m=16$). For
-the sake of simplicity, we have chosen the preconditioner $M$ as the main diagonal of the sparse matrix $A$. Indeed, it allows to easily compute
-the required inverse matrix $M^{-1}$ and it provides a relatively good preconditioning for not too ill-conditioned matrices. In the GPU computing,
-the size of thread blocks is fixed to $512$ threads. Finally, the performance results, presented hereafter, are obtained from the mean value over
-$10$ executions of the same parallel linear solver and for the same input data.
-
-To get more realistic results, we tested the CG and GMRES algorithms on sparse matrices of the Davis's collection~\cite{ref10}, that arise in a wide
-spectrum of real-world applications. We chose six symmetric sparse matrices and six nonsymmetric ones from this collection. In Figure~\ref{fig:05},
-we show structures of these matrices and in Table~\ref{tab:01} we present their main characteristics which are the number of rows, the total number
-of nonzero values (nnz) and the maximal bandwidth. In the present chapter, the bandwidth of a sparse matrix is defined as the number of matrix columns
-separating the first and the last nonzero value on a matrix row.
+Linux cluster version 2.6.39 OS is installed on CPUs. C programming language is used for coding
+the parallel algorithms of both methods on the GPU cluster. CUDA version 4.0~\cite{ch12:ref9}
+is used for programming GPUs, using CUBLAS library~\cite{ch12:ref6} to deal with vector operations
+in GPUs and, finally, MPI routines of OpenMPI 1.3.3 are used to carry out the communications between
+CPU cores. Indeed, the experiments are done on a cluster of $12$ computing nodes, where each node
+is managed by a MPI process and it is composed of one CPU core and one GPU card.
+
+All tests are made on double-precision floating point operations. The parameters of both linear
+solvers are initialized as follows: the residual tolerance threshold $\varepsilon=10^{-12}$, the
+maximum number of iterations $maxiter=500$, the right-hand side $b$ is filled with $1.0$ and the
+initial guess $x_0$ is filled with $0.0$. In addition, we limited the Arnoldi process\index{Iterative~method!Arnoldi~process}
+used in the GMRES method to $16$ iterations ($m=16$). For the sake of simplicity, we have chosen
+the preconditioner $M$ as the main diagonal of the sparse matrix $A$. Indeed, it allows to easily
+compute the required inverse matrix $M^{-1}$ and it provides a relatively good preconditioning for
+not too ill-conditioned matrices. In the GPU computing, the size of thread blocks is fixed to $512$
+threads. Finally, the performance results, presented hereafter, are obtained from the mean value
+over $10$ executions of the same parallel linear solver and for the same input data.
+
+To get more realistic results, we tested the CG and GMRES algorithms on sparse matrices of the Davis's
+collection~\cite{ch12:ref10}, that arise in a wide spectrum of real-world applications. We chose six
+symmetric sparse matrices and six nonsymmetric ones from this collection. In Figure~\ref{ch12:fig:05},
+we show structures of these matrices and in Table~\ref{ch12:tab:01} we present their main characteristics
+which are the number of rows, the total number of nonzero values (nnz) and the maximal bandwidth. In
+the present chapter, the bandwidth of a sparse matrix is defined as the number of matrix columns separating
+the first and the last nonzero value on a matrix row.
  
  \begin{figure}
  \centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/matrices}}
  \caption{Sketches of sparse matrices chosen from the Davis's collection.}
-\label{fig:05}
+\label{ch12:fig:05}
  \end{figure}
  
  \begin{table}
@@ -499,7 +582,7 @@ separating the first and the last nonzero value on a matrix row.
  \end{tabular}
  \vspace{0.5cm}
  \caption{Main characteristics of sparse matrices chosen from the Davis's collection.}
-\label{tab:01}
+\label{ch12:tab:01}
  \end{table}
  
  \begin{table}
@@ -521,7 +604,7 @@ shallow\_water2   & $0.017s$           & $0.010s$            & $1.68$        & $
  thermal2          & $1.172s$           & $0.622s$            & $1.88$        & $15$           & $5.11e$-$09$     & $3.33e$-$16$ \\ \hline   
  \end{tabular}
  \caption{Performances of the parallel CG method on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs.}
-\label{tab:02}
+\label{ch12:tab:02}
  \end{center}
  \end{table}
  
@@ -556,49 +639,63 @@ poli\_large       & $0.097s$           & $0.095s$            & $1.02$        & $
  torso3            & $4.242s$           & $2.030s$            & $2.09$        & $175$          & $2.69e$-$10$     & $1.78e$-$14$ \\ \hline
  \end{tabular}
  \caption{Performances of the parallel GMRES method on a cluster 24 CPU cores vs. on cluster of 12 GPUs.}
-\label{tab:03}
+\label{ch12:tab:03}
  \end{center}
  \end{table}
  
-Tables~\ref{tab:02} and~\ref{tab:03} shows the performances of the parallel CG and GMRES solvers, respectively, for solving linear systems associated to
-the sparse matrices presented in Tables~\ref{tab:01}. They allow to compare the performances obtained on a cluster of $24$ CPU cores and on a cluster
-of $12$ GPUs. However, Table~\ref{tab:02} shows only the performances of solving symmetric sparse linear systems, due to the inability of the CG method
-to solve the nonsymmetric systems. In both tables, the second and third columns give, respectively, the execution times in seconds obtained on $24$ CPU
-cores ($Time_{gpu}$) and that obtained on $12$ GPUs ($Time_{gpu}$). Moreover, we take into account the relative gains $\tau$ of a solver implemented on the
-GPU cluster compared to the same solver implemented on the CPU cluster. The relative gains, presented in the fourth column, are computed as a ratio of
-the CPU execution time over the GPU execution time:
+Tables~\ref{ch12:tab:02} and~\ref{ch12:tab:03} shows the performances of the parallel
+CG and GMRES solvers, respectively, for solving linear systems associated to the sparse
+matrices presented in Tables~\ref{ch12:tab:01}. They allow to compare the performances
+obtained on a cluster of $24$ CPU cores and on a cluster of $12$ GPUs. However, Table~\ref{ch12:tab:02}
+shows only the performances of solving symmetric sparse linear systems, due to the inability
+of the CG method to solve the nonsymmetric systems. In both tables, the second and third
+columns give, respectively, the execution times in seconds obtained on $24$ CPU cores
+($Time_{gpu}$) and that obtained on $12$ GPUs ($Time_{gpu}$). Moreover, we take into account
+the relative gains $\tau$ of a solver implemented on the GPU cluster compared to the same
+solver implemented on the CPU cluster. The relative gains\index{Relative~gain}, presented
+in the fourth column, are computed as a ratio of the CPU execution time over the GPU
+execution time:
  \begin{equation}
  \tau = \frac{Time_{cpu}}{Time_{gpu}}.
-\label{eq:20}
+\label{ch12:eq:20}
  \end{equation}
-In addition, Tables~\ref{tab:02} and~\ref{tab:03} give the number of iterations ($iter$), the precision $prec$ of the solution computed on the GPU cluster
-and the difference $\Delta$ between the solution computed on the CPU cluster and that computed on the GPU cluster. Both parameters $prec$ and $\Delta$
-allow to validate and verify the accuracy of the solution computed on the GPU cluster. We have computed them as follows:
+In addition, Tables~\ref{ch12:tab:02} and~\ref{ch12:tab:03} give the number of iterations
+($iter$), the precision $prec$ of the solution computed on the GPU cluster and the difference
+$\Delta$ between the solution computed on the CPU cluster and that computed on the GPU cluster.
+Both parameters $prec$ and $\Delta$ allow to validate and verify the accuracy of the solution
+computed on the GPU cluster. We have computed them as follows:
  \begin{eqnarray}
  \Delta = max|x^{cpu}-x^{gpu}|,\\
  prec = max|M^{-1}r^{gpu}|,
  \end{eqnarray}
-where $\Delta$ is the maximum vector element, in absolute value, of the difference between the two solutions $x^{cpu}$ and $x^{gpu}$ computed, respectively,
-on CPU and GPU clusters and $prec$ is the maximum element, in absolute value, of the residual vector $r^{gpu}\in\mathbb{R}^{n}$ of the solution $x^{gpu}$.
-Thus, we can see that the solutions obtained on the GPU cluster were computed with a sufficient accuracy (about $10^{-10}$) and they are, more or less,
-equivalent to those computed on the CPU cluster with a small difference ranging from $10^{-10}$ to $10^{-26}$. However, we can notice from the relative
-gains $\tau$ that is not interesting to use multiple GPUs for solving small sparse linear systems. in fact, a small sparse matrix does not allow to maximize
-utilization of GPU cores. In addition, the communications required to synchronize the computations over the cluster increase the idle times of GPUs and
-slow down further the parallel computations.
-
-Consequently, in order to test the performances of the parallel solvers, we developed in C programming language a generator of large sparse matrices. 
-This generator takes a matrix from the Davis's collection~\cite{ref10} as an initial matrix to construct large sparse matrices exceeding ten million
-of rows. It must be executed in parallel by the MPI processes of the computing nodes, so that each process could construct its sparse sub-matrix. In
-first experimental tests, we are focused on sparse matrices having a banded structure, because they are those arise in the most of numerical problems.
-So to generate the global sparse matrix, each MPI process constructs its sub-matrix by performing several copies of an initial sparse matrix chosen
-from the Davis's collection. Then, it puts all these copies on the main diagonal of the global matrix (see Figure~\ref{fig:06}). Moreover, the empty
-spaces between two successive copies in the main diagonal are filled with sub-copies (left-copy and right-copy in Figure~\ref{fig:06}) of the same
+where $\Delta$ is the maximum vector element, in absolute value, of the difference between
+the two solutions $x^{cpu}$ and $x^{gpu}$ computed, respectively, on CPU and GPU clusters and
+$prec$ is the maximum element, in absolute value, of the residual vector $r^{gpu}\in\mathbb{R}^{n}$
+of the solution $x^{gpu}$. Thus, we can see that the solutions obtained on the GPU cluster
+were computed with a sufficient accuracy (about $10^{-10}$) and they are, more or less, equivalent
+to those computed on the CPU cluster with a small difference ranging from $10^{-10}$ to $10^{-26}$.
+However, we can notice from the relative gains $\tau$ that is not interesting to use multiple
+GPUs for solving small sparse linear systems. in fact, a small sparse matrix does not allow to
+maximize utilization of GPU cores. In addition, the communications required to synchronize the
+computations over the cluster increase the idle times of GPUs and slow down further the parallel
+computations.
+
+Consequently, in order to test the performances of the parallel solvers, we developed in C programming
+language a generator of large sparse matrices. This generator takes a matrix from the Davis's collection~\cite{ch12:ref10}
+as an initial matrix to construct large sparse matrices exceeding ten million of rows. It must be executed
+in parallel by the MPI processes of the computing nodes, so that each process could construct its sparse
+sub-matrix. In first experimental tests, we are focused on sparse matrices having a banded structure,
+because they are those arise in the most of numerical problems. So to generate the global sparse matrix,
+each MPI process constructs its sub-matrix by performing several copies of an initial sparse matrix chosen
+from the Davis's collection. Then, it puts all these copies on the main diagonal of the global matrix
+(see Figure~\ref{ch12:fig:06}). Moreover, the empty spaces between two successive copies in the main
+diagonal are filled with sub-copies (left-copy and right-copy in Figure~\ref{ch12:fig:06}) of the same
  initial matrix.
  
  \begin{figure}
  \centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/generation}}
  \caption{Parallel generation of a large sparse matrix by four computing nodes.}
-\label{fig:06}
+\label{ch12:fig:06}
  \end{figure}
  
  \begin{table}[!h]
@@ -633,17 +730,21 @@ initial matrix.
  \end{tabular}
  \vspace{0.5cm}
  \caption{Main characteristics of sparse banded matrices generated from those of the Davis's collection.}
-\label{tab:04}
+\label{ch12:tab:04}
  \end{table}
  
-We have used the parallel CG and GMRES algorithms for solving sparse linear systems of $25$ million unknown values. The sparse matrices associated 
-to these linear systems are generated from those presented in Table~\ref{tab:01}. Their main characteristics are given in Table~\ref{tab:04}. Tables~\ref{tab:05}
-and~\ref{tab:06} shows the performances of the parallel CG and GMRES solvers, respectively, obtained on a cluster of $24$ CPU cores and on a cluster 
-of $12$ GPUs. Obviously, we can notice from these tables that solving large sparse linear systems on a GPU cluster is more efficient than on a CPU 
-cluster (see relative gains $\tau$). We can also notice that the execution times of the CG method, whether in a CPU cluster or on a GPU cluster, are
-better than those the GMRES method for solving large symmetric linear systems. In fact, the CG method is characterized by a better convergence rate 
-and a shorter execution time of an iteration than those of the GMRES method. Moreover, an iteration of the parallel GMRES method requires more data
-exchanges between computing nodes compared to the parallel CG method.
+We have used the parallel CG and GMRES algorithms for solving sparse linear systems of $25$
+million unknown values. The sparse matrices associated to these linear systems are generated
+from those presented in Table~\ref{ch12:tab:01}. Their main characteristics are given in Table~\ref{ch12:tab:04}.
+Tables~\ref{ch12:tab:05} and~\ref{ch12:tab:06} shows the performances of the parallel CG and
+GMRES solvers, respectively, obtained on a cluster of $24$ CPU cores and on a cluster of $12$
+GPUs. Obviously, we can notice from these tables that solving large sparse linear systems on
+a GPU cluster is more efficient than on a CPU cluster (see relative gains $\tau$). We can also
+notice that the execution times of the CG method, whether in a CPU cluster or on a GPU cluster,
+are better than those the GMRES method for solving large symmetric linear systems. In fact, the
+CG method is characterized by a better convergence\index{Convergence} rate and a shorter execution
+time of an iteration than those of the GMRES method. Moreover, an iteration of the parallel GMRES
+method requires more data exchanges between computing nodes compared to the parallel CG method.
   
  \begin{table}[!h]
  \begin{center}
@@ -665,7 +766,7 @@ thermal2        & $1.411s$             & $0.244s$              & $5.78$
  \end{tabular}
  \caption{Performances of the parallel CG method for solving linear systems associated to sparse banded matrices on a cluster of 24 CPU cores vs. 
  on a cluster of 12 GPUs.}
-\label{tab:05}
+\label{ch12:tab:05}
  \end{center}
  \end{table}
  
@@ -701,7 +802,7 @@ torso3            & $31.463s$            & $3.681s$              & $8.55$
  \end{tabular}
  \caption{Performances of the parallel GMRES method for solving linear systems associated to sparse banded matrices on a cluster of 24 CPU cores vs. 
  on a cluster of 12 GPUs.}
-\label{tab:06}
+\label{ch12:tab:06}
  \end{center}
  \end{table}
  
@@ -710,14 +811,15 @@ on a cluster of 12 GPUs.}
  %%       SECTION 5          %%
  %%--------------------------%%
  \section{Hypergraph partitioning}
-\label{sec:05}
-In this section, we present the performances of both parallel CG and GMRES solvers for solving linear systems associated to sparse matrices having
-large bandwidths. Indeed, we are interested on sparse matrices having the nonzero values distributed along their bandwidths. 
+\label{ch12:sec:05}
+In this section, we present the performances of both parallel CG and GMRES solvers for solving linear
+systems associated to sparse matrices having large bandwidths. Indeed, we are interested on sparse
+matrices having the nonzero values distributed along their bandwidths. 
  
  \begin{figure}
  \centerline{\includegraphics[scale=0.22]{Chapters/chapter12/figures/generation_1}}
  \caption{Parallel generation of a large sparse five-bands matrix by four computing nodes.}
-\label{fig:07}
+\label{ch12:fig:07}
  \end{figure}
  
  \begin{table}[!h]
@@ -751,15 +853,20 @@ large bandwidths. Indeed, we are interested on sparse matrices having the nonzer
                                & torso3            & $872,029,998$ & $25,000,000$\\ \hline
  \end{tabular}
  \caption{Main characteristics of sparse five-bands matrices generated from those of the Davis's collection.}
-\label{tab:07}
+\label{ch12:tab:07}
  \end{center}
  \end{table}
  
-We have developed in C programming language a generator of large sparse matrices having five bands distributed along their bandwidths (see Figure~\ref{fig:07}).
-The principle of this generator is equivalent to that in Section~\ref{sec:04}. However, the copies performed on the initial matrix (chosen from the
-Davis's collection) are placed on the main diagonal and on four off-diagonals, two on the right and two on the left of the main diagonal. Figure~\ref{fig:07}
-shows an example of a generation of a sparse five-bands matrix by four computing nodes. Table~\ref{tab:07} shows the main characteristics of sparse
-five-bands matrices generated from those presented in Table~\ref{tab:01} and associated to linear systems of $25$ million unknown values.   
+We have developed in C programming language a generator of large sparse matrices
+having five bands distributed along their bandwidths (see Figure~\ref{ch12:fig:07}).
+The principle of this generator is equivalent to that in Section~\ref{ch12:sec:04}.
+However, the copies performed on the initial matrix (chosen from the Davis's collection)
+are placed on the main diagonal and on four off-diagonals, two on the right and two
+on the left of the main diagonal. Figure~\ref{ch12:fig:07} shows an example of a
+generation of a sparse five-bands matrix by four computing nodes. Table~\ref{ch12:tab:07}
+shows the main characteristics of sparse five-bands matrices generated from those
+presented in Table~\ref{ch12:tab:01} and associated to linear systems of $25$ million
+unknown values.   
  
  \begin{table}[!h]
  \begin{center}
@@ -781,7 +888,7 @@ thermal2          & $2.549s$     & $1.705s$      & $1.49$ & $14$     & $2.36e$-$
  \end{tabular}
  \caption{Performances of parallel CG solver for solving linear systems associated to sparse five-bands matrices
  on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs}
-\label{tab:08}
+\label{ch12:tab:08}
  \end{center}
  \end{table}
  
@@ -817,27 +924,34 @@ torso3            & $49.074s$    & $19.397s$     & $2.53$  & $175$    & $2.69e$-
  \end{tabular}
  \caption{Performances of parallel GMRES solver for solving linear systems associated to sparse five-bands matrices
  on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs}
-\label{tab:09}
+\label{ch12:tab:09}
  \end{center}
  \end{table}
  
-Tables~\ref{tab:08} and~\ref{tab:09} shows the performaces of the parallel CG and GMRES solvers, respectively, obtained on
-a cluster of $24$ CPU cores and on a cluster of $12$ GPUs. The linear systems solved in these tables are associated to the
-sparse five-bands matrices presented on Table~\ref{tab:07}. We can notice from both Tables~\ref{tab:08} and~\ref{tab:09} that 
-using a GPU cluster is not efficient for solving these kind of sparse linear systems. We can see that the execution times obtained
-on the GPU cluster are almost equivalent to those obtained on the CPU cluster (see the relative gains presented in column~$4$
-of each table). This is due to the large number of communications necessary to synchronize the computations over the cluster.
-Indeed, the naive partitioning, row-by-row or column-by-column, of sparse matrices having large bandwidths can link a computing
-node to many neighbors and then generate a large number of data dependencies between these computing nodes in the cluster. 
-
-Therefore, we have chosen to use a hypergraph partitioning method, which is well-suited to numerous kinds of sparse matrices~\cite{ref11}.
-Indeed, it can well model the communications between the computing nodes, particularly in the case of nonsymmetric and irregular
-matrices, and it gives good reduction of the total communication volume. In contrast, it is an expensive operation in terms of 
-execution time and memory space. 
-
-The sparse matrix $A$ of the linear system to be solved is modeled as a hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ as
-follows:
-\begin{itemize*}
+Tables~\ref{ch12:tab:08} and~\ref{ch12:tab:09} shows the performances of the parallel
+CG and GMRES solvers, respectively, obtained on a cluster of $24$ CPU cores and on a
+cluster of $12$ GPUs. The linear systems solved in these tables are associated to the
+sparse five-bands matrices presented on Table~\ref{ch12:tab:07}. We can notice from
+both Tables~\ref{ch12:tab:08} and~\ref{ch12:tab:09} that using a GPU cluster is not
+efficient for solving these kind of sparse linear systems\index{Sparse~linear~system}.
+We can see that the execution times obtained on the GPU cluster are almost equivalent
+to those obtained on the CPU cluster (see the relative gains presented in column~$4$
+of each table). This is due to the large number of communications necessary to synchronize
+the computations over the cluster. Indeed, the naive partitioning, row-by-row or column-by-column,
+of sparse matrices having large bandwidths can link a computing node to many neighbors
+and then generate a large number of data dependencies between these computing nodes in
+the cluster. 
+
+Therefore, we have chosen to use a hypergraph partitioning method\index{Hypergraph},
+which is well-suited to numerous kinds of sparse matrices~\cite{ch12:ref11}. Indeed,
+it can well model the communications between the computing nodes, particularly in the
+case of nonsymmetric and irregular matrices, and it gives good reduction of the total
+communication volume. In contrast, it is an expensive operation in terms of execution
+time and memory space. 
+
+The sparse matrix $A$ of the linear system to be solved is modeled as a hypergraph
+$\mathcal{H}=(\mathcal{V},\mathcal{E})$\index{Hypergraph} as follows:
+\begin{itemize}
  \item each matrix row $\{i\}_{0\leq i<n}$ corresponds to a vertex $v_i\in\mathcal{V}$ and,
  \item each matrix column $\{j\}_{0\leq j<n}$ corresponds to a hyperedge $e_j\in\mathcal{E}$, where:
  \begin{equation}
@@ -845,56 +959,71 @@ follows:
  \end{equation} 
  \item $w_i$ is the weight of vertex $v_i$ and,
  \item $c_j$ is the cost of hyperedge $e_j$.
-\end{itemize*}
-A $K$-way partitioning of a hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ is defined as $\mathcal{P}=\{\mathcal{V}_1,\ldots,\mathcal{V}_K\}$
-a set of pairwise disjoint non-empty subsets (or parts) of the vertex set $\mathcal{V}$, so that each subset is attributed to a computing node.
-Figure~\ref{fig:08} shows an example of the hypergraph model of a  $(9\times 9)$ sparse matrix in three parts. The circles and squares correspond,
-respectively, to the vertices and hyperedges of the hypergraph. The solid squares define the cut hyperedges connecting at least two different parts. 
-The connectivity $\lambda_j$ of a cut hyperedge $e_j$ denotes the number of different parts spanned by $e_j$.
+\end{itemize}
+A $K$-way partitioning of a hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ is
+defined as $\mathcal{P}=\{\mathcal{V}_1,\ldots,\mathcal{V}_K\}$ a set of pairwise
+disjoint non-empty subsets (or parts) of the vertex set $\mathcal{V}$, so that each
+subset is attributed to a computing node. Figure~\ref{ch12:fig:08} shows an example
+of the hypergraph model of a  $(9\times 9)$ sparse matrix in three parts. The circles
+and squares correspond, respectively, to the vertices and hyperedges of the hypergraph.
+The solid squares define the cut hyperedges connecting at least two different parts. 
+The connectivity $\lambda_j$ of a cut hyperedge $e_j$ denotes the number of different
+parts spanned by $e_j$.
  
  \begin{figure}
  \centerline{\includegraphics[scale=0.5]{Chapters/chapter12/figures/hypergraph}}
  \caption{An example of the hypergraph partitioning of a sparse matrix decomposed between three computing nodes.}
-\label{fig:08}
+\label{ch12:fig:08}
  \end{figure}
  
-The cut hyperedges model the total communication volume between the different computing nodes in the cluster, necessary to perform the parallel SpMV
-multiplication. Indeed, each hyperedge $e_j$ defines a set of atomic computations $b_i\leftarrow b_i+a_{ij}x_j$, $0\leq i,j<n$, of the SpMV multiplication
-$Ax=b$ that need the $j^{th}$ unknown value of solution vector $x$. Therefore, pins of hyperedge $e_j$, $pins[e_j]$, are the set of matrix rows sharing
-and requiring the same unknown value $x_j$. For example in Figure~\ref{fig:08}, hyperedge $e_9$ whose pins are: $pins[e_9]=\{v_2,v_5,v_9\}$ represents the
-dependency of matrix rows $2$, $5$ and $9$ to unknown $x_9$ needed to perform in parallel the atomic operations: $b_2\leftarrow b_2+a_{29}x_9$,
-$b_5\leftarrow b_5+a_{59}x_9$ and $b_9\leftarrow b_9+a_{99}x_9$. However, unknown $x_9$ is the third entry of the sub-solution vector $x$ of part (or node) $3$.
-So the computing node $3$ must exchange this value with nodes $1$ and $2$, which leads to perform two communications.
-
-The hypergraph partitioning allows to reduce the total communication volume required to perform the parallel SpMV multiplication, while maintaining the 
-load balancing between the computing nodes. In fact, it allows to minimize at best the following amount:
+The cut hyperedges model the total communication volume between the different computing
+nodes in the cluster, necessary to perform the parallel SpMV multiplication\index{SpMV~multiplication}.
+Indeed, each hyperedge $e_j$ defines a set of atomic computations $b_i\leftarrow b_i+a_{ij}x_j$,
+$0\leq i,j<n$, of the SpMV multiplication $Ax=b$ that need the $j^{th}$ unknown value of
+solution vector $x$. Therefore, pins of hyperedge $e_j$, $pins[e_j]$, are the set of matrix
+rows sharing and requiring the same unknown value $x_j$. For example in Figure~\ref{ch12:fig:08},
+hyperedge $e_9$ whose pins are: $pins[e_9]=\{v_2,v_5,v_9\}$ represents the dependency of matrix
+rows $2$, $5$ and $9$ to unknown $x_9$ needed to perform in parallel the atomic operations:
+$b_2\leftarrow b_2+a_{29}x_9$, $b_5\leftarrow b_5+a_{59}x_9$ and $b_9\leftarrow b_9+a_{99}x_9$.
+However, unknown $x_9$ is the third entry of the sub-solution vector $x$ of part (or node) $3$.
+So the computing node $3$ must exchange this value with nodes $1$ and $2$, which leads to perform
+two communications.
+
+The hypergraph partitioning\index{Hypergraph} allows to reduce the total communication volume
+required to perform the parallel SpMV multiplication, while maintaining the load balancing between
+the computing nodes. In fact, it allows to minimize at best the following amount:
  \begin{equation}
  \mathcal{X}(\mathcal{P})=\sum_{e_{j}\in\mathcal{E}_{C}}c_{j}(\lambda_{j}-1),
  \end{equation}
-where $\mathcal{E}_{C}$ denotes the set of the cut hyperedges coming from the hypergraph partitioning $\mathcal{P}$ and $c_j$ and $\lambda_j$ are, respectively,
-the cost and the connectivity of cut hyperedge $e_j$. Moreover, it also ensures the load balancing between the $K$ parts as follows: 
+where $\mathcal{E}_{C}$ denotes the set of the cut hyperedges coming from the hypergraph partitioning
+$\mathcal{P}$ and $c_j$ and $\lambda_j$ are, respectively, the cost and the connectivity of cut hyperedge
+$e_j$. Moreover, it also ensures the load balancing between the $K$ parts as follows: 
  \begin{equation}
    W_{k}\leq (1+\epsilon)W_{avg}, \hspace{0.2cm} (1\leq k\leq K) \hspace{0.2cm} \text{and} \hspace{0.2cm} (0<\epsilon<1),
  \end{equation} 
-where $W_{k}$ is the sum of all vertex weights ($w_{i}$) in part $\mathcal{V}_{k}$, $W_{avg}$ is the average weight of all $K$ parts and $\epsilon$ is the 
-maximum allowed imbalanced ratio.
-
-The hypergraph partitioning is a NP-complete problem but software tools using heuristics are developed, for example: hMETIS~\cite{ref12}, PaToH~\cite{ref13}
-and Zoltan~\cite{ref14}. Since our objective is solving large sparse linear systems, we use the parallel hypergraph partitioning which must be performed by
-at least two MPI processes. It allows to accelerate the data partitioning of large sparse matrices. For this, the hypergraph $\mathcal{H}$ must be partitioned
-in $p$ (number of MPI processes) sub-hypergraphs $\mathcal{H}_k=(\mathcal{V}_k,\mathcal{E}_k)$, $0\leq k<p$, and then we performed the parallel hypergraph
-partitioning method using some functions of the MPI library between the $p$ processes.
-
-Tables~\ref{tab:10} and~\ref{tab:11} shows the performances of the parallel CG and GMRES solvers, respectively, using the hypergraph partitioning for solving
-large linear systems associated to the sparse five-bands matrices presented in Table~\ref{tab:07}. For these experimental tests, we have applied the parallel
-hypergraph partitioning~\cite{ref15} developed in Zoltan tool~\cite{ref14}. We have initialized the parameters of the partitioning operation as follows:
-\begin{itemize*}
+where $W_{k}$ is the sum of all vertex weights ($w_{i}$) in part $\mathcal{V}_{k}$, $W_{avg}$ is the
+average weight of all $K$ parts and $\epsilon$ is the maximum allowed imbalanced ratio.
+
+The hypergraph partitioning is a NP-complete problem but software tools using heuristics are developed,
+for example: hMETIS~\cite{ch12:ref12}, PaToH~\cite{ch12:ref13} and Zoltan~\cite{ch12:ref14}. Since our
+objective is solving large sparse linear systems, we use the parallel hypergraph partitioning which must
+be performed by at least two MPI processes. It allows to accelerate the data partitioning of large sparse
+matrices. For this, the hypergraph $\mathcal{H}$ must be partitioned in $p$ (number of MPI processes)
+sub-hypergraphs $\mathcal{H}_k=(\mathcal{V}_k,\mathcal{E}_k)$, $0\leq k<p$, and then we performed the
+parallel hypergraph partitioning method using some functions of the MPI library between the $p$ processes.
+
+Tables~\ref{ch12:tab:10} and~\ref{ch12:tab:11} shows the performances of the parallel CG and GMRES solvers,
+respectively, using the hypergraph partitioning for solving large linear systems associated to the sparse
+five-bands matrices presented in Table~\ref{ch12:tab:07}. For these experimental tests, we have applied the
+parallel hypergraph partitioning~\cite{ch12:ref15} developed in Zoltan tool~\cite{ch12:ref14}. We have initialized
+the parameters of the partitioning operation as follows:
+\begin{itemize}
  \item the weight $w_{i}$ of each vertex $v_{j}\in\mathcal{V}$ is set to the number of nonzero values on matrix row $i$,
  \item for the sake of simplicity, the cost $c_{j}$ of each hyperedge $e_{j}\in\mathcal{E}$ is fixed to $1$,
  \item the maximum imbalanced load ratio $\epsilon$ is limited to $10\%$.\\
-\end{itemize*}  
+\end{itemize}  
  
-\begin{table}[!h]
+\begin{table}
  \begin{center}
  \begin{tabular}{|c|c|c|c|c|} 
  \hline
@@ -914,11 +1043,11 @@ thermal2        & $1.889s$             & $0.348s$              & $5.43$
  \end{tabular}
  \caption{Performances of the parallel CG solver using hypergraph partitioning for solving linear systems associated to
  sparse five-bands matrices on a cluster of 24 CPU cores vs. on a cluster of 12 GPU.}
-\label{tab:10}
+\label{ch12:tab:10}
  \end{center}
  \end{table}
  
-\begin{table}[!h]
+\begin{table}
  \begin{center}
  \begin{tabular}{|c|c|c|c|c|} 
  \hline
@@ -950,23 +1079,30 @@ torso3            & $48.459s$            & $6.234s$              & $7.77$
  \end{tabular}
  \caption{Performances of the parallel GMRES solver using hypergraph partitioning for solving linear systems associated to
  sparse five-bands matrices on a cluster of 24 CPU cores vs. on a cluster of 12 GPU.}
-\label{tab:11}
+\label{ch12:tab:11}
  \end{center}
  \end{table}
  
-We can notice from both Tables~\ref{tab:10} and~\ref{tab:11} that the hypergraph partitioning has improved the performances of both
-parallel CG and GMRES algorithms. The execution times
-on the GPU cluster of both parallel solvers are significantly improved compared to those obtained by using the partitioning row-by-row.
-For these examples of sparse matrices, the execution times of CG and GMRES solvers are reduced about $76\%$ and $65\%$ respectively
-(see column~$5$ of each table) compared to those obtained in Tables~\ref{tab:08} and~\ref{tab:09}.
-
-In fact, the hypergraph partitioning applied to sparse matrices having large bandwidths allows to reduce the total communication volume
-necessary to synchronize the computations between the computing nodes in the GPU cluster. Table~\ref{tab:12} presents, for each sparse
-matrix, the total communication volume between $12$ GPU computing nodes obtained by using the partitioning row-by-row (column~$2$), the
-total communication volume obtained by using the hypergraph partitioning (column~$3$) and the execution times in minutes of the hypergraph
-partitioning operation performed by $12$ MPI processes (column~$4$). The total communication volume defines the total number of the vector
-elements exchanged by the computing nodes. Then, Table~\ref{tab:12} shows that the hypergraph partitioning method can split the sparse 
-matrix so as to minimize the data dependencies between the computing nodes and thus to reduce the total communication volume.
+We can notice from both Tables~\ref{ch12:tab:10} and~\ref{ch12:tab:11} that the
+hypergraph partitioning has improved the performances of both parallel CG and GMRES
+algorithms. The execution times on the GPU cluster of both parallel solvers are
+significantly improved compared to those obtained by using the partitioning row-by-row.
+For these examples of sparse matrices, the execution times of CG and GMRES solvers
+are reduced about $76\%$ and $65\%$ respectively (see column~$5$ of each table)
+compared to those obtained in Tables~\ref{ch12:tab:08} and~\ref{ch12:tab:09}.
+
+In fact, the hypergraph partitioning\index{Hypergraph} applied to sparse matrices
+having large bandwidths allows to reduce the total communication volume necessary
+to synchronize the computations between the computing nodes in the GPU cluster.
+Table~\ref{ch12:tab:12} presents, for each sparse matrix, the total communication
+volume between $12$ GPU computing nodes obtained by using the partitioning row-by-row
+(column~$2$), the total communication volume obtained by using the hypergraph partitioning
+(column~$3$) and the execution times in minutes of the hypergraph partitioning
+operation performed by $12$ MPI processes (column~$4$). The total communication
+volume defines the total number of the vector elements exchanged by the computing
+nodes. Then, Table~\ref{ch12:tab:12} shows that the hypergraph partitioning method
+can split the sparse matrix so as to minimize the data dependencies between the
+computing nodes and thus to reduce the total communication volume.
  
  \begin{table}
  \begin{center}
@@ -1002,45 +1138,50 @@ poli\_large                  & $25,053,554$            & $7,388,883$
  torso3                       & $25,682,514$            & $613,250$               & $61.51$         \\ \hline       
  \end{tabular}
  \caption{The total communication volume between 12 GPU computing nodes without and with the hypergraph partitioning method.}
-\label{tab:12}
+\label{ch12:tab:12}
  \end{center}
  \end{table}
  
-Nevertheless, as we can see from the fourth column of Table~\ref{tab:12}, the hypergraph partitioning takes longer compared
-to the execution times of the resolutions. As previously mentioned, the hypergraph partitioning method is less efficient in
-terms of memory consumption and partitioning time than its graph counterpart, but the hypergraph well models the nonsymmetric
-and irregular problems. So for the applications which often use the same sparse matrices, we can perform the hypergraph partitioning
-on these matrices only once for each and then, we save the traces of these partitionings in files to be reused several times.
-Therefore, this allows to avoid the partitioning of the sparse matrices at each resolution of the linear systems.
-
-\begin{figure}[!t]
+Nevertheless, as we can see from the fourth column of Table~\ref{ch12:tab:12},
+the hypergraph partitioning takes longer compared to the execution times of the
+resolutions. As previously mentioned, the hypergraph partitioning method is less
+efficient in terms of memory consumption and partitioning time than its graph
+counterpart, but the hypergraph well models the nonsymmetric and irregular problems.
+So for the applications which often use the same sparse matrices, we can perform
+the hypergraph partitioning on these matrices only once for each and then, we save
+the traces of these partitionings in files to be reused several times. Therefore,
+this allows to avoid the partitioning of the sparse matrices at each resolution
+of the linear systems.
+
+\begin{figure}[!h]
  \centering
-\begin{tabular}{c}
-\includegraphics[scale=0.7]{Chapters/chapter12/figures/scale_band} \\
-\small{(a) Sparse band matrices} \\
-\\
-\includegraphics[scale=0.7]{Chapters/chapter12/figures/scale_5band} \\
-\small{(b) Sparse five-bands matrices}
-\end{tabular}
+  \mbox{\subfigure[Sparse band matrices]{\includegraphics[scale=0.7]{Chapters/chapter12/figures/scale_band}\label{ch12:fig:09.01}}}
+\vfill 
+  \mbox{\subfigure[Sparse five-bands matrices]{\includegraphics[scale=0.7]{Chapters/chapter12/figures/scale_5band}\label{ch12:fig:09.02}}}
  \caption{Weak-scaling of the parallel CG and GMRES solvers on a GPU cluster for solving large sparse linear systems.}
-\label{fig:09}
+\label{ch12:fig:09}
  \end{figure}
  
-However, the most important performance parameter is the scalability of the parallel CG and GMRES solvers on a GPU cluster.
-Particularly, we have taken into account the weak-scaling of both parallel algorithms on a cluster of one to 12 GPU computing
-nodes. We have performed a set of experiments on both matrix structures: band matrices and five-bands matrices. The
-sparse matrices of tests are generated from the symmetric sparse matrix {\it thermal2} chosen from the Davis's collection.
-Figures~\ref{fig:09}-$(a)$ and~\ref{fig:09}-$(b)$ show the execution times of both parallel methods for solving large linear
-systems associated to band matrices and those associated to five-bands matrices, respectively. The size of a sparse sub-matrix
-per computing node, for each matrix structure, is fixed as follows:
-\begin{itemize*}
+However, the most important performance parameter is the scalability of the parallel
+CG\index{Iterative~method!CG} and GMRES\index{Iterative~method!GMRES} solvers on a GPU
+cluster. Particularly, we have taken into account the weak-scaling of both parallel
+algorithms on a cluster of one to 12 GPU computing nodes. We have performed a set of
+experiments on both matrix structures: band matrices and five-bands matrices. The sparse
+matrices of tests are generated from the symmetric sparse matrix {\it thermal2} chosen
+from the Davis's collection. Figures~\ref{ch12:fig:09.01} and~\ref{ch12:fig:09.02}
+show the execution times of both parallel methods for solving large linear systems
+associated to band matrices and those associated to five-bands matrices, respectively.
+The size of a sparse sub-matrix per computing node, for each matrix structure, is fixed
+as follows:
+\begin{itemize}
  \item band matrix: $15$ million of rows and $105,166,557$ of nonzero values,
  \item five-bands matrix: $5$ million of rows and $78,714,492$ of nonzero values. 
-\end{itemize*}
-We can see from these figures that both parallel solvers are quite scalable on a GPU cluster. Indeed, the execution times remains
-almost constant while the size of the sparse linear systems to be solved increases proportionally with the number of 
-the GPU computing nodes. This means that the communication cost is relatively constant regardless of the number the computing nodes
-in the GPU cluster.
+\end{itemize}
+We can see from these figures that both parallel solvers are quite scalable on a GPU
+cluster. Indeed, the execution times remains almost constant while the size of the
+sparse linear systems to be solved increases proportionally with the number of the
+GPU computing nodes. This means that the communication cost is relatively constant
+regardless of the number the computing nodes in the GPU cluster.
  
  
  
@@ -1048,31 +1189,44 @@ in the GPU cluster.
  %%       SECTION 6          %%
  %%--------------------------%%
  \section{Conclusion}
-\label{sec:06}
-In this chapter, we have aimed at harnessing the computing power of a cluster of GPUs for solving large sparse linear systems.
-For this, we have used two Krylov sub-space iterative methods: the CG and GMRES methods. The first method is well-known to its
-efficiency for solving symmetric linear systems and the second one is used, particularly, for solving nonsymmetric linear systems. 
-
-We have presentend the parallel implementation of both iterative methods on a GPU cluster. Particularly, the operations dealing with
-the vectors and/or matrices, of these methods, are parallelized between the different GPU computing nodes of the cluster. Indeed,
-the data-parallel vector operations are accelerated by GPUs and the communications required to synchronize the parallel computations
-are carried out by CPU cores. For this, we have used a heterogeneous CUDA/MPI programming to implement the parallel iterative
+\label{ch12:sec:06}
+In this chapter, we have aimed at harnessing the computing power of a
+cluster of GPUs for solving large sparse linear systems. For this, we
+have used two Krylov subspace iterative methods: the CG and GMRES methods.
+The first method is well-known to its efficiency for solving symmetric
+linear systems and the second one is used, particularly, for solving
+nonsymmetric linear systems. 
+
+We have presented the parallel implementation of both iterative methods
+on a GPU cluster. Particularly, the operations dealing with the vectors
+and/or matrices, of these methods, are parallelized between the different
+GPU computing nodes of the cluster. Indeed, the data-parallel vector operations
+are accelerated by GPUs and the communications required to synchronize the
+parallel computations are carried out by CPU cores. For this, we have used
+a heterogeneous CUDA/MPI programming to implement the parallel iterative
  algorithms.
  
-In the experimental tests, we have shown that using a GPU cluster is efficient for solving linear systems associated to very large
-sparse matrices. The experimental results, obtained in the present chapter, showed that a cluster of $12$ GPUs is about $7$
-times faster than a cluster of $24$ CPU cores for solving large sparse linear systems of $25$ million unknown values. This is due
-to the GPU ability to compute the data-parallel operations faster than the CPUs. However, we have shown that solving linear systems
-associated to matrices having large bandwidths uses many communications to synchronize the computations of GPUs, which slow down 
-even more the resolution. Moreover, there are two kinds of communications: between a CPU and its GPU and between CPUs of the computing
-nodes, such that the first ones are the slowest communications on a GPU cluster. So, we have proposed to use the hypergraph partitioning
-instead of the row-by-row partitioning. This allows to minimize the data dependencies between the GPU computing nodes and thus to
-reduce the total communication volume. The experimental results showed that using the hypergraph partitioning technique improve the
-execution times on average of $76\%$ to the CG method and of $65\%$ to the GMRES method on a cluster of $12$ GPUs. 
-
-In the recent GPU hardware and software architectures, the GPU-Direct system with CUDA version 5.0 is used so that two GPUs located on
-the same node or on distant nodes can communicate between them directly without CPUs. This allows to improve the data transfers between
-GPUs.          
+In the experimental tests, we have shown that using a GPU cluster is efficient
+for solving linear systems associated to very large sparse matrices. The experimental
+results, obtained in the present chapter, showed that a cluster of $12$ GPUs is
+about $7$ times faster than a cluster of $24$ CPU cores for solving large sparse
+linear systems of $25$ million unknown values. This is due to the GPU ability to
+compute the data-parallel operations faster than the CPUs. However, we have shown
+that solving linear systems associated to matrices having large bandwidths uses
+many communications to synchronize the computations of GPUs, which slow down even
+more the resolution. Moreover, there are two kinds of communications: between a
+CPU and its GPU and between CPUs of the computing nodes, such that the first ones
+are the slowest communications on a GPU cluster. So, we have proposed to use the
+hypergraph partitioning instead of the row-by-row partitioning. This allows to
+minimize the data dependencies between the GPU computing nodes and thus to reduce
+the total communication volume. The experimental results showed that using the
+hypergraph partitioning technique improve the execution times on average of $76\%$
+to the CG method and of $65\%$ to the GMRES method on a cluster of $12$ GPUs. 
+
+In the recent GPU hardware and software architectures, the GPU-Direct system with
+CUDA version 5.0 is used so that two GPUs located on the same node or on distant
+nodes can communicate between them directly without CPUs. This allows to improve
+the data transfers between GPUs.          
  
  \putbib[Chapters/chapter12/biblio12]