ch15

[book_gpu.git] / BookGPU / Chapters / chapter12 / ch12.tex
diff --git a/BookGPU/Chapters/chapter12/ch12.tex b/BookGPU/Chapters/chapter12/ch12.tex

index 0b743ebe0fa770e8f01ec185d2ee2cc436756745..384359721df34ab9f01677ea7a37890a2bb066b6 100755 (executable)
--- a/BookGPU/Chapters/chapter12/ch12.tex
+++ b/BookGPU/Chapters/chapter12/ch12.tex
@@ -5,7 +5,7 @@
  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   
  %\chapterauthor{}{}
-\chapterauthor{Lilia Ziane Khodja, Raphaël Couturier and Jacques Bahi}{Femto-ST Institute, University of Franche-Comte, France}
+\chapterauthor{Lilia Ziane Khodja, Raphaël Couturier, and Jacques Bahi}{Femto-ST Institute, University of Franche-Comte, France}
  %\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte, France}
  %\chapterauthor{Jacques Bahi}{Femto-ST Institute, University of Franche-Comte, France}
  
@@ -20,15 +20,15 @@
  Sparse linear systems are used to model many scientific and industrial problems,
  such as the environmental simulations or the industrial processing of the complex or
  non-Newtonian fluids. Moreover, the resolution of these problems often involves the
-solving of such linear systems which are considered as the most expensive process in
+solving of such linear systems that are considered the most expensive process in
  terms of execution time and memory space. Therefore, solving sparse linear systems
  must be as efficient as possible in order to deal with problems of ever increasing
  size.
  
  There are, in the jargon of numerical analysis, different methods of solving sparse
-linear systems that can be classified in two classes: the direct and iterative methods.
-However, the iterative methods are often more suitable than their counterpart, direct
-methods, to  solve these systems. Indeed, they are less memory consuming and easier
+linear systems that can be classified in two classes: direct and iterative methods.
+However, the iterative methods are often more suitable than their counterparts, direct
+methods, to  solve these systems. Indeed, they are less memory-consuming and easier
  to parallelize on parallel computers than direct methods. Different computing platforms,
  sequential and parallel computers, are used to solve sparse linear systems with iterative
  solutions. Nowadays, graphics processing units (GPUs) have become attractive to solve
@@ -38,8 +38,8 @@ traditional CPUs.
  In Section~\ref{ch12:sec:02}, we describe the general principle of two well-known iterative
  methods: the conjugate gradient method and the generalized minimal residual method. In Section~\ref{ch12:sec:03},
  we give the main key points of the parallel implementation of both methods on a cluster of
-GPUs. Finally, in Section~\ref{ch12:sec:04}, we present the experimental results obtained on a
-CPU cluster and on a GPU cluster, to solve large sparse linear systems.    
+GPUs. Finally, in Section~\ref{ch12:sec:04}, we present the experimental results, obtained on a
+CPU cluster and on a GPU cluster of solving large sparse linear systems.    
  
  
  %%--------------------------%%
@@ -54,12 +54,12 @@ Ax=b,
  \label{ch12:eq:01}
  \end{equation}
  where $A\in\mathbb{R}^{n\times n}$ is a sparse nonsingular square matrix, $x\in\mathbb{R}^{n}$
-is the solution vector, $b\in\mathbb{R}^{n}$ is the right-hand side and $n\in\mathbb{N}$ is a
+is the solution vector, $b\in\mathbb{R}^{n}$ is the right-hand side, and $n\in\mathbb{N}$ is a
  large integer number. 
  
  The iterative methods\index{Iterative~method} for solving the large sparse linear system~(\ref{ch12:eq:01})
  proceed by successive iterations of a same block of elementary operations, during which an
-infinite number of approximate solutions $\{x_k\}_{k\geq 0}$ are computed. Indeed, from an
+infinite number of approximate solutions $\{x_k\}_{k\geq 0}$ is computed. Indeed, from an
  initial guess $x_0$, an iterative method determines at each iteration $k>0$ an approximate
  solution $x_k$ which, gradually, converges to the exact solution $x^{*}$ as follows:
  \begin{equation}
@@ -78,9 +78,9 @@ where $\varepsilon<1$ is the required convergence tolerance threshold\index{Conv
  
  Some of the most iterative methods that have proven their efficiency for solving large sparse
  linear systems are those called \textit{Krylov subspace methods}~\cite{ch12:ref1}\index{Iterative~method!Krylov~subspace}.
-In the present chapter, we describe two Krylov methods which are widely used: the conjugate
-gradient method (CG) and the generalized minimal residual method (GMRES). In practice, the
-Krylov subspace methods are usually used with preconditioners that allow to improve their
+In the present chapter, we describe two Krylov methods which are widely used: the CG method (conjugate
+gradient method) and the GMRES method (generalized minimal residual method). In practice, the
+Krylov subspace methods are usually used with preconditioners that allow the improvement of their
  convergence. So, in what follows, the CG and GMRES methods are used to solve the left-preconditioned\index{Sparse~linear~system!Preconditioned}
  sparse linear system:
  \begin{equation}
@@ -95,7 +95,7 @@ where $M$ is the preconditioning matrix.
  \subsection{CG method}
  \label{ch12:sec:02.01}
  The conjugate gradient method was initially developed by Hestenes and Stiefel in 1952~\cite{ch12:ref2}.
-It is one of the well known iterative method to solve large sparse linear systems. In addition, it
+It is one of the well-known iterative methods to solve large sparse linear systems. In addition, it
  can be adapted to solve nonlinear equations and optimization problems. However, it can only be applied
  to problems with positive definite symmetric matrices.
  
@@ -111,7 +111,7 @@ such that the Galerkin condition\index{Galerkin~condition} must be satisfied:
  r_k \bot \mathcal{K}_k(A,r_0),
  \label{ch12:eq:05}
  \end{equation}
-where $x_0$ is the initial guess, $r_k=b-Ax_k$ is the residual of the computed solution $x_k$ and $\mathcal{K}_k$
+where $x_0$ is the initial guess, $r_k=b-Ax_k$ is the residual of the computed solution $x_k$, and $\mathcal{K}_k$
  the Krylov subspace of order $k$: \[\mathcal{K}_k(A,r_0) \equiv\text{span}\{r_0, Ar_0, A^2r_0,\ldots, A^{k-1}r_0\}.\]
  In fact, CG is based on the construction of a sequence $\{p_k\}_{k\in\mathbb{N}}$ of direction vectors in $\mathcal{K}_k$
  which are pairwise $A$-conjugate ($A$-orthogonal):
@@ -142,9 +142,9 @@ p_0=r_0, & p_k=r_k+\beta_k p_{k-1}, & \beta_k\in\mathbb{R}.
  \label{ch12:eq:09}
  \end{equation}
  Moreover, the scalars $\{\alpha_k\}_{k>0}$ are chosen so as to minimize the $A$-norm error $\|x^{*}-x_k\|_A$
-over the Krylov subspace $\mathcal{K}_{k}$ and the scalars $\{\beta_k\}_{k>0}$ are chosen so as to ensure
+over the Krylov subspace $\mathcal{K}_{k}$, and the scalars $\{\beta_k\}_{k>0}$ are chosen so as to ensure
  that the direction vectors are pairwise $A$-conjugate. So, the assumption that matrix $A$ is symmetric and
-the recurrences~(\ref{ch12:eq:08}) and~(\ref{ch12:eq:09}) allow to deduce that:
+the recurrences~(\ref{ch12:eq:08}) and~(\ref{ch12:eq:09}) allow the deduction that:
  \begin{equation}
  \begin{array}{ll}
  \alpha_{k}=\frac{r^{T}_{k-1}r_{k-1}}{p_{k}^{T}Ap_{k}}, & \beta_{k}=\frac{r_{k}^{T}r_{k}}{r_{k-1}^{T}r_{k-1}}.
@@ -176,21 +176,21 @@ the recurrences~(\ref{ch12:eq:08}) and~(\ref{ch12:eq:09}) allow to deduce that:
        $k = k + 1$\;
      }
    }
-\caption{Left-preconditioned CG method}
+\caption{left-preconditioned CG method}
  \label{ch12:alg:01}
  \end{algorithm}
  
  Algorithm~\ref{ch12:alg:01} shows the main key points of the preconditioned CG method. It allows
-to solve the left-preconditioned\index{Sparse~linear~system!Preconditioned} sparse linear system~(\ref{ch12:eq:11}).
+the solving the left-preconditioned\index{Sparse~linear~system!Preconditioned} sparse linear system~(\ref{ch12:eq:11}).
  In this algorithm, $\varepsilon$ is the convergence tolerance threshold, $maxiter$ is the maximum
-number of iterations and $(\cdot,\cdot)$ defines the dot product between two vectors in $\mathbb{R}^{n}$.
+number of iterations, and $(\cdot,\cdot)$ defines the dot product between two vectors in $\mathbb{R}^{n}$.
  At every iteration, a direction vector $p_k$ is determined, so that it is orthogonal to the preconditioned
  residual $z_k$ and to the direction vectors $\{p_i\}_{i<k}$ previously determined (from line~$8$ to
  line~$13$). Then, at lines~$16$ and~$17$, the iterate $x_k$ and the residual $r_k$ are computed using
  formulas~(\ref{ch12:eq:07}) and~(\ref{ch12:eq:08}), respectively. The CG method converges after, at
  most, $n$ iterations. In practice, the CG algorithm stops when the tolerance threshold\index{Convergence!Tolerance~threshold}
  $\varepsilon$ and/or the maximum number of iterations\index{Convergence!Maximum~number~of~iterations}
-$maxiter$ are reached.
+$maxiter$ is reached.
  
  
  %%****************%%
@@ -240,7 +240,7 @@ x_k = x_0 + V_k y, & y\in\mathbb{R}^{k}.
  \end{array}
  \label{ch12:eq:16}
  \end{equation}
-From both formulas~(\ref{ch12:eq:15}) and~(\ref{ch12:eq:16}) and $r_k=b-Ax_k$, we can deduce that:
+From both formulas~(\ref{ch12:eq:15}) and~(\ref{ch12:eq:16}) and $r_k=b-Ax_k$, we can deduce that
  \begin{equation}
  \begin{array}{lll}
    r_{k} & = & b - A (x_{0} + V_{k}y) \\
@@ -257,22 +257,23 @@ norm of the residual $r_k$. Consequently, a linear least-squares problem of size
  \underset{y\in\mathbb{R}^{k}}{min}\|r_{k}\|_{2}=\underset{y\in\mathbb{R}^{k}}{min}\|\beta e_{1}-\bar{H}_{k}y\|_{2}.
  \label{ch12:eq:18}
  \end{equation}
-The QR factorization of matrix $\bar{H}_k$ is used to compute the solution of this problem by using
-Givens rotations~\cite{ch12:ref1,ch12:ref3}, such that:
+The QR factorization of matrix $\bar{H}_k$ is used (the decomposition of the matrix $\bar{H}$ into $Q$ and $R$ matrices)
+to compute the solution of this problem by using
+Givens rotations~\cite{ch12:ref1,ch12:ref3}, such that
  \begin{equation}
  \begin{array}{lll}
  \bar{H}_{k}=Q_{k}R_{k}, & Q_{k}\in\mathbb{R}^{(k+1)\times (k+1)}, & R_{k}\in\mathbb{R}^{(k+1)\times k},
  \end{array}
  \label{ch12:eq:19}
  \end{equation}
-where $Q_kQ_k^T=I_k$ and $R_k$ is an upper triangular matrix.
+where $Q_k$ is an orthogonal matrix and $R_k$ is an upper triangular matrix.
  
  The GMRES method computes an approximate solution with a sufficient precision after, at most, $n$
  iterations ($n$ is the size of the sparse linear system to be solved). However, the GMRES algorithm
  must construct and store in the memory an orthonormal basis $V_k$ whose size is proportional to the
  number of iterations required to achieve the convergence. Then, to avoid a huge memory storage, the
  GMRES method must be restarted at each $m$ iterations, such that $m$ is very small ($m\ll n$), and
-with $x_m$ as the initial guess to the next iteration. This allows to limit the size of the basis
+with $x_m$ as the initial guess to the next iteration. This allows the limitation of the size of the basis
  $V$ to $m$ orthogonal vectors.
  
  \begin{algorithm}[!t]
@@ -292,7 +293,7 @@ $V$ to $m$ orthogonal vectors.
        $h_{j+1,j} = \|w_{j}\|_{2}$\;
        $v_{j+1} = w_{j}/h_{j+1,j}$\;
      }
-    Set $V_{m}=\{v_{j}\}_{1\leq j \leq m}$ and $\bar{H}_{m}=(h_{i,j})$ a $(m+1)\times m$ upper Hessenberg matrix\;
+    Set $V_{m}=\{v_{j}\}_{1\leq j \leq m}$ and $\bar{H}_{m}=(h_{i,j})$ is an upper Hessenberg matrix of size $(m+1)\times m$\;
      Solve a least-squares problem of size $m$: $min_{y\in\mathrm{I\!R}^{m}}\|\beta e_{1}-\bar{H}_{m}y\|_{2}$\;
      $x_{m} = x_{0}+V_{m}y_{m}$\;
      $r_{m} = M^{-1}(b-Ax_{m})$\;
@@ -305,7 +306,7 @@ $V$ to $m$ orthogonal vectors.
        $k = k + 1$\;
      }
    }
-\caption{Left-preconditioned GMRES method with restarts}
+\caption{left-preconditioned GMRES method with restarts}
  \label{ch12:alg:02}
  \end{algorithm}
  
@@ -330,10 +331,10 @@ is reached.
  \label{ch12:sec:03}
  In this section, we present the parallel algorithms of both iterative CG\index{Iterative~method!CG}
  and GMRES\index{Iterative~method!GMRES} methods for GPU clusters. The implementation is performed on
-a GPU cluster composed of different computing nodes, such that each node is a CPU core managed by a
-MPI process and equipped with a GPU card. The parallelization of these algorithms is carried out by
+a GPU cluster composed of different computing nodes, such that each node is a CPU core managed by one
+MPI (message passing interface) process and equipped with a GPU card. The parallelization of these algorithms is carried out by
  using the MPI communication routines between the GPU computing nodes\index{Computing~node} and the
-CUDA programming environment inside each node. In what follows, the algorithms of the iterative methods
+CUDA (compute unified device architecture) programming environment inside each node. In what follows, the algorithms of the iterative methods
  are called iterative solvers.
  
  
@@ -342,24 +343,24 @@ are called iterative solvers.
  \subsection{Data partitioning}
  \label{ch12:sec:03.01}
  The parallel solving of the large sparse linear system~(\ref{ch12:eq:11}) requires a data partitioning
-between the computing nodes of the GPU cluster. Let $p$ denotes the number of the computing nodes on the
-GPU cluster. The partitioning operation consists in the decomposition of the vectors and matrices, involved
-in the iterative solver, in $p$ portions. Indeed, this operation allows to assign to each computing node
+between the computing nodes of the GPU cluster. Let $p$ denote the number of the computing nodes on the
+GPU cluster. The partitioning operation consists of the decomposition of the vectors and matrices, involved
+in the iterative solver, in $p$ portions. Indeed, this operation allows the assignment to each computing node
  $i$:
  \begin{itemize}
  \item a portion of size $\frac{n}{p}$ elements of each vector,
-\item a sparse rectangular sub-matrix $A_i$ of size $(\frac{n}{p},n)$ and,
-\item a square preconditioning sub-matrix $M_i$ of size $(\frac{n}{p},\frac{n}{p})$, 
+\item a sparse rectangular submatrix $A_i$ of size $(\frac{n}{p},n)$, and
+\item a square preconditioning submatrix $M_i$ of size $(\frac{n}{p},\frac{n}{p})$, 
  \end{itemize} 
  where $n$ is the size of the sparse linear system to be solved. In the first instance, we perform a naive
  row-wise partitioning (row-by-row decomposition) on the data of the sparse linear systems to be solved.
  Figure~\ref{ch12:fig:01} shows an example of a row-wise data partitioning between four computing nodes
-of a sparse linear system (sparse matrix $A$, solution vector $x$ and right-hand side $b$) of size $16$
+of a sparse linear system (sparse matrix $A$, solution vector $x$, and right-hand side $b$) of size $16$
  unknown values. 
  
  \begin{figure}
  \centerline{\includegraphics[scale=0.35]{Chapters/chapter12/figures/partition}}
-\caption{A data partitioning of the sparse matrix $A$, the solution vector $x$ and the right-hand side $b$ into four portions.}
+\caption{A data partitioning of the sparse matrix $A$, the solution vector $x$, and the right-hand side $b$ into four portions.}
  \label{ch12:fig:01}
  \end{figure}
  
@@ -371,17 +372,17 @@ unknown values.
  After the partitioning operation, all the data involved from this operation must be
  transferred from the CPU memories to the GPU memories, in order to be processed by
  GPUs. We use two functions of the CUBLAS\index{CUBLAS} library (CUDA Basic Linear
-Algebra Subroutines), developed by Nvidia~\cite{ch12:ref6}: \verb+cublasAlloc()+
+Algebra Subroutines) developed by NVIDIA~\cite{ch12:ref6}: \verb+cublasAlloc()+
  for the memory allocations on GPUs and \verb+cublasSetVector()+ for the memory
  copies from the CPUs to the GPUs.
  
-An efficient implementation of CG and GMRES solvers on a GPU cluster requires to
-determine all parts of their codes that can be executed in parallel and, thus, take
+An efficient implementation of CG and GMRES solvers on a GPU cluster requires the
+determining of all parts of their codes that can be executed in parallel and, thus, takes
  advantage of the GPU acceleration. As many Krylov subspace methods, the CG and GMRES
  methods are mainly based on arithmetic operations dealing with vectors or matrices:
  sparse matrix-vector multiplications, scalar-vector multiplications, dot products,
-Euclidean norms, AXPY operations ($y\leftarrow ax+y$ where $x$ and $y$ are vectors
-and $a$ is a scalar) and so on. These vector operations are often easy to parallelize
+Euclidean norms, AXPY operations ($y\leftarrow ax+y$ where $x$ and $y$ are vectors and $a$ is a scalar),
+and so on. These vector operations are often easy to parallelize
  and they are more efficient on parallel computers when they work on large vectors.
  Therefore, all the vector operations used in CG and GMRES solvers must be executed
  by the GPUs as kernels.
@@ -389,41 +390,41 @@ by the GPUs as kernels.
  We use the kernels of the CUBLAS library to compute some vector operations of CG and
  GMRES solvers. The following kernels of CUBLAS (dealing with double floating point)
  are used: \verb+cublasDdot()+ for the dot products, \verb+cublasDnrm2()+ for the
-Euclidean norms and \verb+cublasDaxpy()+ for the AXPY operations. For the rest of
+Euclidean norms, and \verb+cublasDaxpy()+ for the AXPY operations ($y\leftarrow ax+y$, compute a scalar-vector product and add 
+the result to a vector). For the rest of
  the data-parallel operations, we code their kernels in CUDA. In the CG solver, we
-develop a kernel for the XPAY operation ($y\leftarrow x+ay$) used  line~$12$ in
+develop a kernel for the XPAY operation ($y\leftarrow x+ay$) used in line~$12$ in
  Algorithm~\ref{ch12:alg:01}. In the GMRES solver, we program a kernel for the scalar-vector
  multiplication (lines~$7$ and~$15$ in Algorithm~\ref{ch12:alg:02}), a kernel to
-solve the least-squares problem and a kernel to update the elements  of the solution
+solve the least-squares problem, and a kernel to update the elements of the solution
  vector $x$.
  
  The least-squares problem in the GMRES method is solved by performing a QR factorization
  on the Hessenberg matrix\index{Hessenberg~matrix} $\bar{H}_m$ with plane rotations and,
  then, solving the triangular system by backward substitutions to compute $y$. Consequently,
-solving the least-squares problem on the GPU is not interesting. Indeed, the triangular
+solving the least-squares problem on the GPU is not efficient. Indeed, the triangular
  solves are not easy to parallelize and inefficient on GPUs. However, the least-squares
  problem to solve in the GMRES method with restarts has, generally, a very small size $m$.
-Therefore, we develop an inexpensive kernel which must be executed in sequential by a
-single CUDA thread. 
+Therefore, we develop an inexpensive kernel which must be executed by a single CUDA thread. 
  
  The most important operation in CG\index{Iterative~method!CG} and GMRES\index{Iterative~method!GMRES}
-methods is the sparse matrix-vector multiplication (SpMV)\index{SpMV~multiplication},
+methods is the SpMV multiplication (sparse matrix-vector multiplication)\index{SpMV~multiplication},
  because it is often an expensive operation in terms of execution time and memory space.
-Moreover, it requires to take care of the storage format of the sparse matrix in the
+Moreover, it requires taking care of the storage format of the sparse matrix in the
  memory. Indeed, the naive storage, row-by-row or column-by-column, of a sparse matrix
  can cause a significant waste of memory space and execution time. In addition, the sparse
  nature of the matrix often leads to irregular memory accesses to read the matrix nonzero
-values. So, the computation of the SpMV multiplication on GPUs can involve non coalesced
+values. So, the computation of the SpMV multiplication on GPUs can involve noncoalesced
  accesses to the global memory, which slows down its performances even more. One of the
  most efficient compressed storage formats\index{Compressed~storage~format} of sparse
-matrices on GPUs is the HYB\index{Compressed~storage~format!HYB} format~\cite{ch12:ref7}.
+matrices on GPUs is the HYB (hybrid)\index{Compressed~storage~format!HYB} format~\cite{ch12:ref7}.
  It is a combination of ELLpack (ELL) and Coordinate (COO) formats. Indeed, it stores
  a typical number of nonzero values per row in ELL\index{Compressed~storage~format!ELL}
-format and remaining entries of exceptional rows in COO format. It combines the efficiency
+format and the remaining entries of exceptional rows in COO format. It combines the efficiency
  of ELL due to the regularity of its memory accesses and the flexibility of COO\index{Compressed~storage~format!COO}
  which is insensitive to the matrix structure. Consequently, we use the HYB kernel~\cite{ch12:ref8}
-developed by Nvidia to implement the SpMV multiplication of CG and GMRES methods on GPUs.
-Moreover, to avoid the non coalesced accesses to the high-latency global memory, we fill
+developed by NVIDIA to implement the SpMV multiplication of CG and GMRES methods on GPUs.
+Moreover, to avoid the noncoalesced accesses to the high-latency global memory, we fill
  the elements of the iterate vector $x$ in the cached texture memory.
  
  
@@ -440,54 +441,54 @@ the cluster. In what follows, two computing nodes sharing data are called neighb
  
  As already mentioned, the most important operation of CG and GMRES methods is the SpMV multiplication.
  In the parallel implementation of the iterative methods, each computing node $i$ performs the
-SpMV multiplication on its own sparse rectangular sub-matrix $A_i$. Locally, it has only sub-vectors
-of size $\frac{n}{p}$ corresponding to rows of its sub-matrix $A_i$. However, it also requires
-the vector elements of its neighbors, corresponding to the column indices on which its sub-matrix
+SpMV multiplication on its own sparse rectangular submatrix $A_i$. Locally, it has only subvectors
+of size $\frac{n}{p}$ corresponding to rows of its submatrix $A_i$. However, it also requires
+the vector elements of its neighbors, corresponding to the column indices on which its submatrix
  has nonzero values (see Figure~\ref{ch12:fig:01}). So, in addition to the local vectors, each
  node must also manage vector elements shared with neighbors and required to compute the SpMV
  multiplication. Therefore, the iterate vector $x$ managed by each computing node is composed
-of a local sub-vector $x^{local}$ of size $\frac{n}{p}$ and a sub-vector of shared elements $x^{shared}$.
+of a local subvector $x^{local}$ of size $\frac{n}{p}$ and a subvector of shared elements $x^{shared}$.
  In the same way, the vector used to construct the orthonormal basis of the Krylov subspace (vectors
-$p$ and $v$ in CG and GMRES methods, respectively) is composed of a local sub-vector and a shared
-sub-vector. 
+$p$ and $v$ in CG and GMRES methods, respectively) is composed of a local subvector and a shared
+subvector. 
  
  Therefore, before computing the SpMV multiplication\index{SpMV~multiplication}, the neighboring
  nodes\index{Neighboring~node} over the GPU cluster must exchange between them the shared vector
  elements necessary to compute this multiplication. First, each computing node determines, in its
-local sub-vector, the vector elements needed by other nodes. Then, the neighboring nodes exchange
+local subvector, the vector elements needed by other nodes. Then, the neighboring nodes exchange
  between them these shared vector elements. The data exchanges are implemented by using the MPI
  point-to-point communication routines: blocking\index{MPI~subroutines!Blocking} sends with \verb+MPI_Send()+
  and nonblocking\index{MPI~subroutines!Nonblocking} receives with \verb+MPI_Irecv()+. Figure~\ref{ch12:fig:02}
-shows an example of data exchanges between \textit{Node 1} and its neighbors \textit{Node 0}, \textit{Node 2}
+shows an example of data exchanges between \textit{Node 1} and its neighbors \textit{Node 0}, \textit{Node 2},
  and \textit{Node 3}. In this example, the iterate matrix $A$ split between these four computing
  nodes is that presented in Figure~\ref{ch12:fig:01}.
  
  \begin{figure}
  \centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/compress}}
-\caption{Data exchanges between \textit{Node 1} and its neighbors \textit{Node 0}, \textit{Node 2} and \textit{Node 3}.}
+\caption{Data exchanges between \textit{Node 1} and its neighbors \textit{Node 0}, \textit{Node 2}, and \textit{Node 3}.}
  \label{ch12:fig:02}
  \end{figure}
  
  After the synchronization operation, the computing nodes receive, from their respective neighbors,
-the shared elements in a sub-vector stored in a compressed format. However, in order to compute the
+the shared elements in a subvector stored in a compressed format. However, in order to compute the
  SpMV multiplication, the computing nodes operate on sparse global vectors (see Figure~\ref{ch12:fig:02}).
  In this case, the received vector elements must be copied to the corresponding indices in the global
  vector. So as not to need to perform this at each iteration, we propose to reorder the columns of
-each sub-matrix $\{A_i\}_{0\leq i<p}$, so that the shared sub-vectors could be used in their compressed
-storage formats. Figure~\ref{ch12:fig:03} shows a reordering of a sparse sub-matrix (sub-matrix of
+each submatrix $\{A_i\}_{0\leq i<p}$, so that the shared subvectors could be used in their compressed
+storage formats. Figure~\ref{ch12:fig:03} shows a reordering of a sparse submatrix (submatrix of
  \textit{Node 1}). 
  
  \begin{figure}
  \centerline{\includegraphics[scale=0.35]{Chapters/chapter12/figures/reorder}}
-\caption{Columns reordering of a sparse sub-matrix.}
+\caption{Columns reordering of a sparse submatrix.}
  \label{ch12:fig:03}
  \end{figure}
  
  A GPU cluster\index{GPU~cluster} is a parallel platform with a distributed memory. So, the synchronizations
-and communication data between GPU nodes are carried out by passing messages. However, GPUs can not communicate
-between them in a direct way. Then, CPUs via MPI processes are in charge of the synchronizations within the GPU
+and communication data between GPU nodes are carried out by passing messages. However, a GPU cannot exchange data
+with other GPUs in a direct way. Then, CPUs via MPI processes are in charge of the synchronizations within the GPU
  cluster. Consequently, the vector elements to be exchanged must be copied from the GPU memory to the CPU memory
-and vice-versa before and after the synchronization operation between CPUs. We have used the CUBLAS\index{CUBLAS}
+and vice versa before and after the synchronization operation between CPUs. We have used the CUBLAS\index{CUBLAS}
  communication subroutines to perform the data transfers between a CPU core and its GPU: \verb+cublasGetVector()+
  and \verb+cublasSetVector()+. Finally, in addition to the data exchanges, GPU nodes perform reduction operations
  to compute in parallel the dot products and Euclidean norms. This is implemented by using the MPI global communication\index{MPI~subroutines!Global}
@@ -502,7 +503,7 @@ to compute in parallel the dot products and Euclidean norms. This is implemented
  \label{ch12:sec:04}
  In this section, we present the performances of the parallel CG and GMRES linear solvers obtained
  on a cluster of $12$ GPUs. Indeed, this GPU cluster of tests is composed of six machines connected
-by $20$Gbps InfiniBand network. Each machine is a Quad-Core Xeon E5530 CPU running at $2.4$GHz and
+by a $20$GB/s InfiniBand network. Each machine is a Quad-Core Xeon E5530 CPU running at $2.4$GHz and
  providing $12$GB of RAM with a memory bandwidth of $25.6$GB/s. In addition, two Tesla C1060 GPUs are
  connected to each machine via a PCI-Express 16x Gen 2.0 interface with a throughput of $8$GB/s. A
  Tesla C1060 GPU contains $240$ cores running at $1.3$GHz and providing a global memory of $4$GB with
@@ -511,12 +512,12 @@ that we used in the experimental tests.
  
  Linux cluster version 2.6.39 OS is installed on CPUs. C programming language is used to code
  the parallel algorithms of both methods on the GPU cluster. CUDA version 4.0~\cite{ch12:ref9}
-is used to program GPUs, using CUBLAS library~\cite{ch12:ref6} to deal with vector operations
+is used to program GPUs, using the CUBLAS library~\cite{ch12:ref6} to deal with vector operations
  in GPUs and, finally, MPI routines of OpenMPI 1.3.3 are used to carry out the communications between
  CPU cores. Indeed, the experiments are done on a cluster of $12$ computing nodes, where each node
-is managed by a MPI process and it is composed of one CPU core and one GPU card.
+is managed by one MPI process and is composed of one CPU core and one GPU card.
  
-\begin{figure}[!h]
+\begin{figure}
  \centerline{\includegraphics[scale=0.25]{Chapters/chapter12/figures/cluster}}
  \caption{General scheme of the GPU cluster of tests composed of six machines, each with two GPUs.}
  \label{ch12:fig:04}
@@ -524,26 +525,34 @@ is managed by a MPI process and it is composed of one CPU core and one GPU card.
  
  All tests are made on double-precision floating point operations. The parameters of both linear
  solvers are initialized as follows: the residual tolerance threshold $\varepsilon=10^{-12}$, the
-maximum number of iterations $maxiter=500$, the right-hand side $b$ is filled with $1.0$ and the
+maximum number of iterations $maxiter=500$, the right-hand side $b$ is filled with $1.0$, and the
  initial guess $x_0$ is filled with $0.0$. In addition, we limited the Arnoldi process\index{Iterative~method!Arnoldi~process}
  used in the GMRES method to $16$ iterations ($m=16$). For the sake of simplicity, we have chosen
-the preconditioner $M$ as the main diagonal of the sparse matrix $A$. Indeed, it allows to easily
-compute the required inverse matrix $M^{-1}$ and it provides a relatively good preconditioning for
+the preconditioner $M$ as the main diagonal of the sparse matrix $A$. Indeed, it allows us to easily
+compute the required inverse matrix $M^{-1}$, and it provides a relatively good preconditioning for
  not too ill-conditioned matrices. In the GPU computing, the size of thread blocks is fixed to $512$
  threads. Finally, the performance results, presented hereafter, are obtained from the mean value
  over $10$ executions of the same parallel linear solver and for the same input data.
  
  \begin{figure}
  \centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/matrices}}
-\caption{Sketches of sparse matrices chosen from the Davis collection.}
+\caption{Sketches of sparse matrices chosen from the University of Florida collection.}
  \label{ch12:fig:05}
  \end{figure}
  
+To get more realistic results, we have tested the CG and GMRES algorithms on sparse matrices of the University of Florida
+collection~\cite{ch12:ref10}, that arise in a wide spectrum of real-world applications. We have chosen six
+symmetric sparse matrices and six nonsymmetric ones from this collection. In Figure~\ref{ch12:fig:05},
+we show the structures of these matrices and in Table~\ref{ch12:tab:01} we present their main characteristics
+which are the number of rows, the total number of nonzero values, and the maximal bandwidth. In
+the present chapter, the bandwidth of a sparse matrix is defined as the number of matrix columns separating
+the first and the last nonzero value on a matrix row.
+
  \begin{table}
  \centering
  \begin{tabular}{|c|c|c|c|c|}
  \hline
-{\bf Matrix type}             & {\bf Matrix name} & {\bf \# rows} & {\bf \# nnz} & {\bf Bandwidth} \\ \hline \hline
+{\bf Matrix Type}             & {\bf Matrix Name} & {\bf \# Rows} & {\bf \# Nonzeros} & {\bf Bandwidth} \\ \hline \hline
  
  \multirow{6}{*}{Symmetric}    & 2cubes\_sphere    & $101,492$     & $1,647,264$  & $100,464$ \\
  
@@ -569,23 +578,15 @@ over $10$ executions of the same parallel linear solver and for the same input d
  
                                & torso3            & $259,156$     & $4,429,042$  & $216,854$  \\ \hline
  \end{tabular}
-\caption{Main characteristics of sparse matrices chosen from the Davis collection.}
+\caption{Main characteristics of sparse matrices chosen from the University of Florida collection.}
  \label{ch12:tab:01}
  \end{table}
  
-To get more realistic results, we have tested the CG and GMRES algorithms on sparse matrices of the Davis
-collection~\cite{ch12:ref10}, that arise in a wide spectrum of real-world applications. We have chosen six
-symmetric sparse matrices and six nonsymmetric ones from this collection. In Figure~\ref{ch12:fig:05},
-we show the structures of these matrices and in Table~\ref{ch12:tab:01} we present their main characteristics
-which are the number of rows, the total number of nonzero values (nnz) and the maximal bandwidth. In
-the present chapter, the bandwidth of a sparse matrix is defined as the number of matrix columns separating
-the first and the last nonzero value on a matrix row.
-
-\begin{table}
+\begin{table}[!h]
  \begin{center}
  \begin{tabular}{|c|c|c|c|c|c|c|} 
  \hline
-{\bf Matrix}     & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$  & $\mathbf{\# iter.}$ & $\mathbf{prec.}$     & $\mathbf{\Delta}$   \\ \hline \hline
+{\bf Matrix}     & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$  & $\mathbf{\#~Iter.}$ & $\mathbf{Prec.}$     & $\mathbf{\Delta}$   \\ \hline \hline
  
  2cubes\_sphere    & $0.132s$           & $0.069s$            & $1.93$        & $12$           & $1.14e$-$09$     & $3.47e$-$18$ \\
  
@@ -604,11 +605,11 @@ thermal2          & $1.172s$           & $0.622s$            & $1.88$        & $
  \end{center}
  \end{table}
  
-\begin{table}
+\begin{table}[!h]
  \begin{center}
  \begin{tabular}{|c|c|c|c|c|c|c|} 
  \hline
-{\bf Matrix}     & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$  & $\mathbf{\# iter.}$ & $\mathbf{prec.}$     & $\mathbf{\Delta}$   \\ \hline \hline
+{\bf Matrix}     & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$  & $\mathbf{\#~Iter.}$ & $\mathbf{Prec.}$     & $\mathbf{\Delta}$   \\ \hline \hline
  
  2cubes\_sphere    & $0.234s$           & $0.124s$            & $1.88$        & $21$           & $2.10e$-$14$     & $3.47e$-$18$ \\
  
@@ -640,13 +641,13 @@ torso3            & $4.242s$           & $2.030s$            & $2.09$        & $
  \end{table}
  
  Tables~\ref{ch12:tab:02} and~\ref{ch12:tab:03} show the performances of the parallel
-CG and GMRES solvers, respectively, for solving linear systems associated to the sparse
-matrices presented in Tables~\ref{ch12:tab:01}. They allow to compare the performances
+CG and~GMRES solvers, respectively, for solving linear systems associated to the sparse
+matrices presented in Table~\ref{ch12:tab:01}. They allow us to compare the performances
  obtained on a cluster of $24$ CPU cores and on a cluster of $12$ GPUs. However, Table~\ref{ch12:tab:02}
-only shows the performances of solving symmetric sparse linear systems, due to the inability
+shows the performances of solving only symmetric sparse linear systems, due to the inability
  of the CG method to solve the nonsymmetric systems. In both tables, the second and third
  columns give, respectively, the execution times in seconds obtained on $24$ CPU cores
-($Time_{gpu}$) and that obtained on $12$ GPUs ($Time_{gpu}$). Moreover, we take into account
+($Time_{cpu}$) and that obtained on $12$ GPUs ($Time_{gpu}$). Moreover, we take into account
  the relative gains $\tau$ of a solver implemented on the GPU cluster compared to the same
  solver implemented on the CPU cluster. The relative gains\index{Relative~gain}, presented
  in the fourth column, are computed as a ratio of the CPU execution time over the GPU
@@ -656,9 +657,9 @@ execution time:
  \label{ch12:eq:20}
  \end{equation}
  In addition, Tables~\ref{ch12:tab:02} and~\ref{ch12:tab:03} give the number of iterations
-($iter$), the precision $prec$ of the solution computed on the GPU cluster and the difference
+($iter$), the precision ($prec$) of the solution computed on the GPU cluster, and the difference
  $\Delta$ between the solution computed on the CPU cluster and that computed on the GPU cluster.
-Both parameters $prec$ and $\Delta$ allow to validate and verify the accuracy of the solution
+Both parameters $prec$ and $\Delta$ allow us to validate and verify the accuracy of the solution
  computed on the GPU cluster. We have computed them as follows:
  \begin{eqnarray}
  \Delta = max|x^{cpu}-x^{gpu}|,\\
@@ -670,35 +671,35 @@ $prec$ is the maximum element, in absolute value, of the residual vector $r^{gpu
  of the solution $x^{gpu}$. Thus, we can see that the solutions obtained on the GPU cluster
  were computed with a sufficient accuracy (about $10^{-10}$) and they are, more or less, equivalent
  to those computed on the CPU cluster with a small difference ranging from $10^{-10}$ to $10^{-26}$.
-However, we can notice from the relative gains $\tau$ that it is not interesting to use multiple
-GPUs for solving small sparse linear systems. In fact, a small sparse matrix does not allow to
+However, we can notice from the relative gains $\tau$ that it is not efficient to use multiple
+GPUs for solving small sparse linear systems. In fact, a small sparse matrix does not allow us to
  maximize utilization of GPU cores. In addition, the communications required to synchronize the
  computations over the cluster increase the idle times of GPUs and slow down  the parallel
  computations further.
  
  Consequently, in order to test the performances of the parallel solvers, we developed in C programming
-language a generator of large sparse matrices. This generator takes a matrix from the Davis collection~\cite{ch12:ref10}
-as an initial matrix to build large sparse matrices exceeding ten million of rows. It must be executed
-in parallel by the MPI processes of the computing nodes, so that each process could build its sparse
-sub-matrix. In the first experimental tests, we focused on sparse matrices having a banded structure,
+language a generator of large sparse matrices. This generator takes a matrix from the University of Florida collection~\cite{ch12:ref10}
+as an initial matrix to build large sparse matrices exceeding ten million rows. It must be executed
+in parallel by the MPI processes of the computing nodes, so that each process can build its sparse
+submatrix. In the first experimental tests, we focused on sparse matrices having a banded structure,
  because they are those arising the most in the majority of numerical problems. So to generate the global sparse matrix,
-each MPI process constructs its sub-matrix by performing several copies of an initial sparse matrix chosen
-from the Davis collection. Then, it puts all these copies on the main diagonal of the global matrix
+each MPI process constructs its submatrix by performing several copies of an initial sparse matrix chosen
+from the University of Florida collection. Then, it puts all these copies on the main diagonal of the global matrix
  (see Figure~\ref{ch12:fig:06}). Moreover, the empty spaces between two successive copies in the main
-diagonal are filled with sub-copies (left-copy and right-copy in Figure~\ref{ch12:fig:06}) of the same
+diagonal are filled with subcopies (left-copy and right-copy in Figure~\ref{ch12:fig:06}) of the same
  initial matrix.
  
-\begin{figure}[htbp]
+\begin{figure}
  \centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/generation}}
  \caption{Parallel generation of a large sparse matrix by four computing nodes.}
  \label{ch12:fig:06}
  \end{figure}
  
-\begin{table}[htbp]
+\begin{table}[!h]
  \centering
  \begin{tabular}{|c|c|c|c|}
  \hline
-{\bf Matrix type}             & {\bf Matrix name} & {\bf \# nnz} & {\bf Bandwidth} \\ \hline \hline
+{\bf Matrix Type}             & {\bf Matrix Name} & {\bf \# Nonzeros} & {\bf Bandwidth} \\ \hline \hline
  
  \multirow{6}{*}{Symmetric}    & 2cubes\_sphere    & $413,703,602$ & $198,836$     \\
  
@@ -725,15 +726,28 @@ initial matrix.
                                & torso3            & $433,795,264$ & $328,757$        \\ \hline
  \end{tabular}
  \vspace{0.5cm}
-\caption{Main characteristics of sparse banded matrices generated from those of the Davis collection.}
+\caption{Main characteristics of sparse banded matrices generated from those of the University of Florida collection.}
  \label{ch12:tab:04}
  \end{table}
  
-\begin{table}[htbp]
+We have used the parallel CG and GMRES algorithms for solving sparse linear systems of $25$
+million unknown values. The sparse matrices associated to these linear systems are generated
+from those presented in Table~\ref{ch12:tab:01}. Their main characteristics are given in Table~\ref{ch12:tab:04}.
+Tables~\ref{ch12:tab:05} and~\ref{ch12:tab:06} show the performances of the parallel CG and
+GMRES solvers, respectively, obtained on a cluster of $24$ CPU cores and on a cluster of $12$
+GPUs. Obviously, we can notice from these tables that solving large sparse linear systems on
+a GPU cluster is more efficient than on a CPU cluster (see relative gains $\tau$). We can also
+notice that the execution times of the CG method, whether in a CPU cluster or in a GPU cluster,
+are better than those of the GMRES method for solving large symmetric linear systems. In fact, the
+CG method is characterized by a better convergence\index{Convergence} rate and a shorter execution
+time of an iteration than those of the GMRES method. Moreover, an iteration of the parallel GMRES
+method requires more data exchanges between computing nodes compared to the parallel CG method.
+
+\begin{table}[!h]
  \begin{center}
  \begin{tabular}{|c|c|c|c|c|c|c|} 
  \hline
-{\bf Matrix}    & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$   \\ \hline \hline
+{\bf Matrix}    & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\#~Iter.}$ & $\mathbf{Prec.}$ & $\mathbf{\Delta}$   \\ \hline \hline
  
  2cubes\_sphere  & $1.625s$             & $0.401s$              & $4.05$          & $14$                & $5.73e$-$11$     & $5.20e$-$18$ \\
  
@@ -753,11 +767,11 @@ on a cluster of 12 GPUs.}
  \end{center}
  \end{table}
  
-\begin{table}
+\begin{table}[!h]
  \begin{center}
  \begin{tabular}{|c|c|c|c|c|c|c|} 
  \hline
-{\bf Matrix}      & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$   \\ \hline \hline
+{\bf Matrix}      & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\#~Iter.}$ & $\mathbf{Prec.}$ & $\mathbf{\Delta}$   \\ \hline \hline
  
  2cubes\_sphere    & $3.597s$             & $0.514s$              & $6.99$          & $21$                & $2.11e$-$14$     & $8.67e$-$18$ \\
  
@@ -787,22 +801,7 @@ torso3            & $31.463s$            & $3.681s$              & $8.55$
  on a cluster of 12 GPUs.}
  \label{ch12:tab:06}
  \end{center}
-\end{table}
-
-
-We have used the parallel CG and GMRES algorithms for solving sparse linear systems of $25$
-million unknown values. The sparse matrices associated to these linear systems are generated
-from those presented in Table~\ref{ch12:tab:01}. Their main characteristics are given in Table~\ref{ch12:tab:04}.
-Tables~\ref{ch12:tab:05} and~\ref{ch12:tab:06} shows the performances of the parallel CG and
-GMRES solvers, respectively, obtained on a cluster of $24$ CPU cores and on a cluster of $12$
-GPUs. Obviously, we can notice from these tables that solving large sparse linear systems on
-a GPU cluster is more efficient than on a CPU cluster (see relative gains $\tau$). We can also
-notice that the execution times of the CG method, whether in a CPU cluster or in a GPU cluster,
-are better than those of the GMRES method for solving large symmetric linear systems. In fact, the
-CG method is characterized by a better convergence\index{Convergence} rate and a shorter execution
-time of an iteration than those of the GMRES method. Moreover, an iteration of the parallel GMRES
-method requires more data exchanges between computing nodes compared to the parallel CG method.
- 
+\end{table} 
  
  %%--------------------------%%
  %%       SECTION 5          %%
@@ -812,7 +811,7 @@ method requires more data exchanges between computing nodes compared to the para
  In this chapter, we have aimed at harnessing the computing power of a
  cluster of GPUs for solving large sparse linear systems. For this, we
  have used two Krylov subspace iterative methods: the CG and GMRES methods.
-The first method is well-known for its efficiency to solve symmetric
+The first method is well known for its efficiency to solve symmetric
  linear systems and the second one is used, particularly, to solve
  nonsymmetric linear systems. 
  
@@ -820,28 +819,28 @@ We have presented the parallel implementation of both iterative methods
  on a GPU cluster. Particularly, the operations dealing with the vectors
  and/or matrices, of these methods, are parallelized between the different
  GPU computing nodes of the cluster. Indeed, the data-parallel vector operations
-are accelerated by GPUs and the communications required to synchronize the
+are accelerated by GPUs, and the communications required to synchronize the
  parallel computations are carried out by CPU cores. For this, we have used
-a heterogeneous CUDA/MPI programming to implement the parallel iterative
+heterogeneous CUDA/MPI programming to implement the parallel iterative
  algorithms.
  
  In the experimental tests, we have shown that using a GPU cluster is efficient
  for solving linear systems associated to very large sparse matrices. The experimental
-results, obtained in the present chapter, show that a cluster of $12$ GPUs is
+results, discussed in the present chapter, show that a cluster of $12$ GPUs is
  about $7$ times faster than a cluster of $24$ CPU cores for solving large sparse
-linear systems of $25$ million unknown values. This is due to the GPU ability to
+linear systems of $25$ million unknown values. This is due to the GPUs ability to
  compute the data-parallel operations faster than the CPUs.
  
-In our future works, we plan to test the parallel algorithms of CG and GMRES methods, adapted
+In our future works, we plan to test the parallel algorithms of CG and~GMRES methods, adapted
  to GPUs, for solving large linear systems associated to sparse matrices of different structures.
-For example, the matrices having large bandwidths, which can lead to many data dependencies
+For example, the matrices having large bandwidths can lead to many data dependencies
  between the computing nodes and, thus, degrade the performances of both algorithms. So in
  this case, it would be interesting to study the different data partitioning techniques, in
  order to minimize the dependencies between the computing nodes and thus to reduce the total
  communication volume. This may improve the performances of both algorithms implemented on
  a GPU cluster. Moreover, in the recent GPU hardware and software architectures, the GPU-Direct
  system with CUDA version 5.0 is used so that two GPUs located on the same node or on distant
-nodes can communicate between them directly without CPUs. This allows to improve the data
+nodes can communicate between each other directly without CPUs. This allows us to improve the data
  transfers between GPUs.