X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/6f31b862f879417ba764d85798f4ce9197c6cfe6..55ce7168c6e69a2462d76c95dc9a5298ceedb04f:/BookGPU/Chapters/chapter12/ch12.tex diff --git a/BookGPU/Chapters/chapter12/ch12.tex b/BookGPU/Chapters/chapter12/ch12.tex index 8b869b3..4bc95a6 100755 --- a/BookGPU/Chapters/chapter12/ch12.tex +++ b/BookGPU/Chapters/chapter12/ch12.tex @@ -1,15 +1,15 @@ -1%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% +%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %% %% %% CHAPTER 12 %% %% %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %\chapterauthor{}{} -\chapterauthor{Lilia Ziane Khodja}{Femto-ST Institute, University of Franche-Comte, France} -\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte, France} -\chapterauthor{Jacques Bahi}{Femto-ST Institute, University of Franche-Comte, France} +\chapterauthor{Lilia Ziane Khodja, Raphaël Couturier, and Jacques Bahi}{Femto-ST Institute, University of Franche-Comte, France} +%\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte, France} +%\chapterauthor{Jacques Bahi}{Femto-ST Institute, University of Franche-Comte, France} -\chapter{Solving sparse linear systems with GMRES and CG methods on GPU clusters} +\chapter[Solving linear systems with GMRES and CG methods on GPU clusters]{Solving sparse linear systems with GMRES and CG methods on GPU clusters} \label{ch12} %%--------------------------%% @@ -19,16 +19,16 @@ \label{ch12:sec:01} Sparse linear systems are used to model many scientific and industrial problems, such as the environmental simulations or the industrial processing of the complex or -non-Newtonian fluids. Moreover, the resolution of these problems often involves the -solving of such linear systems which are considered as the most expensive process in +nonNewtonian fluids. Moreover, the resolution of these problems often involves the +solving of such linear systems that are considered the most expensive process in terms of execution time and memory space. Therefore, solving sparse linear systems must be as efficient as possible in order to deal with problems of ever increasing size. There are, in the jargon of numerical analysis, different methods of solving sparse -linear systems that can be classified in two classes: the direct and iterative methods. -However, the iterative methods are often more suitable than their counterpart, direct -methods, to solve these systems. Indeed, they are less memory consuming and easier +linear systems that can be classified in two classes: direct and iterative methods. +However, the iterative methods are often more suitable than their counterparts, direct +methods, to solve these systems. Indeed, they are less memory-consuming and easier to parallelize on parallel computers than direct methods. Different computing platforms, sequential and parallel computers, are used to solve sparse linear systems with iterative solutions. Nowadays, graphics processing units (GPUs) have become attractive to solve @@ -38,8 +38,8 @@ traditional CPUs. In Section~\ref{ch12:sec:02}, we describe the general principle of two well-known iterative methods: the conjugate gradient method and the generalized minimal residual method. In Section~\ref{ch12:sec:03}, we give the main key points of the parallel implementation of both methods on a cluster of -GPUs. Finally, in Section~\ref{ch12:sec:04}, we present the experimental results obtained on a -CPU cluster and on a GPU cluster, to solve large sparse linear systems. +GPUs. Finally, in Section~\ref{ch12:sec:04}, we present the experimental results, obtained on a +CPU cluster and on a GPU cluster of solving large sparse linear systems. %%--------------------------%% @@ -47,19 +47,19 @@ CPU cluster and on a GPU cluster, to solve large sparse linear systems. %%--------------------------%% \section{Krylov iterative methods} \label{ch12:sec:02} -Let us consider the following system of $n$ linear equations\index{Sparse~linear~system} +Let us consider the following system of $n$ linear equations\index{sparse linear system} in $\mathbb{R}$: \begin{equation} Ax=b, \label{ch12:eq:01} \end{equation} where $A\in\mathbb{R}^{n\times n}$ is a sparse nonsingular square matrix, $x\in\mathbb{R}^{n}$ -is the solution vector, $b\in\mathbb{R}^{n}$ is the right-hand side and $n\in\mathbb{N}$ is a +is the solution vector, $b\in\mathbb{R}^{n}$ is the right-hand side, and $n\in\mathbb{N}$ is a large integer number. -The iterative methods\index{Iterative~method} for solving the large sparse linear system~(\ref{ch12:eq:01}) +The iterative methods\index{iterative method} for solving the large sparse linear system~(\ref{ch12:eq:01}) proceed by successive iterations of a same block of elementary operations, during which an -infinite number of approximate solutions $\{x_k\}_{k\geq 0}$ are computed. Indeed, from an +infinite number of approximate solutions $\{x_k\}_{k\geq 0}$ is computed. Indeed, from an initial guess $x_0$, an iterative method determines at each iteration $k>0$ an approximate solution $x_k$ which, gradually, converges to the exact solution $x^{*}$ as follows: \begin{equation} @@ -68,20 +68,20 @@ x^{*}=\lim\limits_{k\to\infty}x_{k}=A^{-1}b. \end{equation} The number of iterations necessary to reach the exact solution $x^{*}$ is not known beforehand and can be infinite. In practice, an iterative method often finds an approximate solution $\tilde{x}$ -after a fixed number of iterations and/or when a given convergence criterion\index{Convergence} +after a fixed number of iterations and/or when a given convergence criterion\index{convergence} is satisfied as follows: \begin{equation} \|b-A\tilde{x}\| < \varepsilon, \label{ch12:eq:03} \end{equation} -where $\varepsilon<1$ is the required convergence tolerance threshold\index{Convergence!Tolerance~threshold}. +where $\varepsilon<1$ is the required convergence tolerance threshold\index{convergence!tolerance threshold}. Some of the most iterative methods that have proven their efficiency for solving large sparse -linear systems are those called \textit{Krylov subspace methods}~\cite{ch12:ref1}\index{Iterative~method!Krylov~subspace}. -In the present chapter, we describe two Krylov methods which are widely used: the conjugate -gradient method (CG) and the generalized minimal residual method (GMRES). In practice, the -Krylov subspace methods are usually used with preconditioners that allow to improve their -convergence. So, in what follows, the CG and GMRES methods are used to solve the left-preconditioned\index{Sparse~linear~system!Preconditioned} +linear systems are those called \textit{Krylov subspace methods}~\cite{ch12:ref1}\index{iterative method!Krylov subspace}. +In the present chapter, we describe two Krylov methods which are widely used: the CG method (conjugate +gradient method) and the GMRES method (generalized minimal residual method). In practice, the +Krylov subspace methods are usually used with preconditioners that allow the improvement of their +convergence. So, in what follows, the CG and GMRES methods are used to solve the left-preconditioned\index{sparse linear system!preconditioned} sparse linear system: \begin{equation} M^{-1}Ax=M^{-1}b, @@ -95,23 +95,23 @@ where $M$ is the preconditioning matrix. \subsection{CG method} \label{ch12:sec:02.01} The conjugate gradient method was initially developed by Hestenes and Stiefel in 1952~\cite{ch12:ref2}. -It is one of the well known iterative method to solve large sparse linear systems. In addition, it +It is one of the well-known iterative methods to solve large sparse linear systems. In addition, it can be adapted to solve nonlinear equations and optimization problems. However, it can only be applied to problems with positive definite symmetric matrices. -The main idea of the CG method\index{Iterative~method!CG} is the computation of a sequence of approximate -solutions $\{x_k\}_{k\geq 0}$ in a Krylov subspace\index{Iterative~method!Krylov~subspace} of order $k$ as +The main idea of the CG method\index{iterative method!CG} is the computation of a sequence of approximate +solutions $\{x_k\}_{k\geq 0}$ in a Krylov subspace\index{iterative method!Krylov~subspace} of order $k$ as follows: \begin{equation} x_k \in x_0 + \mathcal{K}_k(A,r_0), \label{ch12:eq:04} \end{equation} -such that the Galerkin condition\index{Galerkin~condition} must be satisfied: +such that the Galerkin condition\index{Galerkin condition} must be satisfied: \begin{equation} r_k \bot \mathcal{K}_k(A,r_0), \label{ch12:eq:05} \end{equation} -where $x_0$ is the initial guess, $r_k=b-Ax_k$ is the residual of the computed solution $x_k$ and $\mathcal{K}_k$ +where $x_0$ is the initial guess, $r_k=b-Ax_k$ is the residual of the computed solution $x_k$, and $\mathcal{K}_k$ the Krylov subspace of order $k$: \[\mathcal{K}_k(A,r_0) \equiv\text{span}\{r_0, Ar_0, A^2r_0,\ldots, A^{k-1}r_0\}.\] In fact, CG is based on the construction of a sequence $\{p_k\}_{k\in\mathbb{N}}$ of direction vectors in $\mathcal{K}_k$ which are pairwise $A$-conjugate ($A$-orthogonal): @@ -142,9 +142,9 @@ p_0=r_0, & p_k=r_k+\beta_k p_{k-1}, & \beta_k\in\mathbb{R}. \label{ch12:eq:09} \end{equation} Moreover, the scalars $\{\alpha_k\}_{k>0}$ are chosen so as to minimize the $A$-norm error $\|x^{*}-x_k\|_A$ -over the Krylov subspace $\mathcal{K}_{k}$ and the scalars $\{\beta_k\}_{k>0}$ are chosen so as to ensure +over the Krylov subspace $\mathcal{K}_{k}$, and the scalars $\{\beta_k\}_{k>0}$ are chosen so as to ensure that the direction vectors are pairwise $A$-conjugate. So, the assumption that matrix $A$ is symmetric and -the recurrences~(\ref{ch12:eq:08}) and~(\ref{ch12:eq:09}) allow to deduce that: +the recurrences~(\ref{ch12:eq:08}) and~(\ref{ch12:eq:09}) allow the deduction that: \begin{equation} \begin{array}{ll} \alpha_{k}=\frac{r^{T}_{k-1}r_{k-1}}{p_{k}^{T}Ap_{k}}, & \beta_{k}=\frac{r_{k}^{T}r_{k}}{r_{k-1}^{T}r_{k-1}}. @@ -176,21 +176,21 @@ the recurrences~(\ref{ch12:eq:08}) and~(\ref{ch12:eq:09}) allow to deduce that: $k = k + 1$\; } } -\caption{Left-preconditioned CG method} +\caption{left-preconditioned CG method} \label{ch12:alg:01} \end{algorithm} Algorithm~\ref{ch12:alg:01} shows the main key points of the preconditioned CG method. It allows -to solve the left-preconditioned\index{Sparse~linear~system!Preconditioned} sparse linear system~(\ref{ch12:eq:11}). +the solving the left-preconditioned\index{sparse linear system!preconditioned} sparse linear system~(\ref{ch12:eq:11}). In this algorithm, $\varepsilon$ is the convergence tolerance threshold, $maxiter$ is the maximum -number of iterations and $(\cdot,\cdot)$ defines the dot product between two vectors in $\mathbb{R}^{n}$. +number of iterations, and $(\cdot,\cdot)$ defines the dot product between two vectors in $\mathbb{R}^{n}$. At every iteration, a direction vector $p_k$ is determined, so that it is orthogonal to the preconditioned residual $z_k$ and to the direction vectors $\{p_i\}_{i0}$ in -a Krylov subspace\index{Iterative~method!Krylov~subspace} $\mathcal{K}_k$ as follows: +a Krylov subspace\index{iterative method!Krylov subspace} $\mathcal{K}_k$ as follows: \begin{equation} \begin{array}{ll} x_k \in x_0 + \mathcal{K}_k(A, v_1),& v_1=\frac{r_0}{\|r_0\|_2}, \end{array} \label{ch12:eq:12} \end{equation} -so that the Petrov-Galerkin condition\index{Petrov-Galerkin~condition} is satisfied: +so that the Petrov-Galerkin condition\index{Petrov-Galerkin condition} is satisfied: \begin{equation} \begin{array}{ll} r_k \bot A \mathcal{K}_k(A, v_1). \end{array} \label{ch12:eq:13} \end{equation} -GMRES uses the Arnoldi process~\cite{ch12:ref5}\index{Iterative~method!Arnoldi~process} to construct an -orthonormal basis $V_k$ for the Krylov subspace $\mathcal{K}_k$ and an upper Hessenberg matrix\index{Hessenberg~matrix} +GMRES uses the Arnoldi iterations~\cite{ch12:ref5}\index{iterative method!Arnoldi iterations} to construct an +orthonormal basis $V_k$ for the Krylov subspace $\mathcal{K}_k$ and an upper Hessenberg matrix\index{Hessenberg matrix} $\bar{H}_k$ of order $(k+1)\times k$: \begin{equation} \begin{array}{ll} @@ -240,7 +240,7 @@ x_k = x_0 + V_k y, & y\in\mathbb{R}^{k}. \end{array} \label{ch12:eq:16} \end{equation} -From both formulas~(\ref{ch12:eq:15}) and~(\ref{ch12:eq:16}) and $r_k=b-Ax_k$, we can deduce that: +From both formulas~(\ref{ch12:eq:15}) and~(\ref{ch12:eq:16}) and $r_k=b-Ax_k$, we can deduce that \begin{equation} \begin{array}{lll} r_{k} & = & b - A (x_{0} + V_{k}y) \\ @@ -257,22 +257,23 @@ norm of the residual $r_k$. Consequently, a linear least-squares problem of size \underset{y\in\mathbb{R}^{k}}{min}\|r_{k}\|_{2}=\underset{y\in\mathbb{R}^{k}}{min}\|\beta e_{1}-\bar{H}_{k}y\|_{2}. \label{ch12:eq:18} \end{equation} -The QR factorization of matrix $\bar{H}_k$ is used to compute the solution of this problem by using -Givens rotations~\cite{ch12:ref1,ch12:ref3}, such that: +The QR factorization of matrix $\bar{H}_k$ is used (the decomposition of the matrix $\bar{H}$ into $Q$ and $R$ matrices) +to compute the solution of this problem by using +Givens rotations~\cite{ch12:ref1,ch12:ref3}, such that \begin{equation} \begin{array}{lll} \bar{H}_{k}=Q_{k}R_{k}, & Q_{k}\in\mathbb{R}^{(k+1)\times (k+1)}, & R_{k}\in\mathbb{R}^{(k+1)\times k}, \end{array} \label{ch12:eq:19} \end{equation} -where $Q_kQ_k^T=I_k$ and $R_k$ is an upper triangular matrix. +where $Q_k$ is an orthogonal matrix and $R_k$ is an upper triangular matrix. The GMRES method computes an approximate solution with a sufficient precision after, at most, $n$ iterations ($n$ is the size of the sparse linear system to be solved). However, the GMRES algorithm must construct and store in the memory an orthonormal basis $V_k$ whose size is proportional to the number of iterations required to achieve the convergence. Then, to avoid a huge memory storage, the GMRES method must be restarted at each $m$ iterations, such that $m$ is very small ($m\ll n$), and -with $x_m$ as the initial guess to the next iteration. This allows to limit the size of the basis +with $x_m$ as the initial guess to the next iteration. This allows the limitation of the size of the basis $V$ to $m$ orthogonal vectors. \begin{algorithm}[!t] @@ -292,7 +293,7 @@ $V$ to $m$ orthogonal vectors. $h_{j+1,j} = \|w_{j}\|_{2}$\; $v_{j+1} = w_{j}/h_{j+1,j}$\; } - Set $V_{m}=\{v_{j}\}_{1\leq j \leq m}$ and $\bar{H}_{m}=(h_{i,j})$ a $(m+1)\times m$ upper Hessenberg matrix\; + Set $V_{m}=\{v_{j}\}_{1\leq j \leq m}$ and $\bar{H}_{m}=(h_{i,j})$ is an upper Hessenberg matrix of size $(m+1)\times m$\; Solve a least-squares problem of size $m$: $min_{y\in\mathrm{I\!R}^{m}}\|\beta e_{1}-\bar{H}_{m}y\|_{2}$\; $x_{m} = x_{0}+V_{m}y_{m}$\; $r_{m} = M^{-1}(b-Ax_{m})$\; @@ -305,21 +306,21 @@ $V$ to $m$ orthogonal vectors. $k = k + 1$\; } } -\caption{Left-preconditioned GMRES method with restarts} +\caption{left-preconditioned GMRES method with restarts} \label{ch12:alg:02} \end{algorithm} Algorithm~\ref{ch12:alg:02} shows the key points of the GMRES method with restarts. -It solves the left-preconditioned\index{Sparse~linear~system!Preconditioned} sparse linear +It solves the left-preconditioned\index{sparse linear system!preconditioned} sparse linear system~(\ref{ch12:eq:11}), such that $M$ is the preconditioning matrix. At each iteration -$k$, GMRES uses the Arnoldi process\index{Iterative~method!Arnoldi~process} (defined from +$k$, GMRES uses the Arnoldi iterations\index{iterative method!Arnoldi iterations} (defined from line~$7$ to line~$17$) to construct a basis $V_m$ of $m$ orthogonal vectors and an upper -Hessenberg matrix\index{Hessenberg~matrix} $\bar{H}_m$ of size $(m+1)\times m$. Then, it +Hessenberg matrix\index{Hessenberg matrix} $\bar{H}_m$ of size $(m+1)\times m$. Then, it solves the linear least-squares problem of size $m$ to find the vector $y\in\mathbb{R}^{m}$ which minimizes at best the residual norm (line~$18$). Finally, it computes an approximate solution $x_m$ in the Krylov subspace spanned by $V_m$ (line~$19$). The GMRES algorithm is stopped when the residual norm is sufficiently small ($\|r_m\|_2<\varepsilon$) and/or the -maximum number of iterations\index{Convergence!Maximum~number~of~iterations} ($maxiter$) +maximum number of iterations\index{convergence!maximum number of iterations} ($maxiter$) is reached. @@ -328,12 +329,12 @@ is reached. %%--------------------------%% \section{Parallel implementation on a GPU cluster} \label{ch12:sec:03} -In this section, we present the parallel algorithms of both iterative CG\index{Iterative~method!CG} -and GMRES\index{Iterative~method!GMRES} methods for GPU clusters. The implementation is performed on -a GPU cluster composed of different computing nodes, such that each node is a CPU core managed by a -MPI process and equipped with a GPU card. The parallelization of these algorithms is carried out by -using the MPI communication routines between the GPU computing nodes\index{Computing~node} and the -CUDA programming environment inside each node. In what follows, the algorithms of the iterative methods +In this section, we present the parallel algorithms of both iterative CG\index{iterative method!CG} +and GMRES\index{iterative method!GMRES} methods for GPU clusters. The implementation is performed on +a GPU cluster composed of different computing nodes, such that each node is a CPU core managed by one +MPI (message passing interface) process and equipped with a GPU card. The parallelization of these algorithms is carried out by +using the MPI communication routines between the GPU computing nodes\index{computing node} and the +CUDA (compute unified device architecture) programming environment inside each node. In what follows, the algorithms of the iterative methods are called iterative solvers. @@ -342,24 +343,24 @@ are called iterative solvers. \subsection{Data partitioning} \label{ch12:sec:03.01} The parallel solving of the large sparse linear system~(\ref{ch12:eq:11}) requires a data partitioning -between the computing nodes of the GPU cluster. Let $p$ denotes the number of the computing nodes on the -GPU cluster. The partitioning operation consists in the decomposition of the vectors and matrices, involved -in the iterative solver, in $p$ portions. Indeed, this operation allows to assign to each computing node +between the computing nodes of the GPU cluster. Let $p$ denote the number of the computing nodes on the +GPU cluster. The partitioning operation consists of the decomposition of the vectors and matrices, involved +in the iterative solver, in $p$ portions. Indeed, this operation allows the assignment to each computing node $i$: \begin{itemize} \item a portion of size $\frac{n}{p}$ elements of each vector, -\item a sparse rectangular sub-matrix $A_i$ of size $(\frac{n}{p},n)$ and, -\item a square preconditioning sub-matrix $M_i$ of size $(\frac{n}{p},\frac{n}{p})$, +\item a sparse rectangular submatrix $A_i$ of size $(\frac{n}{p},n)$, and +\item a square preconditioning submatrix $M_i$ of size $(\frac{n}{p},\frac{n}{p})$, \end{itemize} where $n$ is the size of the sparse linear system to be solved. In the first instance, we perform a naive row-wise partitioning (row-by-row decomposition) on the data of the sparse linear systems to be solved. Figure~\ref{ch12:fig:01} shows an example of a row-wise data partitioning between four computing nodes -of a sparse linear system (sparse matrix $A$, solution vector $x$ and right-hand side $b$) of size $16$ +of a sparse linear system (sparse matrix $A$, solution vector $x$, and right-hand side $b$) of size $16$ unknown values. \begin{figure} \centerline{\includegraphics[scale=0.35]{Chapters/chapter12/figures/partition}} -\caption{A data partitioning of the sparse matrix $A$, the solution vector $x$ and the right-hand side $b$ into four portions.} +\caption{A data partitioning of the sparse matrix $A$, the solution vector $x$, and the right-hand side $b$ into four portions.} \label{ch12:fig:01} \end{figure} @@ -371,17 +372,17 @@ unknown values. After the partitioning operation, all the data involved from this operation must be transferred from the CPU memories to the GPU memories, in order to be processed by GPUs. We use two functions of the CUBLAS\index{CUBLAS} library (CUDA Basic Linear -Algebra Subroutines), developed by Nvidia~\cite{ch12:ref6}: \verb+cublasAlloc()+ +Algebra Subroutines) developed by NVIDIA~\cite{ch12:ref6}: \verb+cublasAlloc()+ for the memory allocations on GPUs and \verb+cublasSetVector()+ for the memory copies from the CPUs to the GPUs. -An efficient implementation of CG and GMRES solvers on a GPU cluster requires to -determine all parts of their codes that can be executed in parallel and, thus, take +An efficient implementation of CG and GMRES solvers on a GPU cluster requires the +determining of all parts of their codes that can be executed in parallel and, thus, takes advantage of the GPU acceleration. As many Krylov subspace methods, the CG and GMRES methods are mainly based on arithmetic operations dealing with vectors or matrices: sparse matrix-vector multiplications, scalar-vector multiplications, dot products, -Euclidean norms, AXPY operations ($y\leftarrow ax+y$ where $x$ and $y$ are vectors -and $a$ is a scalar) and so on. These vector operations are often easy to parallelize +Euclidean norms, AXPY operations ($y\leftarrow ax+y$ where $x$ and $y$ are vectors and $a$ is a scalar), +and so on. These vector operations are often easy to parallelize and they are more efficient on parallel computers when they work on large vectors. Therefore, all the vector operations used in CG and GMRES solvers must be executed by the GPUs as kernels. @@ -389,41 +390,41 @@ by the GPUs as kernels. We use the kernels of the CUBLAS library to compute some vector operations of CG and GMRES solvers. The following kernels of CUBLAS (dealing with double floating point) are used: \verb+cublasDdot()+ for the dot products, \verb+cublasDnrm2()+ for the -Euclidean norms and \verb+cublasDaxpy()+ for the AXPY operations. For the rest of +Euclidean norms, and \verb+cublasDaxpy()+ for the AXPY operations ($y\leftarrow ax+y$, compute a scalar-vector product and add +the result to a vector). For the rest of the data-parallel operations, we code their kernels in CUDA. In the CG solver, we -develop a kernel for the XPAY operation ($y\leftarrow x+ay$) used line~$12$ in +develop a kernel for the XPAY operation ($y\leftarrow x+ay$) used in line~$12$ in Algorithm~\ref{ch12:alg:01}. In the GMRES solver, we program a kernel for the scalar-vector multiplication (lines~$7$ and~$15$ in Algorithm~\ref{ch12:alg:02}), a kernel to -solve the least-squares problem and a kernel to update the elements of the solution +solve the least-squares problem, and a kernel to update the elements of the solution vector $x$. The least-squares problem in the GMRES method is solved by performing a QR factorization -on the Hessenberg matrix\index{Hessenberg~matrix} $\bar{H}_m$ with plane rotations and, +on the Hessenberg matrix\index{Hessenberg matrix} $\bar{H}_m$ with plane rotations and, then, solving the triangular system by backward substitutions to compute $y$. Consequently, -solving the least-squares problem on the GPU is not interesting. Indeed, the triangular +solving the least-squares problem on the GPU is not efficient. Indeed, the triangular solves are not easy to parallelize and inefficient on GPUs. However, the least-squares problem to solve in the GMRES method with restarts has, generally, a very small size $m$. -Therefore, we develop an inexpensive kernel which must be executed in sequential by a -single CUDA thread. +Therefore, we develop an inexpensive kernel which must be executed by a single CUDA thread. -The most important operation in CG\index{Iterative~method!CG} and GMRES\index{Iterative~method!GMRES} -methods is the sparse matrix-vector multiplication (SpMV)\index{SpMV~multiplication}, +The most important operation in CG\index{iterative method!CG} and GMRES\index{iterative method!GMRES} +methods is the SpMV multiplication (sparse matrix-vector multiplication)\index{SpMV multiplication}, because it is often an expensive operation in terms of execution time and memory space. -Moreover, it requires to take care of the storage format of the sparse matrix in the +Moreover, it requires taking care of the storage format of the sparse matrix in the memory. Indeed, the naive storage, row-by-row or column-by-column, of a sparse matrix can cause a significant waste of memory space and execution time. In addition, the sparse nature of the matrix often leads to irregular memory accesses to read the matrix nonzero -values. So, the computation of the SpMV multiplication on GPUs can involve non coalesced +values. So, the computation of the SpMV multiplication on GPUs can involve noncoalesced accesses to the global memory, which slows down its performances even more. One of the -most efficient compressed storage formats\index{Compressed~storage~format} of sparse -matrices on GPUs is the HYB\index{Compressed~storage~format!HYB} format~\cite{ch12:ref7}. +most efficient compressed storage formats\index{compressed storage format} of sparse +matrices on GPUs is the HYB (hybrid)\index{compressed storage format!HYB} format~\cite{ch12:ref7}. It is a combination of ELLpack (ELL) and Coordinate (COO) formats. Indeed, it stores -a typical number of nonzero values per row in ELL\index{Compressed~storage~format!ELL} -format and remaining entries of exceptional rows in COO format. It combines the efficiency -of ELL due to the regularity of its memory accesses and the flexibility of COO\index{Compressed~storage~format!COO} +a typical number of nonzero values per row in ELL\index{compressed storage format!ELL} +format and the remaining entries of exceptional rows in COO format. It combines the efficiency +of ELL due to the regularity of its memory accesses and the flexibility of COO\index{compressed storage format!COO} which is insensitive to the matrix structure. Consequently, we use the HYB kernel~\cite{ch12:ref8} -developed by Nvidia to implement the SpMV multiplication of CG and GMRES methods on GPUs. -Moreover, to avoid the non coalesced accesses to the high-latency global memory, we fill +developed by NVIDIA to implement the SpMV multiplication of CG and GMRES methods on GPUs. +Moreover, to avoid the noncoalesced accesses to the high-latency global memory, we fill the elements of the iterate vector $x$ in the cached texture memory. @@ -433,64 +434,64 @@ the elements of the iterate vector $x$ in the cached texture memory. \label{ch12:sec:03.03} All the computing nodes of the GPU cluster execute in parallel the same iterative solver (Algorithm~\ref{ch12:alg:01} or Algorithm~\ref{ch12:alg:02}) adapted to GPUs, but on their -own portions of the sparse linear system\index{Sparse~linear~system}: $M^{-1}_iA_ix_i=M^{-1}_ib_i$, +own portions of the sparse linear system\index{sparse linear system}: $M^{-1}_iA_ix_i=M^{-1}_ib_i$, $0\leq i