-\label{sec:03.02}
-After the partitioning operation, all the data involved from this operation must be transferred from the CPU memories to the GPU
-memories, in order to be processed by GPUs. We use two functions of the CUBLAS library (CUDA Basic Linear Algebra Subroutines),
-developed by Nvidia~\cite{ref6}: \verb+cublasAlloc()+ for the memory allocations on GPUs and \verb+cublasSetVector()+ for the
-memory copies from the CPUs to the GPUs.
-
-An efficient implementation of CG and GMRES solvers on a GPU cluster requires to determine all parts of their codes that can be
-executed in parallel and, thus, take advantage of the GPU acceleration. As many Krylov sub-space methods, the CG and GMRES methods
-are mainly based on arithmetic operations dealing with vectors or matrices: sparse matrix-vector multiplications, scalar-vector
-multiplications, dot products, Euclidean norms, AXPY operations ($y\leftarrow ax+y$ where $x$ and $y$ are vectors and $a$ is a
-scalar) and so on. These vector operations are often easy to parallelize and they are more efficient on parallel computers when
-they work on large vectors. Therefore, all the vector operations used in CG and GMRES solvers must be executed by the GPUs as kernels.
-
-We use the kernels of the CUBLAS library to compute some vector operations of CG and GMRES solvers. The following kernels of CUBLAS
-(dealing with double floating point) are used: \verb+cublasDdot()+ for the dot products, \verb+cublasDnrm2()+ for the Euclidean
-norms and \verb+cublasDaxpy()+ for the AXPY operations. For the rest of the data-parallel operations, we code their kernels in CUDA.
-In the CG solver, we develop a kernel for the XPAY operation ($y\leftarrow x+ay$) used at line~$12$ in Algorithm~\ref{alg:01}. In the
-GMRES solver, we program a kernel for the scalar-vector multiplication (lines~$7$ and~$15$ in Algorithm~\ref{alg:02}), a kernel for
-solving the least-squares problem and a kernel for the elements updates of the solution vector $x$.
-
-The least-squares problem in the GMRES method is solved by performing a QR factorization on the Hessenberg matrix $\bar{H}_m$ with
-plane rotations and, then, solving the triangular system by backward substitutions to compute $y$. Consequently, solving the least-squares
-problem on the GPU is not interesting. Indeed, the triangular solves are not easy to parallelize and inefficient on GPUs. However,
-the least-squares problem to solve in the GMRES method with restarts has, generally, a very small size $m$. Therefore, we develop
-an inexpensive kernel which must be executed in sequential by a single CUDA thread.
-
-The most important operation in CG and GMRES methods is the sparse matrix-vector multiplication (SpMV), because it is often an
-expensive operation in terms of execution time and memory space. Moreover, it requires to take care of the storage format of the
-sparse matrix in the memory. Indeed, the naive storage, row-by-row or column-by-column, of a sparse matrix can cause a significant
-waste of memory space and execution time. In addition, the sparsity nature of the matrix often leads to irregular memory accesses
-to read the matrix nonzero values. So, the computation of the SpMV multiplication on GPUs can involve non coalesced accesses to
-the global memory, which slows down even more its performances. One of the most efficient compressed storage formats of sparse
-matrices on GPUs is HYB format~\cite{ref7}. It is a combination of ELLpack (ELL) and Coordinate (COO) formats. Indeed, it stores
-a typical number of nonzero values per row in ELL format and remaining entries of exceptional rows in COO format. It combines
-the efficiency of ELL due to the regularity of its memory accesses and the flexibility of COO which is insensitive to the matrix
-structure. Consequently, we use the HYB kernel~\cite{ref8} developed by Nvidia to implement the SpMV multiplication of CG and
-GMRES methods on GPUs. Moreover, to avoid the non coalesced accesses to the high-latency global memory, we fill the elements of
-the iterate vector $x$ in the cached texture memory.
+\label{ch12:sec:03.02}
+After the partitioning operation, all the data involved from this operation must be
+transferred from the CPU memories to the GPU memories, in order to be processed by
+GPUs. We use two functions of the CUBLAS\index{CUBLAS} library (CUDA Basic Linear
+Algebra Subroutines), developed by Nvidia~\cite{ch12:ref6}: \verb+cublasAlloc()+
+for the memory allocations on GPUs and \verb+cublasSetVector()+ for the memory
+copies from the CPUs to the GPUs.
+
+An efficient implementation of CG and GMRES solvers on a GPU cluster requires to
+determine all parts of their codes that can be executed in parallel and, thus, take
+advantage of the GPU acceleration. As many Krylov subspace methods, the CG and GMRES
+methods are mainly based on arithmetic operations dealing with vectors or matrices:
+sparse matrix-vector multiplications, scalar-vector multiplications, dot products,
+Euclidean norms, AXPY operations ($y\leftarrow ax+y$ where $x$ and $y$ are vectors
+and $a$ is a scalar) and so on. These vector operations are often easy to parallelize
+and they are more efficient on parallel computers when they work on large vectors.
+Therefore, all the vector operations used in CG and GMRES solvers must be executed
+by the GPUs as kernels.
+
+We use the kernels of the CUBLAS library to compute some vector operations of CG and
+GMRES solvers. The following kernels of CUBLAS (dealing with double floating point)
+are used: \verb+cublasDdot()+ for the dot products, \verb+cublasDnrm2()+ for the
+Euclidean norms and \verb+cublasDaxpy()+ for the AXPY operations. For the rest of
+the data-parallel operations, we code their kernels in CUDA. In the CG solver, we
+develop a kernel for the XPAY operation ($y\leftarrow x+ay$) used at line~$12$ in
+Algorithm~\ref{ch12:alg:01}. In the GMRES solver, we program a kernel for the scalar-vector
+multiplication (lines~$7$ and~$15$ in Algorithm~\ref{ch12:alg:02}), a kernel for
+solving the least-squares problem and a kernel for the elements updates of the solution
+vector $x$.
+
+The least-squares problem in the GMRES method is solved by performing a QR factorization
+on the Hessenberg matrix\index{Hessenberg~matrix} $\bar{H}_m$ with plane rotations and,
+then, solving the triangular system by backward substitutions to compute $y$. Consequently,
+solving the least-squares problem on the GPU is not interesting. Indeed, the triangular
+solves are not easy to parallelize and inefficient on GPUs. However, the least-squares
+problem to solve in the GMRES method with restarts has, generally, a very small size $m$.
+Therefore, we develop an inexpensive kernel which must be executed in sequential by a
+single CUDA thread.
+
+The most important operation in CG\index{Iterative~method!CG} and GMRES\index{Iterative~method!GMRES}
+methods is the sparse matrix-vector multiplication (SpMV)\index{SpMV~multiplication},
+because it is often an expensive operation in terms of execution time and memory space.
+Moreover, it requires to take care of the storage format of the sparse matrix in the
+memory. Indeed, the naive storage, row-by-row or column-by-column, of a sparse matrix
+can cause a significant waste of memory space and execution time. In addition, the sparsity
+nature of the matrix often leads to irregular memory accesses to read the matrix nonzero
+values. So, the computation of the SpMV multiplication on GPUs can involve non coalesced
+accesses to the global memory, which slows down even more its performances. One of the
+most efficient compressed storage formats\index{Compressed~storage~format} of sparse
+matrices on GPUs is HYB\index{Compressed~storage~format!HYB} format~\cite{ch12:ref7}.
+It is a combination of ELLpack (ELL) and Coordinate (COO) formats. Indeed, it stores
+a typical number of nonzero values per row in ELL\index{Compressed~storage~format!ELL}
+format and remaining entries of exceptional rows in COO format. It combines the efficiency
+of ELL due to the regularity of its memory accesses and the flexibility of COO\index{Compressed~storage~format!COO}
+which is insensitive to the matrix structure. Consequently, we use the HYB kernel~\cite{ch12:ref8}
+developed by Nvidia to implement the SpMV multiplication of CG and GMRES methods on GPUs.
+Moreover, to avoid the non coalesced accesses to the high-latency global memory, we fill
+the elements of the iterate vector $x$ in the cached texture memory.
+