X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/e2f7ea69b2321fbf77291f35360751e460a99f44..55ce7168c6e69a2462d76c95dc9a5298ceedb04f:/BookGPU/Chapters/chapter12/ch12.tex diff --git a/BookGPU/Chapters/chapter12/ch12.tex b/BookGPU/Chapters/chapter12/ch12.tex index 3843597..4bc95a6 100755 --- a/BookGPU/Chapters/chapter12/ch12.tex +++ b/BookGPU/Chapters/chapter12/ch12.tex @@ -19,7 +19,7 @@ \label{ch12:sec:01} Sparse linear systems are used to model many scientific and industrial problems, such as the environmental simulations or the industrial processing of the complex or -non-Newtonian fluids. Moreover, the resolution of these problems often involves the +nonNewtonian fluids. Moreover, the resolution of these problems often involves the solving of such linear systems that are considered the most expensive process in terms of execution time and memory space. Therefore, solving sparse linear systems must be as efficient as possible in order to deal with problems of ever increasing @@ -47,7 +47,7 @@ CPU cluster and on a GPU cluster of solving large sparse linear systems. %%--------------------------%% \section{Krylov iterative methods} \label{ch12:sec:02} -Let us consider the following system of $n$ linear equations\index{Sparse~linear~system} +Let us consider the following system of $n$ linear equations\index{sparse linear system} in $\mathbb{R}$: \begin{equation} Ax=b, @@ -57,7 +57,7 @@ where $A\in\mathbb{R}^{n\times n}$ is a sparse nonsingular square matrix, $x\in\ is the solution vector, $b\in\mathbb{R}^{n}$ is the right-hand side, and $n\in\mathbb{N}$ is a large integer number. -The iterative methods\index{Iterative~method} for solving the large sparse linear system~(\ref{ch12:eq:01}) +The iterative methods\index{iterative method} for solving the large sparse linear system~(\ref{ch12:eq:01}) proceed by successive iterations of a same block of elementary operations, during which an infinite number of approximate solutions $\{x_k\}_{k\geq 0}$ is computed. Indeed, from an initial guess $x_0$, an iterative method determines at each iteration $k>0$ an approximate @@ -68,20 +68,20 @@ x^{*}=\lim\limits_{k\to\infty}x_{k}=A^{-1}b. \end{equation} The number of iterations necessary to reach the exact solution $x^{*}$ is not known beforehand and can be infinite. In practice, an iterative method often finds an approximate solution $\tilde{x}$ -after a fixed number of iterations and/or when a given convergence criterion\index{Convergence} +after a fixed number of iterations and/or when a given convergence criterion\index{convergence} is satisfied as follows: \begin{equation} \|b-A\tilde{x}\| < \varepsilon, \label{ch12:eq:03} \end{equation} -where $\varepsilon<1$ is the required convergence tolerance threshold\index{Convergence!Tolerance~threshold}. +where $\varepsilon<1$ is the required convergence tolerance threshold\index{convergence!tolerance threshold}. Some of the most iterative methods that have proven their efficiency for solving large sparse -linear systems are those called \textit{Krylov subspace methods}~\cite{ch12:ref1}\index{Iterative~method!Krylov~subspace}. +linear systems are those called \textit{Krylov subspace methods}~\cite{ch12:ref1}\index{iterative method!Krylov subspace}. In the present chapter, we describe two Krylov methods which are widely used: the CG method (conjugate gradient method) and the GMRES method (generalized minimal residual method). In practice, the Krylov subspace methods are usually used with preconditioners that allow the improvement of their -convergence. So, in what follows, the CG and GMRES methods are used to solve the left-preconditioned\index{Sparse~linear~system!Preconditioned} +convergence. So, in what follows, the CG and GMRES methods are used to solve the left-preconditioned\index{sparse linear system!preconditioned} sparse linear system: \begin{equation} M^{-1}Ax=M^{-1}b, @@ -99,14 +99,14 @@ It is one of the well-known iterative methods to solve large sparse linear syste can be adapted to solve nonlinear equations and optimization problems. However, it can only be applied to problems with positive definite symmetric matrices. -The main idea of the CG method\index{Iterative~method!CG} is the computation of a sequence of approximate -solutions $\{x_k\}_{k\geq 0}$ in a Krylov subspace\index{Iterative~method!Krylov~subspace} of order $k$ as +The main idea of the CG method\index{iterative method!CG} is the computation of a sequence of approximate +solutions $\{x_k\}_{k\geq 0}$ in a Krylov subspace\index{iterative method!Krylov~subspace} of order $k$ as follows: \begin{equation} x_k \in x_0 + \mathcal{K}_k(A,r_0), \label{ch12:eq:04} \end{equation} -such that the Galerkin condition\index{Galerkin~condition} must be satisfied: +such that the Galerkin condition\index{Galerkin condition} must be satisfied: \begin{equation} r_k \bot \mathcal{K}_k(A,r_0), \label{ch12:eq:05} @@ -181,15 +181,15 @@ the recurrences~(\ref{ch12:eq:08}) and~(\ref{ch12:eq:09}) allow the deduction th \end{algorithm} Algorithm~\ref{ch12:alg:01} shows the main key points of the preconditioned CG method. It allows -the solving the left-preconditioned\index{Sparse~linear~system!Preconditioned} sparse linear system~(\ref{ch12:eq:11}). +the solving the left-preconditioned\index{sparse linear system!preconditioned} sparse linear system~(\ref{ch12:eq:11}). In this algorithm, $\varepsilon$ is the convergence tolerance threshold, $maxiter$ is the maximum number of iterations, and $(\cdot,\cdot)$ defines the dot product between two vectors in $\mathbb{R}^{n}$. At every iteration, a direction vector $p_k$ is determined, so that it is orthogonal to the preconditioned residual $z_k$ and to the direction vectors $\{p_i\}_{i0}$ in -a Krylov subspace\index{Iterative~method!Krylov~subspace} $\mathcal{K}_k$ as follows: +a Krylov subspace\index{iterative method!Krylov subspace} $\mathcal{K}_k$ as follows: \begin{equation} \begin{array}{ll} x_k \in x_0 + \mathcal{K}_k(A, v_1),& v_1=\frac{r_0}{\|r_0\|_2}, \end{array} \label{ch12:eq:12} \end{equation} -so that the Petrov-Galerkin condition\index{Petrov-Galerkin~condition} is satisfied: +so that the Petrov-Galerkin condition\index{Petrov-Galerkin condition} is satisfied: \begin{equation} \begin{array}{ll} r_k \bot A \mathcal{K}_k(A, v_1). \end{array} \label{ch12:eq:13} \end{equation} -GMRES uses the Arnoldi process~\cite{ch12:ref5}\index{Iterative~method!Arnoldi~process} to construct an -orthonormal basis $V_k$ for the Krylov subspace $\mathcal{K}_k$ and an upper Hessenberg matrix\index{Hessenberg~matrix} +GMRES uses the Arnoldi iterations~\cite{ch12:ref5}\index{iterative method!Arnoldi iterations} to construct an +orthonormal basis $V_k$ for the Krylov subspace $\mathcal{K}_k$ and an upper Hessenberg matrix\index{Hessenberg matrix} $\bar{H}_k$ of order $(k+1)\times k$: \begin{equation} \begin{array}{ll} @@ -311,16 +311,16 @@ $V$ to $m$ orthogonal vectors. \end{algorithm} Algorithm~\ref{ch12:alg:02} shows the key points of the GMRES method with restarts. -It solves the left-preconditioned\index{Sparse~linear~system!Preconditioned} sparse linear +It solves the left-preconditioned\index{sparse linear system!preconditioned} sparse linear system~(\ref{ch12:eq:11}), such that $M$ is the preconditioning matrix. At each iteration -$k$, GMRES uses the Arnoldi process\index{Iterative~method!Arnoldi~process} (defined from +$k$, GMRES uses the Arnoldi iterations\index{iterative method!Arnoldi iterations} (defined from line~$7$ to line~$17$) to construct a basis $V_m$ of $m$ orthogonal vectors and an upper -Hessenberg matrix\index{Hessenberg~matrix} $\bar{H}_m$ of size $(m+1)\times m$. Then, it +Hessenberg matrix\index{Hessenberg matrix} $\bar{H}_m$ of size $(m+1)\times m$. Then, it solves the linear least-squares problem of size $m$ to find the vector $y\in\mathbb{R}^{m}$ which minimizes at best the residual norm (line~$18$). Finally, it computes an approximate solution $x_m$ in the Krylov subspace spanned by $V_m$ (line~$19$). The GMRES algorithm is stopped when the residual norm is sufficiently small ($\|r_m\|_2<\varepsilon$) and/or the -maximum number of iterations\index{Convergence!Maximum~number~of~iterations} ($maxiter$) +maximum number of iterations\index{convergence!maximum number of iterations} ($maxiter$) is reached. @@ -329,11 +329,11 @@ is reached. %%--------------------------%% \section{Parallel implementation on a GPU cluster} \label{ch12:sec:03} -In this section, we present the parallel algorithms of both iterative CG\index{Iterative~method!CG} -and GMRES\index{Iterative~method!GMRES} methods for GPU clusters. The implementation is performed on +In this section, we present the parallel algorithms of both iterative CG\index{iterative method!CG} +and GMRES\index{iterative method!GMRES} methods for GPU clusters. The implementation is performed on a GPU cluster composed of different computing nodes, such that each node is a CPU core managed by one MPI (message passing interface) process and equipped with a GPU card. The parallelization of these algorithms is carried out by -using the MPI communication routines between the GPU computing nodes\index{Computing~node} and the +using the MPI communication routines between the GPU computing nodes\index{computing node} and the CUDA (compute unified device architecture) programming environment inside each node. In what follows, the algorithms of the iterative methods are called iterative solvers. @@ -400,15 +400,15 @@ solve the least-squares problem, and a kernel to update the elements of the solu vector $x$. The least-squares problem in the GMRES method is solved by performing a QR factorization -on the Hessenberg matrix\index{Hessenberg~matrix} $\bar{H}_m$ with plane rotations and, +on the Hessenberg matrix\index{Hessenberg matrix} $\bar{H}_m$ with plane rotations and, then, solving the triangular system by backward substitutions to compute $y$. Consequently, solving the least-squares problem on the GPU is not efficient. Indeed, the triangular solves are not easy to parallelize and inefficient on GPUs. However, the least-squares problem to solve in the GMRES method with restarts has, generally, a very small size $m$. Therefore, we develop an inexpensive kernel which must be executed by a single CUDA thread. -The most important operation in CG\index{Iterative~method!CG} and GMRES\index{Iterative~method!GMRES} -methods is the SpMV multiplication (sparse matrix-vector multiplication)\index{SpMV~multiplication}, +The most important operation in CG\index{iterative method!CG} and GMRES\index{iterative method!GMRES} +methods is the SpMV multiplication (sparse matrix-vector multiplication)\index{SpMV multiplication}, because it is often an expensive operation in terms of execution time and memory space. Moreover, it requires taking care of the storage format of the sparse matrix in the memory. Indeed, the naive storage, row-by-row or column-by-column, of a sparse matrix @@ -416,12 +416,12 @@ can cause a significant waste of memory space and execution time. In addition, t nature of the matrix often leads to irregular memory accesses to read the matrix nonzero values. So, the computation of the SpMV multiplication on GPUs can involve noncoalesced accesses to the global memory, which slows down its performances even more. One of the -most efficient compressed storage formats\index{Compressed~storage~format} of sparse -matrices on GPUs is the HYB (hybrid)\index{Compressed~storage~format!HYB} format~\cite{ch12:ref7}. +most efficient compressed storage formats\index{compressed storage format} of sparse +matrices on GPUs is the HYB (hybrid)\index{compressed storage format!HYB} format~\cite{ch12:ref7}. It is a combination of ELLpack (ELL) and Coordinate (COO) formats. Indeed, it stores -a typical number of nonzero values per row in ELL\index{Compressed~storage~format!ELL} +a typical number of nonzero values per row in ELL\index{compressed storage format!ELL} format and the remaining entries of exceptional rows in COO format. It combines the efficiency -of ELL due to the regularity of its memory accesses and the flexibility of COO\index{Compressed~storage~format!COO} +of ELL due to the regularity of its memory accesses and the flexibility of COO\index{compressed storage format!COO} which is insensitive to the matrix structure. Consequently, we use the HYB kernel~\cite{ch12:ref8} developed by NVIDIA to implement the SpMV multiplication of CG and GMRES methods on GPUs. Moreover, to avoid the noncoalesced accesses to the high-latency global memory, we fill @@ -434,10 +434,10 @@ the elements of the iterate vector $x$ in the cached texture memory. \label{ch12:sec:03.03} All the computing nodes of the GPU cluster execute in parallel the same iterative solver (Algorithm~\ref{ch12:alg:01} or Algorithm~\ref{ch12:alg:02}) adapted to GPUs, but on their -own portions of the sparse linear system\index{Sparse~linear~system}: $M^{-1}_iA_ix_i=M^{-1}_ib_i$, +own portions of the sparse linear system\index{sparse linear system}: $M^{-1}_iA_ix_i=M^{-1}_ib_i$, $0\leq i