X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/59263d81c5f07e8eff22c9091a0847e79b4fbf2c..d74981733767eca78dbbe5be810b9b2e239e8aee:/BookGPU/Chapters/chapter12/ch12.tex?ds=inline diff --git a/BookGPU/Chapters/chapter12/ch12.tex b/BookGPU/Chapters/chapter12/ch12.tex index 7cd99f4..254c0cb 100755 --- a/BookGPU/Chapters/chapter12/ch12.tex +++ b/BookGPU/Chapters/chapter12/ch12.tex @@ -4,138 +4,155 @@ %% %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -\chapterauthor{}{} +%\chapterauthor{}{} +\chapterauthor{Lilia Ziane Khodja}{Femto-ST Institute, University of Franche-Comte, France} +\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte, France} +\chapterauthor{Jacques Bahi}{Femto-ST Institute, University of Franche-Comte, France} + \chapter{Solving sparse linear systems with GMRES and CG methods on GPU clusters} +\label{ch12} %%--------------------------%% %% SECTION 1 %% %%--------------------------%% \section{Introduction} -\label{sec:01} -The sparse linear systems are used to model many scientific and industrial problems, such as the environmental simulations or -the industrial processing of the complex or non-Newtonian fluids. Moreover, the resolution of these problems often involves the -solving of such linear systems which is considered as the most expensive process in terms of time execution and memory space. -Therefore, solving sparse linear systems must be as efficient as possible in order to deal with problems of ever increasing size. - -There are, in the jargon of numerical analysis, different methods of solving sparse linear systems that we can classify in two -classes: the direct and iterative methods. However, the iterative methods are often more suitable than their counterpart, direct -methods, for solving large sparse linear systems. Indeed, they are less memory consuming and easier to parallelize on parallel -computers than direct methods. Different computing platforms, sequential and parallel computers, are used for solving sparse -linear systems with iterative solutions. Nowadays, graphics processing units (GPUs) have become attractive for solving these -linear systems, due to their computing power and their ability to compute faster than traditional CPUs. - -In Section~\ref{sec:02}, we describe the general principle of two well-known iterative methods: the conjugate gradient method and -the generalized minimal residual method. In Section~\ref{sec:03}, we give the main key points of the parallel implementation of both -methods on a cluster of GPUs. Then, in Section~\ref{sec:04}, we present the experimental results obtained on a CPU cluster and on -a GPU cluster, for solving sparse linear systems associated to matrices of different structures. Finally, in Section~\ref{sec:05}, -we apply the hypergraph partitioning technique to reduce the total communication volume between the computing nodes and, thus, to -improve the execution times of the parallel algorithms of both iterative methods. +\label{ch12:sec:01} +Sparse linear systems are used to model many scientific and industrial problems, +such as the environmental simulations or the industrial processing of the complex or +non-Newtonian fluids. Moreover, the resolution of these problems often involves the +solving of such linear systems which are considered as the most expensive process in +terms of execution time and memory space. Therefore, solving sparse linear systems +must be as efficient as possible in order to deal with problems of ever increasing +size. + +There are, in the jargon of numerical analysis, different methods of solving sparse +linear systems that can be classified in two classes: the direct and iterative methods. +However, the iterative methods are often more suitable than their counterpart, direct +methods, to solve these systems. Indeed, they are less memory consuming and easier +to parallelize on parallel computers than direct methods. Different computing platforms, +sequential and parallel computers, are used to solve sparse linear systems with iterative +solutions. Nowadays, graphics processing units (GPUs) have become attractive to solve +these systems, due to their computing power and their ability to compute faster than +traditional CPUs. + +In Section~\ref{ch12:sec:02}, we describe the general principle of two well-known iterative +methods: the conjugate gradient method and the generalized minimal residual method. In Section~\ref{ch12:sec:03}, +we give the main key points of the parallel implementation of both methods on a cluster of +GPUs. Finally, in Section~\ref{ch12:sec:04}, we present the experimental results obtained on a +CPU cluster and on a GPU cluster, to solve large sparse linear systems. %%--------------------------%% %% SECTION 2 %% %%--------------------------%% \section{Krylov iterative methods} -\label{sec:02} -Let us consider the following system of $n$ linear equations in $\mathbb{R}$: +\label{ch12:sec:02} +Let us consider the following system of $n$ linear equations\index{Sparse~linear~system} +in $\mathbb{R}$: \begin{equation} Ax=b, -\label{eq:01} +\label{ch12:eq:01} \end{equation} -where $A\in\mathbb{R}^{n\times n}$ is a sparse nonsingular square matrix, $x\in\mathbb{R}^{n}$ is the solution vector, -$b\in\mathbb{R}^{n}$ is the right-hand side and $n\in\mathbb{N}$ is a large integer number. - -The iterative methods for solving the large sparse linear system~(\ref{eq:01}) proceed by successive iterations of a same -block of elementary operations, during which an infinite number of approximate solutions $\{x_k\}_{k\geq 0}$ are computed. -Indeed, from an initial guess $x_0$, an iterative method determines at each iteration $k>0$ an approximate solution $x_k$ -which, gradually, converges to the exact solution $x^{*}$ as follows: +where $A\in\mathbb{R}^{n\times n}$ is a sparse nonsingular square matrix, $x\in\mathbb{R}^{n}$ +is the solution vector, $b\in\mathbb{R}^{n}$ is the right-hand side and $n\in\mathbb{N}$ is a +large integer number. + +The iterative methods\index{Iterative~method} for solving the large sparse linear system~(\ref{ch12:eq:01}) +proceed by successive iterations of a same block of elementary operations, during which an +infinite number of approximate solutions $\{x_k\}_{k\geq 0}$ are computed. Indeed, from an +initial guess $x_0$, an iterative method determines at each iteration $k>0$ an approximate +solution $x_k$ which, gradually, converges to the exact solution $x^{*}$ as follows: \begin{equation} x^{*}=\lim\limits_{k\to\infty}x_{k}=A^{-1}b. -\label{eq:02} +\label{ch12:eq:02} \end{equation} -The number of iterations necessary to reach the exact solution $x^{*}$ is not known beforehand and can be infinite. In -practice, an iterative method often finds an approximate solution $\tilde{x}$ after a fixed number of iterations and/or -when a given convergence criterion is satisfied as follows: +The number of iterations necessary to reach the exact solution $x^{*}$ is not known beforehand +and can be infinite. In practice, an iterative method often finds an approximate solution $\tilde{x}$ +after a fixed number of iterations and/or when a given convergence criterion\index{Convergence} +is satisfied as follows: \begin{equation} \|b-A\tilde{x}\| < \varepsilon, -\label{eq:03} +\label{ch12:eq:03} \end{equation} -where $\varepsilon<1$ is the required convergence tolerance threshold. - -Some of the most iterative methods that have proven their efficiency for solving large sparse linear systems are those -called \textit{Krylov sub-space methods}~\cite{ref1}. In the present chapter, we describe two Krylov methods which are -widely used: the conjugate gradient method (CG) and the generalized minimal residual method (GMRES). In practice, the -Krylov sub-space methods are usually used with preconditioners that allow to improve their convergence. So, in what -follows, the CG and GMRES methods are used for solving the left-preconditioned sparse linear system: +where $\varepsilon<1$ is the required convergence tolerance threshold\index{Convergence!Tolerance~threshold}. + +Some of the most iterative methods that have proven their efficiency for solving large sparse +linear systems are those called \textit{Krylov subspace methods}~\cite{ch12:ref1}\index{Iterative~method!Krylov~subspace}. +In the present chapter, we describe two Krylov methods which are widely used: the conjugate +gradient method (CG) and the generalized minimal residual method (GMRES). In practice, the +Krylov subspace methods are usually used with preconditioners that allow to improve their +convergence. So, in what follows, the CG and GMRES methods are used to solve the left-preconditioned\index{Sparse~linear~system!Preconditioned} +sparse linear system: \begin{equation} M^{-1}Ax=M^{-1}b, -\label{eq:11} +\label{ch12:eq:11} \end{equation} where $M$ is the preconditioning matrix. + %%****************%% %%****************%% \subsection{CG method} -\label{sec:02.01} -The conjugate gradient method is initially developed by Hestenes and Stiefel in 1952~\cite{ref2}. It is one of the well -known iterative method for solving large sparse linear systems. In addition, it can be adapted for solving nonlinear -equations and optimization problems. However, it can only be applied to problems with positive definite symmetric matrices. - -The main idea of the CG method is the computation of a sequence of approximate solutions $\{x_k\}_{k\geq 0}$ in a Krylov -sub-space of order $k$ as follows: +\label{ch12:sec:02.01} +The conjugate gradient method was initially developed by Hestenes and Stiefel in 1952~\cite{ch12:ref2}. +It is one of the well known iterative method to solve large sparse linear systems. In addition, it +can be adapted to solve nonlinear equations and optimization problems. However, it can only be applied +to problems with positive definite symmetric matrices. + +The main idea of the CG method\index{Iterative~method!CG} is the computation of a sequence of approximate +solutions $\{x_k\}_{k\geq 0}$ in a Krylov subspace\index{Iterative~method!Krylov~subspace} of order $k$ as +follows: \begin{equation} x_k \in x_0 + \mathcal{K}_k(A,r_0), -\label{eq:04} +\label{ch12:eq:04} \end{equation} -such that the Galerkin condition must be satisfied: +such that the Galerkin condition\index{Galerkin~condition} must be satisfied: \begin{equation} r_k \bot \mathcal{K}_k(A,r_0), -\label{eq:05} +\label{ch12:eq:05} \end{equation} -where $x_0$ is the initial guess, $r_k=b-Ax_k$ is the residual of the computed solution $x_k$ and $\mathcal{K}_k$ the Krylov -sub-space of order $k$: \[\mathcal{K}_k(A,r_0) \equiv\text{span}\{r_0, Ar_0, A^2r_0,\ldots, A^{k-1}r_0\}.\] +where $x_0$ is the initial guess, $r_k=b-Ax_k$ is the residual of the computed solution $x_k$ and $\mathcal{K}_k$ +the Krylov subspace of order $k$: \[\mathcal{K}_k(A,r_0) \equiv\text{span}\{r_0, Ar_0, A^2r_0,\ldots, A^{k-1}r_0\}.\] In fact, CG is based on the construction of a sequence $\{p_k\}_{k\in\mathbb{N}}$ of direction vectors in $\mathcal{K}_k$ which are pairwise $A$-conjugate ($A$-orthogonal): \begin{equation} \begin{array}{ll} p_i^T A p_j = 0, & i\neq j. \end{array} -\label{eq:06} +\label{ch12:eq:06} \end{equation} At each iteration $k$, an approximate solution $x_k$ is computed by recurrence as follows: \begin{equation} \begin{array}{ll} x_k = x_{k-1} + \alpha_k p_k, & \alpha_k\in\mathbb{R}. \end{array} -\label{eq:07} +\label{ch12:eq:07} \end{equation} Consequently, the residuals $r_k$ are computed in the same way: \begin{equation} r_k = r_{k-1} - \alpha_k A p_k. -\label{eq:08} +\label{ch12:eq:08} \end{equation} -In the case where all residuals are nonzero, the direction vectors $p_k$ can be determined so that the following recurrence -holds: +In the case where all residuals are nonzero, the direction vectors $p_k$ can be determined so that +the following recurrence holds: \begin{equation} \begin{array}{lll} p_0=r_0, & p_k=r_k+\beta_k p_{k-1}, & \beta_k\in\mathbb{R}. \end{array} -\label{eq:09} +\label{ch12:eq:09} \end{equation} -Moreover, the scalars $\{\alpha_k\}_{k>0}$ are chosen so as to minimize the $A$-norm error $\|x^{*}-x_k\|_A$ over the Krylov -sub-space $\mathcal{K}_{k}$ and the scalars $\{\beta_k\}_{k>0}$ are chosen so as to ensure that the direction vectors are -pairwise $A$-conjugate. So, the assumption that matrix $A$ is symmetric and the recurrences~(\ref{eq:08}) and~(\ref{eq:09}) -allow to deduce that: +Moreover, the scalars $\{\alpha_k\}_{k>0}$ are chosen so as to minimize the $A$-norm error $\|x^{*}-x_k\|_A$ +over the Krylov subspace $\mathcal{K}_{k}$ and the scalars $\{\beta_k\}_{k>0}$ are chosen so as to ensure +that the direction vectors are pairwise $A$-conjugate. So, the assumption that matrix $A$ is symmetric and +the recurrences~(\ref{ch12:eq:08}) and~(\ref{ch12:eq:09}) allow to deduce that: \begin{equation} \begin{array}{ll} \alpha_{k}=\frac{r^{T}_{k-1}r_{k-1}}{p_{k}^{T}Ap_{k}}, & \beta_{k}=\frac{r_{k}^{T}r_{k}}{r_{k-1}^{T}r_{k-1}}. \end{array} -\label{eq:10} +\label{ch12:eq:10} \end{equation} \begin{algorithm}[!t] - %\SetLine - %\linesnumbered Choose an initial guess $x_0$\; $r_{0} = b - A x_{0}$\; $convergence$ = false\; @@ -160,62 +177,70 @@ allow to deduce that: } } \caption{Left-preconditioned CG method} -\label{alg:01} +\label{ch12:alg:01} \end{algorithm} -Algorithm~\ref{alg:01} shows the main key points of the preconditioned CG method. It allows to solve the left-preconditioned -sparse linear system~(\ref{eq:11}). In this algorithm, $\varepsilon$ is the convergence tolerance threshold, $maxiter$ is the maximum -number of iterations and $(\cdot,\cdot)$ defines the dot product between two vectors in $\mathbb{R}^{n}$. At every iteration, a direction -vector $p_k$ is determined, so that it is orthogonal to the preconditioned residual $z_k$ and to the direction vectors $\{p_i\}_{i0}$ in a Krylov sub-space $\mathcal{K}_k$ as follows: +\label{ch12:sec:02.02} +The iterative GMRES method was developed by Saad and Schultz in 1986~\cite{ch12:ref3} as a generalization +of the minimum residual method MINRES~\cite{ch12:ref4}\index{Iterative~method!MINRES}. Indeed, GMRES can +be applied for solving symmetric or nonsymmetric linear systems. + +The main principle of the GMRES method\index{Iterative~method!GMRES} is to find an approximation minimizing +at best the residual norm. In fact, GMRES computes a sequence of approximate solutions $\{x_k\}_{k>0}$ in +a Krylov subspace\index{Iterative~method!Krylov~subspace} $\mathcal{K}_k$ as follows: \begin{equation} \begin{array}{ll} x_k \in x_0 + \mathcal{K}_k(A, v_1),& v_1=\frac{r_0}{\|r_0\|_2}, \end{array} -\label{eq:12} +\label{ch12:eq:12} \end{equation} -so that the Petrov-Galerkin condition is satisfied: +so that the Petrov-Galerkin condition\index{Petrov-Galerkin~condition} is satisfied: \begin{equation} \begin{array}{ll} r_k \bot A \mathcal{K}_k(A, v_1). \end{array} -\label{eq:13} +\label{ch12:eq:13} \end{equation} -GMRES uses the Arnoldi process~\cite{ref5} to construct an orthonormal basis $V_k$ for the Krylov sub-space $\mathcal{K}_k$ -and an upper Hessenberg matrix $\bar{H}_k$ of order $(k+1)\times k$: +GMRES uses the Arnoldi process~\cite{ch12:ref5}\index{Iterative~method!Arnoldi~process} to construct an +orthonormal basis $V_k$ for the Krylov subspace $\mathcal{K}_k$ and an upper Hessenberg matrix\index{Hessenberg~matrix} +$\bar{H}_k$ of order $(k+1)\times k$: \begin{equation} \begin{array}{ll} V_k = \{v_1, v_2,\ldots,v_k\}, & \forall k>1, v_k=A^{k-1}v_1, \end{array} -\label{eq:14} +\label{ch12:eq:14} \end{equation} and \begin{equation} -V_k A = V_{k+1} \bar{H}_k. -\label{eq:15} +A V_k = V_{k+1} \bar{H}_k. +\label{ch12:eq:15} \end{equation} -Then, at each iteration $k$, an approximate solution $x_k$ is computed in the Krylov sub-space $\mathcal{K}_k$ spanned by $V_k$ -as follows: +Then, at each iteration $k$, an approximate solution $x_k$ is computed in the Krylov subspace $\mathcal{K}_k$ +spanned by $V_k$ as follows: \begin{equation} \begin{array}{ll} x_k = x_0 + V_k y, & y\in\mathbb{R}^{k}. \end{array} -\label{eq:16} +\label{ch12:eq:16} \end{equation} -From both formulas~(\ref{eq:15}) and~(\ref{eq:16}) and $r_k=b-Ax_k$, we can deduce that: +From both formulas~(\ref{ch12:eq:15}) and~(\ref{ch12:eq:16}) and $r_k=b-Ax_k$, we can deduce that: \begin{equation} \begin{array}{lll} r_{k} & = & b - A (x_{0} + V_{k}y) \\ @@ -223,34 +248,34 @@ From both formulas~(\ref{eq:15}) and~(\ref{eq:16}) and $r_k=b-Ax_k$, we can dedu & = & \beta v_{1} - V_{k+1}\bar{H}_{k}y \\ & = & V_{k+1}(\beta e_{1} - \bar{H}_{k}y), \end{array} -\label{eq:17} +\label{ch12:eq:17} \end{equation} -such that $\beta=\|r_0\|_2$ and $e_1=(1,0,\cdots,0)$ is the first vector of the canonical basis of $\mathbb{R}^k$. So, -the vector $y$ is chosen in $\mathbb{R}^k$ so as to minimize at best the Euclidean norm of the residual $r_k$. Consequently, -a linear least-squares problem of size $k$ is solved: +such that $\beta=\|r_0\|_2$ and $e_1=(1,0,\cdots,0)$ is the first vector of the canonical basis of +$\mathbb{R}^k$. So, the vector $y$ is chosen in $\mathbb{R}^k$ so as to minimize at best the Euclidean +norm of the residual $r_k$. Consequently, a linear least-squares problem of size $k$ is solved: \begin{equation} \underset{y\in\mathbb{R}^{k}}{min}\|r_{k}\|_{2}=\underset{y\in\mathbb{R}^{k}}{min}\|\beta e_{1}-\bar{H}_{k}y\|_{2}. -\label{eq:18} +\label{ch12:eq:18} \end{equation} -The QR factorization of matrix $\bar{H}_k$ is used to compute the solution of this problem by using Givens rotations~\cite{ref1,ref3}, -such that: +The QR factorization of matrix $\bar{H}_k$ is used to compute the solution of this problem by using +Givens rotations~\cite{ch12:ref1,ch12:ref3}, such that: \begin{equation} \begin{array}{lll} \bar{H}_{k}=Q_{k}R_{k}, & Q_{k}\in\mathbb{R}^{(k+1)\times (k+1)}, & R_{k}\in\mathbb{R}^{(k+1)\times k}, \end{array} -\label{eq:19} +\label{ch12:eq:19} \end{equation} where $Q_kQ_k^T=I_k$ and $R_k$ is an upper triangular matrix. -The GMRES method computes an approximate solution with a sufficient precision after, at most, $n$ iterations ($n$ is the size of the -sparse linear system to be solved). However, the GMRES algorithm must construct and store in the memory an orthonormal basis $V_k$ whose -size is proportional to the number of iterations required to achieve the convergence. Then, to avoid a huge memory storage, the GMRES -method must be restarted at each $m$ iterations, such that $m$ is very small ($m\ll n$), and with $x_m$ as the initial guess to the -next iteration. This allows to limit the size of the basis $V$ to $m$ orthogonal vectors. +The GMRES method computes an approximate solution with a sufficient precision after, at most, $n$ +iterations ($n$ is the size of the sparse linear system to be solved). However, the GMRES algorithm +must construct and store in the memory an orthonormal basis $V_k$ whose size is proportional to the +number of iterations required to achieve the convergence. Then, to avoid a huge memory storage, the +GMRES method must be restarted at each $m$ iterations, such that $m$ is very small ($m\ll n$), and +with $x_m$ as the initial guess to the next iteration. This allows to limit the size of the basis +$V$ to $m$ orthogonal vectors. \begin{algorithm}[!t] - %\SetLine - %\linesnumbered Choose an initial guess $x_0$\; $convergence$ = false\; $k = 1$\; @@ -281,190 +306,237 @@ next iteration. This allows to limit the size of the basis $V$ to $m$ orthogonal } } \caption{Left-preconditioned GMRES method with restarts} -\label{alg:02} +\label{ch12:alg:02} \end{algorithm} -Algorithm~\ref{alg:02} shows the main key points of the GMRES method with restarts. It solves the left-preconditioned sparse linear -system~(\ref{eq:11}), such that $M$ is the preconditioning matrix. At each iteration $k$, GMRES uses the Arnoldi process (defined -from line~$7$ to line~$17$) to construct a basis $V_m$ of $m$ orthogonal vectors and an upper Hessenberg matrix $\bar{H}_m$ of size -$(m+1)\times m$. Then, it solves the linear least-squares problem of size $m$ to find the vector $y\in\mathbb{R}^{m}$ which minimizes -at best the residual norm (line~$18$). Finally, it computes an approximate solution $x_m$ in the Krylov sub-space spanned by $V_m$ -(line~$19$). The GMRES algorithm is stopped when the residual norm is sufficiently small ($\|r_m\|_2<\varepsilon$) and/or the maximum -number of iterations ($maxiter$) is reached. +Algorithm~\ref{ch12:alg:02} shows the key points of the GMRES method with restarts. +It solves the left-preconditioned\index{Sparse~linear~system!Preconditioned} sparse linear +system~(\ref{ch12:eq:11}), such that $M$ is the preconditioning matrix. At each iteration +$k$, GMRES uses the Arnoldi process\index{Iterative~method!Arnoldi~process} (defined from +line~$7$ to line~$17$) to construct a basis $V_m$ of $m$ orthogonal vectors and an upper +Hessenberg matrix\index{Hessenberg~matrix} $\bar{H}_m$ of size $(m+1)\times m$. Then, it +solves the linear least-squares problem of size $m$ to find the vector $y\in\mathbb{R}^{m}$ +which minimizes at best the residual norm (line~$18$). Finally, it computes an approximate +solution $x_m$ in the Krylov subspace spanned by $V_m$ (line~$19$). The GMRES algorithm is +stopped when the residual norm is sufficiently small ($\|r_m\|_2<\varepsilon$) and/or the +maximum number of iterations\index{Convergence!Maximum~number~of~iterations} ($maxiter$) +is reached. + %%--------------------------%% %% SECTION 3 %% %%--------------------------%% \section{Parallel implementation on a GPU cluster} -\label{sec:03} -In this section, we present the parallel algorithms of both iterative CG and GMRES methods for GPU clusters. -The implementation is performed on a GPU cluster composed of different computing nodes, such that each node -is a CPU core managed by a MPI process and equipped with a GPU card. The parallelization of these algorithms -is carried out by using the MPI communication routines between the GPU computing nodes and the CUDA programming -environment inside each node. In what follows, the algorithms of the iterative methods are called iterative -solvers. +\label{ch12:sec:03} +In this section, we present the parallel algorithms of both iterative CG\index{Iterative~method!CG} +and GMRES\index{Iterative~method!GMRES} methods for GPU clusters. The implementation is performed on +a GPU cluster composed of different computing nodes, such that each node is a CPU core managed by a +MPI process and equipped with a GPU card. The parallelization of these algorithms is carried out by +using the MPI communication routines between the GPU computing nodes\index{Computing~node} and the +CUDA programming environment inside each node. In what follows, the algorithms of the iterative methods +are called iterative solvers. + %%****************%% %%****************%% \subsection{Data partitioning} -\label{sec:03.01} -The parallel solving of the large sparse linear system~(\ref{eq:11}) requires a data partitioning between the computing -nodes of the GPU cluster. Let $p$ denotes the number of the computing nodes on the GPU cluster. The partitioning operation -consists in the decomposition of the vectors and matrices, involved in the iterative solver, in $p$ portions. Indeed, this -operation allows to assign to each computing node $i$: +\label{ch12:sec:03.01} +The parallel solving of the large sparse linear system~(\ref{ch12:eq:11}) requires a data partitioning +between the computing nodes of the GPU cluster. Let $p$ denotes the number of the computing nodes on the +GPU cluster. The partitioning operation consists in the decomposition of the vectors and matrices, involved +in the iterative solver, in $p$ portions. Indeed, this operation allows to assign to each computing node +$i$: \begin{itemize} \item a portion of size $\frac{n}{p}$ elements of each vector, \item a sparse rectangular sub-matrix $A_i$ of size $(\frac{n}{p},n)$ and, \item a square preconditioning sub-matrix $M_i$ of size $(\frac{n}{p},\frac{n}{p})$, \end{itemize} -where $n$ is the size of the sparse linear system to be solved. In the first instance, we perform a naive row-wise partitioning -(decomposition row-by-row) on the data of the sparse linear systems to be solved. Figure~\ref{fig:01} shows an example of a row-wise -data partitioning between four computing nodes of a sparse linear system (sparse matrix $A$, solution vector $x$ and right-hand -side $b$) of size $16$ unknown values. +where $n$ is the size of the sparse linear system to be solved. In the first instance, we perform a naive +row-wise partitioning (row-by-row decomposition) on the data of the sparse linear systems to be solved. +Figure~\ref{ch12:fig:01} shows an example of a row-wise data partitioning between four computing nodes +of a sparse linear system (sparse matrix $A$, solution vector $x$ and right-hand side $b$) of size $16$ +unknown values. \begin{figure} \centerline{\includegraphics[scale=0.35]{Chapters/chapter12/figures/partition}} \caption{A data partitioning of the sparse matrix $A$, the solution vector $x$ and the right-hand side $b$ into four portions.} -\label{fig:01} +\label{ch12:fig:01} \end{figure} + %%****************%% %%****************%% \subsection{GPU computing} -\label{sec:03.02} -After the partitioning operation, all the data involved from this operation must be transferred from the CPU memories to the GPU -memories, in order to be processed by GPUs. We use two functions of the CUBLAS library (CUDA Basic Linear Algebra Subroutines), -developed by Nvidia~\cite{ref6}: \verb+cublasAlloc()+ for the memory allocations on GPUs and \verb+cublasSetVector()+ for the -memory copies from the CPUs to the GPUs. - -An efficient implementation of CG and GMRES solvers on a GPU cluster requires to determine all parts of their codes that can be -executed in parallel and, thus, take advantage of the GPU acceleration. As many Krylov sub-space methods, the CG and GMRES methods -are mainly based on arithmetic operations dealing with vectors or matrices: sparse matrix-vector multiplications, scalar-vector -multiplications, dot products, Euclidean norms, AXPY operations ($y\leftarrow ax+y$ where $x$ and $y$ are vectors and $a$ is a -scalar) and so on. These vector operations are often easy to parallelize and they are more efficient on parallel computers when -they work on large vectors. Therefore, all the vector operations used in CG and GMRES solvers must be executed by the GPUs as kernels. - -We use the kernels of the CUBLAS library to compute some vector operations of CG and GMRES solvers. The following kernels of CUBLAS -(dealing with double floating point) are used: \verb+cublasDdot()+ for the dot products, \verb+cublasDnrm2()+ for the Euclidean -norms and \verb+cublasDaxpy()+ for the AXPY operations. For the rest of the data-parallel operations, we code their kernels in CUDA. -In the CG solver, we develop a kernel for the XPAY operation ($y\leftarrow x+ay$) used at line~$12$ in Algorithm~\ref{alg:01}. In the -GMRES solver, we program a kernel for the scalar-vector multiplication (lines~$7$ and~$15$ in Algorithm~\ref{alg:02}), a kernel for -solving the least-squares problem and a kernel for the elements updates of the solution vector $x$. - -The least-squares problem in the GMRES method is solved by performing a QR factorization on the Hessenberg matrix $\bar{H}_m$ with -plane rotations and, then, solving the triangular system by backward substitutions to compute $y$. Consequently, solving the least-squares -problem on the GPU is not interesting. Indeed, the triangular solves are not easy to parallelize and inefficient on GPUs. However, -the least-squares problem to solve in the GMRES method with restarts has, generally, a very small size $m$. Therefore, we develop -an inexpensive kernel which must be executed in sequential by a single CUDA thread. - -The most important operation in CG and GMRES methods is the sparse matrix-vector multiplication (SpMV), because it is often an -expensive operation in terms of execution time and memory space. Moreover, it requires to take care of the storage format of the -sparse matrix in the memory. Indeed, the naive storage, row-by-row or column-by-column, of a sparse matrix can cause a significant -waste of memory space and execution time. In addition, the sparsity nature of the matrix often leads to irregular memory accesses -to read the matrix nonzero values. So, the computation of the SpMV multiplication on GPUs can involve non coalesced accesses to -the global memory, which slows down even more its performances. One of the most efficient compressed storage formats of sparse -matrices on GPUs is HYB format~\cite{ref7}. It is a combination of ELLpack (ELL) and Coordinate (COO) formats. Indeed, it stores -a typical number of nonzero values per row in ELL format and remaining entries of exceptional rows in COO format. It combines -the efficiency of ELL due to the regularity of its memory accesses and the flexibility of COO which is insensitive to the matrix -structure. Consequently, we use the HYB kernel~\cite{ref8} developed by Nvidia to implement the SpMV multiplication of CG and -GMRES methods on GPUs. Moreover, to avoid the non coalesced accesses to the high-latency global memory, we fill the elements of -the iterate vector $x$ in the cached texture memory. +\label{ch12:sec:03.02} +After the partitioning operation, all the data involved from this operation must be +transferred from the CPU memories to the GPU memories, in order to be processed by +GPUs. We use two functions of the CUBLAS\index{CUBLAS} library (CUDA Basic Linear +Algebra Subroutines), developed by Nvidia~\cite{ch12:ref6}: \verb+cublasAlloc()+ +for the memory allocations on GPUs and \verb+cublasSetVector()+ for the memory +copies from the CPUs to the GPUs. + +An efficient implementation of CG and GMRES solvers on a GPU cluster requires to +determine all parts of their codes that can be executed in parallel and, thus, take +advantage of the GPU acceleration. As many Krylov subspace methods, the CG and GMRES +methods are mainly based on arithmetic operations dealing with vectors or matrices: +sparse matrix-vector multiplications, scalar-vector multiplications, dot products, +Euclidean norms, AXPY operations ($y\leftarrow ax+y$ where $x$ and $y$ are vectors +and $a$ is a scalar) and so on. These vector operations are often easy to parallelize +and they are more efficient on parallel computers when they work on large vectors. +Therefore, all the vector operations used in CG and GMRES solvers must be executed +by the GPUs as kernels. + +We use the kernels of the CUBLAS library to compute some vector operations of CG and +GMRES solvers. The following kernels of CUBLAS (dealing with double floating point) +are used: \verb+cublasDdot()+ for the dot products, \verb+cublasDnrm2()+ for the +Euclidean norms and \verb+cublasDaxpy()+ for the AXPY operations. For the rest of +the data-parallel operations, we code their kernels in CUDA. In the CG solver, we +develop a kernel for the XPAY operation ($y\leftarrow x+ay$) used line~$12$ in +Algorithm~\ref{ch12:alg:01}. In the GMRES solver, we program a kernel for the scalar-vector +multiplication (lines~$7$ and~$15$ in Algorithm~\ref{ch12:alg:02}), a kernel to +solve the least-squares problem and a kernel to update the elements of the solution +vector $x$. + +The least-squares problem in the GMRES method is solved by performing a QR factorization +on the Hessenberg matrix\index{Hessenberg~matrix} $\bar{H}_m$ with plane rotations and, +then, solving the triangular system by backward substitutions to compute $y$. Consequently, +solving the least-squares problem on the GPU is not interesting. Indeed, the triangular +solves are not easy to parallelize and inefficient on GPUs. However, the least-squares +problem to solve in the GMRES method with restarts has, generally, a very small size $m$. +Therefore, we develop an inexpensive kernel which must be executed in sequential by a +single CUDA thread. + +The most important operation in CG\index{Iterative~method!CG} and GMRES\index{Iterative~method!GMRES} +methods is the sparse matrix-vector multiplication (SpMV)\index{SpMV~multiplication}, +because it is often an expensive operation in terms of execution time and memory space. +Moreover, it requires to take care of the storage format of the sparse matrix in the +memory. Indeed, the naive storage, row-by-row or column-by-column, of a sparse matrix +can cause a significant waste of memory space and execution time. In addition, the sparse +nature of the matrix often leads to irregular memory accesses to read the matrix nonzero +values. So, the computation of the SpMV multiplication on GPUs can involve non coalesced +accesses to the global memory, which slows down its performances even more. One of the +most efficient compressed storage formats\index{Compressed~storage~format} of sparse +matrices on GPUs is the HYB\index{Compressed~storage~format!HYB} format~\cite{ch12:ref7}. +It is a combination of ELLpack (ELL) and Coordinate (COO) formats. Indeed, it stores +a typical number of nonzero values per row in ELL\index{Compressed~storage~format!ELL} +format and remaining entries of exceptional rows in COO format. It combines the efficiency +of ELL due to the regularity of its memory accesses and the flexibility of COO\index{Compressed~storage~format!COO} +which is insensitive to the matrix structure. Consequently, we use the HYB kernel~\cite{ch12:ref8} +developed by Nvidia to implement the SpMV multiplication of CG and GMRES methods on GPUs. +Moreover, to avoid the non coalesced accesses to the high-latency global memory, we fill +the elements of the iterate vector $x$ in the cached texture memory. + %%****************%% %%****************%% \subsection{Data communications} -\label{sec:03.03} -All the computing nodes of the GPU cluster execute in parallel the same iterative solver (Algorithm~\ref{alg:01} or Algorithm~\ref{alg:02}) -adapted to GPUs, but on their own portions of the sparse linear system: $M^{-1}_iA_ix_i=M^{-1}_ib_i$, $0\leq i