X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/59263d81c5f07e8eff22c9091a0847e79b4fbf2c..b4a21f0b9226126a2c50f54a5518be5ef7c60749:/BookGPU/Chapters/chapter12/ch12.tex?ds=inline diff --git a/BookGPU/Chapters/chapter12/ch12.tex b/BookGPU/Chapters/chapter12/ch12.tex index 7cd99f4..4fe0eb9 100755 --- a/BookGPU/Chapters/chapter12/ch12.tex +++ b/BookGPU/Chapters/chapter12/ch12.tex @@ -4,138 +4,155 @@ %% %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -\chapterauthor{}{} -\chapter{Solving sparse linear systems with GMRES and CG methods on GPU clusters} +%\chapterauthor{}{} +\chapterauthor{Lilia Ziane Khodja, Raphaël Couturier, and Jacques Bahi}{Femto-ST Institute, University of Franche-Comte, France} +%\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte, France} +%\chapterauthor{Jacques Bahi}{Femto-ST Institute, University of Franche-Comte, France} + +\chapter[Solving linear systems with GMRES and CG methods on GPU clusters]{Solving sparse linear systems with GMRES and CG methods on GPU clusters} +\label{ch12} %%--------------------------%% %% SECTION 1 %% %%--------------------------%% \section{Introduction} -\label{sec:01} -The sparse linear systems are used to model many scientific and industrial problems, such as the environmental simulations or -the industrial processing of the complex or non-Newtonian fluids. Moreover, the resolution of these problems often involves the -solving of such linear systems which is considered as the most expensive process in terms of time execution and memory space. -Therefore, solving sparse linear systems must be as efficient as possible in order to deal with problems of ever increasing size. - -There are, in the jargon of numerical analysis, different methods of solving sparse linear systems that we can classify in two -classes: the direct and iterative methods. However, the iterative methods are often more suitable than their counterpart, direct -methods, for solving large sparse linear systems. Indeed, they are less memory consuming and easier to parallelize on parallel -computers than direct methods. Different computing platforms, sequential and parallel computers, are used for solving sparse -linear systems with iterative solutions. Nowadays, graphics processing units (GPUs) have become attractive for solving these -linear systems, due to their computing power and their ability to compute faster than traditional CPUs. - -In Section~\ref{sec:02}, we describe the general principle of two well-known iterative methods: the conjugate gradient method and -the generalized minimal residual method. In Section~\ref{sec:03}, we give the main key points of the parallel implementation of both -methods on a cluster of GPUs. Then, in Section~\ref{sec:04}, we present the experimental results obtained on a CPU cluster and on -a GPU cluster, for solving sparse linear systems associated to matrices of different structures. Finally, in Section~\ref{sec:05}, -we apply the hypergraph partitioning technique to reduce the total communication volume between the computing nodes and, thus, to -improve the execution times of the parallel algorithms of both iterative methods. +\label{ch12:sec:01} +Sparse linear systems are used to model many scientific and industrial problems, +such as the environmental simulations or the industrial processing of the complex or +non-Newtonian fluids. Moreover, the resolution of these problems often involves the +solving of such linear systems that are considered the most expensive process in +terms of execution time and memory space. Therefore, solving sparse linear systems +must be as efficient as possible in order to deal with problems of ever increasing +size. + +There are, in the jargon of numerical analysis, different methods of solving sparse +linear systems that can be classified in two classes: direct and iterative methods. +However, the iterative methods are often more suitable than their counterparts, direct +methods, to solve these systems. Indeed, they are less memory-consuming and easier +to parallelize on parallel computers than direct methods. Different computing platforms, +sequential and parallel computers, are used to solve sparse linear systems with iterative +solutions. Nowadays, graphics processing units (GPUs) have become attractive to solve +these systems, due to their computing power and their ability to compute faster than +traditional CPUs. + +In Section~\ref{ch12:sec:02}, we describe the general principle of two well-known iterative +methods: the conjugate gradient method and the generalized minimal residual method. In Section~\ref{ch12:sec:03}, +we give the main key points of the parallel implementation of both methods on a cluster of +GPUs. Finally, in Section~\ref{ch12:sec:04}, we present the experimental results, obtained on a +CPU cluster and on a GPU cluster of solving large sparse linear systems. %%--------------------------%% %% SECTION 2 %% %%--------------------------%% \section{Krylov iterative methods} -\label{sec:02} -Let us consider the following system of $n$ linear equations in $\mathbb{R}$: +\label{ch12:sec:02} +Let us consider the following system of $n$ linear equations\index{sparse linear system} +in $\mathbb{R}$: \begin{equation} Ax=b, -\label{eq:01} +\label{ch12:eq:01} \end{equation} -where $A\in\mathbb{R}^{n\times n}$ is a sparse nonsingular square matrix, $x\in\mathbb{R}^{n}$ is the solution vector, -$b\in\mathbb{R}^{n}$ is the right-hand side and $n\in\mathbb{N}$ is a large integer number. - -The iterative methods for solving the large sparse linear system~(\ref{eq:01}) proceed by successive iterations of a same -block of elementary operations, during which an infinite number of approximate solutions $\{x_k\}_{k\geq 0}$ are computed. -Indeed, from an initial guess $x_0$, an iterative method determines at each iteration $k>0$ an approximate solution $x_k$ -which, gradually, converges to the exact solution $x^{*}$ as follows: +where $A\in\mathbb{R}^{n\times n}$ is a sparse nonsingular square matrix, $x\in\mathbb{R}^{n}$ +is the solution vector, $b\in\mathbb{R}^{n}$ is the right-hand side, and $n\in\mathbb{N}$ is a +large integer number. + +The iterative methods\index{iterative method} for solving the large sparse linear system~(\ref{ch12:eq:01}) +proceed by successive iterations of a same block of elementary operations, during which an +infinite number of approximate solutions $\{x_k\}_{k\geq 0}$ is computed. Indeed, from an +initial guess $x_0$, an iterative method determines at each iteration $k>0$ an approximate +solution $x_k$ which, gradually, converges to the exact solution $x^{*}$ as follows: \begin{equation} x^{*}=\lim\limits_{k\to\infty}x_{k}=A^{-1}b. -\label{eq:02} +\label{ch12:eq:02} \end{equation} -The number of iterations necessary to reach the exact solution $x^{*}$ is not known beforehand and can be infinite. In -practice, an iterative method often finds an approximate solution $\tilde{x}$ after a fixed number of iterations and/or -when a given convergence criterion is satisfied as follows: +The number of iterations necessary to reach the exact solution $x^{*}$ is not known beforehand +and can be infinite. In practice, an iterative method often finds an approximate solution $\tilde{x}$ +after a fixed number of iterations and/or when a given convergence criterion\index{convergence} +is satisfied as follows: \begin{equation} \|b-A\tilde{x}\| < \varepsilon, -\label{eq:03} +\label{ch12:eq:03} \end{equation} -where $\varepsilon<1$ is the required convergence tolerance threshold. - -Some of the most iterative methods that have proven their efficiency for solving large sparse linear systems are those -called \textit{Krylov sub-space methods}~\cite{ref1}. In the present chapter, we describe two Krylov methods which are -widely used: the conjugate gradient method (CG) and the generalized minimal residual method (GMRES). In practice, the -Krylov sub-space methods are usually used with preconditioners that allow to improve their convergence. So, in what -follows, the CG and GMRES methods are used for solving the left-preconditioned sparse linear system: +where $\varepsilon<1$ is the required convergence tolerance threshold\index{convergence!tolerance threshold}. + +Some of the most iterative methods that have proven their efficiency for solving large sparse +linear systems are those called \textit{Krylov subspace methods}~\cite{ch12:ref1}\index{iterative method!Krylov subspace}. +In the present chapter, we describe two Krylov methods which are widely used: the CG method (conjugate +gradient method) and the GMRES method (generalized minimal residual method). In practice, the +Krylov subspace methods are usually used with preconditioners that allow the improvement of their +convergence. So, in what follows, the CG and GMRES methods are used to solve the left-preconditioned\index{sparse linear system!preconditioned} +sparse linear system: \begin{equation} M^{-1}Ax=M^{-1}b, -\label{eq:11} +\label{ch12:eq:11} \end{equation} where $M$ is the preconditioning matrix. + %%****************%% %%****************%% \subsection{CG method} -\label{sec:02.01} -The conjugate gradient method is initially developed by Hestenes and Stiefel in 1952~\cite{ref2}. It is one of the well -known iterative method for solving large sparse linear systems. In addition, it can be adapted for solving nonlinear -equations and optimization problems. However, it can only be applied to problems with positive definite symmetric matrices. - -The main idea of the CG method is the computation of a sequence of approximate solutions $\{x_k\}_{k\geq 0}$ in a Krylov -sub-space of order $k$ as follows: +\label{ch12:sec:02.01} +The conjugate gradient method was initially developed by Hestenes and Stiefel in 1952~\cite{ch12:ref2}. +It is one of the well-known iterative methods to solve large sparse linear systems. In addition, it +can be adapted to solve nonlinear equations and optimization problems. However, it can only be applied +to problems with positive definite symmetric matrices. + +The main idea of the CG method\index{iterative method!CG} is the computation of a sequence of approximate +solutions $\{x_k\}_{k\geq 0}$ in a Krylov subspace\index{iterative method!Krylov~subspace} of order $k$ as +follows: \begin{equation} x_k \in x_0 + \mathcal{K}_k(A,r_0), -\label{eq:04} +\label{ch12:eq:04} \end{equation} -such that the Galerkin condition must be satisfied: +such that the Galerkin condition\index{Galerkin condition} must be satisfied: \begin{equation} r_k \bot \mathcal{K}_k(A,r_0), -\label{eq:05} +\label{ch12:eq:05} \end{equation} -where $x_0$ is the initial guess, $r_k=b-Ax_k$ is the residual of the computed solution $x_k$ and $\mathcal{K}_k$ the Krylov -sub-space of order $k$: \[\mathcal{K}_k(A,r_0) \equiv\text{span}\{r_0, Ar_0, A^2r_0,\ldots, A^{k-1}r_0\}.\] +where $x_0$ is the initial guess, $r_k=b-Ax_k$ is the residual of the computed solution $x_k$, and $\mathcal{K}_k$ +the Krylov subspace of order $k$: \[\mathcal{K}_k(A,r_0) \equiv\text{span}\{r_0, Ar_0, A^2r_0,\ldots, A^{k-1}r_0\}.\] In fact, CG is based on the construction of a sequence $\{p_k\}_{k\in\mathbb{N}}$ of direction vectors in $\mathcal{K}_k$ which are pairwise $A$-conjugate ($A$-orthogonal): \begin{equation} \begin{array}{ll} p_i^T A p_j = 0, & i\neq j. \end{array} -\label{eq:06} +\label{ch12:eq:06} \end{equation} At each iteration $k$, an approximate solution $x_k$ is computed by recurrence as follows: \begin{equation} \begin{array}{ll} x_k = x_{k-1} + \alpha_k p_k, & \alpha_k\in\mathbb{R}. \end{array} -\label{eq:07} +\label{ch12:eq:07} \end{equation} Consequently, the residuals $r_k$ are computed in the same way: \begin{equation} r_k = r_{k-1} - \alpha_k A p_k. -\label{eq:08} +\label{ch12:eq:08} \end{equation} -In the case where all residuals are nonzero, the direction vectors $p_k$ can be determined so that the following recurrence -holds: +In the case where all residuals are nonzero, the direction vectors $p_k$ can be determined so that +the following recurrence holds: \begin{equation} \begin{array}{lll} p_0=r_0, & p_k=r_k+\beta_k p_{k-1}, & \beta_k\in\mathbb{R}. \end{array} -\label{eq:09} +\label{ch12:eq:09} \end{equation} -Moreover, the scalars $\{\alpha_k\}_{k>0}$ are chosen so as to minimize the $A$-norm error $\|x^{*}-x_k\|_A$ over the Krylov -sub-space $\mathcal{K}_{k}$ and the scalars $\{\beta_k\}_{k>0}$ are chosen so as to ensure that the direction vectors are -pairwise $A$-conjugate. So, the assumption that matrix $A$ is symmetric and the recurrences~(\ref{eq:08}) and~(\ref{eq:09}) -allow to deduce that: +Moreover, the scalars $\{\alpha_k\}_{k>0}$ are chosen so as to minimize the $A$-norm error $\|x^{*}-x_k\|_A$ +over the Krylov subspace $\mathcal{K}_{k}$, and the scalars $\{\beta_k\}_{k>0}$ are chosen so as to ensure +that the direction vectors are pairwise $A$-conjugate. So, the assumption that matrix $A$ is symmetric and +the recurrences~(\ref{ch12:eq:08}) and~(\ref{ch12:eq:09}) allow the deduction that: \begin{equation} \begin{array}{ll} \alpha_{k}=\frac{r^{T}_{k-1}r_{k-1}}{p_{k}^{T}Ap_{k}}, & \beta_{k}=\frac{r_{k}^{T}r_{k}}{r_{k-1}^{T}r_{k-1}}. \end{array} -\label{eq:10} +\label{ch12:eq:10} \end{equation} \begin{algorithm}[!t] - %\SetLine - %\linesnumbered Choose an initial guess $x_0$\; $r_{0} = b - A x_{0}$\; $convergence$ = false\; @@ -159,63 +176,71 @@ allow to deduce that: $k = k + 1$\; } } -\caption{Left-preconditioned CG method} -\label{alg:01} +\caption{left-preconditioned CG method} +\label{ch12:alg:01} \end{algorithm} -Algorithm~\ref{alg:01} shows the main key points of the preconditioned CG method. It allows to solve the left-preconditioned -sparse linear system~(\ref{eq:11}). In this algorithm, $\varepsilon$ is the convergence tolerance threshold, $maxiter$ is the maximum -number of iterations and $(\cdot,\cdot)$ defines the dot product between two vectors in $\mathbb{R}^{n}$. At every iteration, a direction -vector $p_k$ is determined, so that it is orthogonal to the preconditioned residual $z_k$ and to the direction vectors $\{p_i\}_{i<k}$ -previously determined (from line~$8$ to line~$13$). Then, at lines~$16$ and~$17$ , the iterate $x_k$ and the residual $r_k$ are computed -using formulas~(\ref{eq:07}) and~(\ref{eq:08}), respectively. The CG method converges after, at most, $n$ iterations. In practice, the CG -algorithm stops when the tolerance threshold $\varepsilon$ and/or the maximum number of iterations $maxiter$ are reached. +Algorithm~\ref{ch12:alg:01} shows the main key points of the preconditioned CG method. It allows +the solving the left-preconditioned\index{sparse linear system!preconditioned} sparse linear system~(\ref{ch12:eq:11}). +In this algorithm, $\varepsilon$ is the convergence tolerance threshold, $maxiter$ is the maximum +number of iterations, and $(\cdot,\cdot)$ defines the dot product between two vectors in $\mathbb{R}^{n}$. +At every iteration, a direction vector $p_k$ is determined, so that it is orthogonal to the preconditioned +residual $z_k$ and to the direction vectors $\{p_i\}_{i<k}$ previously determined (from line~$8$ to +line~$13$). Then, at lines~$16$ and~$17$, the iterate $x_k$ and the residual $r_k$ are computed using +formulas~(\ref{ch12:eq:07}) and~(\ref{ch12:eq:08}), respectively. The CG method converges after, at +most, $n$ iterations. In practice, the CG algorithm stops when the tolerance threshold\index{convergence!tolerance threshold} +$\varepsilon$ and/or the maximum number of iterations\index{convergence!maximum number of iterations} +$maxiter$ is reached. + %%****************%% %%****************%% \subsection{GMRES method} -\label{sec:02.02} -The iterative GMRES method is developed by Saad and Schultz in 1986~\cite{ref3} as a generalization of the minimum residual method -MINRES~\cite{ref4}. Indeed, GMRES can be applied for solving symmetric or nonsymmetric linear systems. - -The main principle of the GMRES method is to find an approximation minimizing at best the residual norm. In fact, GMRES -computes a sequence of approximate solutions $\{x_k\}_{k>0}$ in a Krylov sub-space $\mathcal{K}_k$ as follows: +\label{ch12:sec:02.02} +The iterative GMRES method was developed by Saad and Schultz in 1986~\cite{ch12:ref3} as a generalization +of the minimum residual method MINRES~\cite{ch12:ref4}\index{iterative method!MINRES}. Indeed, GMRES can +be applied for solving symmetric or nonsymmetric linear systems. + +The main principle of the GMRES method\index{iterative method!GMRES} is to find an approximation minimizing +at best the residual norm. In fact, GMRES computes a sequence of approximate solutions $\{x_k\}_{k>0}$ in +a Krylov subspace\index{iterative method!Krylov subspace} $\mathcal{K}_k$ as follows: \begin{equation} \begin{array}{ll} x_k \in x_0 + \mathcal{K}_k(A, v_1),& v_1=\frac{r_0}{\|r_0\|_2}, \end{array} -\label{eq:12} +\label{ch12:eq:12} \end{equation} -so that the Petrov-Galerkin condition is satisfied: +so that the Petrov-Galerkin condition\index{Petrov-Galerkin condition} is satisfied: \begin{equation} \begin{array}{ll} r_k \bot A \mathcal{K}_k(A, v_1). \end{array} -\label{eq:13} +\label{ch12:eq:13} \end{equation} -GMRES uses the Arnoldi process~\cite{ref5} to construct an orthonormal basis $V_k$ for the Krylov sub-space $\mathcal{K}_k$ -and an upper Hessenberg matrix $\bar{H}_k$ of order $(k+1)\times k$: +GMRES uses the Arnoldi iterations~\cite{ch12:ref5}\index{iterative method!Arnoldi iterations} to construct an +orthonormal basis $V_k$ for the Krylov subspace $\mathcal{K}_k$ and an upper Hessenberg matrix\index{Hessenberg matrix} +$\bar{H}_k$ of order $(k+1)\times k$: \begin{equation} \begin{array}{ll} V_k = \{v_1, v_2,\ldots,v_k\}, & \forall k>1, v_k=A^{k-1}v_1, \end{array} -\label{eq:14} +\label{ch12:eq:14} \end{equation} and \begin{equation} -V_k A = V_{k+1} \bar{H}_k. -\label{eq:15} +A V_k = V_{k+1} \bar{H}_k. +\label{ch12:eq:15} \end{equation} -Then, at each iteration $k$, an approximate solution $x_k$ is computed in the Krylov sub-space $\mathcal{K}_k$ spanned by $V_k$ -as follows: +Then, at each iteration $k$, an approximate solution $x_k$ is computed in the Krylov subspace $\mathcal{K}_k$ +spanned by $V_k$ as follows: \begin{equation} \begin{array}{ll} x_k = x_0 + V_k y, & y\in\mathbb{R}^{k}. \end{array} -\label{eq:16} +\label{ch12:eq:16} \end{equation} -From both formulas~(\ref{eq:15}) and~(\ref{eq:16}) and $r_k=b-Ax_k$, we can deduce that: +From both formulas~(\ref{ch12:eq:15}) and~(\ref{ch12:eq:16}) and $r_k=b-Ax_k$, we can deduce that \begin{equation} \begin{array}{lll} r_{k} & = & b - A (x_{0} + V_{k}y) \\ @@ -223,34 +248,35 @@ From both formulas~(\ref{eq:15}) and~(\ref{eq:16}) and $r_k=b-Ax_k$, we can dedu & = & \beta v_{1} - V_{k+1}\bar{H}_{k}y \\ & = & V_{k+1}(\beta e_{1} - \bar{H}_{k}y), \end{array} -\label{eq:17} +\label{ch12:eq:17} \end{equation} -such that $\beta=\|r_0\|_2$ and $e_1=(1,0,\cdots,0)$ is the first vector of the canonical basis of $\mathbb{R}^k$. So, -the vector $y$ is chosen in $\mathbb{R}^k$ so as to minimize at best the Euclidean norm of the residual $r_k$. Consequently, -a linear least-squares problem of size $k$ is solved: +such that $\beta=\|r_0\|_2$ and $e_1=(1,0,\cdots,0)$ is the first vector of the canonical basis of +$\mathbb{R}^k$. So, the vector $y$ is chosen in $\mathbb{R}^k$ so as to minimize at best the Euclidean +norm of the residual $r_k$. Consequently, a linear least-squares problem of size $k$ is solved: \begin{equation} \underset{y\in\mathbb{R}^{k}}{min}\|r_{k}\|_{2}=\underset{y\in\mathbb{R}^{k}}{min}\|\beta e_{1}-\bar{H}_{k}y\|_{2}. -\label{eq:18} +\label{ch12:eq:18} \end{equation} -The QR factorization of matrix $\bar{H}_k$ is used to compute the solution of this problem by using Givens rotations~\cite{ref1,ref3}, -such that: +The QR factorization of matrix $\bar{H}_k$ is used (the decomposition of the matrix $\bar{H}$ into $Q$ and $R$ matrices) +to compute the solution of this problem by using +Givens rotations~\cite{ch12:ref1,ch12:ref3}, such that \begin{equation} \begin{array}{lll} \bar{H}_{k}=Q_{k}R_{k}, & Q_{k}\in\mathbb{R}^{(k+1)\times (k+1)}, & R_{k}\in\mathbb{R}^{(k+1)\times k}, \end{array} -\label{eq:19} +\label{ch12:eq:19} \end{equation} -where $Q_kQ_k^T=I_k$ and $R_k$ is an upper triangular matrix. +where $Q_k$ is an orthogonal matrix and $R_k$ is an upper triangular matrix. -The GMRES method computes an approximate solution with a sufficient precision after, at most, $n$ iterations ($n$ is the size of the -sparse linear system to be solved). However, the GMRES algorithm must construct and store in the memory an orthonormal basis $V_k$ whose -size is proportional to the number of iterations required to achieve the convergence. Then, to avoid a huge memory storage, the GMRES -method must be restarted at each $m$ iterations, such that $m$ is very small ($m\ll n$), and with $x_m$ as the initial guess to the -next iteration. This allows to limit the size of the basis $V$ to $m$ orthogonal vectors. +The GMRES method computes an approximate solution with a sufficient precision after, at most, $n$ +iterations ($n$ is the size of the sparse linear system to be solved). However, the GMRES algorithm +must construct and store in the memory an orthonormal basis $V_k$ whose size is proportional to the +number of iterations required to achieve the convergence. Then, to avoid a huge memory storage, the +GMRES method must be restarted at each $m$ iterations, such that $m$ is very small ($m\ll n$), and +with $x_m$ as the initial guess to the next iteration. This allows the limitation of the size of the basis +$V$ to $m$ orthogonal vectors. \begin{algorithm}[!t] - %\SetLine - %\linesnumbered Choose an initial guess $x_0$\; $convergence$ = false\; $k = 1$\; @@ -267,7 +293,7 @@ next iteration. This allows to limit the size of the basis $V$ to $m$ orthogonal $h_{j+1,j} = \|w_{j}\|_{2}$\; $v_{j+1} = w_{j}/h_{j+1,j}$\; } - Set $V_{m}=\{v_{j}\}_{1\leq j \leq m}$ and $\bar{H}_{m}=(h_{i,j})$ a $(m+1)\times m$ upper Hessenberg matrix\; + Set $V_{m}=\{v_{j}\}_{1\leq j \leq m}$ and $\bar{H}_{m}=(h_{i,j})$ is an upper Hessenberg matrix of size $(m+1)\times m$\; Solve a least-squares problem of size $m$: $min_{y\in\mathrm{I\!R}^{m}}\|\beta e_{1}-\bar{H}_{m}y\|_{2}$\; $x_{m} = x_{0}+V_{m}y_{m}$\; $r_{m} = M^{-1}(b-Ax_{m})$\; @@ -280,198 +306,253 @@ next iteration. This allows to limit the size of the basis $V$ to $m$ orthogonal $k = k + 1$\; } } -\caption{Left-preconditioned GMRES method with restarts} -\label{alg:02} +\caption{left-preconditioned GMRES method with restarts} +\label{ch12:alg:02} \end{algorithm} -Algorithm~\ref{alg:02} shows the main key points of the GMRES method with restarts. It solves the left-preconditioned sparse linear -system~(\ref{eq:11}), such that $M$ is the preconditioning matrix. At each iteration $k$, GMRES uses the Arnoldi process (defined -from line~$7$ to line~$17$) to construct a basis $V_m$ of $m$ orthogonal vectors and an upper Hessenberg matrix $\bar{H}_m$ of size -$(m+1)\times m$. Then, it solves the linear least-squares problem of size $m$ to find the vector $y\in\mathbb{R}^{m}$ which minimizes -at best the residual norm (line~$18$). Finally, it computes an approximate solution $x_m$ in the Krylov sub-space spanned by $V_m$ -(line~$19$). The GMRES algorithm is stopped when the residual norm is sufficiently small ($\|r_m\|_2<\varepsilon$) and/or the maximum -number of iterations ($maxiter$) is reached. +Algorithm~\ref{ch12:alg:02} shows the key points of the GMRES method with restarts. +It solves the left-preconditioned\index{sparse linear system!preconditioned} sparse linear +system~(\ref{ch12:eq:11}), such that $M$ is the preconditioning matrix. At each iteration +$k$, GMRES uses the Arnoldi iterations\index{iterative method!Arnoldi iterations} (defined from +line~$7$ to line~$17$) to construct a basis $V_m$ of $m$ orthogonal vectors and an upper +Hessenberg matrix\index{Hessenberg matrix} $\bar{H}_m$ of size $(m+1)\times m$. Then, it +solves the linear least-squares problem of size $m$ to find the vector $y\in\mathbb{R}^{m}$ +which minimizes at best the residual norm (line~$18$). Finally, it computes an approximate +solution $x_m$ in the Krylov subspace spanned by $V_m$ (line~$19$). The GMRES algorithm is +stopped when the residual norm is sufficiently small ($\|r_m\|_2<\varepsilon$) and/or the +maximum number of iterations\index{convergence!maximum number of iterations} ($maxiter$) +is reached. + %%--------------------------%% %% SECTION 3 %% %%--------------------------%% \section{Parallel implementation on a GPU cluster} -\label{sec:03} -In this section, we present the parallel algorithms of both iterative CG and GMRES methods for GPU clusters. -The implementation is performed on a GPU cluster composed of different computing nodes, such that each node -is a CPU core managed by a MPI process and equipped with a GPU card. The parallelization of these algorithms -is carried out by using the MPI communication routines between the GPU computing nodes and the CUDA programming -environment inside each node. In what follows, the algorithms of the iterative methods are called iterative -solvers. +\label{ch12:sec:03} +In this section, we present the parallel algorithms of both iterative CG\index{iterative method!CG} +and GMRES\index{iterative method!GMRES} methods for GPU clusters. The implementation is performed on +a GPU cluster composed of different computing nodes, such that each node is a CPU core managed by one +MPI (message passing interface) process and equipped with a GPU card. The parallelization of these algorithms is carried out by +using the MPI communication routines between the GPU computing nodes\index{computing node} and the +CUDA (compute unified device architecture) programming environment inside each node. In what follows, the algorithms of the iterative methods +are called iterative solvers. + %%****************%% %%****************%% \subsection{Data partitioning} -\label{sec:03.01} -The parallel solving of the large sparse linear system~(\ref{eq:11}) requires a data partitioning between the computing -nodes of the GPU cluster. Let $p$ denotes the number of the computing nodes on the GPU cluster. The partitioning operation -consists in the decomposition of the vectors and matrices, involved in the iterative solver, in $p$ portions. Indeed, this -operation allows to assign to each computing node $i$: +\label{ch12:sec:03.01} +The parallel solving of the large sparse linear system~(\ref{ch12:eq:11}) requires a data partitioning +between the computing nodes of the GPU cluster. Let $p$ denote the number of the computing nodes on the +GPU cluster. The partitioning operation consists of the decomposition of the vectors and matrices, involved +in the iterative solver, in $p$ portions. Indeed, this operation allows the assignment to each computing node +$i$: \begin{itemize} \item a portion of size $\frac{n}{p}$ elements of each vector, -\item a sparse rectangular sub-matrix $A_i$ of size $(\frac{n}{p},n)$ and, -\item a square preconditioning sub-matrix $M_i$ of size $(\frac{n}{p},\frac{n}{p})$, +\item a sparse rectangular submatrix $A_i$ of size $(\frac{n}{p},n)$, and +\item a square preconditioning submatrix $M_i$ of size $(\frac{n}{p},\frac{n}{p})$, \end{itemize} -where $n$ is the size of the sparse linear system to be solved. In the first instance, we perform a naive row-wise partitioning -(decomposition row-by-row) on the data of the sparse linear systems to be solved. Figure~\ref{fig:01} shows an example of a row-wise -data partitioning between four computing nodes of a sparse linear system (sparse matrix $A$, solution vector $x$ and right-hand -side $b$) of size $16$ unknown values. +where $n$ is the size of the sparse linear system to be solved. In the first instance, we perform a naive +row-wise partitioning (row-by-row decomposition) on the data of the sparse linear systems to be solved. +Figure~\ref{ch12:fig:01} shows an example of a row-wise data partitioning between four computing nodes +of a sparse linear system (sparse matrix $A$, solution vector $x$, and right-hand side $b$) of size $16$ +unknown values. \begin{figure} \centerline{\includegraphics[scale=0.35]{Chapters/chapter12/figures/partition}} -\caption{A data partitioning of the sparse matrix $A$, the solution vector $x$ and the right-hand side $b$ into four portions.} -\label{fig:01} +\caption{A data partitioning of the sparse matrix $A$, the solution vector $x$, and the right-hand side $b$ into four portions.} +\label{ch12:fig:01} \end{figure} + %%****************%% %%****************%% \subsection{GPU computing} -\label{sec:03.02} -After the partitioning operation, all the data involved from this operation must be transferred from the CPU memories to the GPU -memories, in order to be processed by GPUs. We use two functions of the CUBLAS library (CUDA Basic Linear Algebra Subroutines), -developed by Nvidia~\cite{ref6}: \verb+cublasAlloc()+ for the memory allocations on GPUs and \verb+cublasSetVector()+ for the -memory copies from the CPUs to the GPUs. - -An efficient implementation of CG and GMRES solvers on a GPU cluster requires to determine all parts of their codes that can be -executed in parallel and, thus, take advantage of the GPU acceleration. As many Krylov sub-space methods, the CG and GMRES methods -are mainly based on arithmetic operations dealing with vectors or matrices: sparse matrix-vector multiplications, scalar-vector -multiplications, dot products, Euclidean norms, AXPY operations ($y\leftarrow ax+y$ where $x$ and $y$ are vectors and $a$ is a -scalar) and so on. These vector operations are often easy to parallelize and they are more efficient on parallel computers when -they work on large vectors. Therefore, all the vector operations used in CG and GMRES solvers must be executed by the GPUs as kernels. - -We use the kernels of the CUBLAS library to compute some vector operations of CG and GMRES solvers. The following kernels of CUBLAS -(dealing with double floating point) are used: \verb+cublasDdot()+ for the dot products, \verb+cublasDnrm2()+ for the Euclidean -norms and \verb+cublasDaxpy()+ for the AXPY operations. For the rest of the data-parallel operations, we code their kernels in CUDA. -In the CG solver, we develop a kernel for the XPAY operation ($y\leftarrow x+ay$) used at line~$12$ in Algorithm~\ref{alg:01}. In the -GMRES solver, we program a kernel for the scalar-vector multiplication (lines~$7$ and~$15$ in Algorithm~\ref{alg:02}), a kernel for -solving the least-squares problem and a kernel for the elements updates of the solution vector $x$. - -The least-squares problem in the GMRES method is solved by performing a QR factorization on the Hessenberg matrix $\bar{H}_m$ with -plane rotations and, then, solving the triangular system by backward substitutions to compute $y$. Consequently, solving the least-squares -problem on the GPU is not interesting. Indeed, the triangular solves are not easy to parallelize and inefficient on GPUs. However, -the least-squares problem to solve in the GMRES method with restarts has, generally, a very small size $m$. Therefore, we develop -an inexpensive kernel which must be executed in sequential by a single CUDA thread. - -The most important operation in CG and GMRES methods is the sparse matrix-vector multiplication (SpMV), because it is often an -expensive operation in terms of execution time and memory space. Moreover, it requires to take care of the storage format of the -sparse matrix in the memory. Indeed, the naive storage, row-by-row or column-by-column, of a sparse matrix can cause a significant -waste of memory space and execution time. In addition, the sparsity nature of the matrix often leads to irregular memory accesses -to read the matrix nonzero values. So, the computation of the SpMV multiplication on GPUs can involve non coalesced accesses to -the global memory, which slows down even more its performances. One of the most efficient compressed storage formats of sparse -matrices on GPUs is HYB format~\cite{ref7}. It is a combination of ELLpack (ELL) and Coordinate (COO) formats. Indeed, it stores -a typical number of nonzero values per row in ELL format and remaining entries of exceptional rows in COO format. It combines -the efficiency of ELL due to the regularity of its memory accesses and the flexibility of COO which is insensitive to the matrix -structure. Consequently, we use the HYB kernel~\cite{ref8} developed by Nvidia to implement the SpMV multiplication of CG and -GMRES methods on GPUs. Moreover, to avoid the non coalesced accesses to the high-latency global memory, we fill the elements of -the iterate vector $x$ in the cached texture memory. +\label{ch12:sec:03.02} +After the partitioning operation, all the data involved from this operation must be +transferred from the CPU memories to the GPU memories, in order to be processed by +GPUs. We use two functions of the CUBLAS\index{CUBLAS} library (CUDA Basic Linear +Algebra Subroutines) developed by NVIDIA~\cite{ch12:ref6}: \verb+cublasAlloc()+ +for the memory allocations on GPUs and \verb+cublasSetVector()+ for the memory +copies from the CPUs to the GPUs. + +An efficient implementation of CG and GMRES solvers on a GPU cluster requires the +determining of all parts of their codes that can be executed in parallel and, thus, takes +advantage of the GPU acceleration. As many Krylov subspace methods, the CG and GMRES +methods are mainly based on arithmetic operations dealing with vectors or matrices: +sparse matrix-vector multiplications, scalar-vector multiplications, dot products, +Euclidean norms, AXPY operations ($y\leftarrow ax+y$ where $x$ and $y$ are vectors and $a$ is a scalar), +and so on. These vector operations are often easy to parallelize +and they are more efficient on parallel computers when they work on large vectors. +Therefore, all the vector operations used in CG and GMRES solvers must be executed +by the GPUs as kernels. + +We use the kernels of the CUBLAS library to compute some vector operations of CG and +GMRES solvers. The following kernels of CUBLAS (dealing with double floating point) +are used: \verb+cublasDdot()+ for the dot products, \verb+cublasDnrm2()+ for the +Euclidean norms, and \verb+cublasDaxpy()+ for the AXPY operations ($y\leftarrow ax+y$, compute a scalar-vector product and add +the result to a vector). For the rest of +the data-parallel operations, we code their kernels in CUDA. In the CG solver, we +develop a kernel for the XPAY operation ($y\leftarrow x+ay$) used in line~$12$ in +Algorithm~\ref{ch12:alg:01}. In the GMRES solver, we program a kernel for the scalar-vector +multiplication (lines~$7$ and~$15$ in Algorithm~\ref{ch12:alg:02}), a kernel to +solve the least-squares problem, and a kernel to update the elements of the solution +vector $x$. + +The least-squares problem in the GMRES method is solved by performing a QR factorization +on the Hessenberg matrix\index{Hessenberg matrix} $\bar{H}_m$ with plane rotations and, +then, solving the triangular system by backward substitutions to compute $y$. Consequently, +solving the least-squares problem on the GPU is not efficient. Indeed, the triangular +solves are not easy to parallelize and inefficient on GPUs. However, the least-squares +problem to solve in the GMRES method with restarts has, generally, a very small size $m$. +Therefore, we develop an inexpensive kernel which must be executed by a single CUDA thread. + +The most important operation in CG\index{iterative method!CG} and GMRES\index{iterative method!GMRES} +methods is the SpMV multiplication (sparse matrix-vector multiplication)\index{SpMV multiplication}, +because it is often an expensive operation in terms of execution time and memory space. +Moreover, it requires taking care of the storage format of the sparse matrix in the +memory. Indeed, the naive storage, row-by-row or column-by-column, of a sparse matrix +can cause a significant waste of memory space and execution time. In addition, the sparse +nature of the matrix often leads to irregular memory accesses to read the matrix nonzero +values. So, the computation of the SpMV multiplication on GPUs can involve noncoalesced +accesses to the global memory, which slows down its performances even more. One of the +most efficient compressed storage formats\index{compressed storage format} of sparse +matrices on GPUs is the HYB (hybrid)\index{compressed storage format!HYB} format~\cite{ch12:ref7}. +It is a combination of ELLpack (ELL) and Coordinate (COO) formats. Indeed, it stores +a typical number of nonzero values per row in ELL\index{compressed storage format!ELL} +format and the remaining entries of exceptional rows in COO format. It combines the efficiency +of ELL due to the regularity of its memory accesses and the flexibility of COO\index{compressed storage format!COO} +which is insensitive to the matrix structure. Consequently, we use the HYB kernel~\cite{ch12:ref8} +developed by NVIDIA to implement the SpMV multiplication of CG and GMRES methods on GPUs. +Moreover, to avoid the noncoalesced accesses to the high-latency global memory, we fill +the elements of the iterate vector $x$ in the cached texture memory. + %%****************%% %%****************%% \subsection{Data communications} -\label{sec:03.03} -All the computing nodes of the GPU cluster execute in parallel the same iterative solver (Algorithm~\ref{alg:01} or Algorithm~\ref{alg:02}) -adapted to GPUs, but on their own portions of the sparse linear system: $M^{-1}_iA_ix_i=M^{-1}_ib_i$, $0\leq i<p$. However in order to solve -the complete sparse linear system~(\ref{eq:11}), synchronizations must be performed between the local computations of the computing nodes -over the cluster. In what follows, two computing nodes sharing data are called neighboring nodes. - -As already mentioned, the most important operation of CG and GMRES methods is the SpMV multiplication. In the parallel implementation of -the iterative methods, each computing node $i$ performs the SpMV multiplication on its own sparse rectangular sub-matrix $A_i$. Locally, it -has only sub-vectors of size $\frac{n}{p}$ corresponding to rows of its sub-matrix $A_i$. However, it also requires the vector elements -of its neighbors, corresponding to the column indices on which its sub-matrix has nonzero values (see Figure~\ref{fig:01}). So, in addition -to the local vectors, each node must also manage vector elements shared with neighbors and required to compute the SpMV multiplication. -Therefore, the iterate vector $x$ managed by each computing node is composed of a local sub-vector $x^{local}$ of size $\frac{n}{p}$ and a -sub-vector of shared elements $x^{shared}$. In the same way, the vector used to construct the orthonormal basis of the Krylov sub-space -(vectors $p$ and $v$ in CG and GMRES methods, respectively) is composed of a local sub-vector and a shared sub-vector. - -Therefore, before computing the SpMV multiplication, the neighboring nodes over the GPU cluster must exchange between them the shared -vector elements necessary to compute this multiplication. First, each computing node determines, in its local sub-vector, the vector -elements needed by other nodes. Then, the neighboring nodes exchange between them these shared vector elements. The data exchanges -are implemented by using the MPI point-to-point communication routines: blocking sends with \verb+MPI_Send()+ and nonblocking receives -with \verb+MPI_Irecv()+. Figure~\ref{fig:02} shows an example of data exchanges between \textit{Node 1} and its neighbors \textit{Node 0}, -\textit{Node 2} and \textit{Node 3}. In this example, the iterate matrix $A$ split between these four computing nodes is that presented -in Figure~\ref{fig:01}. +\label{ch12:sec:03.03} +All the computing nodes of the GPU cluster execute in parallel the same iterative solver +(Algorithm~\ref{ch12:alg:01} or Algorithm~\ref{ch12:alg:02}) adapted to GPUs, but on their +own portions of the sparse linear system\index{sparse linear system}: $M^{-1}_iA_ix_i=M^{-1}_ib_i$, +$0\leq i<p$. However, in order to solve the complete sparse linear system~(\ref{ch12:eq:11}), +synchronizations must be performed between the local computations of the computing nodes over +the cluster. In what follows, two computing nodes sharing data are called neighboring nodes\index{neighboring node}. + +As already mentioned, the most important operation of CG and GMRES methods is the SpMV multiplication. +In the parallel implementation of the iterative methods, each computing node $i$ performs the +SpMV multiplication on its own sparse rectangular submatrix $A_i$. Locally, it has only subvectors +of size $\frac{n}{p}$ corresponding to rows of its submatrix $A_i$. However, it also requires +the vector elements of its neighbors, corresponding to the column indices on which its submatrix +has nonzero values (see Figure~\ref{ch12:fig:01}). So, in addition to the local vectors, each +node must also manage vector elements shared with neighbors and required to compute the SpMV +multiplication. Therefore, the iterate vector $x$ managed by each computing node is composed +of a local subvector $x^{local}$ of size $\frac{n}{p}$ and a subvector of shared elements $x^{shared}$. +In the same way, the vector used to construct the orthonormal basis of the Krylov subspace (vectors +$p$ and $v$ in CG and GMRES methods, respectively) is composed of a local subvector and a shared +subvector. + +Therefore, before computing the SpMV multiplication\index{SpMV multiplication}, the neighboring +nodes\index{neighboring node} over the GPU cluster must exchange between them the shared vector +elements necessary to compute this multiplication. First, each computing node determines, in its +local subvector, the vector elements needed by other nodes. Then, the neighboring nodes exchange +between them these shared vector elements. The data exchanges are implemented by using the MPI +point-to-point communication routines: blocking\index{MPI!blocking} sends with \verb+MPI_Send()+ +and nonblocking\index{MPI!nonblocking} receives with \verb+MPI_Irecv()+. Figure~\ref{ch12:fig:02} +shows an example of data exchanges between \textit{Node 1} and its neighbors \textit{Node 0}, \textit{Node 2}, +and \textit{Node 3}. In this example, the iterate matrix $A$ split between these four computing +nodes is that presented in Figure~\ref{ch12:fig:01}. \begin{figure} \centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/compress}} -\caption{Data exchanges between \textit{Node 1} and its neighbors \textit{Node 0}, \textit{Node 2} and \textit{Node 3}.} -\label{fig:02} +\caption{Data exchanges between \textit{Node 1} and its neighbors \textit{Node 0}, \textit{Node 2}, and \textit{Node 3}.} +\label{ch12:fig:02} \end{figure} -After the synchronization operation, the computing nodes receive, from their respective neighbors, the shared elements in a sub-vector -stored in a compressed format. However, in order to compute the SpMV multiplication, the computing nodes operate on sparse global vectors -(see Figure~\ref{fig:02}). In this case, the received vector elements must be copied to the corresponding indices in the global vector. -So as not to need to perform this at each iteration, we propose to reorder the columns of each sub-matrix $\{A_i\}_{0\leq i<p}$, so that -the shared sub-vectors could be used in their compressed storage formats. Figure~\ref{fig:03} shows a reordering of a sparse sub-matrix -(sub-matrix of \textit{Node 1}). +After the synchronization operation, the computing nodes receive, from their respective neighbors, +the shared elements in a subvector stored in a compressed format. However, in order to compute the +SpMV multiplication, the computing nodes operate on sparse global vectors (see Figure~\ref{ch12:fig:02}). +In this case, the received vector elements must be copied to the corresponding indices in the global +vector. So as not to need to perform this at each iteration, we propose to reorder the columns of +each submatrix $\{A_i\}_{0\leq i<p}$, so that the shared subvectors could be used in their compressed +storage formats. Figure~\ref{ch12:fig:03} shows a reordering of a sparse submatrix (submatrix of +\textit{Node 1}). \begin{figure} -\centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/reorder}} -\caption{Columns reordering of a sparse sub-matrix.} -\label{fig:03} +\centerline{\includegraphics[scale=0.35]{Chapters/chapter12/figures/reorder}} +\caption{Columns reordering of a sparse submatrix.} +\label{ch12:fig:03} \end{figure} -A GPU cluster is a parallel platform with a distributed memory. So, the synchronizations and communication data between GPU nodes are -carried out by passing messages. However, GPUs can not communicate between them in direct way. Then, CPUs via MPI processes are in charge -of the synchronizations within the GPU cluster. Consequently, the vector elements to be exchanged must be copied from the GPU memory -to the CPU memory and vice-versa before and after the synchronization operation between CPUs. We have used the CBLAS communication subroutines -to perform the data transfers between a CPU core and its GPU: \verb+cublasGetVector()+ and \verb+cublasSetVector()+. Finally, in addition -to the data exchanges, GPU nodes perform reduction operations to compute in parallel the dot products and Euclidean norms. This is implemented -by using the MPI global communication \verb+MPI_Allreduce()+. +A GPU cluster\index{GPU!cluster}\index{multi-GPU} is a parallel platform with a distributed memory. So, the synchronizations +and communication data between GPU nodes are carried out by passing messages. However, a GPU cannot exchange data +with other GPUs in a direct way. Then, CPUs via MPI processes are in charge of the synchronizations within the GPU +cluster. Consequently, the vector elements to be exchanged must be copied from the GPU memory to the CPU memory +and vice versa before and after the synchronization operation between CPUs. We have used the CUBLAS\index{CUBLAS} +communication subroutines to perform the data transfers between a CPU core and its GPU: \verb+cublasGetVector()+ +and \verb+cublasSetVector()+. Finally, in addition to the data exchanges, GPU nodes perform reduction operations +to compute in parallel the dot products and Euclidean norms. This is implemented by using the MPI global communication\index{MPI!global} +\verb+MPI_Allreduce()+. + %%--------------------------%% %% SECTION 4 %% %%--------------------------%% \section{Experimental results} -\label{sec:04} -In this section, we present the performances of the parallel CG and GMRES linear solvers obtained on a cluster of $12$ GPUs. Indeed, this GPU -cluster of tests is composed of six machines connected by $20$Gbps InfiniBand network. Each machine is a Quad-Core Xeon E5530 CPU running at -$2.4$GHz and providing $12$GB of RAM with a memory bandwidth of $25.6$GB/s. In addition, two Tesla C1060 GPUs are connected to each machine via -a PCI-Express 16x Gen 2.0 interface with a throughput of $8$GB/s. A Tesla C1060 GPU contains $240$ cores running at $1.3$GHz and providing a -global memory of $4$GB with a memory bandwidth of $102$GB/s. Figure~\ref{fig:04} shows the general scheme of the GPU cluster that we used in -the experimental tests. +\label{ch12:sec:04} +In this section, we present the performances of the parallel CG and GMRES linear solvers obtained +on a cluster of $12$ GPUs. Indeed, this GPU cluster of tests is composed of six machines connected +by a $20$GB/s InfiniBand network. Each machine is a Quad-Core Xeon E5530 CPU running at $2.4$GHz and +providing $12$GB of RAM with a memory bandwidth of $25.6$GB/s. In addition, two Tesla C1060 GPUs are +connected to each machine via a PCI-Express 16x Gen 2.0 interface with a throughput of $8$GB/s. A +Tesla C1060 GPU contains $240$ cores running at $1.3$GHz and providing a global memory of $4$GB with +a memory bandwidth of $102$GB/s. Figure~\ref{ch12:fig:04} shows the general scheme of the GPU cluster\index{GPU!cluster} +that we used in the experimental tests. + +Linux cluster version 2.6.39 OS is installed on CPUs. C programming language is used to code +the parallel algorithms of both methods on the GPU cluster. CUDA version 4.0~\cite{ch12:ref9} +is used to program GPUs, using the CUBLAS library~\cite{ch12:ref6} to deal with vector operations +in GPUs and, finally, MPI routines of OpenMPI 1.3.3 are used to carry out the communications between +CPU cores. Indeed, the experiments are done on a cluster of $12$ computing nodes, where each node +is managed by one MPI process and is composed of one CPU core and one GPU card. \begin{figure} \centerline{\includegraphics[scale=0.25]{Chapters/chapter12/figures/cluster}} \caption{General scheme of the GPU cluster of tests composed of six machines, each with two GPUs.} -\label{fig:04} +\label{ch12:fig:04} \end{figure} -Linux cluster version 2.6.39 OS is installed on CPUs. C programming language is used for coding the parallel algorithms of both methods on the -GPU cluster. CUDA version 4.0~\cite{ref9} is used for programming GPUs, using CUBLAS library~\cite{ref6} to deal with vector operations in GPUs -and, finally, MPI routines of OpenMPI 1.3.3 are used to carry out the communications between CPU cores. Indeed, the experiments are done on a -cluster of $12$ computing nodes, where each node is managed by a MPI process and it is composed of one CPU core and one GPU card. - -All tests are made on double-precision floating point operations. The parameters of both linear solvers are initialized as follows: the residual -tolerance threshold $\varepsilon=10^{-12}$, the maximum number of iterations $maxiter=500$, the right-hand side $b$ is filled with $1.0$ and the -initial guess $x_0$ is filled with $0.0$. In addition, we limited the Arnoldi process used in the GMRES method to $16$ iterations ($m=16$). For -the sake of simplicity, we have chosen the preconditioner $M$ as the main diagonal of the sparse matrix $A$. Indeed, it allows to easily compute -the required inverse matrix $M^{-1}$ and it provides a relatively good preconditioning for not too ill-conditioned matrices. In the GPU computing, -the size of thread blocks is fixed to $512$ threads. Finally, the performance results, presented hereafter, are obtained from the mean value over -$10$ executions of the same parallel linear solver and for the same input data. - -To get more realistic results, we tested the CG and GMRES algorithms on sparse matrices of the Davis's collection~\cite{ref10}, that arise in a wide -spectrum of real-world applications. We chose six symmetric sparse matrices and six nonsymmetric ones from this collection. In Figure~\ref{fig:05}, -we show structures of these matrices and in Table~\ref{tab:01} we present their main characteristics which are the number of rows, the total number -of nonzero values (nnz) and the maximal bandwidth. In the present chapter, the bandwidth of a sparse matrix is defined as the number of matrix columns -separating the first and the last nonzero value on a matrix row. +All tests are made on double-precision floating point operations. The parameters of both linear +solvers are initialized as follows: the residual tolerance threshold $\varepsilon=10^{-12}$, the +maximum number of iterations $maxiter=500$, the right-hand side $b$ is filled with $1.0$, and the +initial guess $x_0$ is filled with $0.0$. In addition, we limited the Arnoldi iterations\index{iterative method!Arnoldi iterations} +used in the GMRES method to $16$ iterations ($m=16$). For the sake of simplicity, we have chosen +the preconditioner $M$ as the main diagonal of the sparse matrix $A$. Indeed, it allows us to easily +compute the required inverse matrix $M^{-1}$, and it provides a relatively good preconditioning for +not too ill-conditioned matrices. In the GPU computing, the size of thread blocks is fixed to $512$ +threads. Finally, the performance results, presented hereafter, are obtained from the mean value +over $10$ executions of the same parallel linear solver and for the same input data. \begin{figure} \centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/matrices}} -\caption{Sketches of sparse matrices chosen from the Davis's collection.} -\label{fig:05} +\caption{Sketches of sparse matrices chosen from the University of Florida collection.} +\label{ch12:fig:05} \end{figure} +To get more realistic results, we have tested the CG and GMRES algorithms on sparse matrices of the University of Florida +collection~\cite{ch12:ref10}, that arise in a wide spectrum of real-world applications. We have chosen six +symmetric sparse matrices and six nonsymmetric ones from this collection. In Figure~\ref{ch12:fig:05}, +we show the structures of these matrices and in Table~\ref{ch12:tab:01} we present their main characteristics +which are the number of rows, the total number of nonzero values, and the maximal bandwidth. In +the present chapter, the bandwidth of a sparse matrix is defined as the number of matrix columns separating +the first and the last nonzero value on a matrix row. + \begin{table} \centering \begin{tabular}{|c|c|c|c|c|} \hline -{\bf Matrix type} & {\bf Matrix name} & {\bf \# rows} & {\bf \# nnz} & {\bf Bandwidth} \\ \hline \hline +{\bf Matrix Type} & {\bf Matrix Name} & {\bf \# Rows} & {\bf \# Nonzeros} & {\bf Bandwidth} \\ \hline \hline \multirow{6}{*}{Symmetric} & 2cubes\_sphere & $101,492$ & $1,647,264$ & $100,464$ \\ @@ -497,16 +578,15 @@ separating the first and the last nonzero value on a matrix row. & torso3 & $259,156$ & $4,429,042$ & $216,854$ \\ \hline \end{tabular} -\vspace{0.5cm} -\caption{Main characteristics of sparse matrices chosen from the Davis's collection.} -\label{tab:01} +\caption{Main characteristics of sparse matrices chosen from the University of Florida collection.} +\label{ch12:tab:01} \end{table} -\begin{table} +\begin{table}[!h] \begin{center} \begin{tabular}{|c|c|c|c|c|c|c|} \hline -{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline +{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\#~Iter.}$ & $\mathbf{Prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline 2cubes\_sphere & $0.132s$ & $0.069s$ & $1.93$ & $12$ & $1.14e$-$09$ & $3.47e$-$18$ \\ @@ -521,7 +601,7 @@ shallow\_water2 & $0.017s$ & $0.010s$ & $1.68$ & $ thermal2 & $1.172s$ & $0.622s$ & $1.88$ & $15$ & $5.11e$-$09$ & $3.33e$-$16$ \\ \hline \end{tabular} \caption{Performances of the parallel CG method on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs.} -\label{tab:02} +\label{ch12:tab:02} \end{center} \end{table} @@ -529,7 +609,7 @@ thermal2 & $1.172s$ & $0.622s$ & $1.88$ & $ \begin{center} \begin{tabular}{|c|c|c|c|c|c|c|} \hline -{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline +{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\#~Iter.}$ & $\mathbf{Prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline 2cubes\_sphere & $0.234s$ & $0.124s$ & $1.88$ & $21$ & $2.10e$-$14$ & $3.47e$-$18$ \\ @@ -556,56 +636,70 @@ poli\_large & $0.097s$ & $0.095s$ & $1.02$ & $ torso3 & $4.242s$ & $2.030s$ & $2.09$ & $175$ & $2.69e$-$10$ & $1.78e$-$14$ \\ \hline \end{tabular} \caption{Performances of the parallel GMRES method on a cluster 24 CPU cores vs. on cluster of 12 GPUs.} -\label{tab:03} +\label{ch12:tab:03} \end{center} \end{table} -Tables~\ref{tab:02} and~\ref{tab:03} shows the performances of the parallel CG and GMRES solvers, respectively, for solving linear systems associated to -the sparse matrices presented in Tables~\ref{tab:01}. They allow to compare the performances obtained on a cluster of $24$ CPU cores and on a cluster -of $12$ GPUs. However, Table~\ref{tab:02} shows only the performances of solving symmetric sparse linear systems, due to the inability of the CG method -to solve the nonsymmetric systems. In both tables, the second and third columns give, respectively, the execution times in seconds obtained on $24$ CPU -cores ($Time_{gpu}$) and that obtained on $12$ GPUs ($Time_{gpu}$). Moreover, we take into account the relative gains $\tau$ of a solver implemented on the -GPU cluster compared to the same solver implemented on the CPU cluster. The relative gains, presented in the fourth column, are computed as a ratio of -the CPU execution time over the GPU execution time: +Tables~\ref{ch12:tab:02} and~\ref{ch12:tab:03} show the performances of the parallel +CG and~GMRES solvers, respectively, for solving linear systems associated to the sparse +matrices presented in Table~\ref{ch12:tab:01}. They allow us to compare the performances +obtained on a cluster of $24$ CPU cores and on a cluster of $12$ GPUs. However, Table~\ref{ch12:tab:02} +shows the performances of solving only symmetric sparse linear systems, due to the inability +of the CG method to solve the nonsymmetric systems. In both tables, the second and third +columns give, respectively, the execution times in seconds obtained on $24$ CPU cores +($Time_{cpu}$) and that obtained on $12$ GPUs ($Time_{gpu}$). Moreover, we take into account +the relative gains $\tau$ of a solver implemented on the GPU cluster compared to the same +solver implemented on the CPU cluster. The relative gains\index{relative gain}, presented +in the fourth column, are computed as a ratio of the CPU execution time over the GPU +execution time: \begin{equation} \tau = \frac{Time_{cpu}}{Time_{gpu}}. -\label{eq:20} +\label{ch12:eq:20} \end{equation} -In addition, Tables~\ref{tab:02} and~\ref{tab:03} give the number of iterations ($iter$), the precision $prec$ of the solution computed on the GPU cluster -and the difference $\Delta$ between the solution computed on the CPU cluster and that computed on the GPU cluster. Both parameters $prec$ and $\Delta$ -allow to validate and verify the accuracy of the solution computed on the GPU cluster. We have computed them as follows: +In addition, Tables~\ref{ch12:tab:02} and~\ref{ch12:tab:03} give the number of iterations +($iter$), the precision ($prec$) of the solution computed on the GPU cluster, and the difference +$\Delta$ between the solution computed on the CPU cluster and that computed on the GPU cluster. +Both parameters $prec$ and $\Delta$ allow us to validate and verify the accuracy of the solution +computed on the GPU cluster. We have computed them as follows: \begin{eqnarray} \Delta = max|x^{cpu}-x^{gpu}|,\\ prec = max|M^{-1}r^{gpu}|, \end{eqnarray} -where $\Delta$ is the maximum vector element, in absolute value, of the difference between the two solutions $x^{cpu}$ and $x^{gpu}$ computed, respectively, -on CPU and GPU clusters and $prec$ is the maximum element, in absolute value, of the residual vector $r^{gpu}\in\mathbb{R}^{n}$ of the solution $x^{gpu}$. -Thus, we can see that the solutions obtained on the GPU cluster were computed with a sufficient accuracy (about $10^{-10}$) and they are, more or less, -equivalent to those computed on the CPU cluster with a small difference ranging from $10^{-10}$ to $10^{-26}$. However, we can notice from the relative -gains $\tau$ that is not interesting to use multiple GPUs for solving small sparse linear systems. in fact, a small sparse matrix does not allow to maximize -utilization of GPU cores. In addition, the communications required to synchronize the computations over the cluster increase the idle times of GPUs and -slow down further the parallel computations. - -Consequently, in order to test the performances of the parallel solvers, we developed in C programming language a generator of large sparse matrices. -This generator takes a matrix from the Davis's collection~\cite{ref10} as an initial matrix to construct large sparse matrices exceeding ten million -of rows. It must be executed in parallel by the MPI processes of the computing nodes, so that each process could construct its sparse sub-matrix. In -first experimental tests, we are focused on sparse matrices having a banded structure, because they are those arise in the most of numerical problems. -So to generate the global sparse matrix, each MPI process constructs its sub-matrix by performing several copies of an initial sparse matrix chosen -from the Davis's collection. Then, it puts all these copies on the main diagonal of the global matrix (see Figure~\ref{fig:06}). Moreover, the empty -spaces between two successive copies in the main diagonal are filled with sub-copies (left-copy and right-copy in Figure~\ref{fig:06}) of the same +where $\Delta$ is the maximum vector element, in absolute value, of the difference between +the two solutions $x^{cpu}$ and $x^{gpu}$ computed, respectively, on CPU and GPU clusters and +$prec$ is the maximum element, in absolute value, of the residual vector $r^{gpu}\in\mathbb{R}^{n}$ +of the solution $x^{gpu}$. Thus, we can see that the solutions obtained on the GPU cluster +were computed with a sufficient accuracy (about $10^{-10}$) and they are, more or less, equivalent +to those computed on the CPU cluster with a small difference ranging from $10^{-10}$ to $10^{-26}$. +However, we can notice from the relative gains $\tau$ that it is not efficient to use multiple +GPUs for solving small sparse linear systems. In fact, a small sparse matrix does not allow us to +maximize utilization of GPU cores. In addition, the communications required to synchronize the +computations over the cluster increase the idle times of GPUs and slow down the parallel +computations further. + +Consequently, in order to test the performances of the parallel solvers, we developed in C programming +language a generator of large sparse matrices. This generator takes a matrix from the University of Florida collection~\cite{ch12:ref10} +as an initial matrix to build large sparse matrices exceeding ten million rows. It must be executed +in parallel by the MPI processes of the computing nodes, so that each process can build its sparse +submatrix. In the first experimental tests, we focused on sparse matrices having a banded structure, +because they are those arising the most in the majority of numerical problems. So to generate the global sparse matrix, +each MPI process constructs its submatrix by performing several copies of an initial sparse matrix chosen +from the University of Florida collection. Then, it puts all these copies on the main diagonal of the global matrix +(see Figure~\ref{ch12:fig:06}). Moreover, the empty spaces between two successive copies in the main +diagonal are filled with subcopies (left-copy and right-copy in Figure~\ref{ch12:fig:06}) of the same initial matrix. \begin{figure} \centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/generation}} \caption{Parallel generation of a large sparse matrix by four computing nodes.} -\label{fig:06} +\label{ch12:fig:06} \end{figure} \begin{table}[!h] \centering \begin{tabular}{|c|c|c|c|} \hline -{\bf Matrix type} & {\bf Matrix name} & {\bf \# nnz} & {\bf Bandwidth} \\ \hline \hline +{\bf Matrix Type} & {\bf Matrix Name} & {\bf \# Nonzeros} & {\bf Bandwidth} \\ \hline \hline \multirow{6}{*}{Symmetric} & 2cubes\_sphere & $413,703,602$ & $198,836$ \\ @@ -632,24 +726,28 @@ initial matrix. & torso3 & $433,795,264$ & $328,757$ \\ \hline \end{tabular} \vspace{0.5cm} -\caption{Main characteristics of sparse banded matrices generated from those of the Davis's collection.} -\label{tab:04} +\caption{Main characteristics of sparse banded matrices generated from those of the University of Florida collection.} +\label{ch12:tab:04} \end{table} -We have used the parallel CG and GMRES algorithms for solving sparse linear systems of $25$ million unknown values. The sparse matrices associated -to these linear systems are generated from those presented in Table~\ref{tab:01}. Their main characteristics are given in Table~\ref{tab:04}. Tables~\ref{tab:05} -and~\ref{tab:06} shows the performances of the parallel CG and GMRES solvers, respectively, obtained on a cluster of $24$ CPU cores and on a cluster -of $12$ GPUs. Obviously, we can notice from these tables that solving large sparse linear systems on a GPU cluster is more efficient than on a CPU -cluster (see relative gains $\tau$). We can also notice that the execution times of the CG method, whether in a CPU cluster or on a GPU cluster, are -better than those the GMRES method for solving large symmetric linear systems. In fact, the CG method is characterized by a better convergence rate -and a shorter execution time of an iteration than those of the GMRES method. Moreover, an iteration of the parallel GMRES method requires more data -exchanges between computing nodes compared to the parallel CG method. - +We have used the parallel CG and GMRES algorithms for solving sparse linear systems of $25$ +million unknown values. The sparse matrices associated to these linear systems are generated +from those presented in Table~\ref{ch12:tab:01}. Their main characteristics are given in Table~\ref{ch12:tab:04}. +Tables~\ref{ch12:tab:05} and~\ref{ch12:tab:06} show the performances of the parallel CG and +GMRES solvers, respectively, obtained on a cluster of $24$ CPU cores and on a cluster of $12$ +GPUs. Obviously, we can notice from these tables that solving large sparse linear systems on +a GPU cluster is more efficient than on a CPU cluster (see relative gains $\tau$). We can also +notice that the execution times of the CG method, whether in a CPU cluster or in a GPU cluster, +are better than those of the GMRES method for solving large symmetric linear systems. In fact, the +CG method is characterized by a better convergence\index{convergence} rate and a shorter execution +time of an iteration than those of the GMRES method. Moreover, an iteration of the parallel GMRES +method requires more data exchanges between computing nodes compared to the parallel CG method. + \begin{table}[!h] \begin{center} \begin{tabular}{|c|c|c|c|c|c|c|} \hline -{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline +{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\#~Iter.}$ & $\mathbf{Prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline 2cubes\_sphere & $1.625s$ & $0.401s$ & $4.05$ & $14$ & $5.73e$-$11$ & $5.20e$-$18$ \\ @@ -665,7 +763,7 @@ thermal2 & $1.411s$ & $0.244s$ & $5.78$ \end{tabular} \caption{Performances of the parallel CG method for solving linear systems associated to sparse banded matrices on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs.} -\label{tab:05} +\label{ch12:tab:05} \end{center} \end{table} @@ -673,7 +771,7 @@ on a cluster of 12 GPUs.} \begin{center} \begin{tabular}{|c|c|c|c|c|c|c|} \hline -{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline +{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\#~Iter.}$ & $\mathbf{Prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline 2cubes\_sphere & $3.597s$ & $0.514s$ & $6.99$ & $21$ & $2.11e$-$14$ & $8.67e$-$18$ \\ @@ -701,378 +799,51 @@ torso3 & $31.463s$ & $3.681s$ & $8.55$ \end{tabular} \caption{Performances of the parallel GMRES method for solving linear systems associated to sparse banded matrices on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs.} -\label{tab:06} +\label{ch12:tab:06} \end{center} -\end{table} - +\end{table} %%--------------------------%% %% SECTION 5 %% %%--------------------------%% -\section{Hypergraph partitioning} -\label{sec:05} -In this section, we present the performances of both parallel CG and GMRES solvers for solving linear systems associated to sparse matrices having -large bandwidths. Indeed, we are interested on sparse matrices having the nonzero values distributed along their bandwidths. - -\begin{figure} -\centerline{\includegraphics[scale=0.22]{Chapters/chapter12/figures/generation_1}} -\caption{Parallel generation of a large sparse five-bands matrix by four computing nodes.} -\label{fig:07} -\end{figure} - -\begin{table}[!h] -\begin{center} -\begin{tabular}{|c|c|c|c|} -\hline -{\bf Matrix type} & {\bf Matrix name} & {\bf \# nnz} & {\bf Bandwidth} \\ \hline \hline - -\multirow{6}{*}{Symmetric} & 2cubes\_sphere & $829,082,728$ & $24,999,999$ \\ - - & ecology2 & $254,892,056$ & $25,000,000$ \\ - - & finan512 & $556,982,339$ & $24,999,973$ \\ - - & G3\_circuit & $257,982,646$ & $25,000,000$ \\ - - & shallow\_water2 & $200,798,268$ & $25,000,000$ \\ - - & thermal2 & $359,340,179$ & $24,999,998$ \\ \hline \hline - -\multirow{6}{*}{Nonsymmetric} & cage13 & $879,063,379$ & $24,999,998$ \\ - - & crashbasis & $820,373,286$ & $24,999,803$ \\ - - & FEM\_3D\_thermal2 & $1,194,012,703$ & $24,999,998$ \\ - - & language & $155,261,826$ & $24,999,492$ \\ - - & poli\_large & $106,680,819$ & $25,000,000$ \\ - - & torso3 & $872,029,998$ & $25,000,000$\\ \hline -\end{tabular} -\caption{Main characteristics of sparse five-bands matrices generated from those of the Davis's collection.} -\label{tab:07} -\end{center} -\end{table} - -We have developed in C programming language a generator of large sparse matrices having five bands distributed along their bandwidths (see Figure~\ref{fig:07}). -The principle of this generator is equivalent to that in Section~\ref{sec:04}. However, the copies performed on the initial matrix (chosen from the -Davis's collection) are placed on the main diagonal and on four off-diagonals, two on the right and two on the left of the main diagonal. Figure~\ref{fig:07} -shows an example of a generation of a sparse five-bands matrix by four computing nodes. Table~\ref{tab:07} shows the main characteristics of sparse -five-bands matrices generated from those presented in Table~\ref{tab:01} and associated to linear systems of $25$ million unknown values. - -\begin{table}[!h] -\begin{center} -\begin{tabular}{|c|c|c|c|c|c|c|} -\hline -{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline - -2cubes\_sphere & $6.041s$ & $3.338s$ & $1.81$ & $30$ & $6.77e$-$11$ & $3.25e$-$19$ \\ - -ecology2 & $1.404s$ & $1.301s$ & $1.08$ & $13$ & $5.22e$-$11$ & $2.17e$-$18$ \\ - -finan512 & $1.822s$ & $1.299s$ & $1.40$ & $12$ & $3.52e$-$11$ & $3.47e$-$18$ \\ - -G3\_circuit & $2.331s$ & $2.129s$ & $1.09$ & $15$ & $1.36e$-$11$ & $5.20e$-$18$ \\ - -shallow\_water2 & $0.541s$ & $0.504s$ & $1.07$ & $6$ & $2.12e$-$16$ & $5.05e$-$28$ \\ - -thermal2 & $2.549s$ & $1.705s$ & $1.49$ & $14$ & $2.36e$-$10$ & $5.20e$-$18$ \\ \hline -\end{tabular} -\caption{Performances of parallel CG solver for solving linear systems associated to sparse five-bands matrices -on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs} -\label{tab:08} -\end{center} -\end{table} - -\begin{table} -\begin{center} -\begin{tabular}{|c|c|c|c|c|c|c|} -\hline -{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline - -2cubes\_sphere & $15.963s$ & $7.250s$ & $2.20$ & $58$ & $6.23e$-$16$ & $3.25e$-$19$ \\ - -ecology2 & $3.549s$ & $2.176s$ & $1.63$ & $21$ & $4.78e$-$15$ & $1.06e$-$15$ \\ - -finan512 & $3.862s$ & $1.934s$ & $1.99$ & $17$ & $3.21e$-$14$ & $8.43e$-$17$ \\ - -G3\_circuit & $4.636s$ & $2.811s$ & $1.65$ & $22$ & $1.08e$-$14$ & $1.77e$-$16$ \\ - -shallow\_water2 & $2.738s$ & $1.539s$ & $1.78$ & $17$ & $5.54e$-$23$ & $3.82e$-$26$ \\ - -thermal2 & $5.017s$ & $2.587s$ & $1.94$ & $21$ & $8.25e$-$14$ & $4.34e$-$18$ \\ \hline \hline - -cage13 & $9.315s$ & $3.227s$ & $2.89$ & $26$ & $3.38e$-$13$ & $2.08e$-$16$ \\ - -crashbasis & $35.980s$ & $14.770s$ & $2.43$ & $127$ & $1.17e$-$12$ & $1.56e$-$17$ \\ - -FEM\_3D\_thermal2 & $24.611s$ & $7.749s$ & $3.17$ & $64$ & $3.87e$-$11$ & $2.84e$-$14$ \\ - -language & $16.859s$ & $9.697s$ & $1.74$ & $89$ & $2.17e$-$12$ & $1.70e$-$12$ \\ - -poli\_large & $10.200s$ & $6.534s$ & $1.56$ & $69$ & $5.14e$-$13$ & $1.63e$-$13$ \\ - -torso3 & $49.074s$ & $19.397s$ & $2.53$ & $175$ & $2.69e$-$12$ & $2.77e$-$16$ \\ \hline -\end{tabular} -\caption{Performances of parallel GMRES solver for solving linear systems associated to sparse five-bands matrices -on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs} -\label{tab:09} -\end{center} -\end{table} - -Tables~\ref{tab:08} and~\ref{tab:09} shows the performaces of the parallel CG and GMRES solvers, respectively, obtained on -a cluster of $24$ CPU cores and on a cluster of $12$ GPUs. The linear systems solved in these tables are associated to the -sparse five-bands matrices presented on Table~\ref{tab:07}. We can notice from both Tables~\ref{tab:08} and~\ref{tab:09} that -using a GPU cluster is not efficient for solving these kind of sparse linear systems. We can see that the execution times obtained -on the GPU cluster are almost equivalent to those obtained on the CPU cluster (see the relative gains presented in column~$4$ -of each table). This is due to the large number of communications necessary to synchronize the computations over the cluster. -Indeed, the naive partitioning, row-by-row or column-by-column, of sparse matrices having large bandwidths can link a computing -node to many neighbors and then generate a large number of data dependencies between these computing nodes in the cluster. - -Therefore, we have chosen to use a hypergraph partitioning method, which is well-suited to numerous kinds of sparse matrices~\cite{ref11}. -Indeed, it can well model the communications between the computing nodes, particularly in the case of nonsymmetric and irregular -matrices, and it gives good reduction of the total communication volume. In contrast, it is an expensive operation in terms of -execution time and memory space. - -The sparse matrix $A$ of the linear system to be solved is modeled as a hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ as -follows: -\begin{itemize} -\item each matrix row $\{i\}_{0\leq i<n}$ corresponds to a vertex $v_i\in\mathcal{V}$ and, -\item each matrix column $\{j\}_{0\leq j<n}$ corresponds to a hyperedge $e_j\in\mathcal{E}$, where: -\begin{equation} -\forall a_{ij} \neq 0 \mbox{~is a nonzero value of matrix~} A \mbox{~:~} v_i \in pins[e_j], -\end{equation} -\item $w_i$ is the weight of vertex $v_i$ and, -\item $c_j$ is the cost of hyperedge $e_j$. -\end{itemize} -A $K$-way partitioning of a hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ is defined as $\mathcal{P}=\{\mathcal{V}_1,\ldots,\mathcal{V}_K\}$ -a set of pairwise disjoint non-empty subsets (or parts) of the vertex set $\mathcal{V}$, so that each subset is attributed to a computing node. -Figure~\ref{fig:08} shows an example of the hypergraph model of a $(9\times 9)$ sparse matrix in three parts. The circles and squares correspond, -respectively, to the vertices and hyperedges of the hypergraph. The solid squares define the cut hyperedges connecting at least two different parts. -The connectivity $\lambda_j$ of a cut hyperedge $e_j$ denotes the number of different parts spanned by $e_j$. - -\begin{figure} -\centerline{\includegraphics[scale=0.5]{Chapters/chapter12/figures/hypergraph}} -\caption{An example of the hypergraph partitioning of a sparse matrix decomposed between three computing nodes.} -\label{fig:08} -\end{figure} - -The cut hyperedges model the total communication volume between the different computing nodes in the cluster, necessary to perform the parallel SpMV -multiplication. Indeed, each hyperedge $e_j$ defines a set of atomic computations $b_i\leftarrow b_i+a_{ij}x_j$, $0\leq i,j<n$, of the SpMV multiplication -$Ax=b$ that need the $j^{th}$ unknown value of solution vector $x$. Therefore, pins of hyperedge $e_j$, $pins[e_j]$, are the set of matrix rows sharing -and requiring the same unknown value $x_j$. For example in Figure~\ref{fig:08}, hyperedge $e_9$ whose pins are: $pins[e_9]=\{v_2,v_5,v_9\}$ represents the -dependency of matrix rows $2$, $5$ and $9$ to unknown $x_9$ needed to perform in parallel the atomic operations: $b_2\leftarrow b_2+a_{29}x_9$, -$b_5\leftarrow b_5+a_{59}x_9$ and $b_9\leftarrow b_9+a_{99}x_9$. However, unknown $x_9$ is the third entry of the sub-solution vector $x$ of part (or node) $3$. -So the computing node $3$ must exchange this value with nodes $1$ and $2$, which leads to perform two communications. - -The hypergraph partitioning allows to reduce the total communication volume required to perform the parallel SpMV multiplication, while maintaining the -load balancing between the computing nodes. In fact, it allows to minimize at best the following amount: -\begin{equation} -\mathcal{X}(\mathcal{P})=\sum_{e_{j}\in\mathcal{E}_{C}}c_{j}(\lambda_{j}-1), -\end{equation} -where $\mathcal{E}_{C}$ denotes the set of the cut hyperedges coming from the hypergraph partitioning $\mathcal{P}$ and $c_j$ and $\lambda_j$ are, respectively, -the cost and the connectivity of cut hyperedge $e_j$. Moreover, it also ensures the load balancing between the $K$ parts as follows: -\begin{equation} - W_{k}\leq (1+\epsilon)W_{avg}, \hspace{0.2cm} (1\leq k\leq K) \hspace{0.2cm} \text{and} \hspace{0.2cm} (0<\epsilon<1), -\end{equation} -where $W_{k}$ is the sum of all vertex weights ($w_{i}$) in part $\mathcal{V}_{k}$, $W_{avg}$ is the average weight of all $K$ parts and $\epsilon$ is the -maximum allowed imbalanced ratio. - -The hypergraph partitioning is a NP-complete problem but software tools using heuristics are developed, for example: hMETIS~\cite{ref12}, PaToH~\cite{ref13} -and Zoltan~\cite{ref14}. Since our objective is solving large sparse linear systems, we use the parallel hypergraph partitioning which must be performed by -at least two MPI processes. It allows to accelerate the data partitioning of large sparse matrices. For this, the hypergraph $\mathcal{H}$ must be partitioned -in $p$ (number of MPI processes) sub-hypergraphs $\mathcal{H}_k=(\mathcal{V}_k,\mathcal{E}_k)$, $0\leq k<p$, and then we performed the parallel hypergraph -partitioning method using some functions of the MPI library between the $p$ processes. - -Tables~\ref{tab:10} and~\ref{tab:11} shows the performances of the parallel CG and GMRES solvers, respectively, using the hypergraph partitioning for solving -large linear systems associated to the sparse five-bands matrices presented in Table~\ref{tab:07}. For these experimental tests, we have applied the parallel -hypergraph partitioning~\cite{ref15} developed in Zoltan tool~\cite{ref14}. We have initialized the parameters of the partitioning operation as follows: -\begin{itemize} -\item the weight $w_{i}$ of each vertex $v_{j}\in\mathcal{V}$ is set to the number of nonzero values on matrix row $i$, -\item for the sake of simplicity, the cost $c_{j}$ of each hyperedge $e_{j}\in\mathcal{E}$ is fixed to $1$, -\item the maximum imbalanced load ratio $\epsilon$ is limited to $10\%$.\\ -\end{itemize} - -\begin{table}[!h] -\begin{center} -\begin{tabular}{|c|c|c|c|c|} -\hline -{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{Gains \%}$ \\ \hline \hline - -2cubes\_sphere & $5.935s$ & $1.213s$ & $4.89$ & $63.66\%$ \\ - -ecology2 & $1.093s$ & $0.136s$ & $8.00$ & $89.55\%$ \\ - -finan512 & $1.762s$ & $0.475s$ & $3.71$ & $63.43\%$ \\ - -G3\_circuit & $2.095s$ & $0.558s$ & $3.76$ & $73.79\%$ \\ - -shallow\_water2 & $0.498s$ & $0.068s$ & $7.31$ & $86.51\%$ \\ - -thermal2 & $1.889s$ & $0.348s$ & $5.43$ & $79.59\%$ \\ \hline -\end{tabular} -\caption{Performances of the parallel CG solver using hypergraph partitioning for solving linear systems associated to -sparse five-bands matrices on a cluster of 24 CPU cores vs. on a cluster of 12 GPU.} -\label{tab:10} -\end{center} -\end{table} - -\begin{table}[!h] -\begin{center} -\begin{tabular}{|c|c|c|c|c|} -\hline -{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{Gains \%}$ \\ \hline \hline - -2cubes\_sphere & $16.430s$ & $2.840s$ & $5.78$ & $60.83\%$ \\ - -ecology2 & $3.152s$ & $0.367s$ & $8.59$ & $83.13\%$ \\ - -finan512 & $3.672s$ & $0.723s$ & $5.08$ & $62.62\%$ \\ - -G3\_circuit & $4.468s$ & $0.971s$ & $4.60$ & $65.46\%$ \\ - -shallow\_water2 & $2.647s$ & $0.312s$ & $8.48$ & $79.73\%$ \\ - -thermal2 & $4.190s$ & $0.666s$ & $6.29$ & $74.25\%$ \\ \hline \hline - -cage13 & $8.077s$ & $1.584s$ & $5.10$ & $50.91\%$ \\ - -crashbasis & $35.173s$ & $5.546s$ & $6.34$ & $62.43\%$ \\ - -FEM\_3D\_thermal2 & $24.825s$ & $3.113s$ & $7.97$ & $59.83\%$ \\ - -language & $16.706s$ & $2.522s$ & $6.62$ & $73.99\%$ \\ - -poli\_large & $12.715s$ & $3.989s$ & $3.19$ & $38.95\%$ \\ - -torso3 & $48.459s$ & $6.234s$ & $7.77$ & $67.86\%$ \\ \hline -\end{tabular} -\caption{Performances of the parallel GMRES solver using hypergraph partitioning for solving linear systems associated to -sparse five-bands matrices on a cluster of 24 CPU cores vs. on a cluster of 12 GPU.} -\label{tab:11} -\end{center} -\end{table} - -We can notice from both Tables~\ref{tab:10} and~\ref{tab:11} that the hypergraph partitioning has improved the performances of both -parallel CG and GMRES algorithms. The execution times -on the GPU cluster of both parallel solvers are significantly improved compared to those obtained by using the partitioning row-by-row. -For these examples of sparse matrices, the execution times of CG and GMRES solvers are reduced about $76\%$ and $65\%$ respectively -(see column~$5$ of each table) compared to those obtained in Tables~\ref{tab:08} and~\ref{tab:09}. - -In fact, the hypergraph partitioning applied to sparse matrices having large bandwidths allows to reduce the total communication volume -necessary to synchronize the computations between the computing nodes in the GPU cluster. Table~\ref{tab:12} presents, for each sparse -matrix, the total communication volume between $12$ GPU computing nodes obtained by using the partitioning row-by-row (column~$2$), the -total communication volume obtained by using the hypergraph partitioning (column~$3$) and the execution times in minutes of the hypergraph -partitioning operation performed by $12$ MPI processes (column~$4$). The total communication volume defines the total number of the vector -elements exchanged by the computing nodes. Then, Table~\ref{tab:12} shows that the hypergraph partitioning method can split the sparse -matrix so as to minimize the data dependencies between the computing nodes and thus to reduce the total communication volume. - -\begin{table} -\begin{center} -\begin{tabular}{|c|c|c|c|} -\hline -\multirow{4}{*}{\bf Matrix} & {\bf Total comms.} & {\bf Total comms.} & {\bf Execution} \\ - & {\bf volume without} & {\bf volume with} & {\bf trime} \\ - & {\bf hypergraph} & {\bf hypergraph } & {\bf of the parti.} \\ - & {\bf parti.} & {\bf parti.} & {\bf in minutes}\\ \hline \hline - -2cubes\_sphere & $25,360,543$ & $240,679$ & $68.98$ \\ - -ecology2 & $26,044,002$ & $73,021$ & $4.92$ \\ - -finan512 & $26,087,431$ & $900,729$ & $33.72$ \\ - -G3\_circuit & $31,912,003$ & $5,366,774$ & $11.63$ \\ - -shallow\_water2 & $25,105,108$ & $60,899$ & $5.06$ \\ - -thermal2 & $30,012,846$ & $1,077,921$ & $17.88$ \\ \hline \hline - -cage13 & $28,254,282$ & $3,845,440$ & $196.45$ \\ - -crashbasis & $29,020,060$ & $2,401,876$ & $33.39$ \\ - -FEM\_3D\_thermal2 & $25,263,767$ & $250,105$ & $49.89$ \\ - -language & $27,291,486$ & $1,537,835$ & $9.07$ \\ - -poli\_large & $25,053,554$ & $7,388,883$ & $5.92$ \\ - -torso3 & $25,682,514$ & $613,250$ & $61.51$ \\ \hline -\end{tabular} -\caption{The total communication volume between 12 GPU computing nodes without and with the hypergraph partitioning method.} -\label{tab:12} -\end{center} -\end{table} - -Nevertheless, as we can see from the fourth column of Table~\ref{tab:12}, the hypergraph partitioning takes longer compared -to the execution times of the resolutions. As previously mentioned, the hypergraph partitioning method is less efficient in -terms of memory consumption and partitioning time than its graph counterpart, but the hypergraph well models the nonsymmetric -and irregular problems. So for the applications which often use the same sparse matrices, we can perform the hypergraph partitioning -on these matrices only once for each and then, we save the traces of these partitionings in files to be reused several times. -Therefore, this allows to avoid the partitioning of the sparse matrices at each resolution of the linear systems. - -\begin{figure}[!t] -\centering -\begin{tabular}{c} -\includegraphics[scale=0.7]{Chapters/chapter12/figures/scale_band} \\ -\small{(a) Sparse band matrices} \\ -\\ -\includegraphics[scale=0.7]{Chapters/chapter12/figures/scale_5band} \\ -\small{(b) Sparse five-bands matrices} -\end{tabular} -\caption{Weak-scaling of the parallel CG and GMRES solvers on a GPU cluster for solving large sparse linear systems.} -\label{fig:09} -\end{figure} - -However, the most important performance parameter is the scalability of the parallel CG and GMRES solvers on a GPU cluster. -Particularly, we have taken into account the weak-scaling of both parallel algorithms on a cluster of one to 12 GPU computing -nodes. We have performed a set of experiments on both matrix structures: band matrices and five-bands matrices. The -sparse matrices of tests are generated from the symmetric sparse matrix {\it thermal2} chosen from the Davis's collection. -Figures~\ref{fig:09}-$(a)$ and~\ref{fig:09}-$(b)$ show the execution times of both parallel methods for solving large linear -systems associated to band matrices and those associated to five-bands matrices, respectively. The size of a sparse sub-matrix -per computing node, for each matrix structure, is fixed as follows: -\begin{itemize} -\item band matrix: $15$ million of rows and $105,166,557$ of nonzero values, -\item five-bands matrix: $5$ million of rows and $78,714,492$ of nonzero values. -\end{itemize} -We can see from these figures that both parallel solvers are quite scalable on a GPU cluster. Indeed, the execution times remains -almost constant while the size of the sparse linear systems to be solved increases proportionally with the number of -the GPU computing nodes. This means that the communication cost is relatively constant regardless of the number the computing nodes -in the GPU cluster. - - - -%%--------------------------%% -%% SECTION 6 %% -%%--------------------------%% \section{Conclusion} -\label{sec:06} -In this chapter, we have aimed at harnessing the computing power of a cluster of GPUs for solving large sparse linear systems. -For this, we have used two Krylov sub-space iterative methods: the CG and GMRES methods. The first method is well-known to its -efficiency for solving symmetric linear systems and the second one is used, particularly, for solving nonsymmetric linear systems. - -We have presentend the parallel implementation of both iterative methods on a GPU cluster. Particularly, the operations dealing with -the vectors and/or matrices, of these methods, are parallelized between the different GPU computing nodes of the cluster. Indeed, -the data-parallel vector operations are accelerated by GPUs and the communications required to synchronize the parallel computations -are carried out by CPU cores. For this, we have used a heterogeneous CUDA/MPI programming to implement the parallel iterative +\label{ch12:sec:05} +In this chapter, we have aimed at harnessing the computing power of a +cluster of GPUs for solving large sparse linear systems. For this, we +have used two Krylov subspace iterative methods: the CG and GMRES methods. +The first method is well known for its efficiency to solve symmetric +linear systems and the second one is used, particularly, to solve +nonsymmetric linear systems. + +We have presented the parallel implementation of both iterative methods +on a GPU cluster. Particularly, the operations dealing with the vectors +and/or matrices, of these methods, are parallelized between the different +GPU computing nodes of the cluster. Indeed, the data-parallel vector operations +are accelerated by GPUs, and the communications required to synchronize the +parallel computations are carried out by CPU cores. For this, we have used +heterogeneous CUDA/MPI programming to implement the parallel iterative algorithms. -In the experimental tests, we have shown that using a GPU cluster is efficient for solving linear systems associated to very large -sparse matrices. The experimental results, obtained in the present chapter, showed that a cluster of $12$ GPUs is about $7$ -times faster than a cluster of $24$ CPU cores for solving large sparse linear systems of $25$ million unknown values. This is due -to the GPU ability to compute the data-parallel operations faster than the CPUs. However, we have shown that solving linear systems -associated to matrices having large bandwidths uses many communications to synchronize the computations of GPUs, which slow down -even more the resolution. Moreover, there are two kinds of communications: between a CPU and its GPU and between CPUs of the computing -nodes, such that the first ones are the slowest communications on a GPU cluster. So, we have proposed to use the hypergraph partitioning -instead of the row-by-row partitioning. This allows to minimize the data dependencies between the GPU computing nodes and thus to -reduce the total communication volume. The experimental results showed that using the hypergraph partitioning technique improve the -execution times on average of $76\%$ to the CG method and of $65\%$ to the GMRES method on a cluster of $12$ GPUs. - -In the recent GPU hardware and software architectures, the GPU-Direct system with CUDA version 5.0 is used so that two GPUs located on -the same node or on distant nodes can communicate between them directly without CPUs. This allows to improve the data transfers between -GPUs. +In the experimental tests, we have shown that using a GPU cluster is efficient +for solving linear systems associated to very large sparse matrices. The experimental +results, discussed in the present chapter, show that a cluster of $12$ GPUs is +about $7$ times faster than a cluster of $24$ CPU cores for solving large sparse +linear systems of $25$ million unknown values. This is due to the GPUs ability to +compute the data-parallel operations faster than the CPUs. + +In our future works, we plan to test the parallel algorithms of CG and~GMRES methods, adapted +to GPUs, for solving large linear systems associated to sparse matrices of different structures. +For example, the matrices having large bandwidths can lead to many data dependencies +between the computing nodes and, thus, degrade the performances of both algorithms. So in +this case, it would be interesting to study the different data partitioning techniques, in +order to minimize the dependencies between the computing nodes and thus to reduce the total +communication volume. This may improve the performances of both algorithms implemented on +a GPU cluster. Moreover, in the recent GPU hardware and software architectures, the GPU-Direct +system with CUDA version 5.0 is used so that two GPUs located on the same node or on distant +nodes can communicate between each other directly without CPUs. This allows us to improve the data +transfers between GPUs. + + \putbib[Chapters/chapter12/biblio12]