BookGPU/Chapters/chapter12/ch12.tex

   1 1%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   2 %%                          %%
   3 %%       CHAPTER 12         %%
   4 %%                          %%
   5 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   6
   7 %\chapterauthor{}{}
   8 \chapterauthor{Lilia Ziane Khodja}{Femto-ST Institute, University of Franche-Comte, France}
   9 \chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte, France}
  10 \chapterauthor{Jacques Bahi}{Femto-ST Institute, University of Franche-Comte, France}
  11
  12 \chapter{Solving sparse linear systems with GMRES and CG methods on GPU clusters}
  13 \label{ch12}
  14
  15 %%--------------------------%%
  16 %%       SECTION 1          %%
  17 %%--------------------------%%
  18 \section{Introduction}
  19 \label{ch12:sec:01}
  20 Sparse linear systems are used to model many scientific and industrial problems,
  21 such as the environmental simulations or the industrial processing of the complex or
  22 non-Newtonian fluids. Moreover, the resolution of these problems often involves the
  23 solving of such linear systems which are considered as the most expensive process in
  24 terms of execution time and memory space. Therefore, solving sparse linear systems
  25 must be as efficient as possible in order to deal with problems of ever increasing
  26 size.
  27
  28 There are, in the jargon of numerical analysis, different methods of solving sparse
  29 linear systems that can be classified in two classes: the direct and iterative methods.
  30 However, the iterative methods are often more suitable than their counterpart, direct
  31 methods, to  solve these systems. Indeed, they are less memory consuming and easier
  32 to parallelize on parallel computers than direct methods. Different computing platforms,
  33 sequential and parallel computers, are used to solve sparse linear systems with iterative
  34 solutions. Nowadays, graphics processing units (GPUs) have become attractive to solve
  35 these systems, due to their computing power and their ability to compute faster than
  36 traditional CPUs.
  37
  38 In Section~\ref{ch12:sec:02}, we describe the general principle of two well-known iterative
  39 methods: the conjugate gradient method and the generalized minimal residual method. In Section~\ref{ch12:sec:03},
  40 we give the main key points of the parallel implementation of both methods on a cluster of
  41 GPUs. Finally, in Section~\ref{ch12:sec:04}, we present the experimental results obtained on a
  42 CPU cluster and on a GPU cluster, to solve large sparse linear systems.
  43
  44
  45 %%--------------------------%%
  46 %%       SECTION 2          %%
  47 %%--------------------------%%
  48 \section{Krylov iterative methods}
  49 \label{ch12:sec:02}
  50 Let us consider the following system of $n$ linear equations\index{Sparse~linear~system}
  51 in $\mathbb{R}$:
  52 \begin{equation}
  53 Ax=b,
  54 \label{ch12:eq:01}
  55 \end{equation}
  56 where $A\in\mathbb{R}^{n\times n}$ is a sparse nonsingular square matrix, $x\in\mathbb{R}^{n}$
  57 is the solution vector, $b\in\mathbb{R}^{n}$ is the right-hand side and $n\in\mathbb{N}$ is a
  58 large integer number.
  59
  60 The iterative methods\index{Iterative~method} for solving the large sparse linear system~(\ref{ch12:eq:01})
  61 proceed by successive iterations of a same block of elementary operations, during which an
  62 infinite number of approximate solutions $\{x_k\}_{k\geq 0}$ are computed. Indeed, from an
  63 initial guess $x_0$, an iterative method determines at each iteration $k>0$ an approximate
  64 solution $x_k$ which, gradually, converges to the exact solution $x^{*}$ as follows:
  65 \begin{equation}
  66 x^{*}=\lim\limits_{k\to\infty}x_{k}=A^{-1}b.
  67 \label{ch12:eq:02}
  68 \end{equation}
  69 The number of iterations necessary to reach the exact solution $x^{*}$ is not known beforehand
  70 and can be infinite. In practice, an iterative method often finds an approximate solution $\tilde{x}$
  71 after a fixed number of iterations and/or when a given convergence criterion\index{Convergence}
  72 is satisfied as follows:
  73 \begin{equation}
  74 \|b-A\tilde{x}\| < \varepsilon,
  75 \label{ch12:eq:03}
  76 \end{equation}
  77 where $\varepsilon<1$ is the required convergence tolerance threshold\index{Convergence!Tolerance~threshold}.
  78
  79 Some of the most iterative methods that have proven their efficiency for solving large sparse
  80 linear systems are those called \textit{Krylov subspace methods}~\cite{ch12:ref1}\index{Iterative~method!Krylov~subspace}.
  81 In the present chapter, we describe two Krylov methods which are widely used: the conjugate
  82 gradient method (CG) and the generalized minimal residual method (GMRES). In practice, the
  83 Krylov subspace methods are usually used with preconditioners that allow to improve their
  84 convergence. So, in what follows, the CG and GMRES methods are used to solve the left-preconditioned\index{Sparse~linear~system!Preconditioned}
  85 sparse linear system:
  86 \begin{equation}
  87 M^{-1}Ax=M^{-1}b,
  88 \label{ch12:eq:11}
  89 \end{equation}
  90 where $M$ is the preconditioning matrix.
  91
  92
  93 %%****************%%
  94 %%****************%%
  95 \subsection{CG method}
  96 \label{ch12:sec:02.01}
  97 The conjugate gradient method was initially developed by Hestenes and Stiefel in 1952~\cite{ch12:ref2}.
  98 It is one of the well known iterative method to solve large sparse linear systems. In addition, it
  99 can be adapted to solve nonlinear equations and optimization problems. However, it can only be applied
 100 to problems with positive definite symmetric matrices.
 101
 102 The main idea of the CG method\index{Iterative~method!CG} is the computation of a sequence of approximate
 103 solutions $\{x_k\}_{k\geq 0}$ in a Krylov subspace\index{Iterative~method!Krylov~subspace} of order $k$ as
 104 follows:
 105 \begin{equation}
 106 x_k \in x_0 + \mathcal{K}_k(A,r_0),
 107 \label{ch12:eq:04}
 108 \end{equation}
 109 such that the Galerkin condition\index{Galerkin~condition} must be satisfied:
 110 \begin{equation}
 111 r_k \bot \mathcal{K}_k(A,r_0),
 112 \label{ch12:eq:05}
 113 \end{equation}
 114 where $x_0$ is the initial guess, $r_k=b-Ax_k$ is the residual of the computed solution $x_k$ and $\mathcal{K}_k$
 115 the Krylov subspace of order $k$: \[\mathcal{K}_k(A,r_0) \equiv\text{span}\{r_0, Ar_0, A^2r_0,\ldots, A^{k-1}r_0\}.\]
 116 In fact, CG is based on the construction of a sequence $\{p_k\}_{k\in\mathbb{N}}$ of direction vectors in $\mathcal{K}_k$
 117 which are pairwise $A$-conjugate ($A$-orthogonal):
 118 \begin{equation}
 119 \begin{array}{ll}
 120 p_i^T A p_j = 0, & i\neq j.
 121 \end{array}
 122 \label{ch12:eq:06}
 123 \end{equation}
 124 At each iteration $k$, an approximate solution $x_k$ is computed by recurrence as follows:
 125 \begin{equation}
 126 \begin{array}{ll}
 127 x_k = x_{k-1} + \alpha_k p_k, & \alpha_k\in\mathbb{R}.
 128 \end{array}
 129 \label{ch12:eq:07}
 130 \end{equation}
 131 Consequently, the residuals $r_k$ are computed in the same way:
 132 \begin{equation}
 133 r_k = r_{k-1} - \alpha_k A p_k.
 134 \label{ch12:eq:08}
 135 \end{equation}
 136 In the case where all residuals are nonzero, the direction vectors $p_k$ can be determined so that
 137 the following recurrence holds:
 138 \begin{equation}
 139 \begin{array}{lll}
 140 p_0=r_0, & p_k=r_k+\beta_k p_{k-1}, & \beta_k\in\mathbb{R}.
 141 \end{array}
 142 \label{ch12:eq:09}
 143 \end{equation}
 144 Moreover, the scalars $\{\alpha_k\}_{k>0}$ are chosen so as to minimize the $A$-norm error $\|x^{*}-x_k\|_A$
 145 over the Krylov subspace $\mathcal{K}_{k}$ and the scalars $\{\beta_k\}_{k>0}$ are chosen so as to ensure
 146 that the direction vectors are pairwise $A$-conjugate. So, the assumption that matrix $A$ is symmetric and
 147 the recurrences~(\ref{ch12:eq:08}) and~(\ref{ch12:eq:09}) allow to deduce that:
 148 \begin{equation}
 149 \begin{array}{ll}
 150 \alpha_{k}=\frac{r^{T}_{k-1}r_{k-1}}{p_{k}^{T}Ap_{k}}, & \beta_{k}=\frac{r_{k}^{T}r_{k}}{r_{k-1}^{T}r_{k-1}}.
 151 \end{array}
 152 \label{ch12:eq:10}
 153 \end{equation}
 154
 155 \begin{algorithm}[!t]
 156   Choose an initial guess $x_0$\;
 157   $r_{0} = b - A x_{0}$\;
 158   $convergence$ = false\;
 159   $k = 1$\;
 160   \Repeat{convergence}{
 161     $z_{k} = M^{-1} r_{k-1}$\;
 162     $\rho_{k} = (r_{k-1},z_{k})$\;
 163     \eIf{$k = 1$}{
 164       $p_{k} = z_{k}$\;
 165     }{
 166       $\beta_{k} = \rho_{k} / \rho_{k-1}$\;
 167       $p_{k} = z_{k} + \beta_{k} \times p_{k-1}$\;
 168     }
 169     $q_{k} = A \times p_{k}$\;
 170     $\alpha_{k} = \rho_{k} / (p_{k},q_{k})$\;
 171     $x_{k} = x_{k-1} + \alpha_{k} \times p_{k}$\;
 172     $r_{k} = r_{k-1} - \alpha_{k} \times q_{k}$\;
 173     \eIf{$(\rho_{k} < \varepsilon)$ {\bf or} $(k \geq maxiter)$}{
 174       $convergence$ = true\;
 175     }{
 176       $k = k + 1$\;
 177     }
 178   }
 179 \caption{Left-preconditioned CG method}
 180 \label{ch12:alg:01}
 181 \end{algorithm}
 182
 183 Algorithm~\ref{ch12:alg:01} shows the main key points of the preconditioned CG method. It allows
 184 to solve the left-preconditioned\index{Sparse~linear~system!Preconditioned} sparse linear system~(\ref{ch12:eq:11}).
 185 In this algorithm, $\varepsilon$ is the convergence tolerance threshold, $maxiter$ is the maximum
 186 number of iterations and $(\cdot,\cdot)$ defines the dot product between two vectors in $\mathbb{R}^{n}$.
 187 At every iteration, a direction vector $p_k$ is determined, so that it is orthogonal to the preconditioned
 188 residual $z_k$ and to the direction vectors $\{p_i\}_{i<k}$ previously determined (from line~$8$ to
 189 line~$13$). Then, at lines~$16$ and~$17$, the iterate $x_k$ and the residual $r_k$ are computed using
 190 formulas~(\ref{ch12:eq:07}) and~(\ref{ch12:eq:08}), respectively. The CG method converges after, at
 191 most, $n$ iterations. In practice, the CG algorithm stops when the tolerance threshold\index{Convergence!Tolerance~threshold}
 192 $\varepsilon$ and/or the maximum number of iterations\index{Convergence!Maximum~number~of~iterations}
 193 $maxiter$ are reached.
 194
 195
 196 %%****************%%
 197 %%****************%%
 198 \subsection{GMRES method}
 199 \label{ch12:sec:02.02}
 200 The iterative GMRES method was developed by Saad and Schultz in 1986~\cite{ch12:ref3} as a generalization
 201 of the minimum residual method MINRES~\cite{ch12:ref4}\index{Iterative~method!MINRES}. Indeed, GMRES can
 202 be applied for solving symmetric or nonsymmetric linear systems.
 203
 204 The main principle of the GMRES method\index{Iterative~method!GMRES} is to find an approximation minimizing
 205 at best the residual norm. In fact, GMRES computes a sequence of approximate solutions $\{x_k\}_{k>0}$ in
 206 a Krylov subspace\index{Iterative~method!Krylov~subspace} $\mathcal{K}_k$ as follows:
 207 \begin{equation}
 208 \begin{array}{ll}
 209 x_k \in x_0 + \mathcal{K}_k(A, v_1),& v_1=\frac{r_0}{\|r_0\|_2},
 210 \end{array}
 211 \label{ch12:eq:12}
 212 \end{equation}
 213 so that the Petrov-Galerkin condition\index{Petrov-Galerkin~condition} is satisfied:
 214 \begin{equation}
 215 \begin{array}{ll}
 216 r_k \bot A \mathcal{K}_k(A, v_1).
 217 \end{array}
 218 \label{ch12:eq:13}
 219 \end{equation}
 220 GMRES uses the Arnoldi process~\cite{ch12:ref5}\index{Iterative~method!Arnoldi~process} to construct an
 221 orthonormal basis $V_k$ for the Krylov subspace $\mathcal{K}_k$ and an upper Hessenberg matrix\index{Hessenberg~matrix}
 222 $\bar{H}_k$ of order $(k+1)\times k$:
 223 \begin{equation}
 224 \begin{array}{ll}
 225 V_k = \{v_1, v_2,\ldots,v_k\}, & \forall k>1, v_k=A^{k-1}v_1,
 226 \end{array}
 227 \label{ch12:eq:14}
 228 \end{equation}
 229 and
 230 \begin{equation}
 231 A V_k = V_{k+1} \bar{H}_k.
 232 \label{ch12:eq:15}
 233 \end{equation}
 234
 235 Then, at each iteration $k$, an approximate solution $x_k$ is computed in the Krylov subspace $\mathcal{K}_k$
 236 spanned by $V_k$ as follows:
 237 \begin{equation}
 238 \begin{array}{ll}
 239 x_k = x_0 + V_k y, & y\in\mathbb{R}^{k}.
 240 \end{array}
 241 \label{ch12:eq:16}
 242 \end{equation}
 243 From both formulas~(\ref{ch12:eq:15}) and~(\ref{ch12:eq:16}) and $r_k=b-Ax_k$, we can deduce that:
 244 \begin{equation}
 245 \begin{array}{lll}
 246   r_{k} & = & b - A (x_{0} + V_{k}y) \\
 247         & = & r_{0} - AV_{k}y \\
 248         & = & \beta v_{1} - V_{k+1}\bar{H}_{k}y \\
 249         & = & V_{k+1}(\beta e_{1} - \bar{H}_{k}y),
 250 \end{array}
 251 \label{ch12:eq:17}
 252 \end{equation}
 253 such that $\beta=\|r_0\|_2$ and $e_1=(1,0,\cdots,0)$ is the first vector of the canonical basis of
 254 $\mathbb{R}^k$. So, the vector $y$ is chosen in $\mathbb{R}^k$ so as to minimize at best the Euclidean
 255 norm of the residual $r_k$. Consequently, a linear least-squares problem of size $k$ is solved:
 256 \begin{equation}
 257 \underset{y\in\mathbb{R}^{k}}{min}\|r_{k}\|_{2}=\underset{y\in\mathbb{R}^{k}}{min}\|\beta e_{1}-\bar{H}_{k}y\|_{2}.
 258 \label{ch12:eq:18}
 259 \end{equation}
 260 The QR factorization of matrix $\bar{H}_k$ is used to compute the solution of this problem by using
 261 Givens rotations~\cite{ch12:ref1,ch12:ref3}, such that:
 262 \begin{equation}
 263 \begin{array}{lll}
 264 \bar{H}_{k}=Q_{k}R_{k}, & Q_{k}\in\mathbb{R}^{(k+1)\times (k+1)}, & R_{k}\in\mathbb{R}^{(k+1)\times k},
 265 \end{array}
 266 \label{ch12:eq:19}
 267 \end{equation}
 268 where $Q_kQ_k^T=I_k$ and $R_k$ is an upper triangular matrix.
 269
 270 The GMRES method computes an approximate solution with a sufficient precision after, at most, $n$
 271 iterations ($n$ is the size of the sparse linear system to be solved). However, the GMRES algorithm
 272 must construct and store in the memory an orthonormal basis $V_k$ whose size is proportional to the
 273 number of iterations required to achieve the convergence. Then, to avoid a huge memory storage, the
 274 GMRES method must be restarted at each $m$ iterations, such that $m$ is very small ($m\ll n$), and
 275 with $x_m$ as the initial guess to the next iteration. This allows to limit the size of the basis
 276 $V$ to $m$ orthogonal vectors.
 277
 278 \begin{algorithm}[!t]
 279   Choose an initial guess $x_0$\;
 280   $convergence$ = false\;
 281   $k = 1$\;
 282   $r_{0} = M^{-1}(b-Ax_{0})$\;
 283   $\beta = \|r_{0}\|_{2}$\;
 284   \While{$\neg convergence$}{
 285     $v_{1} = r_{0}/\beta$\;
 286     \For{$j=1$ \KwTo $m$}{
 287       $w_{j} = M^{-1}Av_{j}$\;
 288       \For{$i=1$ \KwTo $j$}{
 289         $h_{i,j} = (w_{j},v_{i})$\;
 290         $w_{j} = w_{j}-h_{i,j}v_{i}$\;
 291       }
 292       $h_{j+1,j} = \|w_{j}\|_{2}$\;
 293       $v_{j+1} = w_{j}/h_{j+1,j}$\;
 294     }
 295     Set $V_{m}=\{v_{j}\}_{1\leq j \leq m}$ and $\bar{H}_{m}=(h_{i,j})$ a $(m+1)\times m$ upper Hessenberg matrix\;
 296     Solve a least-squares problem of size $m$: $min_{y\in\mathrm{I\!R}^{m}}\|\beta e_{1}-\bar{H}_{m}y\|_{2}$\;
 297     $x_{m} = x_{0}+V_{m}y_{m}$\;
 298     $r_{m} = M^{-1}(b-Ax_{m})$\;
 299     $\beta = \|r_{m}\|_{2}$\;
 300     \eIf{ $(\beta<\varepsilon)$ {\bf or} $(k\geq maxiter)$}{
 301       $convergence$ = true\;
 302     }{
 303       $x_{0} = x_{m}$\;
 304       $r_{0} = r_{m}$\;
 305       $k = k + 1$\;
 306     }
 307   }
 308 \caption{Left-preconditioned GMRES method with restarts}
 309 \label{ch12:alg:02}
 310 \end{algorithm}
 311
 312 Algorithm~\ref{ch12:alg:02} shows the key points of the GMRES method with restarts.
 313 It solves the left-preconditioned\index{Sparse~linear~system!Preconditioned} sparse linear
 314 system~(\ref{ch12:eq:11}), such that $M$ is the preconditioning matrix. At each iteration
 315 $k$, GMRES uses the Arnoldi process\index{Iterative~method!Arnoldi~process} (defined from
 316 line~$7$ to line~$17$) to construct a basis $V_m$ of $m$ orthogonal vectors and an upper
 317 Hessenberg matrix\index{Hessenberg~matrix} $\bar{H}_m$ of size $(m+1)\times m$. Then, it
 318 solves the linear least-squares problem of size $m$ to find the vector $y\in\mathbb{R}^{m}$
 319 which minimizes at best the residual norm (line~$18$). Finally, it computes an approximate
 320 solution $x_m$ in the Krylov subspace spanned by $V_m$ (line~$19$). The GMRES algorithm is
 321 stopped when the residual norm is sufficiently small ($\|r_m\|_2<\varepsilon$) and/or the
 322 maximum number of iterations\index{Convergence!Maximum~number~of~iterations} ($maxiter$)
 323 is reached.
 324
 325
 326 %%--------------------------%%
 327 %%       SECTION 3          %%
 328 %%--------------------------%%
 329 \section{Parallel implementation on a GPU cluster}
 330 \label{ch12:sec:03}
 331 In this section, we present the parallel algorithms of both iterative CG\index{Iterative~method!CG}
 332 and GMRES\index{Iterative~method!GMRES} methods for GPU clusters. The implementation is performed on
 333 a GPU cluster composed of different computing nodes, such that each node is a CPU core managed by a
 334 MPI process and equipped with a GPU card. The parallelization of these algorithms is carried out by
 335 using the MPI communication routines between the GPU computing nodes\index{Computing~node} and the
 336 CUDA programming environment inside each node. In what follows, the algorithms of the iterative methods
 337 are called iterative solvers.
 338
 339
 340 %%****************%%
 341 %%****************%%
 342 \subsection{Data partitioning}
 343 \label{ch12:sec:03.01}
 344 The parallel solving of the large sparse linear system~(\ref{ch12:eq:11}) requires a data partitioning
 345 between the computing nodes of the GPU cluster. Let $p$ denotes the number of the computing nodes on the
 346 GPU cluster. The partitioning operation consists in the decomposition of the vectors and matrices, involved
 347 in the iterative solver, in $p$ portions. Indeed, this operation allows to assign to each computing node
 348 $i$:
 349 \begin{itemize}
 350 \item a portion of size $\frac{n}{p}$ elements of each vector,
 351 \item a sparse rectangular sub-matrix $A_i$ of size $(\frac{n}{p},n)$ and,
 352 \item a square preconditioning sub-matrix $M_i$ of size $(\frac{n}{p},\frac{n}{p})$,
 353 \end{itemize}
 354 where $n$ is the size of the sparse linear system to be solved. In the first instance, we perform a naive
 355 row-wise partitioning (row-by-row decomposition) on the data of the sparse linear systems to be solved.
 356 Figure~\ref{ch12:fig:01} shows an example of a row-wise data partitioning between four computing nodes
 357 of a sparse linear system (sparse matrix $A$, solution vector $x$ and right-hand side $b$) of size $16$
 358 unknown values.
 359
 360 \begin{figure}
 361 \centerline{\includegraphics[scale=0.35]{Chapters/chapter12/figures/partition}}
 362 \caption{A data partitioning of the sparse matrix $A$, the solution vector $x$ and the right-hand side $b$ into four portions.}
 363 \label{ch12:fig:01}
 364 \end{figure}
 365
 366
 367 %%****************%%
 368 %%****************%%
 369 \subsection{GPU computing}
 370 \label{ch12:sec:03.02}
 371 After the partitioning operation, all the data involved from this operation must be
 372 transferred from the CPU memories to the GPU memories, in order to be processed by
 373 GPUs. We use two functions of the CUBLAS\index{CUBLAS} library (CUDA Basic Linear
 374 Algebra Subroutines), developed by Nvidia~\cite{ch12:ref6}: \verb+cublasAlloc()+
 375 for the memory allocations on GPUs and \verb+cublasSetVector()+ for the memory
 376 copies from the CPUs to the GPUs.
 377
 378 An efficient implementation of CG and GMRES solvers on a GPU cluster requires to
 379 determine all parts of their codes that can be executed in parallel and, thus, take
 380 advantage of the GPU acceleration. As many Krylov subspace methods, the CG and GMRES
 381 methods are mainly based on arithmetic operations dealing with vectors or matrices:
 382 sparse matrix-vector multiplications, scalar-vector multiplications, dot products,
 383 Euclidean norms, AXPY operations ($y\leftarrow ax+y$ where $x$ and $y$ are vectors
 384 and $a$ is a scalar) and so on. These vector operations are often easy to parallelize
 385 and they are more efficient on parallel computers when they work on large vectors.
 386 Therefore, all the vector operations used in CG and GMRES solvers must be executed
 387 by the GPUs as kernels.
 388
 389 We use the kernels of the CUBLAS library to compute some vector operations of CG and
 390 GMRES solvers. The following kernels of CUBLAS (dealing with double floating point)
 391 are used: \verb+cublasDdot()+ for the dot products, \verb+cublasDnrm2()+ for the
 392 Euclidean norms and \verb+cublasDaxpy()+ for the AXPY operations. For the rest of
 393 the data-parallel operations, we code their kernels in CUDA. In the CG solver, we
 394 develop a kernel for the XPAY operation ($y\leftarrow x+ay$) used  line~$12$ in
 395 Algorithm~\ref{ch12:alg:01}. In the GMRES solver, we program a kernel for the scalar-vector
 396 multiplication (lines~$7$ and~$15$ in Algorithm~\ref{ch12:alg:02}), a kernel to
 397 solve the least-squares problem and a kernel to update the elements  of the solution
 398 vector $x$.
 399
 400 The least-squares problem in the GMRES method is solved by performing a QR factorization
 401 on the Hessenberg matrix\index{Hessenberg~matrix} $\bar{H}_m$ with plane rotations and,
 402 then, solving the triangular system by backward substitutions to compute $y$. Consequently,
 403 solving the least-squares problem on the GPU is not interesting. Indeed, the triangular
 404 solves are not easy to parallelize and inefficient on GPUs. However, the least-squares
 405 problem to solve in the GMRES method with restarts has, generally, a very small size $m$.
 406 Therefore, we develop an inexpensive kernel which must be executed in sequential by a
 407 single CUDA thread.
 408
 409 The most important operation in CG\index{Iterative~method!CG} and GMRES\index{Iterative~method!GMRES}
 410 methods is the sparse matrix-vector multiplication (SpMV)\index{SpMV~multiplication},
 411 because it is often an expensive operation in terms of execution time and memory space.
 412 Moreover, it requires to take care of the storage format of the sparse matrix in the
 413 memory. Indeed, the naive storage, row-by-row or column-by-column, of a sparse matrix
 414 can cause a significant waste of memory space and execution time. In addition, the sparse
 415 nature of the matrix often leads to irregular memory accesses to read the matrix nonzero
 416 values. So, the computation of the SpMV multiplication on GPUs can involve non coalesced
 417 accesses to the global memory, which slows down its performances even more. One of the
 418 most efficient compressed storage formats\index{Compressed~storage~format} of sparse
 419 matrices on GPUs is the HYB\index{Compressed~storage~format!HYB} format~\cite{ch12:ref7}.
 420 It is a combination of ELLpack (ELL) and Coordinate (COO) formats. Indeed, it stores
 421 a typical number of nonzero values per row in ELL\index{Compressed~storage~format!ELL}
 422 format and remaining entries of exceptional rows in COO format. It combines the efficiency
 423 of ELL due to the regularity of its memory accesses and the flexibility of COO\index{Compressed~storage~format!COO}
 424 which is insensitive to the matrix structure. Consequently, we use the HYB kernel~\cite{ch12:ref8}
 425 developed by Nvidia to implement the SpMV multiplication of CG and GMRES methods on GPUs.
 426 Moreover, to avoid the non coalesced accesses to the high-latency global memory, we fill
 427 the elements of the iterate vector $x$ in the cached texture memory.
 428
 429
 430 %%****************%%
 431 %%****************%%
 432 \subsection{Data communications}
 433 \label{ch12:sec:03.03}
 434 All the computing nodes of the GPU cluster execute in parallel the same iterative solver
 435 (Algorithm~\ref{ch12:alg:01} or Algorithm~\ref{ch12:alg:02}) adapted to GPUs, but on their
 436 own portions of the sparse linear system\index{Sparse~linear~system}: $M^{-1}_iA_ix_i=M^{-1}_ib_i$,
 437 $0\leq i<p$. However, in order to solve the complete sparse linear system~(\ref{ch12:eq:11}),
 438 synchronizations must be performed between the local computations of the computing nodes over
 439 the cluster. In what follows, two computing nodes sharing data are called neighboring nodes\index{Neighboring~node}.
 440
 441 As already mentioned, the most important operation of CG and GMRES methods is the SpMV multiplication.
 442 In the parallel implementation of the iterative methods, each computing node $i$ performs the
 443 SpMV multiplication on its own sparse rectangular sub-matrix $A_i$. Locally, it has only sub-vectors
 444 of size $\frac{n}{p}$ corresponding to rows of its sub-matrix $A_i$. However, it also requires
 445 the vector elements of its neighbors, corresponding to the column indices on which its sub-matrix
 446 has nonzero values (see Figure~\ref{ch12:fig:01}). So, in addition to the local vectors, each
 447 node must also manage vector elements shared with neighbors and required to compute the SpMV
 448 multiplication. Therefore, the iterate vector $x$ managed by each computing node is composed
 449 of a local sub-vector $x^{local}$ of size $\frac{n}{p}$ and a sub-vector of shared elements $x^{shared}$.
 450 In the same way, the vector used to construct the orthonormal basis of the Krylov subspace (vectors
 451 $p$ and $v$ in CG and GMRES methods, respectively) is composed of a local sub-vector and a shared
 452 sub-vector.
 453
 454 Therefore, before computing the SpMV multiplication\index{SpMV~multiplication}, the neighboring
 455 nodes\index{Neighboring~node} over the GPU cluster must exchange between them the shared vector
 456 elements necessary to compute this multiplication. First, each computing node determines, in its
 457 local sub-vector, the vector elements needed by other nodes. Then, the neighboring nodes exchange
 458 between them these shared vector elements. The data exchanges are implemented by using the MPI
 459 point-to-point communication routines: blocking\index{MPI~subroutines!Blocking} sends with \verb+MPI_Send()+
 460 and nonblocking\index{MPI~subroutines!Nonblocking} receives with \verb+MPI_Irecv()+. Figure~\ref{ch12:fig:02}
 461 shows an example of data exchanges between \textit{Node 1} and its neighbors \textit{Node 0}, \textit{Node 2}
 462 and \textit{Node 3}. In this example, the iterate matrix $A$ split between these four computing
 463 nodes is that presented in Figure~\ref{ch12:fig:01}.
 464
 465 \begin{figure}
 466 \centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/compress}}
 467 \caption{Data exchanges between \textit{Node 1} and its neighbors \textit{Node 0}, \textit{Node 2} and \textit{Node 3}.}
 468 \label{ch12:fig:02}
 469 \end{figure}
 470
 471 After the synchronization operation, the computing nodes receive, from their respective neighbors,
 472 the shared elements in a sub-vector stored in a compressed format. However, in order to compute the
 473 SpMV multiplication, the computing nodes operate on sparse global vectors (see Figure~\ref{ch12:fig:02}).
 474 In this case, the received vector elements must be copied to the corresponding indices in the global
 475 vector. So as not to need to perform this at each iteration, we propose to reorder the columns of
 476 each sub-matrix $\{A_i\}_{0\leq i<p}$, so that the shared sub-vectors could be used in their compressed
 477 storage formats. Figure~\ref{ch12:fig:03} shows a reordering of a sparse sub-matrix (sub-matrix of
 478 \textit{Node 1}).
 479
 480 \begin{figure}
 481 \centerline{\includegraphics[scale=0.35]{Chapters/chapter12/figures/reorder}}
 482 \caption{Columns reordering of a sparse sub-matrix.}
 483 \label{ch12:fig:03}
 484 \end{figure}
 485
 486 A GPU cluster\index{GPU~cluster} is a parallel platform with a distributed memory. So, the synchronizations
 487 and communication data between GPU nodes are carried out by passing messages. However, GPUs can not communicate
 488 between them in a direct way. Then, CPUs via MPI processes are in charge of the synchronizations within the GPU
 489 cluster. Consequently, the vector elements to be exchanged must be copied from the GPU memory to the CPU memory
 490 and vice-versa before and after the synchronization operation between CPUs. We have used the CUBLAS\index{CUBLAS}
 491 communication subroutines to perform the data transfers between a CPU core and its GPU: \verb+cublasGetVector()+
 492 and \verb+cublasSetVector()+. Finally, in addition to the data exchanges, GPU nodes perform reduction operations
 493 to compute in parallel the dot products and Euclidean norms. This is implemented by using the MPI global communication\index{MPI~subroutines!Global}
 494 \verb+MPI_Allreduce()+.
 495
 496
 497
 498 %%--------------------------%%
 499 %%       SECTION 4          %%
 500 %%--------------------------%%
 501 \section{Experimental results}
 502 \label{ch12:sec:04}
 503 In this section, we present the performances of the parallel CG and GMRES linear solvers obtained
 504 on a cluster of $12$ GPUs. Indeed, this GPU cluster of tests is composed of six machines connected
 505 by $20$Gbps InfiniBand network. Each machine is a Quad-Core Xeon E5530 CPU running at $2.4$GHz and
 506 providing $12$GB of RAM with a memory bandwidth of $25.6$GB/s. In addition, two Tesla C1060 GPUs are
 507 connected to each machine via a PCI-Express 16x Gen 2.0 interface with a throughput of $8$GB/s. A
 508 Tesla C1060 GPU contains $240$ cores running at $1.3$GHz and providing a global memory of $4$GB with
 509 a memory bandwidth of $102$GB/s. Figure~\ref{ch12:fig:04} shows the general scheme of the GPU cluster\index{GPU~cluster}
 510 that we used in the experimental tests.
 511
 512 Linux cluster version 2.6.39 OS is installed on CPUs. C programming language is used to code
 513 the parallel algorithms of both methods on the GPU cluster. CUDA version 4.0~\cite{ch12:ref9}
 514 is used to program GPUs, using CUBLAS library~\cite{ch12:ref6} to deal with vector operations
 515 in GPUs and, finally, MPI routines of OpenMPI 1.3.3 are used to carry out the communications between
 516 CPU cores. Indeed, the experiments are done on a cluster of $12$ computing nodes, where each node
 517 is managed by a MPI process and it is composed of one CPU core and one GPU card.
 518
 519 \begin{figure}[!h]
 520 \centerline{\includegraphics[scale=0.25]{Chapters/chapter12/figures/cluster}}
 521 \caption{General scheme of the GPU cluster of tests composed of six machines, each with two GPUs.}
 522 \label{ch12:fig:04}
 523 \end{figure}
 524
 525 All tests are made on double-precision floating point operations. The parameters of both linear
 526 solvers are initialized as follows: the residual tolerance threshold $\varepsilon=10^{-12}$, the
 527 maximum number of iterations $maxiter=500$, the right-hand side $b$ is filled with $1.0$ and the
 528 initial guess $x_0$ is filled with $0.0$. In addition, we limited the Arnoldi process\index{Iterative~method!Arnoldi~process}
 529 used in the GMRES method to $16$ iterations ($m=16$). For the sake of simplicity, we have chosen
 530 the preconditioner $M$ as the main diagonal of the sparse matrix $A$. Indeed, it allows to easily
 531 compute the required inverse matrix $M^{-1}$ and it provides a relatively good preconditioning for
 532 not too ill-conditioned matrices. In the GPU computing, the size of thread blocks is fixed to $512$
 533 threads. Finally, the performance results, presented hereafter, are obtained from the mean value
 534 over $10$ executions of the same parallel linear solver and for the same input data.
 535
 536 \begin{figure}
 537 \centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/matrices}}
 538 \caption{Sketches of sparse matrices chosen from the Davis collection.}
 539 \label{ch12:fig:05}
 540 \end{figure}
 541
 542 \begin{table}
 543 \centering
 544 \begin{tabular}{|c|c|c|c|c|}
 545 \hline
 546 {\bf Matrix type}             & {\bf Matrix name} & {\bf \# rows} & {\bf \# nnz} & {\bf Bandwidth} \\ \hline \hline
 547
 548 \multirow{6}{*}{Symmetric}    & 2cubes\_sphere    & $101,492$     & $1,647,264$  & $100,464$ \\
 549
 550                               & ecology2          & $999,999$     & $4,995,991$  & $2,001$   \\
 551
 552                               & finan512          & $74,752$      & $596,992$    & $74,725$  \\
 553
 554                               & G3\_circuit       & $1,585,478$   & $7,660,826$  & $1,219,059$ \\
 555
 556                               & shallow\_water2   & $81,920$      & $327,680$    & $58,710$ \\
 557
 558                               & thermal2          & $1,228,045$   & $8,580,313$  & $1,226,629$ \\ \hline \hline
 559
 560 \multirow{6}{*}{Nonsymmetric} & cage13            & $445,315$     & $7,479,343$  & $318,788$\\
 561
 562                               & crashbasis        & $160,000$     & $1,750,416$  & $120,202$ \\
 563
 564                               & FEM\_3D\_thermal2 & $147,900$     & $3,489.300$  & $117,827$ \\
 565
 566                               & language          & $399,130$     & $1,216,334$  & $398,622$\\
 567
 568                               & poli\_large       & $15,575$      & $33,074$     & $15,575$ \\
 569
 570                               & torso3            & $259,156$     & $4,429,042$  & $216,854$  \\ \hline
 571 \end{tabular}
 572 \caption{Main characteristics of sparse matrices chosen from the Davis collection.}
 573 \label{ch12:tab:01}
 574 \end{table}
 575
 576 To get more realistic results, we have tested the CG and GMRES algorithms on sparse matrices of the Davis
 577 collection~\cite{ch12:ref10}, that arise in a wide spectrum of real-world applications. We have chosen six
 578 symmetric sparse matrices and six nonsymmetric ones from this collection. In Figure~\ref{ch12:fig:05},
 579 we show the structures of these matrices and in Table~\ref{ch12:tab:01} we present their main characteristics
 580 which are the number of rows, the total number of nonzero values (nnz) and the maximal bandwidth. In
 581 the present chapter, the bandwidth of a sparse matrix is defined as the number of matrix columns separating
 582 the first and the last nonzero value on a matrix row.
 583
 584 \begin{table}
 585 \begin{center}
 586 \begin{tabular}{|c|c|c|c|c|c|c|}
 587 \hline
 588 {\bf Matrix}     & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$  & $\mathbf{\# iter.}$ & $\mathbf{prec.}$     & $\mathbf{\Delta}$   \\ \hline \hline
 589
 590 2cubes\_sphere    & $0.132s$           & $0.069s$            & $1.93$        & $12$           & $1.14e$-$09$     & $3.47e$-$18$ \\
 591
 592 ecology2          & $0.026s$           & $0.017s$            & $1.52$        & $13$           & $5.06e$-$09$     & $8.33e$-$17$ \\
 593
 594 finan512          & $0.053s$           & $0.036s$            & $1.49$        & $12$           & $3.52e$-$09$     & $1.66e$-$16$ \\
 595
 596 G3\_circuit       & $0.704s$           & $0.466s$            & $1.51$        & $16$           & $4.16e$-$10$     & $4.44e$-$16$ \\
 597
 598 shallow\_water2   & $0.017s$           & $0.010s$            & $1.68$        & $5$            & $2.24e$-$14$     & $3.88e$-$26$ \\
 599
 600 thermal2          & $1.172s$           & $0.622s$            & $1.88$        & $15$           & $5.11e$-$09$     & $3.33e$-$16$ \\ \hline
 601 \end{tabular}
 602 \caption{Performances of the parallel CG method on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs.}
 603 \label{ch12:tab:02}
 604 \end{center}
 605 \end{table}
 606
 607 \begin{table}
 608 \begin{center}
 609 \begin{tabular}{|c|c|c|c|c|c|c|}
 610 \hline
 611 {\bf Matrix}     & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$  & $\mathbf{\# iter.}$ & $\mathbf{prec.}$     & $\mathbf{\Delta}$   \\ \hline \hline
 612
 613 2cubes\_sphere    & $0.234s$           & $0.124s$            & $1.88$        & $21$           & $2.10e$-$14$     & $3.47e$-$18$ \\
 614
 615 ecology2          & $0.076s$           & $0.035s$            & $2.15$        & $21$           & $4.30e$-$13$     & $4.38e$-$15$ \\
 616
 617 finan512          & $0.073s$           & $0.052s$            & $1.40$        & $17$           & $3.21e$-$12$     & $5.00e$-$16$ \\
 618
 619 G3\_circuit       & $1.016s$           & $0.649s$            & $1.56$        & $22$           & $1.04e$-$12$     & $2.00e$-$15$ \\
 620
 621 shallow\_water2   & $0.061s$           & $0.044s$            & $1.38$        & $17$           & $5.42e$-$22$     & $2.71e$-$25$ \\
 622
 623 thermal2          & $1.666s$           & $0.880s$            & $1.89$        & $21$           & $6.58e$-$12$     & $2.77e$-$16$ \\ \hline \hline
 624
 625 cage13            & $0.721s$           & $0.338s$            & $2.13$        & $26$           & $3.37e$-$11$     & $2.66e$-$15$ \\
 626
 627 crashbasis        & $1.349s$           & $0.830s$            & $1.62$        & $121$          & $9.10e$-$12$     & $6.90e$-$12$ \\
 628
 629 FEM\_3D\_thermal2 & $0.797s$           & $0.419s$            & $1.90$        & $64$           & $3.87e$-$09$     & $9.09e$-$13$ \\
 630
 631 language          & $2.252s$           & $1.204s$            & $1.87$        & $90$           & $1.18e$-$10$     & $8.00e$-$11$ \\
 632
 633 poli\_large       & $0.097s$           & $0.095s$            & $1.02$        & $69$           & $4.98e$-$11$     & $1.14e$-$12$ \\
 634
 635 torso3            & $4.242s$           & $2.030s$            & $2.09$        & $175$          & $2.69e$-$10$     & $1.78e$-$14$ \\ \hline
 636 \end{tabular}
 637 \caption{Performances of the parallel GMRES method on a cluster 24 CPU cores vs. on cluster of 12 GPUs.}
 638 \label{ch12:tab:03}
 639 \end{center}
 640 \end{table}
 641
 642 Tables~\ref{ch12:tab:02} and~\ref{ch12:tab:03} show the performances of the parallel
 643 CG and GMRES solvers, respectively, for solving linear systems associated to the sparse
 644 matrices presented in Tables~\ref{ch12:tab:01}. They allow to compare the performances
 645 obtained on a cluster of $24$ CPU cores and on a cluster of $12$ GPUs. However, Table~\ref{ch12:tab:02}
 646 only shows the performances of solving symmetric sparse linear systems, due to the inability
 647 of the CG method to solve the nonsymmetric systems. In both tables, the second and third
 648 columns give, respectively, the execution times in seconds obtained on $24$ CPU cores
 649 ($Time_{gpu}$) and that obtained on $12$ GPUs ($Time_{gpu}$). Moreover, we take into account
 650 the relative gains $\tau$ of a solver implemented on the GPU cluster compared to the same
 651 solver implemented on the CPU cluster. The relative gains\index{Relative~gain}, presented
 652 in the fourth column, are computed as a ratio of the CPU execution time over the GPU
 653 execution time:
 654 \begin{equation}
 655 \tau = \frac{Time_{cpu}}{Time_{gpu}}.
 656 \label{ch12:eq:20}
 657 \end{equation}
 658 In addition, Tables~\ref{ch12:tab:02} and~\ref{ch12:tab:03} give the number of iterations
 659 ($iter$), the precision $prec$ of the solution computed on the GPU cluster and the difference
 660 $\Delta$ between the solution computed on the CPU cluster and that computed on the GPU cluster.
 661 Both parameters $prec$ and $\Delta$ allow to validate and verify the accuracy of the solution
 662 computed on the GPU cluster. We have computed them as follows:
 663 \begin{eqnarray}
 664 \Delta = max|x^{cpu}-x^{gpu}|,\\
 665 prec = max|M^{-1}r^{gpu}|,
 666 \end{eqnarray}
 667 where $\Delta$ is the maximum vector element, in absolute value, of the difference between
 668 the two solutions $x^{cpu}$ and $x^{gpu}$ computed, respectively, on CPU and GPU clusters and
 669 $prec$ is the maximum element, in absolute value, of the residual vector $r^{gpu}\in\mathbb{R}^{n}$
 670 of the solution $x^{gpu}$. Thus, we can see that the solutions obtained on the GPU cluster
 671 were computed with a sufficient accuracy (about $10^{-10}$) and they are, more or less, equivalent
 672 to those computed on the CPU cluster with a small difference ranging from $10^{-10}$ to $10^{-26}$.
 673 However, we can notice from the relative gains $\tau$ that it is not interesting to use multiple
 674 GPUs for solving small sparse linear systems. In fact, a small sparse matrix does not allow to
 675 maximize utilization of GPU cores. In addition, the communications required to synchronize the
 676 computations over the cluster increase the idle times of GPUs and slow down  the parallel
 677 computations further.
 678
 679 Consequently, in order to test the performances of the parallel solvers, we developed in C programming
 680 language a generator of large sparse matrices. This generator takes a matrix from the Davis collection~\cite{ch12:ref10}
 681 as an initial matrix to build large sparse matrices exceeding ten million of rows. It must be executed
 682 in parallel by the MPI processes of the computing nodes, so that each process could build its sparse
 683 sub-matrix. In the first experimental tests, we focused on sparse matrices having a banded structure,
 684 because they are those arising the most in the majority of numerical problems. So to generate the global sparse matrix,
 685 each MPI process constructs its sub-matrix by performing several copies of an initial sparse matrix chosen
 686 from the Davis collection. Then, it puts all these copies on the main diagonal of the global matrix
 687 (see Figure~\ref{ch12:fig:06}). Moreover, the empty spaces between two successive copies in the main
 688 diagonal are filled with sub-copies (left-copy and right-copy in Figure~\ref{ch12:fig:06}) of the same
 689 initial matrix.
 690
 691 \begin{figure}
 692 \centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/generation}}
 693 \caption{Parallel generation of a large sparse matrix by four computing nodes.}
 694 \label{ch12:fig:06}
 695 \end{figure}
 696
 697 \begin{table}[!h]
 698 \centering
 699 \begin{tabular}{|c|c|c|c|}
 700 \hline
 701 {\bf Matrix type}             & {\bf Matrix name} & {\bf \# nnz} & {\bf Bandwidth} \\ \hline \hline
 702
 703 \multirow{6}{*}{Symmetric}    & 2cubes\_sphere    & $413,703,602$ & $198,836$     \\
 704
 705                               & ecology2          & $124,948,019$ & $2,002$          \\
 706
 707                               & finan512          & $278,175,945$ & $123,900$        \\
 708
 709                               & G3\_circuit       & $125,262,292$ & $1,891,887$      \\
 710
 711                               & shallow\_water2   & $100,235,292$ & $62,806$      \\
 712
 713                               & thermal2          & $175,300,284$ & $2,421,285$ \\ \hline \hline
 714
 715 \multirow{6}{*}{Nonsymmetric} & cage13            & $435,770,480$ & $352,566$        \\
 716
 717                               & crashbasis        & $409,291,236$ & $200,203$        \\
 718
 719                               & FEM\_3D\_thermal2 & $595,266,787$ & $206,029$ \\
 720
 721                               & language          & $76,912,824$  & $398,626$ \\
 722
 723                               & poli\_large       & $53,322,580$  & $15,576$         \\
 724
 725                               & torso3            & $433,795,264$ & $328,757$        \\ \hline
 726 \end{tabular}
 727 \vspace{0.5cm}
 728 \caption{Main characteristics of sparse banded matrices generated from those of the Davis collection.}
 729 \label{ch12:tab:04}
 730 \end{table}
 731
 732 We have used the parallel CG and GMRES algorithms for solving sparse linear systems of $25$
 733 million unknown values. The sparse matrices associated to these linear systems are generated
 734 from those presented in Table~\ref{ch12:tab:01}. Their main characteristics are given in Table~\ref{ch12:tab:04}.
 735 Tables~\ref{ch12:tab:05} and~\ref{ch12:tab:06} shows the performances of the parallel CG and
 736 GMRES solvers, respectively, obtained on a cluster of $24$ CPU cores and on a cluster of $12$
 737 GPUs. Obviously, we can notice from these tables that solving large sparse linear systems on
 738 a GPU cluster is more efficient than on a CPU cluster (see relative gains $\tau$). We can also
 739 notice that the execution times of the CG method, whether in a CPU cluster or in a GPU cluster,
 740 are better than those of the GMRES method for solving large symmetric linear systems. In fact, the
 741 CG method is characterized by a better convergence\index{Convergence} rate and a shorter execution
 742 time of an iteration than those of the GMRES method. Moreover, an iteration of the parallel GMRES
 743 method requires more data exchanges between computing nodes compared to the parallel CG method.
 744
 745 \begin{table}
 746 \begin{center}
 747 \begin{tabular}{|c|c|c|c|c|c|c|}
 748 \hline
 749 {\bf Matrix}    & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$   \\ \hline \hline
 750
 751 2cubes\_sphere  & $1.625s$             & $0.401s$              & $4.05$          & $14$                & $5.73e$-$11$     & $5.20e$-$18$ \\
 752
 753 ecology2        & $0.856s$             & $0.103s$              & $8.27$          & $15$                & $3.75e$-$10$     & $1.11e$-$16$ \\
 754
 755 finan512        & $1.210s$             & $0.354s$              & $3.42$          & $14$                & $1.04e$-$10$     & $2.77e$-$16$ \\
 756
 757 G3\_circuit     & $1.346s$             & $0.263s$              & $5.12$          & $17$                & $1.10e$-$10$     & $5.55e$-$16$ \\
 758
 759 shallow\_water2 & $0.397s$             & $0.055s$              & $7.23$          & $7$                 & $3.43e$-$15$     & $5.17e$-$26$ \\
 760
 761 thermal2        & $1.411s$             & $0.244s$              & $5.78$          & $16$                & $1.67e$-$09$     & $3.88e$-$16$ \\ \hline
 762 \end{tabular}
 763 \caption{Performances of the parallel CG method for solving linear systems associated to sparse banded matrices on a cluster of 24 CPU cores vs.
 764 on a cluster of 12 GPUs.}
 765 \label{ch12:tab:05}
 766 \end{center}
 767 \end{table}
 768
 769 \begin{table}
 770 \begin{center}
 771 \begin{tabular}{|c|c|c|c|c|c|c|}
 772 \hline
 773 {\bf Matrix}      & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$   \\ \hline \hline
 774
 775 2cubes\_sphere    & $3.597s$             & $0.514s$              & $6.99$          & $21$                & $2.11e$-$14$     & $8.67e$-$18$ \\
 776
 777 ecology2          & $2.549s$             & $0.288s$              & $8.83$          & $21$                & $4.88e$-$13$     & $2.08e$-$14$ \\
 778
 779 finan512          & $2.660s$             & $0.377s$              & $7.05$          & $17$                & $3.22e$-$12$     & $8.82e$-$14$ \\
 780
 781 G3\_circuit       & $3.139s$             & $0.480s$              & $6.53$          & $22$                & $1.04e$-$12$     & $5.00e$-$15$ \\
 782
 783 shallow\_water2   & $2.195s$             & $0.253s$              & $8.68$          & $17$                & $5.54e$-$21$     & $7.92e$-$24$ \\
 784
 785 thermal2          & $3.206s$             & $0.463s$              & $6.93$          & $21$                & $8.89e$-$12$     & $3.33e$-$16$ \\ \hline \hline
 786
 787 cage13            & $5.560s$             & $0.663s$              & $8.39$          & $26$                & $3.29e$-$11$     & $1.59e$-$14$ \\
 788
 789 crashbasis        & $25.802s$            & $3.511s$              & $7.35$          & $135$               & $6.81e$-$11$     & $4.61e$-$15$ \\
 790
 791 FEM\_3D\_thermal2 & $13.281s$            & $1.572s$              & $8.45$          & $64$                & $3.88e$-$09$     & $1.82e$-$12$ \\
 792
 793 language          & $12.553s$            & $1.760s$              & $7.13$          & $89$                & $2.11e$-$10$     & $1.60e$-$10$ \\
 794
 795 poli\_large       & $8.515s$             & $1.053s$              & $8.09$          & $69$                & $5.05e$-$11$     & $6.59e$-$12$ \\
 796
 797 torso3            & $31.463s$            & $3.681s$              & $8.55$          & $175$               & $2.69e$-$10$     & $2.66e$-$14$ \\ \hline
 798 \end{tabular}
 799 \caption{Performances of the parallel GMRES method for solving linear systems associated to sparse banded matrices on a cluster of 24 CPU cores vs.
 800 on a cluster of 12 GPUs.}
 801 \label{ch12:tab:06}
 802 \end{center}
 803 \end{table}
 804
 805 %%--------------------------%%
 806 %%       SECTION 5          %%
 807 %%--------------------------%%
 808 \section{Conclusion}
 809 \label{ch12:sec:05}
 810 In this chapter, we have aimed at harnessing the computing power of a
 811 cluster of GPUs for solving large sparse linear systems. For this, we
 812 have used two Krylov subspace iterative methods: the CG and GMRES methods.
 813 The first method is well-known for its efficiency to solve symmetric
 814 linear systems and the second one is used, particularly, to solve
 815 nonsymmetric linear systems.
 816
 817 We have presented the parallel implementation of both iterative methods
 818 on a GPU cluster. Particularly, the operations dealing with the vectors
 819 and/or matrices, of these methods, are parallelized between the different
 820 GPU computing nodes of the cluster. Indeed, the data-parallel vector operations
 821 are accelerated by GPUs and the communications required to synchronize the
 822 parallel computations are carried out by CPU cores. For this, we have used
 823 a heterogeneous CUDA/MPI programming to implement the parallel iterative
 824 algorithms.
 825
 826 In the experimental tests, we have shown that using a GPU cluster is efficient
 827 for solving linear systems associated to very large sparse matrices. The experimental
 828 results, obtained in the present chapter, show that a cluster of $12$ GPUs is
 829 about $7$ times faster than a cluster of $24$ CPU cores for solving large sparse
 830 linear systems of $25$ million unknown values. This is due to the GPU ability to
 831 compute the data-parallel operations faster than the CPUs.
 832
 833 In our future works, we plan to test the parallel algorithms of CG and GMRES methods, adapted
 834 to GPUs, for solving large linear systems associated to sparse matrices of different structures.
 835 For example, the matrices having large bandwidths, which can lead to many data dependencies
 836 between the computing nodes and, thus, degrade the performances of both algorithms. So in
 837 this case, it would be interesting to study the different data partitioning techniques, in
 838 order to minimize the dependencies between the computing nodes and thus to reduce the total
 839 communication volume. This may improve the performances of both algorithms implemented on
 840 a GPU cluster. Moreover, in the recent GPU hardware and software architectures, the GPU-Direct
 841 system with CUDA version 5.0 is used so that two GPUs located on the same node or on distant
 842 nodes can communicate between them directly without CPUs. This allows to improve the data
 843 transfers between GPUs.
 844
 845
 846
 847 \putbib[Chapters/chapter12/biblio12]
 848