BookGPU/Chapters/chapter13/ch12.tex~

   1 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   2 %%                          %%
   3 %%       CHAPTER 12         %%
   4 %%                          %%
   5 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   6
   7 \chapterauthor{}{}
   8 \chapter{Solving sparse linear systems with GMRES and CG methods on GPU clusters}
   9
  10 %%--------------------------%%
  11 %%       SECTION 1          %%
  12 %%--------------------------%%
  13 \section{Introduction}
  14 \label{sec:01}
  15 The sparse linear systems are used to model many scientific and industrial problems, such as the environmental simulations or
  16 the industrial processing of the complex or non-Newtonian fluids. Moreover, the resolution of these problems often involves the
  17 solving of such linear systems which is considered as the most expensive process in terms of time execution and memory space.
  18 Therefore, solving sparse linear systems must be as efficient as possible in order to deal with problems of ever increasing size.
  19
  20 There are, in the jargon of numerical analysis, different methods of solving sparse linear systems that we can classify in two
  21 classes: the direct and iterative methods. However, the iterative methods are often more suitable than their counterpart, direct
  22 methods, for solving large sparse linear systems. Indeed, they are less memory consuming and easier to parallelize on parallel
  23 computers than direct methods. Different computing platforms, sequential and parallel computers, are used for solving sparse
  24 linear systems with iterative solutions. Nowadays, graphics processing units (GPUs) have become attractive for solving these
  25 linear systems, due to their computing power and their ability to compute faster than traditional CPUs.
  26
  27 In Section~\ref{sec:02}, we describe the general principle of two well-known iterative methods: the conjugate gradient method and
  28 the generalized minimal residual method. In Section~\ref{sec:03}, we give the main key points of the parallel implementation of both
  29 methods on a cluster of GPUs. Then, in Section~\ref{sec:04}, we present the experimental results obtained on a CPU cluster and on
  30 a GPU cluster, for solving sparse linear systems associated to matrices of different structures. Finally, in Section~\ref{sec:05},
  31 we apply the hypergraph partitioning technique to reduce the total communication volume between the computing nodes and, thus, to
  32 improve the execution times of the parallel algorithms of both iterative methods.
  33
  34
  35 %%--------------------------%%
  36 %%       SECTION 2          %%
  37 %%--------------------------%%
  38 \section{Krylov iterative methods}
  39 \label{sec:02}
  40 Let us consider the following system of $n$ linear equations in $\mathbb{R}$:
  41 \begin{equation}
  42 Ax=b,
  43 \label{eq:01}
  44 \end{equation}
  45 where $A\in\mathbb{R}^{n\times n}$ is a sparse nonsingular square matrix, $x\in\mathbb{R}^{n}$ is the solution vector,
  46 $b\in\mathbb{R}^{n}$ is the right-hand side and $n\in\mathbb{N}$ is a large integer number.
  47
  48 The iterative methods for solving the large sparse linear system~(\ref{eq:01}) proceed by successive iterations of a same
  49 block of elementary operations, during which an infinite number of approximate solutions $\{x_k\}_{k\geq 0}$ are computed.
  50 Indeed, from an initial guess $x_0$, an iterative method determines at each iteration $k>0$ an approximate solution $x_k$
  51 which, gradually, converges to the exact solution $x^{*}$ as follows:
  52 \begin{equation}
  53 x^{*}=\lim\limits_{k\to\infty}x_{k}=A^{-1}b.
  54 \label{eq:02}
  55 \end{equation}
  56 The number of iterations necessary to reach the exact solution $x^{*}$ is not known beforehand and can be infinite. In
  57 practice, an iterative method often finds an approximate solution $\tilde{x}$ after a fixed number of iterations and/or
  58 when a given convergence criterion is satisfied as follows:
  59 \begin{equation}
  60 \|b-A\tilde{x}\| < \varepsilon,
  61 \label{eq:03}
  62 \end{equation}
  63 where $\varepsilon<1$ is the required convergence tolerance threshold.
  64
  65 Some of the most iterative methods that have proven their efficiency for solving large sparse linear systems are those
  66 called \textit{Krylov sub-space methods}~\cite{ref1}. In the present chapter, we describe two Krylov methods which are
  67 widely used: the conjugate gradient method (CG) and the generalized minimal residual method (GMRES). In practice, the
  68 Krylov sub-space methods are usually used with preconditioners that allow to improve their convergence. So, in what
  69 follows, the CG and GMRES methods are used for solving the left-preconditioned sparse linear system:
  70 \begin{equation}
  71 M^{-1}Ax=M^{-1}b,
  72 \label{eq:11}
  73 \end{equation}
  74 where $M$ is the preconditioning matrix.
  75
  76 %%****************%%
  77 %%****************%%
  78 \subsection{CG method}
  79 \label{sec:02.01}
  80 The conjugate gradient method is initially developed by Hestenes and Stiefel in 1952~\cite{ref2}. It is one of the well
  81 known iterative method for solving large sparse linear systems. In addition, it can be adapted for solving nonlinear
  82 equations and optimization problems. However, it can only be applied to problems with positive definite symmetric matrices.
  83
  84 The main idea of the CG method is the computation of a sequence of approximate solutions $\{x_k\}_{k\geq 0}$ in a Krylov
  85 sub-space of order $k$ as follows:
  86 \begin{equation}
  87 x_k \in x_0 + \mathcal{K}_k(A,r_0),
  88 \label{eq:04}
  89 \end{equation}
  90 such that the Galerkin condition must be satisfied:
  91 \begin{equation}
  92 r_k \bot \mathcal{K}_k(A,r_0),
  93 \label{eq:05}
  94 \end{equation}
  95 where $x_0$ is the initial guess, $r_k=b-Ax_k$ is the residual of the computed solution $x_k$ and $\mathcal{K}_k$ the Krylov
  96 sub-space of order $k$: \[\mathcal{K}_k(A,r_0) \equiv\text{span}\{r_0, Ar_0, A^2r_0,\ldots, A^{k-1}r_0\}.\]
  97 In fact, CG is based on the construction of a sequence $\{p_k\}_{k\in\mathbb{N}}$ of direction vectors in $\mathcal{K}_k$
  98 which are pairwise $A$-conjugate ($A$-orthogonal):
  99 \begin{equation}
 100 \begin{array}{ll}
 101 p_i^T A p_j = 0, & i\neq j.
 102 \end{array}
 103 \label{eq:06}
 104 \end{equation}
 105 At each iteration $k$, an approximate solution $x_k$ is computed by recurrence as follows:
 106 \begin{equation}
 107 \begin{array}{ll}
 108 x_k = x_{k-1} + \alpha_k p_k, & \alpha_k\in\mathbb{R}.
 109 \end{array}
 110 \label{eq:07}
 111 \end{equation}
 112 Consequently, the residuals $r_k$ are computed in the same way:
 113 \begin{equation}
 114 r_k = r_{k-1} - \alpha_k A p_k.
 115 \label{eq:08}
 116 \end{equation}
 117 In the case where all residuals are nonzero, the direction vectors $p_k$ can be determined so that the following recurrence
 118 holds:
 119 \begin{equation}
 120 \begin{array}{lll}
 121 p_0=r_0, & p_k=r_k+\beta_k p_{k-1}, & \beta_k\in\mathbb{R}.
 122 \end{array}
 123 \label{eq:09}
 124 \end{equation}
 125 Moreover, the scalars $\{\alpha_k\}_{k>0}$ are chosen so as to minimize the $A$-norm error $\|x^{*}-x_k\|_A$ over the Krylov
 126 sub-space $\mathcal{K}_{k}$ and the scalars $\{\beta_k\}_{k>0}$ are chosen so as to ensure that the direction vectors are
 127 pairwise $A$-conjugate. So, the assumption that matrix $A$ is symmetric and the recurrences~(\ref{eq:08}) and~(\ref{eq:09})
 128 allow to deduce that:
 129 \begin{equation}
 130 \begin{array}{ll}
 131 \alpha_{k}=\frac{r^{T}_{k-1}r_{k-1}}{p_{k}^{T}Ap_{k}}, & \beta_{k}=\frac{r_{k}^{T}r_{k}}{r_{k-1}^{T}r_{k-1}}.
 132 \end{array}
 133 \label{eq:10}
 134 \end{equation}
 135
 136 Algorithm~\ref{alg:01} shows the main key points of the preconditioned CG method. It allows to solve the left-preconditioned
 137 sparse linear system~(\ref{eq:11}). In this algorithm, $\varepsilon$ is the convergence tolerance threshold, $maxiter$ is the maximum
 138 number of iterations and $(\cdot,\cdot)$ defines the dot product between two vectors in $\mathbb{R}^{n}$. At every iteration, a direction
 139 vector $p_k$ is determined, so that it is orthogonal to the preconditioned residual $z_k$ and to the direction vectors $\{p_i\}_{i<k}$
 140 previously determined (from line~$8$ to line~$13$). Then, at lines~$16$ and~$17$ , the iterate $x_k$ and the residual $r_k$ are computed
 141 using formulas~(\ref{eq:07}) and~(\ref{eq:08}), respectively. The CG method converges after, at most, $n$ iterations. In practice, the CG
 142 algorithm stops when the tolerance threshold $\varepsilon$ and/or the maximum number of iterations $maxiter$ are reached.
 143
 144 \begin{algorithm}
 145   \SetLine
 146   \linesnumbered
 147   Choose an initial guess $x_0$\;
 148   $r_{0} = b - A x_{0}$\;
 149   $convergence$ = false\;
 150   $k = 1$\;
 151   \Repeat{convergence}{
 152     $z_{k} = M^{-1} r_{k-1}$\;
 153     $\rho_{k} = (r_{k-1},z_{k})$\;
 154     \eIf{$k = 1$}{
 155       $p_{k} = z_{k}$\;
 156     }{
 157       $\beta_{k} = \rho_{k} / \rho_{k-1}$\;
 158       $p_{k} = z_{k} + \beta_{k} \times p_{k-1}$\;
 159     }
 160     $q_{k} = A \times p_{k}$\;
 161     $\alpha_{k} = \rho_{k} / (p_{k},q_{k})$\;
 162     $x_{k} = x_{k-1} + \alpha_{k} \times p_{k}$\;
 163     $r_{k} = r_{k-1} - \alpha_{k} \times q_{k}$\;
 164     \eIf{$(\rho_{k} < \varepsilon)$ {\bf or} $(k \geq maxiter)$}{
 165       $convergence$ = true\;
 166     }{
 167       $k = k + 1$\;
 168     }
 169   }
 170 \caption{Left-preconditioned CG method}
 171 \label{alg:01}
 172 \end{algorithm}
 173
 174 %%****************%%
 175 %%****************%%
 176 \subsection{GMRES method}
 177 \label{sec:02.02}
 178 The iterative method GMRES is developed by Saad and Schultz in 1986~\cite{ref3} as a generalization of the minimum residual method
 179 MINRES~\cite{ref4}. Indeed, GMRES can be applied for solving symmetric or asymmetric linear systems.
 180
 181 The main principle of the GMRES method is to find an approximation minimizing at best the residual norm. In fact, GMRES
 182 computes a sequence of approximate solutions $\{x_k\}_{k>0}$ in a Krylov sub-space $\mathcal{K}_k$ as follows:
 183 \begin{equation}
 184 \begin{array}{ll}
 185 x_k \in x_0 + \mathcal{K}_k(A, v_1),& v_1=\frac{r_0}{\|r_0\|_2},
 186 \end{array}
 187 \label{eq:12}
 188 \end{equation}
 189 so that the Petrov-Galerkin condition is satisfied:
 190 \begin{equation}
 191 \begin{array}{ll}
 192 r_k \bot A \mathcal{K}_k(A, v_1).
 193 \end{array}
 194 \label{eq:13}
 195 \end{equation}
 196 GMRES uses the Arnoldi process~\cite{ref5} to construct an orthonormal basis $V_k$ for the Krylov sub-space $\mathcal{K}_k$
 197 and an upper Hessenberg matrix $\bar{H}_k$ of order $(k+1)\times k$:
 198 \begin{equation}
 199 \begin{array}{ll}
 200 V_k = \{v_1, v_2,\ldots,v_k\}, & \forall k>1, v_k=A^{k-1}v_1,
 201 \end{array}
 202 \label{eq:14}
 203 \end{equation}
 204 and
 205 \begin{equation}
 206 V_k A = V_{k+1} \bar{H}_k.
 207 \label{eq:15}
 208 \end{equation}
 209
 210 Then, at each iteration $k$, an approximate solution $x_k$ is computed in the Krylov sub-space $\mathcal{K}_k$ spanned by $V_k$
 211 as follows:
 212 \begin{equation}
 213 \begin{array}{ll}
 214 x_k = x_0 + V_k y, & y\in\mathbb{R}^{k}.
 215 \end{array}
 216 \label{eq:16}
 217 \end{equation}
 218 From both formulas~(\ref{eq:15}) and~(\ref{eq:16}) and $r_k=b-Ax_k$, we can deduce that:
 219 \begin{equation}
 220 \begin{array}{lll}
 221   r_{k} & = & b - A (x_{0} + V_{k}y) \\
 222         & = & r_{0} - AV_{k}y \\
 223         & = & \beta v_{1} - V_{k+1}\bar{H}_{k}y \\
 224         & = & V_{k+1}(\beta e_{1} - \bar{H}_{k}y),
 225 \end{array}
 226 \label{eq:17}
 227 \end{equation}
 228 such that $\beta=\|r_0\|_2$ and $e_1=(1,0,\cdots,0)$ is the first vector of the canonical basis of $\mathbb{R}^k$. So,
 229 the vector $y$ is chosen in $\mathbb{R}^k$ so as to minimize at best the Euclidean norm of the residual $r_k$. Consequently,
 230 a linear least-squares problem of size $k$ is solved:
 231 \begin{equation}
 232 \underset{y\in\mathbb{R}^{k}}{min}\|r_{k}\|_{2}=\underset{y\in\mathbb{R}^{k}}{min}\|\beta e_{1}-\bar{H}_{k}y\|_{2}.
 233 \label{eq:18}
 234 \end{equation}
 235 The QR factorization of matrix $\bar{H}_k$ is used to compute the solution of this problem by using Givens rotations~\cite{ref1,ref3},
 236 such that:
 237 \begin{equation}
 238 \begin{array}{lll}
 239 \bar{H}_{k}=Q_{k}R_{k}, & Q_{k}\in\mathbb{R}^{(k+1)\times (k+1)}, & R_{k}\in\mathbb{R}^{(k+1)\times k},
 240 \end{array}
 241 \label{eq:19}
 242 \end{equation}
 243 where $Q_kQ_k^T=I_k$ and $R_k$ is an upper triangular matrix.
 244
 245 The GMRES method computes an approximate solution with a sufficient precision after, at most, $n$ iterations ($n$ is the size of the
 246 sparse linear system to be solved). However, the GMRES algorithm must construct and store in the memory an orthonormal basis $V_k$ whose
 247 size is proportional to the number of iterations required to achieve the convergence. Then, to avoid a huge memory storage, the GMRES
 248 method must be restarted at each $m$ iterations, such that $m$ is very small ($m\ll n$), and with $x_m$ as the initial guess to the
 249 next iteration. This allows to limit the size of the basis $V$ to $m$ orthogonal vectors.
 250
 251 Algorithm~\ref{alg:02} shows the main key points of the GMRES method with restarts. It solves the left-preconditioned sparse linear
 252 system~(\ref{eq:11}), such that $M$ is the preconditioning matrix. At each iteration $k$, GMRES uses the Arnoldi process (defined
 253 from line~$7$ to line~$17$) to construct a basis $V_m$ of $m$ orthogonal vectors and an upper Hessenberg matrix $\bar{H}_m$ of size
 254 $(m+1)\times m$. Then, it solves the linear least-squares problem of size $m$ to find the vector $y\in\mathbb{R}^{m}$ which minimizes
 255 at best the residual norm (line~$18$). Finally, it computes an approximate solution $x_m$ in the Krylov sub-space spanned by $V_m$
 256 (line~$19$). The GMRES algorithm is stopped when the residual norm is sufficiently small ($\|r_m\|_2<\varepsilon$) and/or the maximum
 257 number of iterations ($maxiter$) is reached.
 258
 259 \begin{algorithm}
 260   \SetLine
 261   \linesnumbered
 262   Choose an initial guess $x_0$\;
 263   $convergence$ = false\;
 264   $k = 1$\;
 265   $r_{0} = M^{-1}(b-Ax_{0})$\;
 266   $\beta = \|r_{0}\|_{2}$\;
 267   \While{$\neg convergence$}{
 268     $v_{1} = r_{0}/\beta$\;
 269     \For{$j=1$ \KwTo $m$}{
 270       $w_{j} = M^{-1}Av_{j}$\;
 271       \For{$i=1$ \KwTo $j$}{
 272         $h_{i,j} = (w_{j},v_{i})$\;
 273         $w_{j} = w_{j}-h_{i,j}v_{i}$\;
 274       }
 275       $h_{j+1,j} = \|w_{j}\|_{2}$\;
 276       $v_{j+1} = w_{j}/h_{j+1,j}$\;
 277     }
 278     Set $V_{m}=\{v_{j}\}_{1\leq j \leq m}$ and $\bar{H}_{m}=(h_{i,j})$ a $(m+1)\times m$ upper Hessenberg matrix\;
 279     Solve a least-squares problem of size $m$: $min_{y\in\mathrm{I\!R}^{m}}\|\beta e_{1}-\bar{H}_{m}y\|_{2}$\;
 280     $x_{m} = x_{0}+V_{m}y_{m}$\;
 281     $r_{m} = M^{-1}(b-Ax_{m})$\;
 282     $\beta = \|r_{m}\|_{2}$\;
 283     \eIf{ $(\beta<\varepsilon)$ {\bf or} $(k\geq maxiter)$}{
 284       $convergence$ = true\;
 285     }{
 286       $x_{0} = x_{m}$\;
 287       $r_{0} = r_{m}$\;
 288       $k = k + 1$\;
 289     }
 290   }
 291 \caption{Left-preconditioned GMRES method with restarts}
 292 \label{alg:02}
 293 \end{algorithm}
 294
 295
 296 %%--------------------------%%
 297 %%       SECTION 3          %%
 298 %%--------------------------%%
 299 \section{Parallel implementation on a GPU cluster}
 300 \label{sec:03}
 301 In this section, we present the parallel algorithms of both iterative CG and GMRES methods for GPU clusters.
 302 The implementation is performed on a GPU cluster composed of different computing nodes, such that each node
 303 is a CPU core managed by a MPI process and equipped with a GPU card. The parallelization of these algorithms
 304 is carried out by using the MPI communication routines between the GPU computing nodes and the CUDA programming
 305 environment inside each node. In what follows, the algorithms of the iterative methods are called iterative
 306 solvers.
 307
 308 %%****************%%
 309 %%****************%%
 310 \subsection{Data partitioning}
 311 \label{sec:03.01}
 312 The parallel solving of the large sparse linear system~(\ref{eq:11}) requires a data partitioning between the computing
 313 nodes of the GPU cluster. Let $p$ denotes the number of the computing nodes on the GPU cluster. The partitioning operation
 314 consists in the decomposition of the vectors and matrices, involved in the iterative solver, in $p$ portions. Indeed, this
 315 operation allows to assign to each computing node $i$:
 316 \begin{itemize*}
 317 \item a portion of size $\frac{n}{p}$ elements of each vector,
 318 \item a sparse rectangular sub-matrix $A_i$ of size $(n,\frac{n}{p})$ and,
 319 \item a square preconditioning sub-matrix $M_i$ of size $(\frac{n}{p},\frac{n}{p})$,
 320 \end{itemize*}
 321 where $n$ is the size of the sparse linear system to be solved. In the first instance, we perform a naive row-wise partitioning
 322 (decomposition row-by-row) on the data of the sparse linear systems to be solved. Figure~\ref{fig:01} shows an example of a row-wise
 323 data partitioning between four computing nodes of a sparse linear system (sparse matrix $A$, solution vector $x$ and right-hand
 324 side $b$) of size $16$ unknown values.
 325
 326 \begin{figure}
 327 \centerline{\includegraphics[scale=0.35]{Chapters/chapter12/figures/partition}}
 328 \caption{A data partitioning of the sparse matrix $A$, the solution vector $x$ and the right-hand side $b$ into four portions.}
 329 \label{fig:01}
 330 \end{figure}
 331
 332 %%****************%%
 333 %%****************%%
 334 \subsection{GPU computing}
 335 \label{sec:03.02}
 336 After the partitioning operation, all the data involved from this operation must be transferred from the CPU memories to the GPU
 337 memories, in order to be processed by GPUs. We use two functions of the CUBLAS library (CUDA Basic Linear Algebra Subroutines),
 338 developed by Nvidia~\cite{ref6}: \verb+cublasAlloc()+ for the memory allocations on GPUs and \verb+cublasSetVector()+ for the
 339 memory copies from the CPUs to the GPUs.
 340
 341 An efficient implementation of CG and GMRES solvers on a GPU cluster requires to determine all parts of their codes that can be
 342 executed in parallel and, thus, take advantage of the GPU acceleration. As many Krylov sub-space methods, the CG and GMRES methods
 343 are mainly based on arithmetic operations dealing with vectors or matrices: sparse matrix-vector multiplications, scalar-vector
 344 multiplications, dot products, Euclidean norms, AXPY operations ($y\leftarrow ax+y$ where $x$ and $y$ are vectors and $a$ is a
 345 scalar) and so on. These vector operations are often easy to parallelize and they are more efficient on parallel computers when
 346 they work on large vectors. Therefore, all the vector operations used in CG and GMRES solvers must be executed by the GPUs as kernels.
 347
 348 We use the kernels of the CUBLAS library to compute some vector operations of CG and GMRES solvers. The following kernels of CUBLAS
 349 (dealing with double floating point) are used: \verb+cublasDdot()+ for the dot products, \verb+cublasDnrm2()+ for the Euclidean
 350 norms and \verb+cublasDaxpy()+ for the AXPY operations. For the rest of the data-parallel operations, we code their kernels in CUDA.
 351 In the CG solver, we develop a kernel for the XPAY operation ($y\leftarrow x+ay$) used at line~$12$ in Algorithm~\ref{alg:01}. In the
 352 GMRES solver, we program a kernel for the scalar-vector multiplication (lines~$7$ and~$15$ in Algorithm~\ref{alg:02}), a kernel for
 353 solving the least-squares problem and a kernel for the elements updates of the solution vector $x$.
 354
 355 The least-squares problem in the GMRES method is solved by performing a QR factorization on the Hessenberg matrix $\bar{H}_m$ with
 356 plane rotations and, then, solving the triangular system by backward substitutions to compute $y$. Consequently, solving the least-squares
 357 problem on the GPU is not interesting. Indeed, the triangular solves are not easy to parallelize and inefficient on GPUs. However,
 358 the least-squares problem to solve in the GMRES method with restarts has, generally, a very small size $m$. Therefore, we develop
 359 an inexpensive kernel which must be executed in sequential by a single CUDA thread.
 360
 361 The most important operation in CG and GMRES methods is the sparse matrix-vector multiplication (SpMV), because it is often an
 362 expensive operation in terms of execution time and memory space. Moreover, it requires to take care of the storage format of the
 363 sparse matrix in the memory. Indeed, the naive storage, row-by-row or column-by-column, of a sparse matrix can cause a significant
 364 waste of memory space and execution time. In addition, the sparsity nature of the matrix often leads to irregular memory accesses
 365 to read the matrix nonzero values. So, the computation of the SpMV multiplication on GPUs can involve non coalesced accesses to
 366 the global memory, which slows down even more its performances. One of the most efficient compressed storage formats of sparse
 367 matrices on GPUs is HYB format~\cite{ref7}. It is a combination of ELLpack (ELL) and Coordinate (COO) formats. Indeed, it stores
 368 a typical number of nonzero values per row in ELL format and remaining entries of exceptional rows in COO format. It combines
 369 the efficiency of ELL due to the regularity of its memory accesses and the flexibility of COO which is insensitive to the matrix
 370 structure. Consequently, we use the HYB kernel~\cite{ref8} developed by Nvidia to implement the SpMV multiplication of CG and
 371 GMRES methods on GPUs. Moreover, to avoid the non coalesced accesses to the high-latency global memory, we fill the elements of
 372 the iterate vector $x$ in the cached texture memory.
 373
 374 %%****************%%
 375 %%****************%%
 376 \subsection{Data communications}
 377 \label{sec:03.03}
 378 All the computing nodes of the GPU cluster execute in parallel the same iterative solver (Algorithm~\ref{alg:01} or Algorithm~\ref{alg:02})
 379 adapted to GPUs, but on their own portions of the sparse linear system: $M^{-1}_iA_ix_i=M^{-1}_ib_i$, $0\leq i<p$. However in order to solve
 380 the complete sparse linear system~(\ref{eq:11}), synchronizations must be performed between the local computations of the computing nodes
 381 over the cluster. In what follows, two computing nodes sharing data are called neighboring nodes.
 382
 383 As already mentioned, the most important operation of CG and GMRES methods is the SpMV multiplication. In the parallel implementation of
 384 the iterative methods, each computing node $i$ performs the SpMV multiplication on its own sparse rectangular sub-matrix $A_i$. Locally, it
 385 has only sub-vectors of size $\frac{n}{p}$ corresponding to rows of its sub-matrix $A_i$. However, it also requires the vector elements
 386 of its neighbors, corresponding to the column indices on which its sub-matrix has nonzero values (see Figure~\ref{fig:01}). So, in addition
 387 to the local vectors, each node must also manage vector elements shared with neighbors and required to compute the SpMV multiplication.
 388 Therefore, the iterate vector $x$ managed by each computing node is composed of a local sub-vector $x^{local}$ of size $\frac{n}{p}$ and a
 389 sub-vector of shared elements $x^{shared}$. In the same way, the vector used to construct the orthonormal basis of the Krylov sub-space
 390 (vectors $p$ and $v$ in CG and GMRES methods, respectively) is composed of a local sub-vector and a shared sub-vector.
 391
 392 Therefore, before computing the SpMV multiplication, the neighboring nodes over the GPU cluster must exchange between them the shared
 393 vector elements necessary to compute this multiplication. First, each computing node determines, in its local sub-vector, the vector
 394 elements needed by other nodes. Then, the neighboring nodes exchange between them these shared vector elements. The data exchanges
 395 are implemented by using the MPI point-to-point communication routines: blocking sends with \verb+MPI_Send()+ and nonblocking receives
 396 with \verb+MPI_Irecv()+. Figure~\ref{fig:02} shows an example of data exchanges between \textit{Node 1} and its neighbors \textit{Node 0},
 397 \textit{Node 2} and \textit{Node 3}. In this example, the iterate matrix $A$ split between these four computing nodes is that presented
 398 in Figure~\ref{fig:01}.
 399
 400 \begin{figure}
 401 \centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/compress}}
 402 \caption{Data exchanges between \textit{Node 1} and its neighbors \textit{Node 0}, \textit{Node 2} and \textit{Node 3}.}
 403 \label{fig:02}
 404 \end{figure}
 405
 406 After the synchronization operation, the computing nodes receive, from their respective neighbors, the shared elements in a sub-vector
 407 stored in a compressed format. However, in order to compute the SpMV multiplication, the computing nodes operate on sparse global vectors
 408 (see Figure~\ref{fig:02}). In this case, the received vector elements must be copied to the corresponding indices in the global vector.
 409 So as not to need to perform this at each iteration, we propose to reorder the columns of each sub-matrix $\{A_i\}_{0\leq i<p}$, so that
 410 the shared sub-vectors could be used in their compressed storage formats. Figure~\ref{fig:03} shows a reordering of a sparse sub-matrix
 411 (sub-matrix of \textit{Node 1}). Furthermore, we use the texture memory to cache the global vector. This allows to avoid the non coalesced
 412 accesses to the GPU global memory.
 413
 414 \begin{figure}
 415 \centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/reorder}}
 416 \caption{Columns reordering of a sparse sub-matrix.}
 417 \label{fig:03}
 418 \end{figure}
 419
 420 A GPU cluster is a parallel platform with a distributed memory. So, the synchronizations and communication data between GPU nodes are
 421 carried out by passing messages. However, GPUs can not communicate between them in direct way. Then, CPUs via MPI processes are in charge
 422 of the synchronizations within the GPU cluster. Consequently, the vector elements to be exchanged must be copied from the GPU memory
 423 to the CPU memory and vice-versa before and after the synchronization operation between CPUs. We have used the CBLAS communication subroutines
 424 to perform the data transfers between a CPU core and its GPU: \verb+cublasGetVector()+ and \verb+cublasSetVector()+. Finally, in addition
 425 to the data exchanges, GPU nodes perform reduction operations to compute in parallel the dot products and Euclidean norms. This is implemented
 426 by using the MPI global communication \verb+MPI_Allreduce()+.
 427
 428
 429 %%--------------------------%%
 430 %%       SECTION 4          %%
 431 %%--------------------------%%
 432 \section{Experimental results}
 433 \label{sec:04}
 434 In this section, we present the performances of the parallel CG and GMRES linear solvers obtained on a cluster of $12$ GPUs. Indeed, this GPU
 435 cluster of tests is composed of six machines connected by $20$Gbps InfiniBand network. Each machine is a Quad-Core Xeon E5530 CPU running at
 436 $2.4$GHz and providing $12$GB of RAM with a memory bandwidth of $25.6$GB/s. In addition, two Tesla C1060 GPUs are connected to each machine via
 437 a PCI-Express 16x Gen 2.0 interface with a throughput of $8$GB/s. A Tesla C1060 GPU contains $240$ cores running at $1.3$GHz and providing a
 438 global memory of $4$GB with a memory bandwidth of $102$GB/s. Figure~\ref{fig:04} shows the general scheme of the GPU cluster that we used in
 439 the experimental tests.
 440
 441 \begin{figure}
 442 \centerline{\includegraphics[scale=0.25]{Chapters/chapter12/figures/cluster}}
 443 \caption{General scheme of the GPU cluster of tests composed of six machines, each with two GPUs.}
 444 \label{fig:04}
 445 \end{figure}
 446
 447 Linux cluster version 2.6.39 OS is installed on CPUs. C programming language is used for coding the parallel algorithms of both methods on the
 448 GPU cluster. CUDA version 4.0~\cite{ref9} is used for programming GPUs, using CUBLAS library~\cite{ref6} to deal with vector operations in GPUs
 449 and, finally, MPI routines of OpenMPI 1.3.3 are used to carry out the communications between CPU cores. Indeed, the experiments are done on a
 450 cluster of $12$ computing nodes, where each node is managed by a MPI process and it is composed of one CPU core and one GPU card.
 451
 452 All tests are made on double-precision floating point operations. The parameters of both linear solvers are initialized as follows: the residual
 453 tolerance threshold $\varepsilon=10^{-12}$, the maximum number of iterations $maxiter=500$, the right-hand side $b$ is filled with $1.0$ and the
 454 initial guess $x_0$ is filled with $0.0$. In addition, we limited the Arnoldi process used in the GMRES method to $16$ iterations ($m=16$). For
 455 the sake of simplicity, we have chosen the preconditioner $M$ as the main diagonal of the sparse matrix $A$. Indeed, it allows to easily compute
 456 the required inverse matrix $M^{-1}$ and it provides a relatively good preconditioning for not too ill-conditioned matrices. In the GPU computing,
 457 the size of thread blocks is fixed to $512$ threads. Finally, the performance results, presented hereafter, are obtained from the mean value over
 458 $10$ executions of the same parallel linear solver and for the same input data.
 459
 460 To get more realistic results, we tested the CG and GMRES algorithms on sparse matrices of the Davis's collection~\cite{ref10}, that arise in a wide
 461 spectrum of real-world applications. We chose six symmetric sparse matrices and six nonsymmetric ones from this collection. In Figure~\ref{fig:05},
 462 we show structures of these matrices and in Table~\ref{tab:01} we present their main characteristics which are the number of rows, the total number
 463 of nonzero values (nnz) and the maximal bandwidth. In the present chapter, the bandwidth of a sparse matrix is defined as the number of matrix columns
 464 separating the first and the last nonzero value on a matrix row.
 465
 466 \begin{figure}
 467 \centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/matrices}}
 468 \caption{Sketches of sparse matrices chosen from the Davis's collection.}
 469 \label{fig:05}
 470 \end{figure}
 471
 472 \begin{table}[!h]
 473 \centering
 474 \begin{tabular}{|c|c|c|c|c|}
 475 \hline
 476 {\bf Matrix type}             & {\bf Matrix name} & {\bf \# rows} & {\bf \# nnz} & {\bf Bandwidth} \\ \hline \hline
 477
 478 \multirow{6}{*}{Symmetric}    & 2cubes\_sphere    & $101,492$     & $1,647,264$  & $100,464$ \\
 479
 480                               & ecology2          & $999,999$     & $4,995,991$  & $2,001$   \\
 481
 482                               & finan512          & $74,752$      & $596,992$    & $74,725$  \\
 483
 484                               & G3\_circuit       & $1,585,478$   & $7,660,826$  & $1,219,059$ \\
 485
 486                               & shallow\_water2   & $81,920$      & $327,680$    & $58,710$ \\
 487
 488                               & thermal2          & $1,228,045$   & $8,580,313$  & $1,226,629$ \\ \hline \hline
 489
 490 \multirow{6}{*}{Nonsymmetric} & cage13            & $445,315$     & $7,479,343$  & $318,788$\\
 491
 492                               & crashbasis        & $160,000$     & $1,750,416$  & $120,202$ \\
 493
 494                               & FEM\_3D\_thermal2 & $147,900$     & $3,489.300$  & $117,827$ \\
 495
 496                               & language          & $399,130$     & $1,216,334$  & $398,622$\\
 497
 498                               & poli\_large       & $15,575$      & $33,074$     & $15,575$ \\
 499
 500                               & torso3            & $259,156$     & $4,429,042$  & $216,854$  \\ \hline
 501 \end{tabular}
 502 \vspace{0.5cm}
 503 \caption{Main characteristics of sparse matrices chosen from the Davis's collection.}
 504 \label{tab:01}
 505 \end{table}
 506
 507 \begin{table}
 508 \begin{center}
 509 \begin{tabular}{|c|c|c|c|c|c|c|}
 510 \hline
 511 {\bf Matrix}     & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$  & $\mathbf{\# iter.}$ & $\mathbf{prec.}$     & $\mathbf{\Delta}$   \\ \hline \hline
 512
 513 2cubes\_sphere    & $0.132s$           & $0.069s$            & $1.93$        & $12$           & $1.14e$-$09$     & $3.47e$-$18$ \\
 514
 515 ecology2          & $0.026s$           & $0.017s$            & $1.52$        & $13$           & $5.06e$-$09$     & $8.33e$-$17$ \\
 516
 517 finan512          & $0.053s$           & $0.036s$            & $1.49$        & $12$           & $3.52e$-$09$     & $1.66e$-$16$ \\
 518
 519 G3\_circuit       & $0.704s$           & $0.466s$            & $1.51$        & $16$           & $4.16e$-$10$     & $4.44e$-$16$ \\
 520
 521 shallow\_water2   & $0.017s$           & $0.010s$            & $1.68$        & $5$            & $2.24e$-$14$     & $3.88e$-$26$ \\
 522
 523 thermal2          & $1.172s$           & $0.622s$            & $1.88$        & $15$           & $5.11e$-$09$     & $3.33e$-$16$ \\ \hline
 524 \end{tabular}
 525 \caption{Performances of the parallel CG method on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs.}
 526 \label{tab:02}
 527 \end{center}
 528 \end{table}
 529
 530 \begin{table}[!h]
 531 \begin{center}
 532 \begin{tabular}{|c|c|c|c|c|c|c|}
 533 \hline
 534 {\bf Matrix}     & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$  & $\mathbf{\# iter.}$ & $\mathbf{prec.}$     & $\mathbf{\Delta}$   \\ \hline \hline
 535
 536 2cubes\_sphere    & $0.234s$           & $0.124s$            & $1.88$        & $21$           & $2.10e$-$14$     & $3.47e$-$18$ \\
 537
 538 ecology2          & $0.076s$           & $0.035s$            & $2.15$        & $21$           & $4.30e$-$13$     & $4.38e$-$15$ \\
 539
 540 finan512          & $0.073s$           & $0.052s$            & $1.40$        & $17$           & $3.21e$-$12$     & $5.00e$-$16$ \\
 541
 542 G3\_circuit       & $1.016s$           & $0.649s$            & $1.56$        & $22$           & $1.04e$-$12$     & $2.00e$-$15$ \\
 543
 544 shallow\_water2   & $0.061s$           & $0.044s$            & $1.38$        & $17$           & $5.42e$-$22$     & $2.71e$-$25$ \\
 545
 546 thermal2          & $1.666s$           & $0.880s$            & $1.89$        & $21$           & $6.58e$-$12$     & $2.77e$-$16$ \\ \hline \hline
 547
 548 cage13            & $0.721s$           & $0.338s$            & $2.13$        & $26$           & $3.37e$-$11$     & $2.66e$-$15$ \\
 549
 550 crashbasis        & $1.349s$           & $0.830s$            & $1.62$        & $121$          & $9.10e$-$12$     & $6.90e$-$12$ \\
 551
 552 FEM\_3D\_thermal2 & $0.797s$           & $0.419s$            & $1.90$        & $64$           & $3.87e$-$09$     & $9.09e$-$13$ \\
 553
 554 language          & $2.252s$           & $1.204s$            & $1.87$        & $90$           & $1.18e$-$10$     & $8.00e$-$11$ \\
 555
 556 poli\_large       & $0.097s$           & $0.095s$            & $1.02$        & $69$           & $4.98e$-$11$     & $1.14e$-$12$ \\
 557
 558 torso3            & $4.242s$           & $2.030s$            & $2.09$        & $175$          & $2.69e$-$10$     & $1.78e$-$14$ \\ \hline
 559 \end{tabular}
 560 \caption{Performances of the parallel GMRES method on a cluster 24 CPU cores vs. on cluster of 12 GPUs.}
 561 \label{tab:03}
 562 \end{center}
 563 \end{table}
 564
 565 Tables~\ref{tab:02} and~\ref{tab:03} shows the performances of the parallel CG and GMRES solvers, respectively, for solving linear systems associated to
 566 the sparse matrices presented in Tables~\ref{tab:01}. They allow to compare the performances obtained on a cluster of $24$ CPU cores and on a cluster
 567 of $12$ GPUs. However, Table~\ref{tab:02} shows only the solving performances of symmetric sparse linear systems, due to the inability of the CG method
 568 to solve the nonsymmetric systems. In both tables, the second and third columns give, respectively, the execution times in seconds obtained on $24$ CPU
 569 cores($Time_{gpu}$) and that obtained on $12$ GPUs ($Time_{gpu}$). Moreover, we take into account the relative gains $\tau$ of a solver implemented on the
 570 GPU cluster compared to the same solver implemented on the CPU cluster. The relative gains, presented in the fourth column, are computed as a ratio of
 571 the CPU execution time over the GPU execution time:
 572 \begin{equation}
 573 \tau = \frac{Time_{cpu}}{Time_{gpu}}.
 574 \label{eq:20}
 575 \end{equation}
 576 In addition, Tables~\ref{tab:02} and~\ref{tab:03} give the number of iterations ($iter$), the precision $prec$ of the solution computed on the GPU cluster
 577 and the difference $\Delta$ between the solution computed on the CPU cluster and that computed on the GPU cluster. Both parameters $prec$ and $\Delta$
 578 allow to validate and verify the accuracy of the solution computed on the GPU cluster. We have computed them as follows:
 579 \begin{eqnarray}
 580 \Delta = max|x^{cpu}-x^{gpu}|,\\
 581 prec = max|M^{-1}r^{gpu}|,
 582 \end{eqnarray}
 583 where $\Delta$ is the maximum vector element, in absolute value, of the difference between the two solutions $x^{cpu}$ and $x^{gpu}$ computed, respectively,
 584 on CPU and GPU cluster and $prec$ is the maximum element, in absolute value, of the residual vector $r^{gpu}\in\mathbb{R}^{n}$ of the solution $x^{gpu}$.
 585 Thus, we can see that the solutions obtained on the GPU cluster were computed with a sufficient accuracy (about $10^{-10}$) and they are, more or less,
 586 equivalent to those computed on the CPU cluster with a small difference ranging from $10^{-10}$ and $10^{-26}$. However, we can notice from the relative
 587 gains $\tau$ that is not interesting to use multiple GPUs for solving small sparse linear systems. in fact, a small sparse matrix does not allow to maximize
 588 utilization of GPU cores. In addition, the communications required to synchronize the computations over the cluster increase the idle times of GPUs and
 589 slow down further the parallel computations.
 590
 591 Consequently, in order to test the performances of the parallel solvers, we developed in C programming language a generator of large sparse matrices.
 592 This generator takes a matrix from the Davis's collection~\cite{ref10} as an initial matrix to construct large sparse matrices exceeding ten million
 593 of rows. It must be executed in parallel by the MPI processes of the computing nodes, so that each process could construct its sparse sub-matrix. In
 594 first experimental tests, we are focused on sparse matrices having a banded structure, because they are those arise in the most of numerical problems.
 595 So to generate the global sparse matrix, each MPI process constructs its sub-matrix by performing several copies of an initial sparse matrix chosen
 596 from the Davis's collection. Then, it puts all these copies on the main diagonal of the global matrix (see Figure~\ref{fig:06}). Moreover, the empty
 597 spaces between two successive copies in the main diagonal are filled with sub-copies (left-copy and right-copy in Figure~\ref{fig:06}) of the same
 598 initial matrix.
 599
 600 \begin{figure}
 601 \centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/generation}}
 602 \caption{Parallel generation of a large sparse matrix by four computing nodes.}
 603 \label{fig:06}
 604 \end{figure}
 605
 606 We have used the parallel CG and GMRES algorithms for solving sparse linear systems of $25$ million of unknown values. The sparse matrices associated
 607 to these linear systems are generated from those presented in Table~\ref{tab:01}. Their main characteristics are given in Table~\ref{tab:04}. Tables~\ref{tab:05}
 608 and~\ref{tab:06} shows the performances of the parallel CG and GMRES solvers, respectively, obtained on a cluster of $24$ CPU cores and on a cluster
 609 of $12$ GPUs. Obviously, we can notice from these tables that solving large sparse linear systems on a GPU cluster is more efficient than on a CPU
 610 cluster (see relative gains $\tau$). We can also notice that the execution times of the CG method, whether in a CPU cluster or on a GPU cluster, are
 611 better than those the GMRES method for solving large symmetric linear systems. In fact, the CG method is characterized by a better convergence rate
 612 and a shorter execution time of an iteration than those of the GMRES method. Moreover, an iteration of the parallel GMRES method requires more data
 613 exchanges between computing nodes compared to the parallel CG method.
 614
 615 \begin{table}
 616 \centering
 617 \begin{tabular}{|c|c|c|c|}
 618 \hline
 619 {\bf Matrix type}             & {\bf Matrix name} & {\bf \# nnz} & {\bf Bandwidth} \\ \hline \hline
 620
 621 \multirow{6}{*}{Symmetric}    & 2cubes\_sphere    & $413,703,602$ & $198,836$     \\
 622
 623                               & ecology2          & $124,948,019$ & $2,002$          \\
 624
 625                               & finan512          & $278,175,945$ & $123,900$        \\
 626
 627                               & G3\_circuit       & $125,262,292$ & $1,891,887$      \\
 628
 629                               & shallow\_water2   & $100,235,292$ & $62,806$      \\
 630
 631                               & thermal2          & $175,300,284$ & $2,421,285$ \\ \hline \hline
 632
 633 \multirow{6}{*}{Nonsymmetric} & cage13            & $435,770,480$ & $352,566$        \\
 634
 635                               & crashbasis        & $409,291,236$ & $200,203$        \\
 636
 637                               & FEM\_3D\_thermal2 & $595,266,787$ & $206,029$ \\
 638
 639                               & language          & $76,912,824$  & $398,626$ \\
 640
 641                               & poli\_large       & $53,322,580$  & $15,576$         \\
 642
 643                               & torso3            & $433,795,264$ & $328,757$        \\ \hline
 644 \end{tabular}
 645 \vspace{0.5cm}
 646 \caption{Main characteristics of sparse banded matrices generated from those of the Davis's collection.}
 647 \label{tab:04}
 648 \end{table}
 649
 650 \begin{table}
 651 \begin{center}
 652 \begin{tabular}{|c|c|c|c|c|c|c|}
 653 \hline
 654 {\bf Matrix}    & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$   \\ \hline \hline
 655
 656 2cubes\_sphere  & $1.625s$             & $0.401s$              & $4.05$          & $14$                & $5.73e$-$11$     & $5.20e$-$18$ \\
 657
 658 ecology2        & $0.856s$             & $0.103s$              & $8.27$          & $15$                & $3.75e$-$10$     & $1.11e$-$16$ \\
 659
 660 finan512        & $1.210s$             & $0.354s$              & $3.42$          & $14$                & $1.04e$-$10$     & $2.77e$-$16$ \\
 661
 662 G3\_circuit     & $1.346s$             & $0.263s$              & $5.12$          & $17$                & $1.10e$-$10$     & $5.55e$-$16$ \\
 663
 664 shallow\_water2 & $0.397s$             & $0.055s$              & $7.23$          & $7$                 & $3.43e$-$15$     & $5.17e$-$26$ \\
 665
 666 thermal2        & $1.411s$             & $0.244s$              & $5.78$          & $16$                & $1.67e$-$09$     & $3.88e$-$16$ \\ \hline
 667 \end{tabular}
 668 \caption{Performances of the parallel CG method for solving linear systems associated to sparse banded matrices on a cluster of 24 CPU cores vs.
 669 on a cluster of 12 GPUs.}
 670 \label{tab:05}
 671 \end{center}
 672 \end{table}
 673
 674 \begin{table}
 675 \begin{center}
 676 \begin{tabular}{|c|c|c|c|c|c|c|}
 677 \hline
 678 {\bf Matrix}      & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$   \\ \hline \hline
 679
 680 2cubes\_sphere    & $3.597s$             & $0.514s$              & $6.99$          & $21$                & $2.11e$-$14$     & $8.67e$-$18$ \\
 681
 682 ecology2          & $2.549s$             & $0.288s$              & $8.83$          & $21$                & $4.88e$-$13$     & $2.08e$-$14$ \\
 683
 684 finan512          & $2.660s$             & $0.377s$              & $7.05$          & $17$                & $3.22e$-$12$     & $8.82e$-$14$ \\
 685
 686 G3\_circuit       & $3.139s$             & $0.480s$              & $6.53$          & $22$                & $1.04e$-$12$     & $5.00e$-$15$ \\
 687
 688 shallow\_water2   & $2.195s$             & $0.253s$              & $8.68$          & $17$                & $5.54e$-$21$     & $7.92e$-$24$ \\
 689
 690 thermal2          & $3.206s$             & $0.463s$              & $6.93$          & $21$                & $8.89e$-$12$     & $3.33e$-$16$ \\ \hline \hline
 691
 692 cage13            & $5.560s$             & $0.663s$              & $8.39$          & $26$                & $3.29e$-$11$     & $1.59e$-$14$ \\
 693
 694 crashbasis        & $25.802s$            & $3.511s$              & $7.35$          & $135$               & $6.81e$-$11$     & $4.61e$-$15$ \\
 695
 696 FEM\_3D\_thermal2 & $13.281s$            & $1.572s$              & $8.45$          & $64$                & $3.88e$-$09$     & $1.82e$-$12$ \\
 697
 698 language          & $12.553s$            & $1.760s$              & $7.13$          & $89$                & $2.11e$-$10$     & $1.60e$-$10$ \\
 699
 700 poli\_large       & $8.515s$             & $1.053s$              & $8.09$          & $69$                & $5.05e$-$11$     & $6.59e$-$12$ \\
 701
 702 torso3            & $31.463s$            & $3.681s$              & $8.55$          & $175$               & $2.69e$-$10$     & $2.66e$-$14$ \\ \hline
 703 \end{tabular}
 704 \caption{Performances of the parallel GMRES method for solving linear systems associated to sparse banded matrices on a cluster of 24 CPU cores vs.
 705 on a cluster of 12 GPUs.}
 706 \label{tab:06}
 707 \end{center}
 708 \end{table}
 709
 710
 711 %%--------------------------%%
 712 %%       SECTION 5          %%
 713 %%--------------------------%%
 714 \section{Hypergraph partitioning}
 715 \label{sec:05}
 716 In this section, we present the performances of both parallel CG and GMRES solvers for solving linear systems associated to sparse matrices having
 717 large bandwidths. Indeed, we are interested on sparse matrices having the nonzero values distributed along their bandwidths.
 718
 719 We have developed in C programming language a generator of large sparse matrices having five bands distributed along their bandwidths (see Figure~\ref{fig:07}).
 720 The principle of this generator is equivalent to that in Section~\ref{sec:04}. However, the copies performed on the initial matrix (chosen from the
 721 Davis's collection) are placed on the main diagonal and on four off-diagonals, two on the right and two on the left of the main diagonal. Figure~\ref{fig:07}
 722 shows an example of a generation of a sparse five-bands matrix by four computing nodes. Table~\ref{tab:07} shows the main characteristics of sparse
 723 five-bands matrices generated from those presented in Table~\ref{tab:01} and associated to linear systems of $25$ million of unknown values.
 724
 725 \begin{figure}[!h]
 726 \centerline{\includegraphics[scale=0.23]{Chapters/chapter12/figures/generation_1}}
 727 \caption{Parallel generation of a large sparse five-bands matrix by four computing nodes.}
 728 \label{fig:07}
 729 \end{figure}
 730
 731 \begin{table}
 732 \begin{center}
 733 \begin{tabular}{|c|c|c|c|}
 734 \hline
 735 {\bf Matrix type}             & {\bf Matrix name} & {\bf \# nnz}  & {\bf Bandwidth} \\ \hline \hline
 736
 737 \multirow{6}{*}{Symmetric}    & 2cubes\_sphere    & $829,082,728$ & $24,999,999$     \\
 738
 739                               & ecology2          & $254,892,056$ & $25,000,000$     \\
 740
 741                               & finan512          & $556,982,339$ & $24,999,973$     \\
 742
 743                               & G3\_circuit       & $257,982,646$ & $25,000,000$     \\
 744
 745                               & shallow\_water2   & $200,798,268$ & $25,000,000$     \\
 746
 747                               & thermal2          & $359,340,179$ & $24,999,998$     \\ \hline \hline
 748
 749 \multirow{6}{*}{Nonsymmetric} & cage13            & $879,063,379$ & $24,999,998$     \\
 750
 751                               & crashbasis        & $820,373,286$ & $24,999,803$     \\
 752
 753                               & FEM\_3D\_thermal2 & $1,194,012,703$ & $24,999,998$     \\
 754
 755                               & language          & $155,261,826$ & $24,999,492$     \\
 756
 757                               & poli\_large       & $106,680,819$ & $25,000,000$    \\
 758
 759                               & torso3            & $872,029,998$ & $25,000,000$\\ \hline
 760 \end{tabular}
 761 \caption{Main characteristics of sparse five-bands matrices generated from those of the Davis's collection.}
 762 \label{tab:07}
 763 \end{center}
 764 \end{table}
 765
 766
 767 \begin{table}
 768 \begin{center}
 769 \begin{tabular}{|c|c|c|c|c|c|c|}
 770 \hline
 771 {\bf Matrix}      & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$   \\ \hline \hline
 772
 773 2cubes\_sphere    & $6.041s$     & $3.338s$      & $1.81$ & $30$ & $6.77e$-$11$ & $3.25e$-$19$ \\
 774
 775 ecology2          & $1.404s$     & $1.301s$      & $1.08$ & $13$     & $5.22e$-$11$ & $2.17e$-$18$ \\
 776
 777 finan512          & $1.822s$     & $1.299s$      & $1.40$ & $12$     & $3.52e$-$11$ & $3.47e$-$18$ \\
 778
 779 G3\_circuit       & $2.331s$     & $2.129s$      & $1.09$ & $15$     & $1.36e$-$11$ & $5.20e$-$18$ \\
 780
 781 shallow\_water2   & $0.541s$     & $0.504s$      & $1.07$ & $6$      & $2.12e$-$16$ & $5.05e$-$28$ \\
 782
 783 thermal2          & $2.549s$     & $1.705s$      & $1.49$ & $14$     & $2.36e$-$10$ & $5.20e$-$18$ \\ \hline
 784 \end{tabular}
 785 \caption{Performances of parallel CG solver for solving linear systems associated to sparse five-bands matrices
 786 on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs}
 787 \label{tab:08}
 788 \end{center}
 789 \end{table}
 790
 791 \begin{table}
 792 \begin{center}
 793 \begin{tabular}{|c|c|c|c|c|c|c|}
 794 \hline
 795 {\bf Matrix}      & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$   \\ \hline \hline
 796
 797 2cubes\_sphere    & $15.963s$    & $7.250s$      & $2.20$  & $58$     & $6.23e$-$16$ & $3.25e$-$19$ \\
 798
 799 ecology2          & $3.549s$     & $2.176s$      & $1.63$  & $21$     & $4.78e$-$15$ & $1.06e$-$15$ \\
 800
 801 finan512          & $3.862s$     & $1.934s$      & $1.99$  & $17$     & $3.21e$-$14$ & $8.43e$-$17$ \\
 802
 803 G3\_circuit       & $4.636s$     & $2.811s$      & $1.65$  & $22$     & $1.08e$-$14$ & $1.77e$-$16$ \\
 804
 805 shallow\_water2   & $2.738s$     & $1.539s$      & $1.78$  & $17$     & $5.54e$-$23$ & $3.82e$-$26$ \\
 806
 807 thermal2          & $5.017s$     & $2.587s$      & $1.94$  & $21$     & $8.25e$-$14$ & $4.34e$-$18$ \\ \hline \hline
 808
 809 cage13            & $9.315s$     & $3.227s$      & $2.89$  & $26$     & $3.38e$-$13$ & $2.08e$-$16$ \\
 810
 811 crashbasis        & $35.980s$    & $14.770s$     & $2.43$  & $127$    & $1.17e$-$12$ & $1.56e$-$17$ \\
 812
 813 FEM\_3D\_thermal2 & $24.611s$    & $7.749s$      & $3.17$  & $64$     & $3.87e$-$11$ & $2.84e$-$14$ \\
 814
 815 language          & $16.859s$    & $9.697s$      & $1.74$  & $89$     & $2.17e$-$12$ & $1.70e$-$12$ \\
 816
 817 poli\_large       & $10.200s$    & $6.534s$      & $1.56$  & $69$     & $5.14e$-$13$ & $1.63e$-$13$ \\
 818
 819 torso3            & $49.074s$    & $19.397s$     & $2.53$  & $175$    & $2.69e$-$12$ & $2.77e$-$16$ \\ \hline
 820 \end{tabular}
 821 \caption{Performances of parallel GMRES solver for solving linear systems associated to sparse five-bands matrices
 822 on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs}
 823 \label{tab:09}
 824 \end{center}
 825 \end{table}
 826
 827 Tables~\ref{tab:08} and~\ref{tab:09} shows the performaces of the parallel CG and GMRES solvers, respectively, obtained on
 828 a cluster of $24$ CPU cores and on a cluster of $12$ GPUs. The linear systems solved in these tables are associated to the
 829 sparse five-bands matrices presented on Table~\ref{tab:07}. We can notice from both Tables~\ref{tab:08} and~\ref{tab:09} that
 830 using a GPU cluster is not efficient for solving these kind of sparse linear systems. We can see that the execution times obtained
 831 on the GPU cluster are almost equivalent to those obtained on the CPU cluster (see the relative gains presented in column~$4$
 832 of each table). This is due to the large number of communications necessary to synchronize the computations over the cluster.
 833 Indeed, the naive partitioning, row-by-row or column-by-column, of sparse matrices having large bandwidths can link a computing
 834 node to many neighbors and then generate a large number of data dependencies between these computing nodes in the cluster.
 835
 836 Therefore, we have chosen to use a hypergraph partitioning method, which is well-suited to numerous kinds of sparse matrices~\cite{ref11}.
 837 Indeed, it can well model the communications between the computing nodes, particularly in the case of nonsymmetric and irregular
 838 matrices, and it gives good reduction of the total communication volume. In contrast, it is an expensive operation in terms of
 839 execution time and memory space.
 840
 841 The sparse matrix $A$ of the linear system to be solved is modeled as a hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ as
 842 follows:
 843 \begin{itemize*}
 844 \item each matrix row $\{i\}_{0\leq i<n}$ corresponds to a vertex $v_i\in\mathcal{V}$ and,
 845 \item each matrix column $\{j\}_{0\leq j<n}$ corresponds to a hyperedge $e_j\in\mathcal{E}$, where:
 846 \begin{equation}
 847 \forall a_{ij} \neq 0 \mbox{~is a nonzero value of matrix~} A \mbox{~:~} v_i \in pins[e_j],
 848 \end{equation}
 849 \item $w_i$ is the weight of vertex $v_i$ and,
 850 \item $c_j$ is the cost of hyperedge $e_j$.
 851 \end{itemize*}
 852 A $K$-way partitioning of a hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ is defined as $\mathcal{P}=\{\mathcal{V}_1,\ldots,\mathcal{V}_K\}$
 853 a set of pairwise disjoint non-empty subsets (or parts) of the vertex set $\mathcal{V}$, so that each subset is attributed to a computing node.
 854 Figure~\ref{fig:08} shows an example of the hypergraph model of a  $(9\times 9)$ sparse matrix in three parts. The circles and squares correspond,
 855 respectively, to the vertices and hyperedges of the hypergraph. The solid squares define the cut hyperedges connecting at least two different parts.
 856 The connectivity $\lambda_j$ of a cut hyperedge $e_j$ denotes the number of different parts spanned by $e_j$.
 857
 858 \begin{figure}
 859 \centerline{\includegraphics[scale=0.5]{Chapters/chapter12/figures/hypergraph}}
 860 \caption{An example of the hypergraph partitioning of a sparse matrix decomposed between three computing nodes.}
 861 \label{fig:08}
 862 \end{figure}
 863
 864 The cut hyperedges model the total communication volume between the different computing nodes in the cluster, necessary to perform the parallel SpMV
 865 multiplication. Indeed, each hyperedge $e_j$ defines a set of atomic computations $b_i\leftarrow b_i+a_{ij}x_j$, $0\leq i,j<n$, of the SpMV multiplication
 866 $Ax=b$ that need the $j^{th}$ unknown value of solution vector $x$. Therefore, pins of hyperedge $e_j$, $pins[e_j]$, are the set of matrix rows sharing
 867 and requiring the same unknown value $x_j$. For example in Figure~\ref{fig:08}, hyperedge $e_9$ whose pins are: $pins[e_9]=\{v_2,v_5,v_9\}$ represents the
 868 dependency of matrix rows $2$, $5$ and $9$ to unknown $x_9$ needed to perform in parallel the atomic operations: $b_2\leftarrow b_2+a_{29}x_9$,
 869 $b_5\leftarrow b_5+a_{59}x_9$ and $b_9\leftarrow b_9+a_{99}x_9$. However, unknown $x_9$ is the third entry of the sub-solution vector $x$ of part (or node) $3$.
 870 So the computing node $3$ must exchange this value with nodes $1$ and $2$, which leads to perform two communications.
 871
 872 The hypergraph partitioning allows to reduce the total communication volume required to perform the parallel SpMV multiplication, while maintaining the
 873 load balancing between the computing nodes. In fact, it allows to minimize at best the following amount:
 874 \begin{equation}
 875 \mathcal{X}(\mathcal{P})=\sum_{e_{j}\in\mathcal{E}_{C}}c_{j}(\lambda_{j}-1),
 876 \end{equation}
 877 where $\mathcal{E}_{C}$ denotes the set of the cut hyperedges coming from the hypergraph partitioning $\mathcal{P}$ and $c_j$ and $\lambda_j$ are, respectively,
 878 the cost and the connectivity of cut hyperedge $e_j$. Moreover, it also ensures the load balancing between the $K$ parts as follows:
 879 \begin{equation}
 880   W_{k}\leq (1+\epsilon)W_{avg}, \hspace{0.2cm} (1\leq k\leq K) \hspace{0.2cm} \text{and} \hspace{0.2cm} (0<\epsilon<1),
 881 \end{equation}
 882 where $W_{k}$ is the sum of all vertex weights ($w_{i}$) in part $\mathcal{V}_{k}$, $W_{avg}$ is the average weight of all $K$ parts and $\epsilon$ is the
 883 maximum allowed imbalanced ratio.
 884
 885 The hypergraph partitioning is a NP-complete problem but software tools using heuristics are developed, for example: hMETIS~\cite{ref12}, PaToH~\cite{ref13}
 886 and Zoltan~\cite{ref14}. Since our objective is solving large sparse linear systems, we use the parallel hypergraph partitioning which must be performed by
 887 at least two MPI processes. It allows us to accelerate the data partitioning of large sparse matrices. For this, the hypergraph $\mathcal{H}$ must be partitioned
 888 in $p$ (number of MPI processes) sub-hypergraphs $\mathcal{H}_k=(\mathcal{V}_k,\mathcal{E}_k)$, $0\leq k<p$, and then we performed the parallel hypergraph
 889 partitioning method using some functions of the MPI library between the $p$ processes.
 890
 891 Tables~\ref{tab:10} and~\ref{tab:11} shows the performances of the parallel CG and GMRES solvers, respectively, using the hypergraph partitioning for solving
 892 large linear systems associated to the sparse five-bands matrices presented in Table~\ref{tab:07}. For these experimental tests, we have applied the parallel
 893 hypergraph partitioning~\cite{ref15} developed in Zoltan tool~\ref{ref14}. We have initialized the parameters of the partitioning as follows:
 894 \begin{itemize*}
 895 \item the weight $w_{i}$ of each vertex $v_{j}\in\mathcal{V}$ is set to the number of nonzero values on matrix row $i$,
 896 \item for the sake of simplicity, the cost $c_{j}$ of each hyperedge $e_{j}\in\mathcal{E}$ is fixed to $1$,
 897 \item the maximum imbalanced load ratio $\epsilon$ is limited to $10\%$.\\
 898 \end{itemize*}
 899
 900 \begin{table}[!h]
 901 \begin{center}
 902 \begin{tabular}{|c|c|c|c|c|}
 903 \hline
 904 {\bf Matrix}    & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{Gains \%}$ \\ \hline \hline
 905
 906 2cubes\_sphere  & $5.935s$             & $1.213s$              & $4.89$          & $63.66\%$ \\
 907
 908 ecology2        & $1.093s$             & $0.136s$              & $8.00$          & $89.55\%$ \\
 909
 910 finan512        & $1.762s$             & $0.475s$              & $3.71$          & $63.43\%$ \\
 911
 912 G3\_circuit     & $2.095s$             & $0.558s$              & $3.76$          & $73.79\%$ \\
 913
 914 shallow\_water2 & $0.498s$             & $0.068s$              & $7.31$          & $86.51\%$ \\
 915
 916 thermal2        & $1.889s$             & $0.348s$              & $5.43$          & $79.59\%$ \\ \hline
 917 \end{tabular}
 918 \caption{Performances of the parallel CG solver using hypergraph partitioning for solving linear systems associated to
 919 sparse five-bands matrices on a cluster of 24 CPU cores vs. on a cluster of 12 GPU.}
 920 \label{tab:10}
 921 \end{center}
 922 \end{table}
 923
 924 \begin{table}[!h]
 925 \begin{center}
 926 \begin{tabular}{|c|c|c|c|c|}
 927 \hline
 928 {\bf Matrix}      & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{Gains \%}$ \\ \hline \hline
 929
 930 2cubes\_sphere    & $16.430s$            & $2.840s$              & $5.78$          & $60.83\%$ \\
 931
 932 ecology2          & $3.152s$             & $0.367s$              & $8.59$          & $83.13\%$ \\
 933
 934 finan512          & $3.672s$             & $0.723s$              & $5.08$          & $62.62\%$ \\
 935
 936 G3\_circuit       & $4.468s$             & $0.971s$              & $4.60$          & $65.46\%$ \\
 937
 938 shallow\_water2   & $2.647s$             & $0.312s$              & $8.48$          & $79.73\%$ \\
 939
 940 thermal2          & $4.190s$             & $0.666s$              & $6.29$          & $74.25\%$ \\ \hline \hline
 941
 942 cage13            & $8.077s$             & $1.584s$              & $5.10$          & $50.91\%$ \\
 943
 944 crashbasis        & $35.173s$            & $5.546s$              & $6.34$          & $62.43\%$ \\
 945
 946 FEM\_3D\_thermal2 & $24.825s$            & $3.113s$              & $7.97$          & $59.83\%$ \\
 947
 948 language          & $16.706s$            & $2.522s$              & $6.62$          & $73.99\%$ \\
 949
 950 poli\_large       & $12.715s$            & $3.989s$              & $3.19$          & $38.95\%$ \\
 951
 952 torso3            & $48.459s$            & $6.234s$              & $7.77$          & $67.86\%$ \\ \hline
 953 \end{tabular}
 954 \caption{Performances of the parallel GMRES solver using hypergraph partitioning for solving linear systems associated to
 955 sparse five-bands matrices on a cluster of 24 CPU cores vs. on a cluster of 12 GPU.}
 956 \label{tab:11}
 957 \end{center}
 958 \end{table}
 959
 960 We can notice from both Tables~\ref{tab:10} and~\ref{tab:11} that the hypergraph partitioning has improved the performances of both
 961 parallel CG and GMRES algorithms for solving large linear systems associated to matrices having large bandwidths. The execution times
 962 on the GPU cluster of both parallel solvers are significantly improved compared to those obtained by using the partitioning row-by-row.
 963 For these examples of sparse matrices, the execution times of CG and GMRES solvers are reduced on average about $76\%$ and $65\%$ respectively
 964 (see column~$5$ of each table) compared to those obtained in Tables~\ref{tab:08} and~\ref{tab:09}.
 965
 966 In fact, the hypergraph partitioning applied to sparse matrices having large bandwidths allows to reduce the total communication volume
 967 necessary to synchronize the computations between the computing nodes in the GPU cluster. Table~\ref{tab:12} presents, for each sparse
 968 matrix, the total communication volume between $12$ GPU computing nodes obtained by using the partitioning row-by-row (column~$2$), the
 969 total communication volume obtained by using the hypergraph partitioning (column~$3$) and the execution times in minutes of the hypergraph
 970 partitioning operation performed by $12$ MPI processes (column~$4$). The total communication volume defines the total number of the vector
 971 elements exchanged by the computing nodes. Then, Table~\ref{tab:12} shows that the hypergraph partitioning method can split the sparse
 972 matrix so as to minimize the data dependencies between the computing nodes and thus to reduce the total communication volume.
 973
 974
 975 \begin{table}[!h]
 976 \begin{center}
 977 \begin{tabular}{|c|c|c|c|}
 978 \hline
 979 \multirow{4}{*}{\bf Matrix}  & {\bf Total comms.}      & {\bf Total comms.}      & {\bf Execution} \\
 980                              & {\bf volume without}    & {\bf volume with}       & {\bf trime}  \\
 981                              & {\bf hypergraph}        & {\bf hypergraph }       & {\bf of the parti.}  \\
 982                              & {\bf parti.}            & {\bf parti.}            & {\bf in minutes}\\ \hline \hline
 983
 984 2cubes\_sphere               & $25,360,543$            & $240,679$               & $68.98$         \\
 985
 986 ecology2                     & $26,044,002$            & $73,021$                & $4.92$          \\
 987
 988 finan512                     & $26,087,431$            & $900,729$               & $33.72$         \\
 989
 990 G3\_circuit                  & $31,912,003$            & $5,366,774$             & $11.63$         \\
 991
 992 shallow\_water2              & $25,105,108$            & $60,899$                & $5.06$          \\
 993
 994 thermal2                     & $30,012,846$            & $1,077,921$             & $17.88$         \\ \hline \hline
 995
 996 cage13                       & $28,254,282$            & $3,845,440$             & $196.45$        \\
 997
 998 crashbasis                   & $29,020,060$            & $2,401,876$             & $33.39$         \\
 999
1000 FEM\_3D\_thermal2            & $25,263,767$            & $250,105$               & $49.89$         \\
1001
1002 language                     & $27,291,486$            & $1,537,835$             & $9.07$          \\
1003
1004 poli\_large                  & $25,053,554$            & $7,388,883$             & $5.92$          \\
1005
1006 torso3                       & $25,682,514$            & $613,250$               & $61.51$         \\ \hline
1007 \end{tabular}
1008 \caption{The total communication volume between 12 GPU computing nodes without and with the hypergraph partitioning method.}
1009 \label{tab:12}
1010 \end{center}
1011 \end{table}
1012
1013 Nevertheless, as we can see from the fourth column of Table~\ref{tab:12}, the hypergraph partitioning takes longer compared
1014 to the execution times of the resolutions. As previously mentioned, the hypergraph partitioning method is less efficient in
1015 terms of memory consumption and partitioning time than its graph counterpart, but the hypergraph well models the nonsymmetric
1016 and irregular problems. So for the applications which often use the same sparse matrices, we can perform the hypergraph partitioning
1017 on these matrices only once for each and then, we save the traces of these partitionings in files to be reused several times.
1018 Therefore, this allows us to avoid the partitioning of the sparse matrices at each resolution of the linear systems.
1019
1020 However, the most important performance parameter is the scalability of the parallel CG and GMRES solvers on a GPU cluster.
1021 Particularly, we have taken into account the weak-scaling of both parallel algorithms on a cluster of one to 12 GPU computing
1022 nodes. We have performed a set of experimental tests on both matrix structures: band matrices and five-bands matrices. The
1023 sparse matrices of tests are generated from the symmetric sparse matrix {\it thermal2} chosen from the Davis's collection.
1024 Figures~\ref{fig:09}-$(a)$ and~\ref{fig:09}-$(b)$ show the execution times of both parallel methods for solving large linear
1025 systems associated to band matrices and those associated to five-bands matrices, respectively. The size of a sparse sub-matrix
1026 per computing node, for each matrix structure, is fixed as follows:
1027 \begin{itemize*}
1028 \item band matrix: $15$ million of rows and $105,166,557$ of nonzero values,
1029 \item five-bands matrix: $5$ million of rows and $78,714,492$ of nonzero values.
1030 \end{itemize*}
1031 We can see from these figures that both parallel solvers are quite scalable on a GPU cluster. Indeed, the execution times remains
1032 almost constant while the size of the size of the sparse linear systems to be solved increases proportionally with the number of
1033 the GPU computing nodes. This means that the communication cost is relatively constant regardless of the number the computing nodes
1034 in the GPU cluster.
1035
1036 \begin{figure}
1037 \centering
1038 \begin{tabular}{c}
1039 \includegraphics[scale=0.7]{Chapters/chapter12/figures/scale_band} \\
1040 \small{(a) Sparse band matrices} \\
1041 \\
1042 \includegraphics[scale=0.7]{Chapters/chapter12/figures/scale_5band} \\
1043 \small{(b) Sparse five-bands matrices}
1044 \end{tabular}
1045 \caption{Weak-scaling of the parallel CG and GMRES solvers on a GPU cluster for solving large sparse linear systems.}
1046 \label{fig:09}
1047 \end{figure}
1048
1049 %%--------------------------%%
1050 %%       SECTION 6          %%
1051 %%--------------------------%%
1052 \section{Conclusion}
1053 \label{sec:06}
1054 In this chapter, we have aimed at harnessing the computing power of a cluster of GPUs for solving large sparse linear systems.
1055 For this, we have used two Krylov sub-space iterative methods: the CG and GMRES methods. The first method is well-known to its
1056 efficiency for solving symmetric linear systems and the second one is used, particularly, for solving nonsymmetric linear systems.
1057
1058 We have presentend the parallel implementation of both iterative methods on a GPU cluster. Particularly, the operations dealing with
1059 the vectors and/or matrices, of these methods, are parallelized between the different GPU computing nodes of the cluster. Indeed,
1060 the data-parallel vector operations are accelerated by GPUs and the communications required to synchronize the parallel computations
1061 are carried out by CPU cores. For this, we have used a heterogeneous CUDA/MPI programming to implement the parallel iterative
1062 algorithms.
1063
1064 In the experimental tests, we have shown that using a GPU cluster is efficient for solving linear systems associated to very large
1065 sparse matrices. The experimental results, obtained in the present chapter, showed that a cluster of $12$ GPUs is about $7$
1066 times faster than a cluster of $24$ CPU cores for solving large sparse linear systems of $25$ million unknown values. This is due
1067 to the GPU ability to compute the data-parallel operations faster than the CPUs. However, we have shown that solving linear systems
1068 associated to matrices having large bandwidths uses many communications to synchronize the computations of GPUs, which slow down
1069 even more the resolution. Moreover, there two kinds of communications: between a CPU and its GPU and between CPUs of the computing
1070 nodes, such that the first ones are the slowest communications on a GPU cluster. So, we have proposed to use the hypergraph partitioning
1071 instead of the row-by-row partitioning. This allows to minimize the data dependencies between the GPU computing nodes and thus to
1072 reduce the total communication volume. The experimental results showed that using the hypergraph partitioning technique improve the
1073 execution times on average of $76\%$ to the CG method and of $65\%$ to the GMRES method on a cluster of $12$ GPUs.
1074
1075 In the recent GPU hardware and software architectures, the GPU-Direct system with CUDA version 5.0 is used so that two GPUs located on
1076 the same node or on distant nodes can communicate between them directly without CPUs. This allows to improve the data transfers between
1077 GPUs.
1078
1079 \putbib[Chapters/chapter12/biblio12]
1080