BookGPU/Chapters/chapter12/ch12.tex~

   1 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   2 %%                          %%
   3 %%       CHAPTER 12         %%
   4 %%                          %%
   5 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   6
   7 %\chapterauthor{}{}
   8 \chapterauthor{Lilia Ziane Khodja}{Femto-ST Institute, University of Franche-Comte, France}
   9 \chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte, France}
  10 \chapterauthor{Jacques Bahi}{Femto-ST Institute, University of Franche-Comte, France}
  11
  12 \chapter{Solving sparse linear systems with GMRES and CG methods on GPU clusters}
  13 \label{ch12}
  14
  15 %%--------------------------%%
  16 %%       SECTION 1          %%
  17 %%--------------------------%%
  18 \section{Introduction}
  19 \label{ch12:sec:01}
  20 The sparse linear systems are used to model many scientific and industrial problems,
  21 such as the environmental simulations or the industrial processing of the complex or
  22 non-Newtonian fluids. Moreover, the resolution of these problems often involves the
  23 solving of such linear systems which is considered as the most expensive process in
  24 terms of execution time and memory space. Therefore, solving sparse linear systems
  25 must be as efficient as possible in order to deal with problems of ever increasing
  26 size.
  27
  28 There are, in the jargon of numerical analysis, different methods of solving sparse
  29 linear systems that can be classified in two classes: the direct and iterative methods.
  30 However, the iterative methods are often more suitable than their counterpart, direct
  31 methods, for solving these systems. Indeed, they are less memory consuming and easier
  32 to parallelize on parallel computers than direct methods. Different computing platforms,
  33 sequential and parallel computers, are used for solving sparse linear systems with iterative
  34 solutions. Nowadays, graphics processing units (GPUs) have become attractive for solving
  35 these systems, due to their computing power and their ability to compute faster than
  36 traditional CPUs.
  37
  38 In Section~\ref{ch12:sec:02}, we describe the general principle of two well-known iterative
  39 methods: the conjugate gradient method and the generalized minimal residual method. In Section~\ref{ch12:sec:03},
  40 we give the main key points of the parallel implementation of both methods on a cluster of
  41 GPUs. Then, in Section~\ref{ch12:sec:04}, we present the experimental results obtained on a
  42 CPU cluster and on a GPU cluster, for solving sparse linear systems associated to matrices
  43 of different structures. Finally, in Section~\ref{ch12:sec:05}, we apply the hypergraph partitioning
  44 technique to reduce the total communication volume between the computing nodes and, thus,
  45 to improve the execution times of the parallel algorithms of both iterative methods.
  46
  47
  48 %%--------------------------%%
  49 %%       SECTION 2          %%
  50 %%--------------------------%%
  51 \section{Krylov iterative methods}
  52 \label{ch12:sec:02}
  53 Let us consider the following system of $n$ linear equations\index{Sparse~linear~system}
  54 in $\mathbb{R}$:
  55 \begin{equation}
  56 Ax=b,
  57 \label{ch12:eq:01}
  58 \end{equation}
  59 where $A\in\mathbb{R}^{n\times n}$ is a sparse nonsingular square matrix, $x\in\mathbb{R}^{n}$
  60 is the solution vector, $b\in\mathbb{R}^{n}$ is the right-hand side and $n\in\mathbb{N}$ is a
  61 large integer number.
  62
  63 The iterative methods\index{Iterative~method} for solving the large sparse linear system~(\ref{ch12:eq:01})
  64 proceed by successive iterations of a same block of elementary operations, during which an
  65 infinite number of approximate solutions $\{x_k\}_{k\geq 0}$ are computed. Indeed, from an
  66 initial guess $x_0$, an iterative method determines at each iteration $k>0$ an approximate
  67 solution $x_k$ which, gradually, converges to the exact solution $x^{*}$ as follows:
  68 \begin{equation}
  69 x^{*}=\lim\limits_{k\to\infty}x_{k}=A^{-1}b.
  70 \label{ch12:eq:02}
  71 \end{equation}
  72 The number of iterations necessary to reach the exact solution $x^{*}$ is not known beforehand
  73 and can be infinite. In practice, an iterative method often finds an approximate solution $\tilde{x}$
  74 after a fixed number of iterations and/or when a given convergence criterion\index{Convergence}
  75 is satisfied as follows:
  76 \begin{equation}
  77 \|b-A\tilde{x}\| < \varepsilon,
  78 \label{ch12:eq:03}
  79 \end{equation}
  80 where $\varepsilon<1$ is the required convergence tolerance threshold\index{Convergence!Tolerance~threshold}.
  81
  82 Some of the most iterative methods that have proven their efficiency for solving large sparse
  83 linear systems are those called \textit{Krylov subspace methods}~\cite{ch12:ref1}\index{Iterative~method!Krylov~subspace}.
  84 In the present chapter, we describe two Krylov methods which are widely used: the conjugate
  85 gradient method (CG) and the generalized minimal residual method (GMRES). In practice, the
  86 Krylov subspace methods are usually used with preconditioners that allow to improve their
  87 convergence. So, in what follows, the CG and GMRES methods are used for solving the left-preconditioned\index{Sparse~linear~system!Preconditioned}
  88 sparse linear system:
  89 \begin{equation}
  90 M^{-1}Ax=M^{-1}b,
  91 \label{ch12:eq:11}
  92 \end{equation}
  93 where $M$ is the preconditioning matrix.
  94
  95
  96 %%****************%%
  97 %%****************%%
  98 \subsection{CG method}
  99 \label{ch12:sec:02.01}
 100 The conjugate gradient method is initially developed by Hestenes and Stiefel in 1952~\cite{ch12:ref2}.
 101 It is one of the well known iterative method for solving large sparse linear systems. In addition, it
 102 can be adapted for solving nonlinear equations and optimization problems. However, it can only be applied
 103 to problems with positive definite symmetric matrices.
 104
 105 The main idea of the CG method\index{Iterative~method!CG} is the computation of a sequence of approximate
 106 solutions $\{x_k\}_{k\geq 0}$ in a Krylov subspace\index{Iterative~method!Krylov~subspace} of order $k$ as
 107 follows:
 108 \begin{equation}
 109 x_k \in x_0 + \mathcal{K}_k(A,r_0),
 110 \label{ch12:eq:04}
 111 \end{equation}
 112 such that the Galerkin condition\index{Galerkin~condition} must be satisfied:
 113 \begin{equation}
 114 r_k \bot \mathcal{K}_k(A,r_0),
 115 \label{ch12:eq:05}
 116 \end{equation}
 117 where $x_0$ is the initial guess, $r_k=b-Ax_k$ is the residual of the computed solution $x_k$ and $\mathcal{K}_k$
 118 the Krylov subspace of order $k$: \[\mathcal{K}_k(A,r_0) \equiv\text{span}\{r_0, Ar_0, A^2r_0,\ldots, A^{k-1}r_0\}.\]
 119 In fact, CG is based on the construction of a sequence $\{p_k\}_{k\in\mathbb{N}}$ of direction vectors in $\mathcal{K}_k$
 120 which are pairwise $A$-conjugate ($A$-orthogonal):
 121 \begin{equation}
 122 \begin{array}{ll}
 123 p_i^T A p_j = 0, & i\neq j.
 124 \end{array}
 125 \label{ch12:eq:06}
 126 \end{equation}
 127 At each iteration $k$, an approximate solution $x_k$ is computed by recurrence as follows:
 128 \begin{equation}
 129 \begin{array}{ll}
 130 x_k = x_{k-1} + \alpha_k p_k, & \alpha_k\in\mathbb{R}.
 131 \end{array}
 132 \label{ch12:eq:07}
 133 \end{equation}
 134 Consequently, the residuals $r_k$ are computed in the same way:
 135 \begin{equation}
 136 r_k = r_{k-1} - \alpha_k A p_k.
 137 \label{ch12:eq:08}
 138 \end{equation}
 139 In the case where all residuals are nonzero, the direction vectors $p_k$ can be determined so that
 140 the following recurrence holds:
 141 \begin{equation}
 142 \begin{array}{lll}
 143 p_0=r_0, & p_k=r_k+\beta_k p_{k-1}, & \beta_k\in\mathbb{R}.
 144 \end{array}
 145 \label{ch12:eq:09}
 146 \end{equation}
 147 Moreover, the scalars $\{\alpha_k\}_{k>0}$ are chosen so as to minimize the $A$-norm error $\|x^{*}-x_k\|_A$
 148 over the Krylov subspace $\mathcal{K}_{k}$ and the scalars $\{\beta_k\}_{k>0}$ are chosen so as to ensure
 149 that the direction vectors are pairwise $A$-conjugate. So, the assumption that matrix $A$ is symmetric and
 150 the recurrences~(\ref{ch12:eq:08}) and~(\ref{ch12:eq:09}) allow to deduce that:
 151 \begin{equation}
 152 \begin{array}{ll}
 153 \alpha_{k}=\frac{r^{T}_{k-1}r_{k-1}}{p_{k}^{T}Ap_{k}}, & \beta_{k}=\frac{r_{k}^{T}r_{k}}{r_{k-1}^{T}r_{k-1}}.
 154 \end{array}
 155 \label{ch12:eq:10}
 156 \end{equation}
 157
 158 \begin{algorithm}[!t]
 159   Choose an initial guess $x_0$\;
 160   $r_{0} = b - A x_{0}$\;
 161   $convergence$ = false\;
 162   $k = 1$\;
 163   \Repeat{convergence}{
 164     $z_{k} = M^{-1} r_{k-1}$\;
 165     $\rho_{k} = (r_{k-1},z_{k})$\;
 166     \eIf{$k = 1$}{
 167       $p_{k} = z_{k}$\;
 168     }{
 169       $\beta_{k} = \rho_{k} / \rho_{k-1}$\;
 170       $p_{k} = z_{k} + \beta_{k} \times p_{k-1}$\;
 171     }
 172     $q_{k} = A \times p_{k}$\;
 173     $\alpha_{k} = \rho_{k} / (p_{k},q_{k})$\;
 174     $x_{k} = x_{k-1} + \alpha_{k} \times p_{k}$\;
 175     $r_{k} = r_{k-1} - \alpha_{k} \times q_{k}$\;
 176     \eIf{$(\rho_{k} < \varepsilon)$ {\bf or} $(k \geq maxiter)$}{
 177       $convergence$ = true\;
 178     }{
 179       $k = k + 1$\;
 180     }
 181   }
 182 \caption{Left-preconditioned CG method}
 183 \label{ch12:alg:01}
 184 \end{algorithm}
 185
 186 Algorithm~\ref{ch12:alg:01} shows the main key points of the preconditioned CG method. It allows
 187 to solve the left-preconditioned\index{Sparse~linear~system!Preconditioned} sparse linear system~(\ref{ch12:eq:11}).
 188 In this algorithm, $\varepsilon$ is the convergence tolerance threshold, $maxiter$ is the maximum
 189 number of iterations and $(\cdot,\cdot)$ defines the dot product between two vectors in $\mathbb{R}^{n}$.
 190 At every iteration, a direction vector $p_k$ is determined, so that it is orthogonal to the preconditioned
 191 residual $z_k$ and to the direction vectors $\{p_i\}_{i<k}$ previously determined (from line~$8$ to
 192 line~$13$). Then, at lines~$16$ and~$17$, the iterate $x_k$ and the residual $r_k$ are computed using
 193 formulas~(\ref{ch12:eq:07}) and~(\ref{ch12:eq:08}), respectively. The CG method converges after, at
 194 most, $n$ iterations. In practice, the CG algorithm stops when the tolerance threshold\index{Convergence!Tolerance~threshold}
 195 $\varepsilon$ and/or the maximum number of iterations\index{Convergence!Maximum~number~of~iterations}
 196 $maxiter$ are reached.
 197
 198
 199 %%****************%%
 200 %%****************%%
 201 \subsection{GMRES method}
 202 \label{ch12:sec:02.02}
 203 The iterative GMRES method is developed by Saad and Schultz in 1986~\cite{ch12:ref3} as a generalization
 204 of the minimum residual method MINRES~\cite{ch12:ref4}\index{Iterative~method!MINRES}. Indeed, GMRES can
 205 be applied for solving symmetric or nonsymmetric linear systems.
 206
 207 The main principle of the GMRES method\index{Iterative~method!GMRES} is to find an approximation minimizing
 208 at best the residual norm. In fact, GMRES computes a sequence of approximate solutions $\{x_k\}_{k>0}$ in
 209 a Krylov subspace\index{Iterative~method!Krylov~subspace} $\mathcal{K}_k$ as follows:
 210 \begin{equation}
 211 \begin{array}{ll}
 212 x_k \in x_0 + \mathcal{K}_k(A, v_1),& v_1=\frac{r_0}{\|r_0\|_2},
 213 \end{array}
 214 \label{ch12:eq:12}
 215 \end{equation}
 216 so that the Petrov-Galerkin condition\index{Petrov-Galerkin~condition} is satisfied:
 217 \begin{equation}
 218 \begin{array}{ll}
 219 r_k \bot A \mathcal{K}_k(A, v_1).
 220 \end{array}
 221 \label{ch12:eq:13}
 222 \end{equation}
 223 GMRES uses the Arnoldi process~\cite{ch12:ref5}\index{Iterative~method!Arnoldi~process} to construct an
 224 orthonormal basis $V_k$ for the Krylov subspace $\mathcal{K}_k$ and an upper Hessenberg matrix\index{Hessenberg~matrix}
 225 $\bar{H}_k$ of order $(k+1)\times k$:
 226 \begin{equation}
 227 \begin{array}{ll}
 228 V_k = \{v_1, v_2,\ldots,v_k\}, & \forall k>1, v_k=A^{k-1}v_1,
 229 \end{array}
 230 \label{ch12:eq:14}
 231 \end{equation}
 232 and
 233 \begin{equation}
 234 V_k A = V_{k+1} \bar{H}_k.
 235 \label{ch12:eq:15}
 236 \end{equation}
 237
 238 Then, at each iteration $k$, an approximate solution $x_k$ is computed in the Krylov subspace $\mathcal{K}_k$
 239 spanned by $V_k$ as follows:
 240 \begin{equation}
 241 \begin{array}{ll}
 242 x_k = x_0 + V_k y, & y\in\mathbb{R}^{k}.
 243 \end{array}
 244 \label{ch12:eq:16}
 245 \end{equation}
 246 From both formulas~(\ref{ch12:eq:15}) and~(\ref{ch12:eq:16}) and $r_k=b-Ax_k$, we can deduce that:
 247 \begin{equation}
 248 \begin{array}{lll}
 249   r_{k} & = & b - A (x_{0} + V_{k}y) \\
 250         & = & r_{0} - AV_{k}y \\
 251         & = & \beta v_{1} - V_{k+1}\bar{H}_{k}y \\
 252         & = & V_{k+1}(\beta e_{1} - \bar{H}_{k}y),
 253 \end{array}
 254 \label{ch12:eq:17}
 255 \end{equation}
 256 such that $\beta=\|r_0\|_2$ and $e_1=(1,0,\cdots,0)$ is the first vector of the canonical basis of
 257 $\mathbb{R}^k$. So, the vector $y$ is chosen in $\mathbb{R}^k$ so as to minimize at best the Euclidean
 258 norm of the residual $r_k$. Consequently, a linear least-squares problem of size $k$ is solved:
 259 \begin{equation}
 260 \underset{y\in\mathbb{R}^{k}}{min}\|r_{k}\|_{2}=\underset{y\in\mathbb{R}^{k}}{min}\|\beta e_{1}-\bar{H}_{k}y\|_{2}.
 261 \label{ch12:eq:18}
 262 \end{equation}
 263 The QR factorization of matrix $\bar{H}_k$ is used to compute the solution of this problem by using
 264 Givens rotations~\cite{ch12:ref1,ch12:ref3}, such that:
 265 \begin{equation}
 266 \begin{array}{lll}
 267 \bar{H}_{k}=Q_{k}R_{k}, & Q_{k}\in\mathbb{R}^{(k+1)\times (k+1)}, & R_{k}\in\mathbb{R}^{(k+1)\times k},
 268 \end{array}
 269 \label{ch12:eq:19}
 270 \end{equation}
 271 where $Q_kQ_k^T=I_k$ and $R_k$ is an upper triangular matrix.
 272
 273 The GMRES method computes an approximate solution with a sufficient precision after, at most, $n$
 274 iterations ($n$ is the size of the sparse linear system to be solved). However, the GMRES algorithm
 275 must construct and store in the memory an orthonormal basis $V_k$ whose size is proportional to the
 276 number of iterations required to achieve the convergence. Then, to avoid a huge memory storage, the
 277 GMRES method must be restarted at each $m$ iterations, such that $m$ is very small ($m\ll n$), and
 278 with $x_m$ as the initial guess to the next iteration. This allows to limit the size of the basis
 279 $V$ to $m$ orthogonal vectors.
 280
 281 \begin{algorithm}[!t]
 282   Choose an initial guess $x_0$\;
 283   $convergence$ = false\;
 284   $k = 1$\;
 285   $r_{0} = M^{-1}(b-Ax_{0})$\;
 286   $\beta = \|r_{0}\|_{2}$\;
 287   \While{$\neg convergence$}{
 288     $v_{1} = r_{0}/\beta$\;
 289     \For{$j=1$ \KwTo $m$}{
 290       $w_{j} = M^{-1}Av_{j}$\;
 291       \For{$i=1$ \KwTo $j$}{
 292         $h_{i,j} = (w_{j},v_{i})$\;
 293         $w_{j} = w_{j}-h_{i,j}v_{i}$\;
 294       }
 295       $h_{j+1,j} = \|w_{j}\|_{2}$\;
 296       $v_{j+1} = w_{j}/h_{j+1,j}$\;
 297     }
 298     Set $V_{m}=\{v_{j}\}_{1\leq j \leq m}$ and $\bar{H}_{m}=(h_{i,j})$ a $(m+1)\times m$ upper Hessenberg matrix\;
 299     Solve a least-squares problem of size $m$: $min_{y\in\mathrm{I\!R}^{m}}\|\beta e_{1}-\bar{H}_{m}y\|_{2}$\;
 300     $x_{m} = x_{0}+V_{m}y_{m}$\;
 301     $r_{m} = M^{-1}(b-Ax_{m})$\;
 302     $\beta = \|r_{m}\|_{2}$\;
 303     \eIf{ $(\beta<\varepsilon)$ {\bf or} $(k\geq maxiter)$}{
 304       $convergence$ = true\;
 305     }{
 306       $x_{0} = x_{m}$\;
 307       $r_{0} = r_{m}$\;
 308       $k = k + 1$\;
 309     }
 310   }
 311 \caption{Left-preconditioned GMRES method with restarts}
 312 \label{ch12:alg:02}
 313 \end{algorithm}
 314
 315 Algorithm~\ref{ch12:alg:02} shows the main key points of the GMRES method with restarts.
 316 It solves the left-preconditioned\index{Sparse~linear~system!Preconditioned} sparse linear
 317 system~(\ref{ch12:eq:11}), such that $M$ is the preconditioning matrix. At each iteration
 318 $k$, GMRES uses the Arnoldi process\index{Iterative~method!Arnoldi~process} (defined from
 319 line~$7$ to line~$17$) to construct a basis $V_m$ of $m$ orthogonal vectors and an upper
 320 Hessenberg matrix\index{Hessenberg~matrix} $\bar{H}_m$ of size $(m+1)\times m$. Then, it
 321 solves the linear least-squares problem of size $m$ to find the vector $y\in\mathbb{R}^{m}$
 322 which minimizes at best the residual norm (line~$18$). Finally, it computes an approximate
 323 solution $x_m$ in the Krylov subspace spanned by $V_m$ (line~$19$). The GMRES algorithm is
 324 stopped when the residual norm is sufficiently small ($\|r_m\|_2<\varepsilon$) and/or the
 325 maximum number of iterations\index{Convergence!Maximum~number~of~iterations} ($maxiter$)
 326 is reached.
 327
 328
 329 %%--------------------------%%
 330 %%       SECTION 3          %%
 331 %%--------------------------%%
 332 \section{Parallel implementation on a GPU cluster}
 333 \label{ch12:sec:03}
 334 In this section, we present the parallel algorithms of both iterative CG\index{Iterative~method!CG}
 335 and GMRES\index{Iterative~method!GMRES} methods for GPU clusters. The implementation is performed on
 336 a GPU cluster composed of different computing nodes, such that each node is a CPU core managed by a
 337 MPI process and equipped with a GPU card. The parallelization of these algorithms is carried out by
 338 using the MPI communication routines between the GPU computing nodes\index{Computing~node} and the
 339 CUDA programming environment inside each node. In what follows, the algorithms of the iterative methods
 340 are called iterative solvers.
 341
 342
 343 %%****************%%
 344 %%****************%%
 345 \subsection{Data partitioning}
 346 \label{ch12:sec:03.01}
 347 The parallel solving of the large sparse linear system~(\ref{ch12:eq:11}) requires a data partitioning
 348 between the computing nodes of the GPU cluster. Let $p$ denotes the number of the computing nodes on the
 349 GPU cluster. The partitioning operation consists in the decomposition of the vectors and matrices, involved
 350 in the iterative solver, in $p$ portions. Indeed, this operation allows to assign to each computing node
 351 $i$:
 352 \begin{itemize}
 353 \item a portion of size $\frac{n}{p}$ elements of each vector,
 354 \item a sparse rectangular sub-matrix $A_i$ of size $(\frac{n}{p},n)$ and,
 355 \item a square preconditioning sub-matrix $M_i$ of size $(\frac{n}{p},\frac{n}{p})$,
 356 \end{itemize}
 357 where $n$ is the size of the sparse linear system to be solved. In the first instance, we perform a naive
 358 row-wise partitioning (decomposition row-by-row) on the data of the sparse linear systems to be solved.
 359 Figure~\ref{ch12:fig:01} shows an example of a row-wise data partitioning between four computing nodes
 360 of a sparse linear system (sparse matrix $A$, solution vector $x$ and right-hand side $b$) of size $16$
 361 unknown values.
 362
 363 \begin{figure}
 364 \centerline{\includegraphics[scale=0.35]{Chapters/chapter12/figures/partition}}
 365 \caption{A data partitioning of the sparse matrix $A$, the solution vector $x$ and the right-hand side $b$ into four portions.}
 366 \label{ch12:fig:01}
 367 \end{figure}
 368
 369
 370 %%****************%%
 371 %%****************%%
 372 \subsection{GPU computing}
 373 \label{ch12:sec:03.02}
 374 After the partitioning operation, all the data involved from this operation must be
 375 transferred from the CPU memories to the GPU memories, in order to be processed by
 376 GPUs. We use two functions of the CUBLAS\index{CUBLAS} library (CUDA Basic Linear
 377 Algebra Subroutines), developed by Nvidia~\cite{ch12:ref6}: \verb+cublasAlloc()+
 378 for the memory allocations on GPUs and \verb+cublasSetVector()+ for the memory
 379 copies from the CPUs to the GPUs.
 380
 381 An efficient implementation of CG and GMRES solvers on a GPU cluster requires to
 382 determine all parts of their codes that can be executed in parallel and, thus, take
 383 advantage of the GPU acceleration. As many Krylov subspace methods, the CG and GMRES
 384 methods are mainly based on arithmetic operations dealing with vectors or matrices:
 385 sparse matrix-vector multiplications, scalar-vector multiplications, dot products,
 386 Euclidean norms, AXPY operations ($y\leftarrow ax+y$ where $x$ and $y$ are vectors
 387 and $a$ is a scalar) and so on. These vector operations are often easy to parallelize
 388 and they are more efficient on parallel computers when they work on large vectors.
 389 Therefore, all the vector operations used in CG and GMRES solvers must be executed
 390 by the GPUs as kernels.
 391
 392 We use the kernels of the CUBLAS library to compute some vector operations of CG and
 393 GMRES solvers. The following kernels of CUBLAS (dealing with double floating point)
 394 are used: \verb+cublasDdot()+ for the dot products, \verb+cublasDnrm2()+ for the
 395 Euclidean norms and \verb+cublasDaxpy()+ for the AXPY operations. For the rest of
 396 the data-parallel operations, we code their kernels in CUDA. In the CG solver, we
 397 develop a kernel for the XPAY operation ($y\leftarrow x+ay$) used at line~$12$ in
 398 Algorithm~\ref{ch12:alg:01}. In the GMRES solver, we program a kernel for the scalar-vector
 399 multiplication (lines~$7$ and~$15$ in Algorithm~\ref{ch12:alg:02}), a kernel for
 400 solving the least-squares problem and a kernel for the elements updates of the solution
 401 vector $x$.
 402
 403 The least-squares problem in the GMRES method is solved by performing a QR factorization
 404 on the Hessenberg matrix\index{Hessenberg~matrix} $\bar{H}_m$ with plane rotations and,
 405 then, solving the triangular system by backward substitutions to compute $y$. Consequently,
 406 solving the least-squares problem on the GPU is not interesting. Indeed, the triangular
 407 solves are not easy to parallelize and inefficient on GPUs. However, the least-squares
 408 problem to solve in the GMRES method with restarts has, generally, a very small size $m$.
 409 Therefore, we develop an inexpensive kernel which must be executed in sequential by a
 410 single CUDA thread.
 411
 412 The most important operation in CG\index{Iterative~method!CG} and GMRES\index{Iterative~method!GMRES}
 413 methods is the sparse matrix-vector multiplication (SpMV)\index{SpMV~multiplication},
 414 because it is often an expensive operation in terms of execution time and memory space.
 415 Moreover, it requires to take care of the storage format of the sparse matrix in the
 416 memory. Indeed, the naive storage, row-by-row or column-by-column, of a sparse matrix
 417 can cause a significant waste of memory space and execution time. In addition, the sparsity
 418 nature of the matrix often leads to irregular memory accesses to read the matrix nonzero
 419 values. So, the computation of the SpMV multiplication on GPUs can involve non coalesced
 420 accesses to the global memory, which slows down even more its performances. One of the
 421 most efficient compressed storage formats\index{Compressed~storage~format} of sparse
 422 matrices on GPUs is HYB\index{Compressed~storage~format!HYB} format~\cite{ch12:ref7}.
 423 It is a combination of ELLpack (ELL) and Coordinate (COO) formats. Indeed, it stores
 424 a typical number of nonzero values per row in ELL\index{Compressed~storage~format!ELL}
 425 format and remaining entries of exceptional rows in COO format. It combines the efficiency
 426 of ELL due to the regularity of its memory accesses and the flexibility of COO\index{Compressed~storage~format!COO}
 427 which is insensitive to the matrix structure. Consequently, we use the HYB kernel~\cite{ch12:ref8}
 428 developed by Nvidia to implement the SpMV multiplication of CG and GMRES methods on GPUs.
 429 Moreover, to avoid the non coalesced accesses to the high-latency global memory, we fill
 430 the elements of the iterate vector $x$ in the cached texture memory.
 431
 432
 433 %%****************%%
 434 %%****************%%
 435 \subsection{Data communications}
 436 \label{ch12:sec:03.03}
 437 All the computing nodes of the GPU cluster execute in parallel the same iterative solver
 438 (Algorithm~\ref{ch12:alg:01} or Algorithm~\ref{ch12:alg:02}) adapted to GPUs, but on their
 439 own portions of the sparse linear system\index{Sparse~linear~system}: $M^{-1}_iA_ix_i=M^{-1}_ib_i$,
 440 $0\leq i<p$. However, in order to solve the complete sparse linear system~(\ref{ch12:eq:11}),
 441 synchronizations must be performed between the local computations of the computing nodes over
 442 the cluster. In what follows, two computing nodes sharing data are called neighboring nodes\index{Neighboring~node}.
 443
 444 As already mentioned, the most important operation of CG and GMRES methods is the SpMV multiplication.
 445 In the parallel implementation of the iterative methods, each computing node $i$ performs the
 446 SpMV multiplication on its own sparse rectangular sub-matrix $A_i$. Locally, it has only sub-vectors
 447 of size $\frac{n}{p}$ corresponding to rows of its sub-matrix $A_i$. However, it also requires
 448 the vector elements of its neighbors, corresponding to the column indices on which its sub-matrix
 449 has nonzero values (see Figure~\ref{ch12:fig:01}). So, in addition to the local vectors, each
 450 node must also manage vector elements shared with neighbors and required to compute the SpMV
 451 multiplication. Therefore, the iterate vector $x$ managed by each computing node is composed
 452 of a local sub-vector $x^{local}$ of size $\frac{n}{p}$ and a sub-vector of shared elements $x^{shared}$.
 453 In the same way, the vector used to construct the orthonormal basis of the Krylov subspace (vectors
 454 $p$ and $v$ in CG and GMRES methods, respectively) is composed of a local sub-vector and a shared
 455 sub-vector.
 456
 457 Therefore, before computing the SpMV multiplication\index{SpMV~multiplication}, the neighboring
 458 nodes\index{Neighboring~node} over the GPU cluster must exchange between them the shared vector
 459 elements necessary to compute this multiplication. First, each computing node determines, in its
 460 local sub-vector, the vector elements needed by other nodes. Then, the neighboring nodes exchange
 461 between them these shared vector elements. The data exchanges are implemented by using the MPI
 462 point-to-point communication routines: blocking\index{MPI~subroutines!Blocking} sends with \verb+MPI_Send()+
 463 and nonblocking\index{MPI~subroutines!Nonblocking} receives with \verb+MPI_Irecv()+. Figure~\ref{ch12:fig:02}
 464 shows an example of data exchanges between \textit{Node 1} and its neighbors \textit{Node 0}, \textit{Node 2}
 465 and \textit{Node 3}. In this example, the iterate matrix $A$ split between these four computing
 466 nodes is that presented in Figure~\ref{ch12:fig:01}.
 467
 468 \begin{figure}
 469 \centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/compress}}
 470 \caption{Data exchanges between \textit{Node 1} and its neighbors \textit{Node 0}, \textit{Node 2} and \textit{Node 3}.}
 471 \label{ch12:fig:02}
 472 \end{figure}
 473
 474 After the synchronization operation, the computing nodes receive, from their respective neighbors,
 475 the shared elements in a sub-vector stored in a compressed format. However, in order to compute the
 476 SpMV multiplication, the computing nodes operate on sparse global vectors (see Figure~\ref{ch12:fig:02}).
 477 In this case, the received vector elements must be copied to the corresponding indices in the global
 478 vector. So as not to need to perform this at each iteration, we propose to reorder the columns of
 479 each sub-matrix $\{A_i\}_{0\leq i<p}$, so that the shared sub-vectors could be used in their compressed
 480 storage formats. Figure~\ref{ch12:fig:03} shows a reordering of a sparse sub-matrix (sub-matrix of
 481 \textit{Node 1}).
 482
 483 \begin{figure}
 484 \centerline{\includegraphics[scale=0.35]{Chapters/chapter12/figures/reorder}}
 485 \caption{Columns reordering of a sparse sub-matrix.}
 486 \label{ch12:fig:03}
 487 \end{figure}
 488
 489 A GPU cluster\index{GPU~cluster} is a parallel platform with a distributed memory. So, the synchronizations
 490 and communication data between GPU nodes are carried out by passing messages. However, GPUs can not communicate
 491 between them in direct way. Then, CPUs via MPI processes are in charge of the synchronizations within the GPU
 492 cluster. Consequently, the vector elements to be exchanged must be copied from the GPU memory to the CPU memory
 493 and vice-versa before and after the synchronization operation between CPUs. We have used the CUBLAS\index{CUBLAS}
 494 communication subroutines to perform the data transfers between a CPU core and its GPU: \verb+cublasGetVector()+
 495 and \verb+cublasSetVector()+. Finally, in addition to the data exchanges, GPU nodes perform reduction operations
 496 to compute in parallel the dot products and Euclidean norms. This is implemented by using the MPI global communication\index{MPI~subroutines!Global}
 497 \verb+MPI_Allreduce()+.
 498
 499
 500
 501 %%--------------------------%%
 502 %%       SECTION 4          %%
 503 %%--------------------------%%
 504 \section{Experimental results}
 505 \label{ch12:sec:04}
 506 In this section, we present the performances of the parallel CG and GMRES linear solvers obtained
 507 on a cluster of $12$ GPUs. Indeed, this GPU cluster of tests is composed of six machines connected
 508 by $20$Gbps InfiniBand network. Each machine is a Quad-Core Xeon E5530 CPU running at $2.4$GHz and
 509 providing $12$GB of RAM with a memory bandwidth of $25.6$GB/s. In addition, two Tesla C1060 GPUs are
 510 connected to each machine via a PCI-Express 16x Gen 2.0 interface with a throughput of $8$GB/s. A
 511 Tesla C1060 GPU contains $240$ cores running at $1.3$GHz and providing a global memory of $4$GB with
 512 a memory bandwidth of $102$GB/s. Figure~\ref{ch12:fig:04} shows the general scheme of the GPU cluster\index{GPU~cluster}
 513 that we used in the experimental tests.
 514
 515 \begin{figure}
 516 \centerline{\includegraphics[scale=0.25]{Chapters/chapter12/figures/cluster}}
 517 \caption{General scheme of the GPU cluster of tests composed of six machines, each with two GPUs.}
 518 \label{ch12:fig:04}
 519 \end{figure}
 520
 521 Linux cluster version 2.6.39 OS is installed on CPUs. C programming language is used for coding
 522 the parallel algorithms of both methods on the GPU cluster. CUDA version 4.0~\cite{ch12:ref9}
 523 is used for programming GPUs, using CUBLAS library~\cite{ch12:ref6} to deal with vector operations
 524 in GPUs and, finally, MPI routines of OpenMPI 1.3.3 are used to carry out the communications between
 525 CPU cores. Indeed, the experiments are done on a cluster of $12$ computing nodes, where each node
 526 is managed by a MPI process and it is composed of one CPU core and one GPU card.
 527
 528 All tests are made on double-precision floating point operations. The parameters of both linear
 529 solvers are initialized as follows: the residual tolerance threshold $\varepsilon=10^{-12}$, the
 530 maximum number of iterations $maxiter=500$, the right-hand side $b$ is filled with $1.0$ and the
 531 initial guess $x_0$ is filled with $0.0$. In addition, we limited the Arnoldi process\index{Iterative~method!Arnoldi~process}
 532 used in the GMRES method to $16$ iterations ($m=16$). For the sake of simplicity, we have chosen
 533 the preconditioner $M$ as the main diagonal of the sparse matrix $A$. Indeed, it allows to easily
 534 compute the required inverse matrix $M^{-1}$ and it provides a relatively good preconditioning for
 535 not too ill-conditioned matrices. In the GPU computing, the size of thread blocks is fixed to $512$
 536 threads. Finally, the performance results, presented hereafter, are obtained from the mean value
 537 over $10$ executions of the same parallel linear solver and for the same input data.
 538
 539 To get more realistic results, we tested the CG and GMRES algorithms on sparse matrices of the Davis's
 540 collection~\cite{ch12:ref10}, that arise in a wide spectrum of real-world applications. We chose six
 541 symmetric sparse matrices and six nonsymmetric ones from this collection. In Figure~\ref{ch12:fig:05},
 542 we show structures of these matrices and in Table~\ref{ch12:tab:01} we present their main characteristics
 543 which are the number of rows, the total number of nonzero values (nnz) and the maximal bandwidth. In
 544 the present chapter, the bandwidth of a sparse matrix is defined as the number of matrix columns separating
 545 the first and the last nonzero value on a matrix row.
 546
 547 \begin{figure}
 548 \centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/matrices}}
 549 \caption{Sketches of sparse matrices chosen from the Davis's collection.}
 550 \label{ch12:fig:05}
 551 \end{figure}
 552
 553 \begin{table}
 554 \centering
 555 \begin{tabular}{|c|c|c|c|c|}
 556 \hline
 557 {\bf Matrix type}             & {\bf Matrix name} & {\bf \# rows} & {\bf \# nnz} & {\bf Bandwidth} \\ \hline \hline
 558
 559 \multirow{6}{*}{Symmetric}    & 2cubes\_sphere    & $101,492$     & $1,647,264$  & $100,464$ \\
 560
 561                               & ecology2          & $999,999$     & $4,995,991$  & $2,001$   \\
 562
 563                               & finan512          & $74,752$      & $596,992$    & $74,725$  \\
 564
 565                               & G3\_circuit       & $1,585,478$   & $7,660,826$  & $1,219,059$ \\
 566
 567                               & shallow\_water2   & $81,920$      & $327,680$    & $58,710$ \\
 568
 569                               & thermal2          & $1,228,045$   & $8,580,313$  & $1,226,629$ \\ \hline \hline
 570
 571 \multirow{6}{*}{Nonsymmetric} & cage13            & $445,315$     & $7,479,343$  & $318,788$\\
 572
 573                               & crashbasis        & $160,000$     & $1,750,416$  & $120,202$ \\
 574
 575                               & FEM\_3D\_thermal2 & $147,900$     & $3,489.300$  & $117,827$ \\
 576
 577                               & language          & $399,130$     & $1,216,334$  & $398,622$\\
 578
 579                               & poli\_large       & $15,575$      & $33,074$     & $15,575$ \\
 580
 581                               & torso3            & $259,156$     & $4,429,042$  & $216,854$  \\ \hline
 582 \end{tabular}
 583 \vspace{0.5cm}
 584 \caption{Main characteristics of sparse matrices chosen from the Davis's collection.}
 585 \label{ch12:tab:01}
 586 \end{table}
 587
 588 \begin{table}
 589 \begin{center}
 590 \begin{tabular}{|c|c|c|c|c|c|c|}
 591 \hline
 592 {\bf Matrix}     & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$  & $\mathbf{\# iter.}$ & $\mathbf{prec.}$     & $\mathbf{\Delta}$   \\ \hline \hline
 593
 594 2cubes\_sphere    & $0.132s$           & $0.069s$            & $1.93$        & $12$           & $1.14e$-$09$     & $3.47e$-$18$ \\
 595
 596 ecology2          & $0.026s$           & $0.017s$            & $1.52$        & $13$           & $5.06e$-$09$     & $8.33e$-$17$ \\
 597
 598 finan512          & $0.053s$           & $0.036s$            & $1.49$        & $12$           & $3.52e$-$09$     & $1.66e$-$16$ \\
 599
 600 G3\_circuit       & $0.704s$           & $0.466s$            & $1.51$        & $16$           & $4.16e$-$10$     & $4.44e$-$16$ \\
 601
 602 shallow\_water2   & $0.017s$           & $0.010s$            & $1.68$        & $5$            & $2.24e$-$14$     & $3.88e$-$26$ \\
 603
 604 thermal2          & $1.172s$           & $0.622s$            & $1.88$        & $15$           & $5.11e$-$09$     & $3.33e$-$16$ \\ \hline
 605 \end{tabular}
 606 \caption{Performances of the parallel CG method on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs.}
 607 \label{ch12:tab:02}
 608 \end{center}
 609 \end{table}
 610
 611 \begin{table}[!h]
 612 \begin{center}
 613 \begin{tabular}{|c|c|c|c|c|c|c|}
 614 \hline
 615 {\bf Matrix}     & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$  & $\mathbf{\# iter.}$ & $\mathbf{prec.}$     & $\mathbf{\Delta}$   \\ \hline \hline
 616
 617 2cubes\_sphere    & $0.234s$           & $0.124s$            & $1.88$        & $21$           & $2.10e$-$14$     & $3.47e$-$18$ \\
 618
 619 ecology2          & $0.076s$           & $0.035s$            & $2.15$        & $21$           & $4.30e$-$13$     & $4.38e$-$15$ \\
 620
 621 finan512          & $0.073s$           & $0.052s$            & $1.40$        & $17$           & $3.21e$-$12$     & $5.00e$-$16$ \\
 622
 623 G3\_circuit       & $1.016s$           & $0.649s$            & $1.56$        & $22$           & $1.04e$-$12$     & $2.00e$-$15$ \\
 624
 625 shallow\_water2   & $0.061s$           & $0.044s$            & $1.38$        & $17$           & $5.42e$-$22$     & $2.71e$-$25$ \\
 626
 627 thermal2          & $1.666s$           & $0.880s$            & $1.89$        & $21$           & $6.58e$-$12$     & $2.77e$-$16$ \\ \hline \hline
 628
 629 cage13            & $0.721s$           & $0.338s$            & $2.13$        & $26$           & $3.37e$-$11$     & $2.66e$-$15$ \\
 630
 631 crashbasis        & $1.349s$           & $0.830s$            & $1.62$        & $121$          & $9.10e$-$12$     & $6.90e$-$12$ \\
 632
 633 FEM\_3D\_thermal2 & $0.797s$           & $0.419s$            & $1.90$        & $64$           & $3.87e$-$09$     & $9.09e$-$13$ \\
 634
 635 language          & $2.252s$           & $1.204s$            & $1.87$        & $90$           & $1.18e$-$10$     & $8.00e$-$11$ \\
 636
 637 poli\_large       & $0.097s$           & $0.095s$            & $1.02$        & $69$           & $4.98e$-$11$     & $1.14e$-$12$ \\
 638
 639 torso3            & $4.242s$           & $2.030s$            & $2.09$        & $175$          & $2.69e$-$10$     & $1.78e$-$14$ \\ \hline
 640 \end{tabular}
 641 \caption{Performances of the parallel GMRES method on a cluster 24 CPU cores vs. on cluster of 12 GPUs.}
 642 \label{ch12:tab:03}
 643 \end{center}
 644 \end{table}
 645
 646 Tables~\ref{ch12:tab:02} and~\ref{ch12:tab:03} shows the performances of the parallel
 647 CG and GMRES solvers, respectively, for solving linear systems associated to the sparse
 648 matrices presented in Tables~\ref{ch12:tab:01}. They allow to compare the performances
 649 obtained on a cluster of $24$ CPU cores and on a cluster of $12$ GPUs. However, Table~\ref{ch12:tab:02}
 650 shows only the performances of solving symmetric sparse linear systems, due to the inability
 651 of the CG method to solve the nonsymmetric systems. In both tables, the second and third
 652 columns give, respectively, the execution times in seconds obtained on $24$ CPU cores
 653 ($Time_{gpu}$) and that obtained on $12$ GPUs ($Time_{gpu}$). Moreover, we take into account
 654 the relative gains $\tau$ of a solver implemented on the GPU cluster compared to the same
 655 solver implemented on the CPU cluster. The relative gains\index{Relative~gain}, presented
 656 in the fourth column, are computed as a ratio of the CPU execution time over the GPU
 657 execution time:
 658 \begin{equation}
 659 \tau = \frac{Time_{cpu}}{Time_{gpu}}.
 660 \label{ch12:eq:20}
 661 \end{equation}
 662 In addition, Tables~\ref{ch12:tab:02} and~\ref{ch12:tab:03} give the number of iterations
 663 ($iter$), the precision $prec$ of the solution computed on the GPU cluster and the difference
 664 $\Delta$ between the solution computed on the CPU cluster and that computed on the GPU cluster.
 665 Both parameters $prec$ and $\Delta$ allow to validate and verify the accuracy of the solution
 666 computed on the GPU cluster. We have computed them as follows:
 667 \begin{eqnarray}
 668 \Delta = max|x^{cpu}-x^{gpu}|,\\
 669 prec = max|M^{-1}r^{gpu}|,
 670 \end{eqnarray}
 671 where $\Delta$ is the maximum vector element, in absolute value, of the difference between
 672 the two solutions $x^{cpu}$ and $x^{gpu}$ computed, respectively, on CPU and GPU clusters and
 673 $prec$ is the maximum element, in absolute value, of the residual vector $r^{gpu}\in\mathbb{R}^{n}$
 674 of the solution $x^{gpu}$. Thus, we can see that the solutions obtained on the GPU cluster
 675 were computed with a sufficient accuracy (about $10^{-10}$) and they are, more or less, equivalent
 676 to those computed on the CPU cluster with a small difference ranging from $10^{-10}$ to $10^{-26}$.
 677 However, we can notice from the relative gains $\tau$ that is not interesting to use multiple
 678 GPUs for solving small sparse linear systems. in fact, a small sparse matrix does not allow to
 679 maximize utilization of GPU cores. In addition, the communications required to synchronize the
 680 computations over the cluster increase the idle times of GPUs and slow down further the parallel
 681 computations.
 682
 683 Consequently, in order to test the performances of the parallel solvers, we developed in C programming
 684 language a generator of large sparse matrices. This generator takes a matrix from the Davis's collection~\cite{ch12:ref10}
 685 as an initial matrix to construct large sparse matrices exceeding ten million of rows. It must be executed
 686 in parallel by the MPI processes of the computing nodes, so that each process could construct its sparse
 687 sub-matrix. In first experimental tests, we are focused on sparse matrices having a banded structure,
 688 because they are those arise in the most of numerical problems. So to generate the global sparse matrix,
 689 each MPI process constructs its sub-matrix by performing several copies of an initial sparse matrix chosen
 690 from the Davis's collection. Then, it puts all these copies on the main diagonal of the global matrix
 691 (see Figure~\ref{ch12:fig:06}). Moreover, the empty spaces between two successive copies in the main
 692 diagonal are filled with sub-copies (left-copy and right-copy in Figure~\ref{ch12:fig:06}) of the same
 693 initial matrix.
 694
 695 \begin{figure}
 696 \centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/generation}}
 697 \caption{Parallel generation of a large sparse matrix by four computing nodes.}
 698 \label{ch12:fig:06}
 699 \end{figure}
 700
 701 \begin{table}[!h]
 702 \centering
 703 \begin{tabular}{|c|c|c|c|}
 704 \hline
 705 {\bf Matrix type}             & {\bf Matrix name} & {\bf \# nnz} & {\bf Bandwidth} \\ \hline \hline
 706
 707 \multirow{6}{*}{Symmetric}    & 2cubes\_sphere    & $413,703,602$ & $198,836$     \\
 708
 709                               & ecology2          & $124,948,019$ & $2,002$          \\
 710
 711                               & finan512          & $278,175,945$ & $123,900$        \\
 712
 713                               & G3\_circuit       & $125,262,292$ & $1,891,887$      \\
 714
 715                               & shallow\_water2   & $100,235,292$ & $62,806$      \\
 716
 717                               & thermal2          & $175,300,284$ & $2,421,285$ \\ \hline \hline
 718
 719 \multirow{6}{*}{Nonsymmetric} & cage13            & $435,770,480$ & $352,566$        \\
 720
 721                               & crashbasis        & $409,291,236$ & $200,203$        \\
 722
 723                               & FEM\_3D\_thermal2 & $595,266,787$ & $206,029$ \\
 724
 725                               & language          & $76,912,824$  & $398,626$ \\
 726
 727                               & poli\_large       & $53,322,580$  & $15,576$         \\
 728
 729                               & torso3            & $433,795,264$ & $328,757$        \\ \hline
 730 \end{tabular}
 731 \vspace{0.5cm}
 732 \caption{Main characteristics of sparse banded matrices generated from those of the Davis's collection.}
 733 \label{ch12:tab:04}
 734 \end{table}
 735
 736 We have used the parallel CG and GMRES algorithms for solving sparse linear systems of $25$
 737 million unknown values. The sparse matrices associated to these linear systems are generated
 738 from those presented in Table~\ref{ch12:tab:01}. Their main characteristics are given in Table~\ref{ch12:tab:04}.
 739 Tables~\ref{ch12:tab:05} and~\ref{ch12:tab:06} shows the performances of the parallel CG and
 740 GMRES solvers, respectively, obtained on a cluster of $24$ CPU cores and on a cluster of $12$
 741 GPUs. Obviously, we can notice from these tables that solving large sparse linear systems on
 742 a GPU cluster is more efficient than on a CPU cluster (see relative gains $\tau$). We can also
 743 notice that the execution times of the CG method, whether in a CPU cluster or on a GPU cluster,
 744 are better than those the GMRES method for solving large symmetric linear systems. In fact, the
 745 CG method is characterized by a better convergence\index{Convergence} rate and a shorter execution
 746 time of an iteration than those of the GMRES method. Moreover, an iteration of the parallel GMRES
 747 method requires more data exchanges between computing nodes compared to the parallel CG method.
 748
 749 \begin{table}[!h]
 750 \begin{center}
 751 \begin{tabular}{|c|c|c|c|c|c|c|}
 752 \hline
 753 {\bf Matrix}    & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$   \\ \hline \hline
 754
 755 2cubes\_sphere  & $1.625s$             & $0.401s$              & $4.05$          & $14$                & $5.73e$-$11$     & $5.20e$-$18$ \\
 756
 757 ecology2        & $0.856s$             & $0.103s$              & $8.27$          & $15$                & $3.75e$-$10$     & $1.11e$-$16$ \\
 758
 759 finan512        & $1.210s$             & $0.354s$              & $3.42$          & $14$                & $1.04e$-$10$     & $2.77e$-$16$ \\
 760
 761 G3\_circuit     & $1.346s$             & $0.263s$              & $5.12$          & $17$                & $1.10e$-$10$     & $5.55e$-$16$ \\
 762
 763 shallow\_water2 & $0.397s$             & $0.055s$              & $7.23$          & $7$                 & $3.43e$-$15$     & $5.17e$-$26$ \\
 764
 765 thermal2        & $1.411s$             & $0.244s$              & $5.78$          & $16$                & $1.67e$-$09$     & $3.88e$-$16$ \\ \hline
 766 \end{tabular}
 767 \caption{Performances of the parallel CG method for solving linear systems associated to sparse banded matrices on a cluster of 24 CPU cores vs.
 768 on a cluster of 12 GPUs.}
 769 \label{ch12:tab:05}
 770 \end{center}
 771 \end{table}
 772
 773 \begin{table}[!h]
 774 \begin{center}
 775 \begin{tabular}{|c|c|c|c|c|c|c|}
 776 \hline
 777 {\bf Matrix}      & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$   \\ \hline \hline
 778
 779 2cubes\_sphere    & $3.597s$             & $0.514s$              & $6.99$          & $21$                & $2.11e$-$14$     & $8.67e$-$18$ \\
 780
 781 ecology2          & $2.549s$             & $0.288s$              & $8.83$          & $21$                & $4.88e$-$13$     & $2.08e$-$14$ \\
 782
 783 finan512          & $2.660s$             & $0.377s$              & $7.05$          & $17$                & $3.22e$-$12$     & $8.82e$-$14$ \\
 784
 785 G3\_circuit       & $3.139s$             & $0.480s$              & $6.53$          & $22$                & $1.04e$-$12$     & $5.00e$-$15$ \\
 786
 787 shallow\_water2   & $2.195s$             & $0.253s$              & $8.68$          & $17$                & $5.54e$-$21$     & $7.92e$-$24$ \\
 788
 789 thermal2          & $3.206s$             & $0.463s$              & $6.93$          & $21$                & $8.89e$-$12$     & $3.33e$-$16$ \\ \hline \hline
 790
 791 cage13            & $5.560s$             & $0.663s$              & $8.39$          & $26$                & $3.29e$-$11$     & $1.59e$-$14$ \\
 792
 793 crashbasis        & $25.802s$            & $3.511s$              & $7.35$          & $135$               & $6.81e$-$11$     & $4.61e$-$15$ \\
 794
 795 FEM\_3D\_thermal2 & $13.281s$            & $1.572s$              & $8.45$          & $64$                & $3.88e$-$09$     & $1.82e$-$12$ \\
 796
 797 language          & $12.553s$            & $1.760s$              & $7.13$          & $89$                & $2.11e$-$10$     & $1.60e$-$10$ \\
 798
 799 poli\_large       & $8.515s$             & $1.053s$              & $8.09$          & $69$                & $5.05e$-$11$     & $6.59e$-$12$ \\
 800
 801 torso3            & $31.463s$            & $3.681s$              & $8.55$          & $175$               & $2.69e$-$10$     & $2.66e$-$14$ \\ \hline
 802 \end{tabular}
 803 \caption{Performances of the parallel GMRES method for solving linear systems associated to sparse banded matrices on a cluster of 24 CPU cores vs.
 804 on a cluster of 12 GPUs.}
 805 \label{ch12:tab:06}
 806 \end{center}
 807 \end{table}
 808
 809
 810 %%--------------------------%%
 811 %%       SECTION 5          %%
 812 %%--------------------------%%
 813 \section{Hypergraph partitioning}
 814 \label{ch12:sec:05}
 815 In this section, we present the performances of both parallel CG and GMRES solvers for solving linear
 816 systems associated to sparse matrices having large bandwidths. Indeed, we are interested on sparse
 817 matrices having the nonzero values distributed along their bandwidths.
 818
 819 \begin{figure}
 820 \centerline{\includegraphics[scale=0.22]{Chapters/chapter12/figures/generation_1}}
 821 \caption{Parallel generation of a large sparse five-bands matrix by four computing nodes.}
 822 \label{ch12:fig:07}
 823 \end{figure}
 824
 825 \begin{table}[!h]
 826 \begin{center}
 827 \begin{tabular}{|c|c|c|c|}
 828 \hline
 829 {\bf Matrix type}             & {\bf Matrix name} & {\bf \# nnz}  & {\bf Bandwidth} \\ \hline \hline
 830
 831 \multirow{6}{*}{Symmetric}    & 2cubes\_sphere    & $829,082,728$ & $24,999,999$     \\
 832
 833                               & ecology2          & $254,892,056$ & $25,000,000$     \\
 834
 835                               & finan512          & $556,982,339$ & $24,999,973$     \\
 836
 837                               & G3\_circuit       & $257,982,646$ & $25,000,000$     \\
 838
 839                               & shallow\_water2   & $200,798,268$ & $25,000,000$     \\
 840
 841                               & thermal2          & $359,340,179$ & $24,999,998$     \\ \hline \hline
 842
 843 \multirow{6}{*}{Nonsymmetric} & cage13            & $879,063,379$ & $24,999,998$     \\
 844
 845                               & crashbasis        & $820,373,286$ & $24,999,803$     \\
 846
 847                               & FEM\_3D\_thermal2 & $1,194,012,703$ & $24,999,998$     \\
 848
 849                               & language          & $155,261,826$ & $24,999,492$     \\
 850
 851                               & poli\_large       & $106,680,819$ & $25,000,000$    \\
 852
 853                               & torso3            & $872,029,998$ & $25,000,000$\\ \hline
 854 \end{tabular}
 855 \caption{Main characteristics of sparse five-bands matrices generated from those of the Davis's collection.}
 856 \label{ch12:tab:07}
 857 \end{center}
 858 \end{table}
 859
 860 We have developed in C programming language a generator of large sparse matrices
 861 having five bands distributed along their bandwidths (see Figure~\ref{ch12:fig:07}).
 862 The principle of this generator is equivalent to that in Section~\ref{ch12:sec:04}.
 863 However, the copies performed on the initial matrix (chosen from the Davis's collection)
 864 are placed on the main diagonal and on four off-diagonals, two on the right and two
 865 on the left of the main diagonal. Figure~\ref{ch12:fig:07} shows an example of a
 866 generation of a sparse five-bands matrix by four computing nodes. Table~\ref{ch12:tab:07}
 867 shows the main characteristics of sparse five-bands matrices generated from those
 868 presented in Table~\ref{ch12:tab:01} and associated to linear systems of $25$ million
 869 unknown values.
 870
 871 \begin{table}[!h]
 872 \begin{center}
 873 \begin{tabular}{|c|c|c|c|c|c|c|}
 874 \hline
 875 {\bf Matrix}      & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$   \\ \hline \hline
 876
 877 2cubes\_sphere    & $6.041s$     & $3.338s$      & $1.81$ & $30$ & $6.77e$-$11$ & $3.25e$-$19$ \\
 878
 879 ecology2          & $1.404s$     & $1.301s$      & $1.08$ & $13$     & $5.22e$-$11$ & $2.17e$-$18$ \\
 880
 881 finan512          & $1.822s$     & $1.299s$      & $1.40$ & $12$     & $3.52e$-$11$ & $3.47e$-$18$ \\
 882
 883 G3\_circuit       & $2.331s$     & $2.129s$      & $1.09$ & $15$     & $1.36e$-$11$ & $5.20e$-$18$ \\
 884
 885 shallow\_water2   & $0.541s$     & $0.504s$      & $1.07$ & $6$      & $2.12e$-$16$ & $5.05e$-$28$ \\
 886
 887 thermal2          & $2.549s$     & $1.705s$      & $1.49$ & $14$     & $2.36e$-$10$ & $5.20e$-$18$ \\ \hline
 888 \end{tabular}
 889 \caption{Performances of parallel CG solver for solving linear systems associated to sparse five-bands matrices
 890 on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs}
 891 \label{ch12:tab:08}
 892 \end{center}
 893 \end{table}
 894
 895 \begin{table}
 896 \begin{center}
 897 \begin{tabular}{|c|c|c|c|c|c|c|}
 898 \hline
 899 {\bf Matrix}      & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$   \\ \hline \hline
 900
 901 2cubes\_sphere    & $15.963s$    & $7.250s$      & $2.20$  & $58$     & $6.23e$-$16$ & $3.25e$-$19$ \\
 902
 903 ecology2          & $3.549s$     & $2.176s$      & $1.63$  & $21$     & $4.78e$-$15$ & $1.06e$-$15$ \\
 904
 905 finan512          & $3.862s$     & $1.934s$      & $1.99$  & $17$     & $3.21e$-$14$ & $8.43e$-$17$ \\
 906
 907 G3\_circuit       & $4.636s$     & $2.811s$      & $1.65$  & $22$     & $1.08e$-$14$ & $1.77e$-$16$ \\
 908
 909 shallow\_water2   & $2.738s$     & $1.539s$      & $1.78$  & $17$     & $5.54e$-$23$ & $3.82e$-$26$ \\
 910
 911 thermal2          & $5.017s$     & $2.587s$      & $1.94$  & $21$     & $8.25e$-$14$ & $4.34e$-$18$ \\ \hline \hline
 912
 913 cage13            & $9.315s$     & $3.227s$      & $2.89$  & $26$     & $3.38e$-$13$ & $2.08e$-$16$ \\
 914
 915 crashbasis        & $35.980s$    & $14.770s$     & $2.43$  & $127$    & $1.17e$-$12$ & $1.56e$-$17$ \\
 916
 917 FEM\_3D\_thermal2 & $24.611s$    & $7.749s$      & $3.17$  & $64$     & $3.87e$-$11$ & $2.84e$-$14$ \\
 918
 919 language          & $16.859s$    & $9.697s$      & $1.74$  & $89$     & $2.17e$-$12$ & $1.70e$-$12$ \\
 920
 921 poli\_large       & $10.200s$    & $6.534s$      & $1.56$  & $69$     & $5.14e$-$13$ & $1.63e$-$13$ \\
 922
 923 torso3            & $49.074s$    & $19.397s$     & $2.53$  & $175$    & $2.69e$-$12$ & $2.77e$-$16$ \\ \hline
 924 \end{tabular}
 925 \caption{Performances of parallel GMRES solver for solving linear systems associated to sparse five-bands matrices
 926 on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs}
 927 \label{ch12:tab:09}
 928 \end{center}
 929 \end{table}
 930
 931 Tables~\ref{ch12:tab:08} and~\ref{ch12:tab:09} shows the performances of the parallel
 932 CG and GMRES solvers, respectively, obtained on a cluster of $24$ CPU cores and on a
 933 cluster of $12$ GPUs. The linear systems solved in these tables are associated to the
 934 sparse five-bands matrices presented on Table~\ref{ch12:tab:07}. We can notice from
 935 both Tables~\ref{ch12:tab:08} and~\ref{ch12:tab:09} that using a GPU cluster is not
 936 efficient for solving these kind of sparse linear systems\index{Sparse~linear~system}.
 937 We can see that the execution times obtained on the GPU cluster are almost equivalent
 938 to those obtained on the CPU cluster (see the relative gains presented in column~$4$
 939 of each table). This is due to the large number of communications necessary to synchronize
 940 the computations over the cluster. Indeed, the naive partitioning, row-by-row or column-by-column,
 941 of sparse matrices having large bandwidths can link a computing node to many neighbors
 942 and then generate a large number of data dependencies between these computing nodes in
 943 the cluster.
 944
 945 Therefore, we have chosen to use a hypergraph partitioning method\index{Hypergraph},
 946 which is well-suited to numerous kinds of sparse matrices~\cite{ch12:ref11}. Indeed,
 947 it can well model the communications between the computing nodes, particularly in the
 948 case of nonsymmetric and irregular matrices, and it gives good reduction of the total
 949 communication volume. In contrast, it is an expensive operation in terms of execution
 950 time and memory space.
 951
 952 The sparse matrix $A$ of the linear system to be solved is modeled as a hypergraph
 953 $\mathcal{H}=(\mathcal{V},\mathcal{E})$\index{Hypergraph} as follows:
 954 \begin{itemize}
 955 \item each matrix row $\{i\}_{0\leq i<n}$ corresponds to a vertex $v_i\in\mathcal{V}$ and,
 956 \item each matrix column $\{j\}_{0\leq j<n}$ corresponds to a hyperedge $e_j\in\mathcal{E}$, where:
 957 \begin{equation}
 958 \forall a_{ij} \neq 0 \mbox{~is a nonzero value of matrix~} A \mbox{~:~} v_i \in pins[e_j],
 959 \end{equation}
 960 \item $w_i$ is the weight of vertex $v_i$ and,
 961 \item $c_j$ is the cost of hyperedge $e_j$.
 962 \end{itemize}
 963 A $K$-way partitioning of a hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ is
 964 defined as $\mathcal{P}=\{\mathcal{V}_1,\ldots,\mathcal{V}_K\}$ a set of pairwise
 965 disjoint non-empty subsets (or parts) of the vertex set $\mathcal{V}$, so that each
 966 subset is attributed to a computing node. Figure~\ref{ch12:fig:08} shows an example
 967 of the hypergraph model of a  $(9\times 9)$ sparse matrix in three parts. The circles
 968 and squares correspond, respectively, to the vertices and hyperedges of the hypergraph.
 969 The solid squares define the cut hyperedges connecting at least two different parts.
 970 The connectivity $\lambda_j$ of a cut hyperedge $e_j$ denotes the number of different
 971 parts spanned by $e_j$.
 972
 973 \begin{figure}
 974 \centerline{\includegraphics[scale=0.5]{Chapters/chapter12/figures/hypergraph}}
 975 \caption{An example of the hypergraph partitioning of a sparse matrix decomposed between three computing nodes.}
 976 \label{ch12:fig:08}
 977 \end{figure}
 978
 979 The cut hyperedges model the total communication volume between the different computing
 980 nodes in the cluster, necessary to perform the parallel SpMV multiplication\index{SpMV~multiplication}.
 981 Indeed, each hyperedge $e_j$ defines a set of atomic computations $b_i\leftarrow b_i+a_{ij}x_j$,
 982 $0\leq i,j<n$, of the SpMV multiplication $Ax=b$ that need the $j^{th}$ unknown value of
 983 solution vector $x$. Therefore, pins of hyperedge $e_j$, $pins[e_j]$, are the set of matrix
 984 rows sharing and requiring the same unknown value $x_j$. For example in Figure~\ref{ch12:fig:08},
 985 hyperedge $e_9$ whose pins are: $pins[e_9]=\{v_2,v_5,v_9\}$ represents the dependency of matrix
 986 rows $2$, $5$ and $9$ to unknown $x_9$ needed to perform in parallel the atomic operations:
 987 $b_2\leftarrow b_2+a_{29}x_9$, $b_5\leftarrow b_5+a_{59}x_9$ and $b_9\leftarrow b_9+a_{99}x_9$.
 988 However, unknown $x_9$ is the third entry of the sub-solution vector $x$ of part (or node) $3$.
 989 So the computing node $3$ must exchange this value with nodes $1$ and $2$, which leads to perform
 990 two communications.
 991
 992 The hypergraph partitioning\index{Hypergraph} allows to reduce the total communication volume
 993 required to perform the parallel SpMV multiplication, while maintaining the load balancing between
 994 the computing nodes. In fact, it allows to minimize at best the following amount:
 995 \begin{equation}
 996 \mathcal{X}(\mathcal{P})=\sum_{e_{j}\in\mathcal{E}_{C}}c_{j}(\lambda_{j}-1),
 997 \end{equation}
 998 where $\mathcal{E}_{C}$ denotes the set of the cut hyperedges coming from the hypergraph partitioning
 999 $\mathcal{P}$ and $c_j$ and $\lambda_j$ are, respectively, the cost and the connectivity of cut hyperedge
1000 $e_j$. Moreover, it also ensures the load balancing between the $K$ parts as follows:
1001 \begin{equation}
1002   W_{k}\leq (1+\epsilon)W_{avg}, \hspace{0.2cm} (1\leq k\leq K) \hspace{0.2cm} \text{and} \hspace{0.2cm} (0<\epsilon<1),
1003 \end{equation}
1004 where $W_{k}$ is the sum of all vertex weights ($w_{i}$) in part $\mathcal{V}_{k}$, $W_{avg}$ is the
1005 average weight of all $K$ parts and $\epsilon$ is the maximum allowed imbalanced ratio.
1006
1007 The hypergraph partitioning is a NP-complete problem but software tools using heuristics are developed,
1008 for example: hMETIS~\cite{ch12:ref12}, PaToH~\cite{ch12:ref13} and Zoltan~\cite{ch12:ref14}. Since our
1009 objective is solving large sparse linear systems, we use the parallel hypergraph partitioning which must
1010 be performed by at least two MPI processes. It allows to accelerate the data partitioning of large sparse
1011 matrices. For this, the hypergraph $\mathcal{H}$ must be partitioned in $p$ (number of MPI processes)
1012 sub-hypergraphs $\mathcal{H}_k=(\mathcal{V}_k,\mathcal{E}_k)$, $0\leq k<p$, and then we performed the
1013 parallel hypergraph partitioning method using some functions of the MPI library between the $p$ processes.
1014
1015 Tables~\ref{ch12:tab:10} and~\ref{ch12:tab:11} shows the performances of the parallel CG and GMRES solvers,
1016 respectively, using the hypergraph partitioning for solving large linear systems associated to the sparse
1017 five-bands matrices presented in Table~\ref{ch12:tab:07}. For these experimental tests, we have applied the
1018 parallel hypergraph partitioning~\cite{ch12:ref15} developed in Zoltan tool~\cite{ch12:ref14}. We have initialized
1019 the parameters of the partitioning operation as follows:
1020 \begin{itemize}
1021 \item the weight $w_{i}$ of each vertex $v_{j}\in\mathcal{V}$ is set to the number of nonzero values on matrix row $i$,
1022 \item for the sake of simplicity, the cost $c_{j}$ of each hyperedge $e_{j}\in\mathcal{E}$ is fixed to $1$,
1023 \item the maximum imbalanced load ratio $\epsilon$ is limited to $10\%$.\\
1024 \end{itemize}
1025
1026 \begin{table}
1027 \begin{center}
1028 \begin{tabular}{|c|c|c|c|c|}
1029 \hline
1030 {\bf Matrix}    & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{Gains \%}$ \\ \hline \hline
1031
1032 2cubes\_sphere  & $5.935s$             & $1.213s$              & $4.89$          & $63.66\%$ \\
1033
1034 ecology2        & $1.093s$             & $0.136s$              & $8.00$          & $89.55\%$ \\
1035
1036 finan512        & $1.762s$             & $0.475s$              & $3.71$          & $63.43\%$ \\
1037
1038 G3\_circuit     & $2.095s$             & $0.558s$              & $3.76$          & $73.79\%$ \\
1039
1040 shallow\_water2 & $0.498s$             & $0.068s$              & $7.31$          & $86.51\%$ \\
1041
1042 thermal2        & $1.889s$             & $0.348s$              & $5.43$          & $79.59\%$ \\ \hline
1043 \end{tabular}
1044 \caption{Performances of the parallel CG solver using hypergraph partitioning for solving linear systems associated to
1045 sparse five-bands matrices on a cluster of 24 CPU cores vs. on a cluster of 12 GPU.}
1046 \label{ch12:tab:10}
1047 \end{center}
1048 \end{table}
1049
1050 \begin{table}
1051 \begin{center}
1052 \begin{tabular}{|c|c|c|c|c|}
1053 \hline
1054 {\bf Matrix}      & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{Gains \%}$ \\ \hline \hline
1055
1056 2cubes\_sphere    & $16.430s$            & $2.840s$              & $5.78$          & $60.83\%$ \\
1057
1058 ecology2          & $3.152s$             & $0.367s$              & $8.59$          & $83.13\%$ \\
1059
1060 finan512          & $3.672s$             & $0.723s$              & $5.08$          & $62.62\%$ \\
1061
1062 G3\_circuit       & $4.468s$             & $0.971s$              & $4.60$          & $65.46\%$ \\
1063
1064 shallow\_water2   & $2.647s$             & $0.312s$              & $8.48$          & $79.73\%$ \\
1065
1066 thermal2          & $4.190s$             & $0.666s$              & $6.29$          & $74.25\%$ \\ \hline \hline
1067
1068 cage13            & $8.077s$             & $1.584s$              & $5.10$          & $50.91\%$ \\
1069
1070 crashbasis        & $35.173s$            & $5.546s$              & $6.34$          & $62.43\%$ \\
1071
1072 FEM\_3D\_thermal2 & $24.825s$            & $3.113s$              & $7.97$          & $59.83\%$ \\
1073
1074 language          & $16.706s$            & $2.522s$              & $6.62$          & $73.99\%$ \\
1075
1076 poli\_large       & $12.715s$            & $3.989s$              & $3.19$          & $38.95\%$ \\
1077
1078 torso3            & $48.459s$            & $6.234s$              & $7.77$          & $67.86\%$ \\ \hline
1079 \end{tabular}
1080 \caption{Performances of the parallel GMRES solver using hypergraph partitioning for solving linear systems associated to
1081 sparse five-bands matrices on a cluster of 24 CPU cores vs. on a cluster of 12 GPU.}
1082 \label{ch12:tab:11}
1083 \end{center}
1084 \end{table}
1085
1086 We can notice from both Tables~\ref{ch12:tab:10} and~\ref{ch12:tab:11} that the
1087 hypergraph partitioning has improved the performances of both parallel CG and GMRES
1088 algorithms. The execution times on the GPU cluster of both parallel solvers are
1089 significantly improved compared to those obtained by using the partitioning row-by-row.
1090 For these examples of sparse matrices, the execution times of CG and GMRES solvers
1091 are reduced about $76\%$ and $65\%$ respectively (see column~$5$ of each table)
1092 compared to those obtained in Tables~\ref{ch12:tab:08} and~\ref{ch12:tab:09}.
1093
1094 In fact, the hypergraph partitioning\index{Hypergraph} applied to sparse matrices
1095 having large bandwidths allows to reduce the total communication volume necessary
1096 to synchronize the computations between the computing nodes in the GPU cluster.
1097 Table~\ref{ch12:tab:12} presents, for each sparse matrix, the total communication
1098 volume between $12$ GPU computing nodes obtained by using the partitioning row-by-row
1099 (column~$2$), the total communication volume obtained by using the hypergraph partitioning
1100 (column~$3$) and the execution times in minutes of the hypergraph partitioning
1101 operation performed by $12$ MPI processes (column~$4$). The total communication
1102 volume defines the total number of the vector elements exchanged by the computing
1103 nodes. Then, Table~\ref{ch12:tab:12} shows that the hypergraph partitioning method
1104 can split the sparse matrix so as to minimize the data dependencies between the
1105 computing nodes and thus to reduce the total communication volume.
1106
1107 \begin{table}
1108 \begin{center}
1109 \begin{tabular}{|c|c|c|c|}
1110 \hline
1111 \multirow{4}{*}{\bf Matrix}  & {\bf Total comms.}      & {\bf Total comms.}      & {\bf Execution} \\
1112                              & {\bf volume without}    & {\bf volume with}       & {\bf trime}  \\
1113                              & {\bf hypergraph}        & {\bf hypergraph }       & {\bf of the parti.}  \\
1114                              & {\bf parti.}            & {\bf parti.}            & {\bf in minutes}\\ \hline \hline
1115
1116 2cubes\_sphere               & $25,360,543$            & $240,679$               & $68.98$         \\
1117
1118 ecology2                     & $26,044,002$            & $73,021$                & $4.92$          \\
1119
1120 finan512                     & $26,087,431$            & $900,729$               & $33.72$         \\
1121
1122 G3\_circuit                  & $31,912,003$            & $5,366,774$             & $11.63$         \\
1123
1124 shallow\_water2              & $25,105,108$            & $60,899$                & $5.06$          \\
1125
1126 thermal2                     & $30,012,846$            & $1,077,921$             & $17.88$         \\ \hline \hline
1127
1128 cage13                       & $28,254,282$            & $3,845,440$             & $196.45$        \\
1129
1130 crashbasis                   & $29,020,060$            & $2,401,876$             & $33.39$         \\
1131
1132 FEM\_3D\_thermal2            & $25,263,767$            & $250,105$               & $49.89$         \\
1133
1134 language                     & $27,291,486$            & $1,537,835$             & $9.07$          \\
1135
1136 poli\_large                  & $25,053,554$            & $7,388,883$             & $5.92$          \\
1137
1138 torso3                       & $25,682,514$            & $613,250$               & $61.51$         \\ \hline
1139 \end{tabular}
1140 \caption{The total communication volume between 12 GPU computing nodes without and with the hypergraph partitioning method.}
1141 \label{ch12:tab:12}
1142 \end{center}
1143 \end{table}
1144
1145 Nevertheless, as we can see from the fourth column of Table~\ref{ch12:tab:12},
1146 the hypergraph partitioning takes longer compared to the execution times of the
1147 resolutions. As previously mentioned, the hypergraph partitioning method is less
1148 efficient in terms of memory consumption and partitioning time than its graph
1149 counterpart, but the hypergraph well models the nonsymmetric and irregular problems.
1150 So for the applications which often use the same sparse matrices, we can perform
1151 the hypergraph partitioning on these matrices only once for each and then, we save
1152 the traces of these partitionings in files to be reused several times. Therefore,
1153 this allows to avoid the partitioning of the sparse matrices at each resolution
1154 of the linear systems.
1155
1156 \begin{figure}[!h]
1157 \centering
1158   \mbox{\subfigure[Sparse band matrices]{\includegraphics[scale=0.7]{Chapters/chapter12/figures/scale_band}\label{ch12:fig:09.01}}}
1159 \vfill
1160   \mbox{\subfigure[Sparse five-bands matrices]{\includegraphics[scale=0.7]{Chapters/chapter12/figures/scale_5band}\label{ch12:fig:09.02}}}
1161 \caption{Weak-scaling of the parallel CG and GMRES solvers on a GPU cluster for solving large sparse linear systems.}
1162 \label{ch12:fig:09}
1163 \end{figure}
1164
1165 However, the most important performance parameter is the scalability of the parallel
1166 CG\index{Iterative~method!CG} and GMRES\index{Iterative~method!GMRES} solvers on a GPU
1167 cluster. Particularly, we have taken into account the weak-scaling of both parallel
1168 algorithms on a cluster of one to 12 GPU computing nodes. We have performed a set of
1169 experiments on both matrix structures: band matrices and five-bands matrices. The sparse
1170 matrices of tests are generated from the symmetric sparse matrix {\it thermal2} chosen
1171 from the Davis's collection. Figures~\ref{ch12:fig:09.01} and~\ref{ch12:fig:09.02}
1172 show the execution times of both parallel methods for solving large linear systems
1173 associated to band matrices and those associated to five-bands matrices, respectively.
1174 The size of a sparse sub-matrix per computing node, for each matrix structure, is fixed
1175 as follows:
1176 \begin{itemize}
1177 \item band matrix: $15$ million of rows and $105,166,557$ of nonzero values,
1178 \item five-bands matrix: $5$ million of rows and $78,714,492$ of nonzero values.
1179 \end{itemize}
1180 We can see from these figures that both parallel solvers are quite scalable on a GPU
1181 cluster. Indeed, the execution times remains almost constant while the size of the
1182 sparse linear systems to be solved increases proportionally with the number of the
1183 GPU computing nodes. This means that the communication cost is relatively constant
1184 regardless of the number the computing nodes in the GPU cluster.
1185
1186
1187
1188 %%--------------------------%%
1189 %%       SECTION 6          %%
1190 %%--------------------------%%
1191 \section{Conclusion}
1192 \label{ch12:sec:06}
1193 In this chapter, we have aimed at harnessing the computing power of a
1194 cluster of GPUs for solving large sparse linear systems. For this, we
1195 have used two Krylov subspace iterative methods: the CG and GMRES methods.
1196 The first method is well-known to its efficiency for solving symmetric
1197 linear systems and the second one is used, particularly, for solving
1198 nonsymmetric linear systems.
1199
1200 We have presented the parallel implementation of both iterative methods
1201 on a GPU cluster. Particularly, the operations dealing with the vectors
1202 and/or matrices, of these methods, are parallelized between the different
1203 GPU computing nodes of the cluster. Indeed, the data-parallel vector operations
1204 are accelerated by GPUs and the communications required to synchronize the
1205 parallel computations are carried out by CPU cores. For this, we have used
1206 a heterogeneous CUDA/MPI programming to implement the parallel iterative
1207 algorithms.
1208
1209 In the experimental tests, we have shown that using a GPU cluster is efficient
1210 for solving linear systems associated to very large sparse matrices. The experimental
1211 results, obtained in the present chapter, showed that a cluster of $12$ GPUs is
1212 about $7$ times faster than a cluster of $24$ CPU cores for solving large sparse
1213 linear systems of $25$ million unknown values. This is due to the GPU ability to
1214 compute the data-parallel operations faster than the CPUs. However, we have shown
1215 that solving linear systems associated to matrices having large bandwidths uses
1216 many communications to synchronize the computations of GPUs, which slow down even
1217 more the resolution. Moreover, there are two kinds of communications: between a
1218 CPU and its GPU and between CPUs of the computing nodes, such that the first ones
1219 are the slowest communications on a GPU cluster. So, we have proposed to use the
1220 hypergraph partitioning instead of the row-by-row partitioning. This allows to
1221 minimize the data dependencies between the GPU computing nodes and thus to reduce
1222 the total communication volume. The experimental results showed that using the
1223 hypergraph partitioning technique improve the execution times on average of $76\%$
1224 to the CG method and of $65\%$ to the GMRES method on a cluster of $12$ GPUs.
1225
1226 In the recent GPU hardware and software architectures, the GPU-Direct system with
1227 CUDA version 5.0 is used so that two GPUs located on the same node or on distant
1228 nodes can communicate between them directly without CPUs. This allows to improve
1229 the data transfers between GPUs.
1230
1231 \putbib[Chapters/chapter12/biblio12]
1232