GMRES_Journal.tex

   1 \documentclass[11pt]{article}
   2 %\documentclass{acmconf}
   3
   4 \usepackage[paper=a4paper,dvips,top=1.5cm,left=1.5cm,right=1.5cm,foot=1cm,bottom=1.5cm]{geometry}
   5 \usepackage{times}
   6 \usepackage{graphicx}
   7 \usepackage{amsmath}
   8 \usepackage{amsfonts}
   9 \usepackage{amssymb}
  10 \usepackage{amsthm}
  11 \usepackage{amsopn}
  12 \usepackage{xspace}
  13 \usepackage{array}
  14 \usepackage{epsfig}
  15 \usepackage{algorithmic}
  16 \usepackage[ruled,english,boxed,linesnumbered]{algorithm2e}
  17 \usepackage[english]{algorithm2e}
  18 \usepackage{url}
  19 \usepackage{mdwlist}
  20 \usepackage{multirow}
  21
  22 \date{}
  23
  24 \title{Parallel sparse linear solver with GMRES method using minimization techniques of communications for GPU clusters}
  25
  26 \author{
  27 \textsc{Jacques M. Bahi}
  28 \qquad
  29 \textsc{Rapha\"el Couturier}\thanks{Contact author}
  30 \qquad
  31 \textsc{Lilia Ziane Khodja}
  32 \mbox{}\\ %
  33 \\
  34 FEMTO-ST Institute, University of Franche-Comte\\
  35 IUT Belfort-Montb\'eliard\\
  36 Rue Engel Gros, BP 527, 90016 Belfort, \underline{France}\\
  37 \mbox{}\\ %
  38 \normalsize
  39 \{\texttt{jacques.bahi},~\texttt{raphael.couturier},~\texttt{lilia.ziane\_khoja}\}\texttt{@univ-fcomte.fr}
  40 }
  41
  42 \begin{document}
  43
  44 \maketitle
  45
  46 \begin{abstract}
  47 In this paper, we aim at exploiting the power computing of a GPU cluster for solving large sparse
  48 linear systems. We implement the parallel algorithm of the GMRES iterative method using the CUDA
  49 programming language and the MPI parallel environment. The experiments shows that a GPU cluster
  50 is more efficient than a CPU cluster. In order to optimize the performances, we use a compressed
  51 storage format for the sparse vectors and the hypergraph partitioning. These solutions improve
  52 the spatial and temporal localization of the shared data between the computing nodes of the GPU
  53 cluster.
  54 \end{abstract}
  55
  56
  57
  58 %%--------------------%%
  59 %%      SECTION 1     %%
  60 %%--------------------%%
  61 \section{Introduction}
  62 \label{sec:01}
  63 Large sparse linear systems arise in  most numerical scientific or industrial simulations.
  64 They model numerous complex problems in different areas of applications such as mathematics,
  65 engineering, biology or physics~\cite{ref18}. However, solving these systems of equations is
  66 often an expensive operation in terms of execution time and memory space consumption. Indeed,
  67 the linear systems arising in most applications are very large  and have many zero
  68 coefficients, and this sparse nature leads to irregular accesses to load the nonzero coefficients
  69 from the memory.
  70
  71 Parallel computing has become a key issue for solving sparse linear systems of large sizes.
  72 This is due to the computing power and the storage capacity of the current parallel computers as
  73 well as the availability of different parallel programming languages and environments such as the
  74 MPI communication standard. Nowadays, graphics processing units (GPUs) are the most commonly used
  75 hardware accelerators in high performance computing. They are equipped with a massively parallel
  76 architecture allowing them to compute faster than CPUs. However, the parallel computers equipped
  77 with GPUs introduce new programming difficulties to adapt  parallel algorithms to their architectures.
  78
  79 In this paper, we use the GMRES iterative method for solving large sparse linear systems on a cluster
  80 of GPUs. The parallel algorithm of this method is implemented using the CUDA programming language for
  81 the GPUs and the MPI parallel environment to distribute the computations between the different GPU nodes
  82 of the cluster. Particularly, we focus on improving the performances of the parallel sparse matrix-vector
  83 multiplication. Indeed, this operation is not only very time-consuming but it also requires communications
  84 between the GPU nodes. These communications are needed to build the global vector involved in
  85 the parallel sparse matrix-vector multiplication. It should be noted that a communication between two
  86 GPU nodes involves data transfers between the GPU and CPU memories in the same node and the MPI communications
  87 between the CPUs of the GPU nodes. For performance purposes, we propose to use a compressed storage
  88 format to reduce the size of the vectors to be exchanged between the GPU nodes and a hypergraph partitioning
  89 of the sparse matrix to reduce the total communication volume.
  90
  91 The present paper is organized as follows. In Section~\ref{sec:02} some previous works about solving
  92 sparse linear systems on GPUs are presented. In Section~\ref{sec:03} is given a general overview of the GPU architectures,
  93 followed by that the GMRES method in Section~\ref{sec:04}. In Section~\ref{sec:05} the main key points
  94 of the parallel implementation of the GMRES method on a GPU cluster are described. Finally, in Section~\ref{sec:06}
  95 is presented the performance improvements of the parallel GMRES algorithm on a GPU cluster.
  96
  97
  98 %%--------------------%%
  99 %%      SECTION 2     %%
 100 %%--------------------%%
 101 \section{Related work}
 102 \label{sec:02}
 103 Numerous works have shown the efficiency of GPUs for solving sparse linear systems compared
 104 to their CPUs counterpart. Different iterative methods are implemented on one GPU, for example
 105 Jacobi and Gauss-Seidel in~\cite{refa}, conjugate and biconjugate gradients in~\cite{refd,refe,reff,refj}
 106 and GMRES in~\cite{refb,refc,refg,refm}. In addition, some iterative methods are implemented on
 107 shared memory multi-GPUs machines as~\cite{refh,refi,refk,refl}. A limited set of studies are
 108 devoted to the parallel implementation of the iterative methods on distributed memory GPU clusters
 109 as~\cite{refn,refo,refp}.
 110
 111 Traditionally, the parallel iterative algorithms do not often scale well on GPU clusters due to
 112 the significant cost of the communications between the computing nodes. Some authors have already
 113 studied how to reduce these communications. In~\cite{cev10}, the authors used a hypergraph partitioning
 114 as a preprocessing to the parallel conjugate gradient algorithm in order to reduce the inter-GPU
 115 communications over a GPU cluster. The sequential hypergraph partitioning method provided by the
 116 PaToH tool~\cite{Cata99} is used because of the small sizes of the sparse symmetric linear systems
 117 to be solved. In~\cite{refq}, a compression and decompression technique is proposed to reduce the
 118 communication overheads. This technique is performed on the shared vectors to be exchanged between
 119 the computing nodes. In~\cite{refr}, the authors studied the impact of asynchronism on parallel
 120 iterative algorithms on local GPU clusters. Asynchronous communication primitives suppress some
 121 synchronization barriers and allow overlap of communication and computation. In~\cite{refs}, a
 122 communication reduction method is used for implementing finite element methods (FEM) on GPU clusters.
 123 This method firstly uses the Reverse Cuthill-McKee reordering to reduce the total communication
 124 volume. In addition, the performances of the parallel FEM algorithm are improved by overlapping
 125 the communication with computation.
 126
 127
 128 %%--------------------%%
 129 %%      SECTION 3     %%
 130 %%--------------------%%
 131 \section{{GPU} architectures}
 132 \label{sec:03}
 133 A GPU (Graphics processing unit) is a hardware accelerator for high performance computing.
 134 Its hardware architecture is composed of hundreds of cores organized in several blocks called
 135 \emph{streaming multiprocessors}. It is also equipped with a memory hierarchy. It has a set
 136 of registers and a private read-write \emph{local memory} per core, a fast \emph{shared memory},
 137 read-only \emph{constant} and \emph{texture} caches per multiprocessor and a read-write
 138 \emph{global memory} shared by all its multiprocessors. The new architectures (Fermi, Kepler,
 139 etc) have also L1 and L2 caches to improve the accesses to the global memory.
 140
 141 NVIDIA has released the CUDA platform (Compute Unified Device Architecture)~\cite{Nvi10}
 142 which provides a high level GPGPU-based programming language (General-Purpose computing
 143 on GPUs), allowing to program GPUs for general purpose computations. In CUDA programming
 144 environment, all data-parallel and compute intensive portions of an application running
 145 on the CPU are off-loaded onto the GPU. Indeed, an application developed in CUDA is a
 146 program written in C language (or Fortran) with a minimal set of extensions to define
 147 the parallel functions to be executed by the GPU, called \emph{kernels}. We define kernels,
 148 as separate functions from those of the CPU, by assigning them a function type qualifiers
 149 \verb+__global__+ or \verb+__device__+.
 150
 151 At the GPU level, the same kernel is executed by a large number of parallel CUDA threads
 152 grouped together as a grid of thread blocks. Each multiprocessor of the GPU executes one
 153 or more thread blocks in SIMD fashion (Single Instruction, Multiple Data) and in turn each
 154 core of a GPU multiprocessor runs one or more threads within a block in SIMT fashion (Single
 155 Instruction, Multiple threads). In order to maximize the occupation of the GPU cores, the
 156 number of CUDA threads to be involved in a kernel execution is computed according to the
 157 size of the problem to be solved. In contrast, the block size is restricted by the limited
 158 memory resources of a core. On current GPUs, a thread block may contain up-to $1,024$ concurrent
 159 threads. At any given clock cycle, the threads execute the same instruction of a kernel,
 160 but each of them operates on different data. Moreover, threads within a block can cooperate
 161 by sharing data through the fast shared memory and coordinate their execution through
 162 synchronization points. In contrast, within a grid of thread blocks, there is no synchronization
 163 at all between blocks.
 164
 165 GPUs  only work on data filled in their global memory and the final results of their kernel
 166 executions must be communicated to their hosts (CPUs). Hence, the data must be transferred
 167 \emph{in} and \emph{out} of the GPU. However, the speed of memory copy between the CPU and
 168 the GPU is slower than the memory copy speed of GPUs. Accordingly, it is necessary to limit
 169 the transfer of data between the GPU and its host.
 170
 171
 172 %%--------------------%%
 173 %%      SECTION 4     %%
 174 %%--------------------%%
 175 \section{{GMRES} method}
 176 \label{sec:04}
 177 The generalized minimal residual method (GMRES) is an iterative method designed by Saad
 178 and Schultz in 1986~\cite{Saa86}. It is a generalization of the minimal residual method
 179 (MNRES)~\cite{Pai75} to deal with asymmetric and non Hermitian problems and indefinite
 180 symmetric problems.
 181
 182 Let us consider the following sparse linear system of $n$ equations:
 183 \begin{equation}
 184   Ax = b,
 185 \label{eq:01}
 186 \end{equation}
 187 where $A\in\mathbb{R}^{n\times n}$ is a sparse square and nonsingular matrix, $x\in\mathbb{R}^n$
 188 is the solution vector and $b\in\mathbb{R}^n$ is the right-hand side vector. The main idea of
 189 the GMRES method is to find a sequence of solutions $\{x_k\}_{k\in\mathbb{N}}$ which minimizes at
 190 best the residual $r_k=b-Ax_k$. The solution $x_k$ is computed in a Krylov sub-space $\mathcal{K}_k(A,v_1)$:
 191 \begin{equation}
 192 \begin{array}{ll}
 193   \mathcal{K}_{k}(A,v_{1}) \equiv \text{span}\{v_{1}, Av_{1}, A^{2}v_{1},..., A^{k-1}v_{1}\}, & v_{1}=\frac{r_{0}}{\|r_{0}\|_{2}},
 194 \end{array}
 195 \end{equation}
 196 such that the Petrov-Galerkin condition is satisfied:
 197 \begin{equation}
 198   r_{k} \perp A\mathcal{K}_{k}(A, v_{1}).
 199 \end{equation}
 200 The GMRES method uses the Arnoldi method~\cite{Arn51} to build an orthonormal basis $V_k$
 201 and a Hessenberg upper matrix $\bar{H}_k$ of order $(k+1)\times k$:
 202 \begin{equation}
 203 \begin{array}{lll}
 204   V_{k}=\{v_{1},v_{2},...,v_{k}\}, & \text{such that} & \forall k>1, v_{k}=A^{k-1}v_{1},
 205 \end{array}
 206 \end{equation}
 207 and
 208 \begin{equation}
 209   AV_{k} = V_{k+1}\bar{H}_{k}.
 210 \label{eq:02}
 211 \end{equation}
 212 Then at each iteration $k$, a solution $x_k$ is computed in the Krylov sub-space $\mathcal{K}_{k}$
 213 spanned by $V_k$ as follows:
 214 \begin{equation}
 215 \begin{array}{ll}
 216   x_{k} = x_{0} + V_{k}y, & y\in\mathbb{R}^{k}.
 217 \end{array}
 218 \label{eq:03}
 219 \end{equation}
 220 From both equations~(\ref{eq:02}) and~(\ref{eq:03}), we can deduce:
 221 \begin{equation}
 222 \begin{array}{lll}
 223   r_{k} & = & b - A (x_{0} + V_{k}y) \\
 224         & = & r_{0} - AV_{k}y \\
 225         & = & \beta v_{1} - V_{k+1}\bar{H}_{k}y \\
 226         & = & V_{k+1}(\beta e_{1} - \bar{H}_{k}y),
 227 \end{array}
 228 \end{equation}
 229 where $\beta=\|r_{0}\|_{2}$ and $e_{1}=(1,0,...,0)$ is the first vector of the canonical basis of
 230 $\mathbb{R}^{k}$. The vector $y$ is computed in $\mathbb{R}^{k}$ so as to minimize at best the Euclidean
 231 norm of the residual $r_{k}$. This means that the following least-squares problem of size $k$ must
 232 be solved:
 233 \begin{equation}
 234   \underset{y\in\mathbb{R}^{k}}{min}\|r_{k}\|_{2}=\underset{y\in\mathbb{R}^{k}}{min}\|\beta e_{1}-\bar{H}_{k}y\|_{2}.
 235 \end{equation}
 236 The solution of this problem is computed thanks to the QR factorization of the Hessenberg matrix
 237 $\bar{H}_{k}$ by using the Givens rotations~\cite{Saa03,Saa86} such that:
 238 \begin{equation}
 239 \begin{array}{lll}
 240   \bar{H}_{k}=Q_{k}R_{k}, & Q_{k}\in\mathbb{R}^{(k+1)\times (k+1)}, & R_{k}\in\mathbb{R}^{(k+1)\times k},
 241 \end{array}
 242 \end{equation}
 243 where $Q_{k}Q_{k}^{T}=I_{k}$ and $R_{k}$ is a upper triangular matrix.
 244
 245 The GMRES method finds a solution at most after $n$ iterations. So, the Krylov sub-space dimension
 246 can increase up-to the large size $n$ of the linear system. Then in order to avoid a large storage
 247 of the orthonormal basis $V_{k}$, the Arnoldi process is restricted at $m\ll n$ iterations and restarted
 248 with the last iterate $x_{m}$ as an initial guess to the next iteration. Moreover, GMRES sometimes
 249 does not converge or takes too many iterations to satisfy the convergence criteria. Therefore, in
 250 most cases, GMRES must contain a preconditioning step to improve its convergence. The preconditioning
 251 technique replaces the system~(\ref{eq:01}) with the modified systems:
 252 \begin{equation}M^{-1}Ax=M^{-1}b.\end{equation}
 253
 254 Algorithm~\ref{alg:01} illustrates the main key points of the GMRES method with restarts.
 255 The linear system to be solved in this algorithm is left-preconditioned where $M$ is the preconditioning
 256 matrix. At each iteration $k$, the Arnoldi process is used (from line~$7$ to line~$17$ of algorithm~\ref{alg:01})
 257 to construct an orthonormal basis $V_m$ and a Hessenberg matrix $\bar{H}_m$ of order $(m+1)\times m$.
 258 Then, the least-squares problem is solved (line~$18$) to find the vector $y\in\mathbb{R}^m$ which minimizes
 259 the residual. Finally, the solution $x_m$ is computed in the Krylov sub-space spanned by $V_m$ (line~$19$).
 260 In practice, the GMRES algorithm stops when the Euclidean norm of the residual is small enough and/or
 261 the maximum number of iterations is reached.
 262
 263 \begin{algorithm}[!h]
 264   \SetAlgoLined
 265   \Entree{$A$ (matrix), $b$ (vector), $M$ (preconditioning matrix),
 266 $x_{0}$ (initial guess), $\varepsilon$ (tolerance threshold), $max$ (maximum number of iterations),
 267 $m$ (number of iterations of the Arnoldi process)}
 268   \Sortie{$x$ (solution vector)}
 269   \BlankLine
 270   $r_{0} \leftarrow M^{-1}(b - Ax_{0})$\;
 271   $\beta \leftarrow \|r_{0}\|_{2}$\;
 272   $\alpha \leftarrow \|M^{-1}b\|_{2}$\;
 273   $convergence \leftarrow false$\;
 274   $k \leftarrow 1$\;
 275   \BlankLine
 276   \While{$(\neg convergence)$}{
 277     $v_{1} \leftarrow r_{0} / \beta$\;
 278     \For{$j=1$ {\bf to} $m$}{
 279       $w_{j} \leftarrow M^{-1}Av_{j}$\;
 280       \For{$i=1$ {\bf to} $j$}{
 281         $h_{i,j} \leftarrow (w_{j},v_{i})$\;
 282         $w_{j} \leftarrow w_{j} - h_{i,j} \times v_{i}$\;
 283       }
 284       $h_{j+1,j} \leftarrow \|w_{j}\|_{2}$\;
 285       $v_{j+1} \leftarrow w_{j} / h_{j+1,j}$\;
 286     }
 287     \BlankLine
 288     Put $V_{m}=\{v_{j}\}_{1\leq j \leq m}$ and $\bar{H}_{m}=(h_{i,j})$ Hessenberg matrix of order $(m+1)\times m$\;
 289     Solve the least-squares problem of size $m$: $\underset{y\in\mathbb{R}^{m}}{min}\|\beta e_{1}-\bar{H}_{m}y\|_{2}$\;
 290     \BlankLine
 291     $x_{m} \leftarrow x_{0} + V_{m}y$\;
 292     $r_{m} \leftarrow M^{-1}(b-Ax_{m})$\;
 293     $\beta \leftarrow \|r_{m}\|_{2}$\;
 294     \BlankLine
 295     \eIf{$(\frac{\beta}{\alpha}<\varepsilon)$ {\bf or} $(k\geq max)$}{
 296       $convergence \leftarrow true$\;
 297     }{
 298       $x_{0} \leftarrow x_{m}$\;
 299       $r_{0} \leftarrow r_{m}$\;
 300       $k \leftarrow k + 1$\;
 301     }
 302   }
 303 \caption{Left-preconditioned GMRES algorithm with restarts}
 304 \label{alg:01}
 305 \end{algorithm}
 306
 307
 308
 309 %%--------------------%%
 310 %%      SECTION 5     %%
 311 %%--------------------%%
 312 \section{Parallel GMRES method on {GPU} clusters}
 313 \label{sec:05}
 314
 315 \subsection{Parallel implementation on a GPU cluster}
 316 \label{sec:05.01}
 317 The implementation of the GMRES algorithm on a GPU cluster is performed by using
 318 a parallel heterogeneous programming. We use the programming language CUDA for the
 319 GPUs and the parallel environment MPI for the distribution of the computations between
 320 the GPU computing nodes. In this work, a GPU computing node is composed of a GPU and
 321 a CPU core managed by a MPI process.
 322
 323 Let us consider a cluster composed of $p$ GPU computing nodes. First, the sparse linear
 324 system~(\ref{eq:01}) is split into $p$ sub-linear systems, each is attributed to a GPU
 325 computing node. We partition row-by-row the sparse matrix $A$ and both vectors $x$ and
 326 $b$ in $p$ parts (see Figure~\ref{fig:01}). The data issued from the partitioning operation
 327 are off-loaded on the GPU global memories to be proceeded by the GPUs. Then, all the
 328 computing nodes of the GPU cluster execute the same GMRES iterative algorithm but on
 329 different data. Finally, the GPU computing nodes synchronize their computations by using
 330 MPI communication routines to solve the global sparse linear system. In what follows,
 331 the computing nodes sharing data are called the neighboring nodes.
 332
 333 \begin{figure}
 334 \centering
 335   \includegraphics[width=80mm,keepaspectratio]{Figures/partition}
 336 \caption{Data partitioning of the sparse matrix $A$, the solution vector $x$ and the right-hand side $b$ in $4$ partitions}
 337 \label{fig:01}
 338 \end{figure}
 339
 340 In order to exploit the computing power of the GPUs, we have to execute maximum computations
 341 on GPUs to avoid the data transfers between the GPU and its host (CPU), and to maximize the
 342 GPU cores utilization to hide global memory access latency. The implementation of the GMRES
 343 algorithm is performed by executing the functions operating on vectors and matrices as kernels
 344 on GPUs. These operations are often easy to parallelize and more efficient on parallel architectures
 345 when they operate on large vectors. We use the fastest routines of the CUBLAS library
 346 (CUDA Basic Linear Algebra Subroutines) to implement the dot product (\verb+cublasDdot()+),
 347 the Euclidean norm (\verb+cublasDnrm2()+) and the AXPY operation (\verb+cublasDaxpy()+).
 348 In addition, we have coded in CUDA a kernel for the scalar-vector product (lines~$7$ and~$15$
 349 of Algorithm~\ref{alg:01}), a kernel for solving the least-squares problem (line~$18$) and a
 350 kernel for solution vector updates (line~$19$).
 351
 352 The solution of the least-squares problem in the GMRES algorithm is based on:
 353 \begin{itemize}
 354   \item a QR factorization of the Hessenberg matrix $\bar{H}$ by using plane rotations and,
 355   \item backward-substitution method to compute the vector $y$ minimizing the residual.
 356 \end{itemize}
 357 This operation is not easy to parallelize and it is not interesting to implement it on GPUs.
 358 However, the size $m$ of the linear least-squares problem to solve in the GMRES method with
 359 restarts is very small. So, this problem is solved in sequential by one GPU thread.
 360
 361 The most important operation in the GMRES method is the sparse matrix-vector multiplication.
 362 It is quite expensive for large size matrices in terms of execution time and memory space. In
 363 addition, it performs irregular memory accesses to read the nonzero values of the sparse matrix,
 364 implying non coalescent accesses to the GPU global memory which slow down the performances of
 365 the GPUs. So we use the HYB kernel developed and optimized by Nvidia~\cite{CUSP} which gives on
 366 average the best performance in sparse matrix-vector multiplications on GPUs~\cite{Bel09}. The
 367 HYB (Hybrid) storage format is the combination of two sparse storage formats: Ellpack format
 368 (ELL) and Coordinate format (COO). It stores a typical number of nonzero values per row in ELL
 369 format and remaining entries of exceptional rows in COO format. It combines the efficiency of
 370 ELL, due to the regularity of its memory accessing and the flexibility of COO which is insensitive
 371 to the matrix structure.
 372
 373 In the parallel GMRES algorithm, the GPU computing nodes must exchange between them their shared data in
 374 order to construct the global vector necessary to compute the parallel sparse matrix-vector
 375 multiplication (SpMV). In fact, each computing node has locally the vector elements corresponding
 376 to the rows of its sparse sub-matrix and, in order to compute its part of the SpMV, it also
 377 requires the vector elements of its neighboring nodes corresponding to the column indices in
 378 which its local sub-matrix has nonzero values. Consequently, each computing node manages a global
 379 vector composed of a local vector of size $\frac{n}{p}$ and a shared vector of size $S$:
 380 \begin{equation}
 381   S = bw - \frac{n}{p},
 382 \label{eq:11}
 383 \end{equation}
 384 where $\frac{n}{p}$ is the size of the local vector and $bw$ is the bandwidth of the local sparse
 385 sub-matrix which represents the number of columns between the minimum and the maximum column indices
 386 (see Figure~\ref{fig:01}). In order to improve memory accesses, we use the texture memory to
 387 cache elements of the global vector.
 388
 389 On a GPU cluster, the exchanges of the shared vectors elements between the neighboring nodes are
 390 performed as follows:
 391 \begin{itemize}
 392   \item at the level of the sending node: data transfers of the shared data from the GPU global memory
 393 to the CPU memory by using the CUBLAS communication routine \verb+cublasGetVector()+,
 394   \item data exchanges between the CPUs by the MPI communication routine \verb+MPI_Alltoallv()+ and,
 395   \item at the level of the receiving node: data transfers of the received shared data from the CPU
 396 memory to the GPU global memory by using CUBLAS communication routine \verb+cublasSetVector()+.
 397 \end{itemize}
 398
 399 \subsection{Experimentations}
 400 \label{sec:05.02}
 401 The experiments are done on a cluster composed of six machines interconnected by an Infiniband network
 402 of $20$~GB/s. Each machine is a Xeon E5530 Quad-Core running at $2.4$~GHz. It provides $12$~GB of RAM
 403 memory with a memory bandwidth of $25.6$~GB/s and it is equipped with two Tesla C1060 GPUs. Each GPU
 404 is composed of $240$ cores running at $1.3$ GHz and has $4$~GB of global memory with a memory bandwidth
 405 of $102$~GB/s. The GPU is connected to the CPU via a PCI-Express 16x Gen2.0 with a throughput of $8$~GB/s.
 406 Figure~\ref{fig:02} shows the general scheme of the GPU cluster.
 407
 408 \begin{figure}[!h]
 409 \centering
 410   \includegraphics[width=80mm,keepaspectratio]{Figures/clusterGPU}
 411 \caption{A cluster composed of six machines, each equipped with two Tesla C1060 GPUs}
 412 \label{fig:02}
 413 \end{figure}
 414
 415 Linux cluster version 2.6.18 OS is installed on the six machines. The C programming language is used for
 416 coding the GMRES algorithm on both the CPU and the GPU versions. CUDA version 4.0~\cite{ref19} is used for programming
 417 the GPUs, using CUBLAS library~\cite{ref37} to deal with the functions operating on vectors. Finally, MPI routines
 418 of OpenMPI 1.3.3 are used to carry out the communication between the CPU cores.
 419
 420 The experiments are done on linear systems associated to sparse matrices chosen from the Davis collection of the
 421 university of Florida~\cite{Dav97}. They are matrices arising in real-world applications. Table~\ref{tab:01} shows
 422 the main characteristics of these sparse matrices and Figure~\ref{fig:03} shows their sparse structures. For
 423 each matrix, we give the number of rows (column~$3$ in Table~\ref{tab:01}), the number of nonzero values (column~$4$)
 424 and the bandwidth (column~$5$).
 425
 426 \begin{table}
 427 \begin{center}
 428 \begin{tabular}{|c|c|r|r|r|}
 429 \hline
 430 Matrix type                 & Name              & \# Rows   & \# Nonzeros  & Bandwidth \\\hline \hline
 431 \multirow{6}{*}{Symmetric}  & 2cubes\_sphere    & 101 492   & 1 647 264    & 100 464        \\
 432                             & ecology2          & 999 999   & 4 995 991    & 2 001          \\
 433                             & finan512          & 74 752    & 596 992      & 74 725         \\
 434                             & G3\_circuit       & 1 585 478 & 7 660 826    & 1 219 059      \\
 435                             & shallow\_water2   & 81 920    & 327 680      & 58 710         \\
 436                             & thermal2          & 1 228 045 & 8 580 313    & 1 226 629      \\ \hline \hline
 437 \multirow{6}{*}{Asymmetric} & cage13            & 445 315   & 7 479 343    & 318 788        \\
 438                             & crashbasis        & 160 000   & 1 750 416    & 120 202        \\
 439                             & FEM\_3D\_thermal2 & 147 900   & 3 489 300    & 117 827        \\
 440                             & language          & 399 130   & 1 216 334    & 398 622        \\
 441                             & poli\_large       & 15 575    & 33 074       & 15 575         \\
 442                             & torso3            & 259 156   & 4 429 042    & 216 854        \\ \hline
 443 \end{tabular}
 444 \caption{Main characteristics of the sparse matrices chosen from the Davis collection}
 445 \label{tab:01}
 446 \end{center}
 447 \end{table}
 448
 449 \begin{figure}
 450 \centering
 451   \includegraphics[width=120mm,keepaspectratio]{Figures/matrices}
 452 \caption{Structures of the sparse matrices chosen from the Davis collection}
 453 \label{fig:03}
 454 \end{figure}
 455
 456 All the experiments are performed on double-precision data. The parameters of the parallel
 457 GMRES algorithm are as follows: the tolerance threshold $\varepsilon=10^{-12}$, the maximum
 458 number of iterations $max=500$, the Arnoldi process is limited to $m=16$ iterations, the elements
 459 of the guess solution $x_0$ is initialized to $0$ and those of the right-hand side vector are
 460 initialized to $1$. For simplicity sake, we chose the matrix preconditioning $M$ as the
 461 main diagonal of the sparse matrix $A$. Indeed, it allows us to easily compute the required inverse
 462 matrix $M^{-1}$ and it provides relatively good preconditioning in most cases. Finally, we set
 463 the size of a thread-block in GPUs to $512$ threads.
 464
 465 \begin{table}[!h]
 466 \begin{center}
 467 \begin{tabular}{|c|c|c|c|c|c|c|}
 468 \hline
 469 Matrix            & $Time_{cpu}$  & $Time_{gpu}$  & $\tau$  & \# $iter$ & $prec$ & $\Delta$   \\ \hline \hline
 470 2cubes\_sphere    & 0.234 s     & 0.124 s      & 1.88  & 21     & 2.10e-14      & 3.47e-18 \\
 471 ecology2          & 0.076 s     & 0.035 s      & 2.15  & 21     & 4.30e-13      & 4.38e-15 \\
 472 finan512          & 0.073 s     & 0.052 s      & 1.40  & 17     & 3.21e-12      & 5.00e-16 \\
 473 G3\_circuit       & 1.016 s     & 0.649 s      & 1.56  & 22     & 1.04e-12      & 2.00e-15 \\
 474 shallow\_water2   & 0.061 s     & 0.044 s      & 1.38  & 17     & 5.42e-22      & 2.71e-25 \\
 475 thermal2          & 1.666 s     & 0.880 s      & 1.89  & 21     & 6.58e-12      & 2.77e-16 \\ \hline \hline
 476 cage13            & 0.721 s     & 0.338 s      & 2.13  & 26     & 3.37e-11      & 2.66e-15 \\
 477 crashbasis        & 1.349 s     & 0.830 s      & 1.62  & 121    & 9.10e-12      & 6.90e-12 \\
 478 FEM\_3D\_thermal2 & 0.797 s     & 0.419 s      & 1.90  & 64     & 3.87e-09      & 9.09e-13 \\
 479 language          & 2.252 s     & 1.204 s      & 1.87  & 90     & 1.18e-10      & 8.00e-11 \\
 480 poli\_large       & 0.097 s     & 0.095 s      & 1.02  & 69     & 4.98e-11      & 1.14e-12 \\
 481 torso3            & 4.242 s     & 2.030 s      & 2.09  & 175    & 2.69e-10      & 1.78e-14 \\ \hline
 482 \end{tabular}
 483 \caption{Performances of the parallel GMRES algorithm on a cluster of 24 CPU cores vs. a cluster of 12 GPUs}
 484 \label{tab:02}
 485 \end{center}
 486 \end{table}
 487
 488 In Table~\ref{tab:02}, we give the performances of the parallel GMRES algorithm for solving the linear
 489 systems associated to the sparse matrices shown in Table~\ref{tab:01}. The second and third columns show
 490 the execution times in seconds obtained on a cluster of 24 CPU cores and on a cluster of 12 GPUs, respectively.
 491 The fourth column shows the ratio $\tau$ between the CPU time $Time_{cpu}$ and the GPU time $Time_{gpu}$ that
 492 is computed as follows:
 493 \begin{equation}
 494   \tau = \frac{Time_{cpu}}{Time_{gpu}}.
 495 \end{equation}
 496 From these ratios, we can notice that the use of many GPUs is not interesting to solve small sparse linear
 497 systems. Solving these sparse linear systems on a cluster of 12 GPUs is as fast as on a cluster of 24 CPU
 498 cores. Indeed, the small sizes of the sparse matrices do not allow to maximize the utilization of the GPU
 499 cores of the cluster. The fifth, sixth and seventh columns show, respectively, the number of iterations performed
 500 by the parallel GMRES algorithm to converge, the precision of the solution, $prec$, computed on the GPU
 501 cluster and the difference, $\Delta$, between the solutions computed on the GPU and the GPU clusters. The
 502 last two parameters are used to validate the results obtained on the GPU cluster and they are computed as
 503 follows:
 504 \begin{equation}
 505 \begin{array}{c}
 506   prec = \|M^{-1}(b-Ax^{gpu})\|_{\infty}, \\
 507   \Delta = \|x^{cpu}-x^{gpu}\|_{\infty},
 508 \end{array}
 509 \end{equation}
 510 where $x^{cpu}$ and $x^{gpu}$ are the solutions computed, respectively, on the CPU cluster and on the GPU cluster.
 511 We can see that the precision of the solutions computed on the GPU cluster are sufficient, they are about $10^{-10}$,
 512 and the parallel GMRES algorithm computes almost the same solutions in both CPU and GPU clusters, with $\Delta$ varying
 513 from $10^{-11}$ to $10^{-25}$.
 514
 515 Afterwards, we evaluate the performances of the parallel GMRES algorithm for solving large linear systems. We have
 516 developed in C programming language a generator of large sparse matrices having a band structure which arises
 517 in  most numerical problems. This generator uses the sparse matrices of the Davis collection as the initial
 518 matrices to build the large band matrices. It is executed in parallel by all the MPI processes of the cluster
 519 so that each process constructs its own sub-matrix as a rectangular block of the global sparse matrix. Each process
 520 $i$ computes the size $n_i$ and the offset $offset_i$ of its sub-matrix in the global sparse matrix according to the
 521 size $n$ of the linear system to be solved and the number of the GPU computing nodes $p$ as follows:
 522 \begin{equation}
 523   n_i = \frac{n}{p},
 524 \end{equation}
 525 \begin{equation}
 526   offset_i = \left\{
 527   \begin{array}{l}
 528   0\mbox{~if~}i=0,\\
 529   offset_{i-1}+n_{i-1}\mbox{~otherwise.}
 530   \end{array}
 531   \right.
 532 \end{equation}
 533 So each process $i$ performs several copies of the same initial matrix chosen from the Davis collection and it
 534 puts all these copies on the main diagonal of the global matrix in order to construct a band matrix. Moreover,
 535 it fulfills the empty spaces between two successive copies by small copies, \textit{lower\_copy} and \textit{upper\_copy},
 536 of the same initial matrix. Figure~\ref{fig:04} shows a generation of a sparse bended matrix by four computing nodes.
 537
 538 \begin{figure}
 539 \centering
 540   \includegraphics[width=100mm,keepaspectratio]{Figures/generation}
 541 \caption{Example of the generation of a large sparse and band matrix by four computing nodes.}
 542 \label{fig:04}
 543 \end{figure}
 544
 545 Table~\ref{tab:03} shows the main characteristics (the number of nonzero values and the bandwidth) of the
 546 large sparse matrices generated from those of the Davis collection. These matrices are associated to the
 547 linear systems of 25 million of unknown values (each generated sparse matrix has 25 million rows). In Table~\ref{tab:04}
 548 we show the performances of the parallel GMRES algorithm for solving large linear systems associated to the
 549 sparse band matrices of Table~\ref{tab:03}. The fourth column gives the ratio between the execution time
 550 spent on a cluster of 24 CPU cores and that spent on a cluster of 12 GPUs. We can notice from these ratios
 551 that for solving large sparse matrices the GPU cluster is more efficient (about 5 times faster) than the CPU
 552 cluster. The computing power of the GPUs allows to accelerate the computation of the functions operating
 553 on large vectors of the parallel GMRES algorithm.
 554
 555 \begin{table}[!h]
 556 \begin{center}
 557 \begin{tabular}{|c|c|r|r|}
 558 \hline
 559 Matrix type                 & Name              & \# nonzeros & Bandwidth \\ \hline \hline
 560 \multirow{6}{*}{Symmetric}  & 2cubes\_sphere    & 413 703 602 & 198 836   \\
 561                             & ecology2          & 124 948 019 & 2 002     \\
 562                             & finan512          & 278 175 945 & 123 900   \\
 563                             & G3\_circuit       & 125 262 292 & 1 891 887 \\
 564                             & shallow\_water2   & 100 235 292 & 62 806    \\
 565                             & thermal2          & 175 300 284 & 2 421 285 \\ \hline \hline
 566 \multirow{6}{*}{Asymmetric} & cage13            & 435 770 480 & 352 566   \\
 567                             & crashbasis        & 409 291 236 & 200 203   \\
 568                             & FEM\_3D\_thermal2 & 595 266 787 & 206 029   \\
 569                             & language          & 76 912 824  & 398 626   \\
 570                             & poli\_large       & 53 322 580  & 15 576    \\
 571                             & torso3            & 433 795 264 & 328 757   \\ \hline
 572 \end{tabular}
 573 \caption{Main characteristics of the sparse and band matrices generated from the sparse matrices of the Davis collection.}
 574 \label{tab:03}
 575 \end{center}
 576 \end{table}
 577
 578
 579 \begin{table}[!h]
 580 \begin{center}
 581 \begin{tabular}{|c|c|c|c|c|c|c|}
 582 \hline
 583 Matrix            & $Time_{cpu}$ & $Time_{gpu}$ & $\tau$ & \# $iter$& $prec$   & $\Delta$   \\ \hline \hline
 584 2cubes\_sphere    & 3.683 s     & 0.870 s      & 4.23   & 21       & 2.11e-14 & 8.67e-18 \\
 585 ecology2          & 2.570 s     & 0.424 s      & 6.06   & 21       & 4.88e-13 & 2.08e-14 \\
 586 finan512          & 2.727 s     & 0.533 s      & 5.11   & 17       & 3.22e-12 & 8.82e-14 \\
 587 G3\_circuit       & 4.656 s     & 1.024 s      & 4.54   & 22       & 1.04e-12 & 5.00e-15 \\
 588 shallow\_water2   & 2.268 s     & 0.384 s      & 5.91   & 17       & 5.54e-21 & 7.92e-24 \\
 589 thermal2          & 4.650 s     & 1.130 s      & 4.11   & 21       & 8.89e-12 & 3.33e-16 \\ \hline \hline
 590 cage13            & 6.068 s     & 1.054 s      & 5.75   & 26       & 3.29e-11 & 1.59e-14 \\
 591 crashbasis        & 25.906 s    & 4.569 s      & 5.67   & 135      & 6.81e-11 & 4.61e-15 \\
 592 FEM\_3D\_thermal2 & 13.555 s    & 2.654 s      & 5.11   & 64       & 3.88e-09 & 1.82e-12 \\
 593 language          & 13.538 s    & 2.621 s      & 5.16   & 89       & 2.11e-10 & 1.60e-10 \\
 594 poli\_large       & 8.619 s     & 1.474 s      & 5.85   & 69       & 5.05e-11 & 6.59e-12 \\
 595 torso3            & 35.213 s    & 6.763 s      & 5.21   & 175      & 2.69e-10 & 2.66e-14 \\ \hline
 596 \end{tabular}
 597 \caption{Performances of the parallel GMRES algorithm for solving large sparse linear systems associated
 598 to band matrices on a cluster of 24 CPU cores vs. a cluster of 12 GPUs.}
 599 \label{tab:04}
 600 \end{center}
 601 \end{table}
 602
 603
 604 %%--------------------%%
 605 %%      SECTION 6     %%
 606 %%--------------------%%
 607 \section{Minimization of communications}
 608 \label{sec:06}
 609 The parallel sparse matrix-vector multiplication requires  data exchanges between the GPU computing nodes
 610 to construct the global vector. However, a GPU cluster requires communications between the GPU nodes and the
 611 data transfers between the GPUs and their hosts CPUs. In fact, a communication between two GPU nodes implies:
 612 a data transfer from the GPU memory to the CPU memory at the sending node, a MPI communication between the CPUs
 613 of two GPU nodes, and a data transfer from the CPU memory to the GPU memory at the receiving node. Moreover,
 614 the data transfers between a GPU and a CPU are considered as the most expensive communications on a GPU cluster.
 615 For example in our GPU cluster, the data throughput between a GPU and a CPU is of 8 GB/s which is about twice
 616 lower than the data transfer rate between CPUs (20 GB/s) and 12 times lower than the memory bandwidth of the
 617 GPU global memory (102 GB/s). In this section, we propose two solutions to improve the execution time of the
 618 parallel GMRES algorithm on GPU clusters.
 619
 620 \subsection{Compressed storage format of the sparse vectors}
 621 \label{sec:06.01}
 622 In Section~\ref{sec:05.01}, the SpMV multiplication uses a global vector having a size equivalent to the matrix
 623 bandwidth (see Formula~\ref{eq:11}). However, we can notice that a SpMV multiplication does not often need all
 624 the vector elements of the global vector composed of the local and shared sub-vectors. For example in Figure~\ref{fig:01},
 625  node 1 only needs  a single vector element from  node 0 (element 1), two elements from node 2 (elements 8
 626 and 9) and two elements from  node 3 (elements 13 and 14). Therefore to reduce the communication overheads
 627 of the unused vector elements, the GPU computing nodes must exchange between them only the vector elements necessary
 628 to perform their local sparse matrix-vector multiplications.
 629
 630 \begin{figure}
 631 \centering
 632   \includegraphics[width=120mm,keepaspectratio]{Figures/compress}
 633 \caption{Example of data exchanges between node 1 and its neighbors 0, 2 and 3.}
 634 \label{fig:05}
 635 \end{figure}
 636
 637 We propose to use a compressed storage format of the sparse global vector. In Figure~\ref{fig:05}, we show an
 638 example of the data exchanges between node 1 and its neighbors to construct the compressed global vector.
 639 First, the neighboring nodes 0, 2 and 3 determine the vector elements needed by node 1 and, then, they send
 640 only these elements to it. Node 1 receives these shared elements in a compressed vector. However to compute
 641 the sparse matrix-vector multiplication, it must first copy the received elements to the corresponding indices
 642 in the global vector. In order to avoid this process at each iteration, we propose to reorder the columns of the
 643 local sub-matrices so as to use the shared vectors in their compressed storage format (see Figure~\ref{fig:06}).
 644 For performance purposes, the computation of the shared data to send to the neighboring nodes is performed
 645 by the GPU as a kernel. In addition, we use the MPI point-to-point communication routines: a blocking send routine
 646 \verb+MPI_Send()+ and a nonblocking receive routine \verb+MPI_Irecv()+.
 647
 648 \begin{figure}
 649 \centering
 650   \includegraphics[width=100mm,keepaspectratio]{Figures/reorder}
 651 \caption{Reordering of the columns of a local sparse matrix.}
 652 \label{fig:06}
 653 \end{figure}
 654
 655 Table~\ref{tab:05} shows the performances of the parallel GMRES algorithm using the compressed storage format
 656 of the sparse global vector. The results are obtained from solving large linear systems associated to the sparse
 657 band matrices presented in Table~\ref{tab:03}. We can see from Table~\ref{tab:05} that the execution times
 658 of the parallel GMRES algorithm on a cluster of 12 GPUs are improved by about 38\% compared to those presented
 659 in Table~\ref{tab:04}. In addition, the ratios between the execution times spent on the cluster of 24 CPU cores
 660 and those spent on the cluster of 12 GPUs have increased. Indeed, the reordering of the sparse sub-matrices and
 661 the use of a compressed storage format for the sparse vectors minimize the communication overheads between the
 662 GPU computing nodes.
 663
 664 \begin{table}[!h]
 665 \begin{center}
 666 \begin{tabular}{|c|c|c|c|c|c|c|}
 667 \hline
 668 Matrix            & $Time_{cpu}$ & $Time_{gpu}$ & $\tau$& \# $iter$& $prec$   & $\Delta$   \\ \hline \hline
 669 2cubes\_sphere    & 3.597 s     & 0.514 s      & 6.99  & 21       & 2.11e-14 & 8.67e-18 \\
 670 ecology2          & 2.549 s     & 0.288 s      & 8.83  & 21       & 4.88e-13 & 2.08e-14 \\
 671 finan512          & 2.660 s     & 0.377 s      & 7.05  & 17       & 3.22e-12 & 8.82e-14 \\
 672 G3\_circuit       & 3.139 s     & 0.480 s      & 6.53  & 22       & 1.04e-12 & 5.00e-15 \\
 673 shallow\_water2   & 2.195 s     & 0.253 s      & 8.68  & 17       & 5.54e-21 & 7.92e-24 \\
 674 thermal2          & 3.206 s     & 0.463 s      & 6.93  & 21       & 8.89e-12 & 3.33e-16 \\ \hline \hline
 675 cage13            & 5.560 s     & 0.663 s      & 8.39  & 26       & 3.29e-11 & 1.59e-14 \\
 676 crashbasis        & 25.802 s    & 3.511 s      & 7.35  & 135      & 6.81e-11 & 4.61e-15 \\
 677 FEM\_3D\_thermal2 & 13.281 s    & 1.572 s      & 8.45  & 64       & 3.88e-09 & 1.82e-12 \\
 678 language          & 12.553 s    & 1.760 s      & 7.13  & 89       & 2.11e-10 & 1.60e-10 \\
 679 poli\_large       & 8.515 s     & 1.053 s      & 8.09  & 69       & 5.05e-11 & 6.59e-12 \\
 680 torso3            & 31.463 s    & 3.681 s      & 8.55  & 175      & 2.69e-10 & 2.66e-14 \\ \hline
 681 \end{tabular}
 682 \caption{Performances of the parallel GMRES algorithm using a compressed storage format of the sparse
 683 vectors for solving large sparse linear systems associated to band matrices on a cluster of 24 CPU cores vs.
 684 a cluster of 12 GPUs.}
 685 \label{tab:05}
 686 \end{center}
 687 \end{table}
 688
 689
 690 \subsection{Hypergraph partitioning}
 691 \label{sec:06.02}
 692 In this section, we use another structure of the sparse matrices. We are interested in sparse matrices
 693 whose nonzero values are distributed along their large bandwidths. We developed in C programming
 694 language a generator of sparse matrices having five bands (see Figure~\ref{fig:07}). The principle of
 695 this generator is the same as the one presented in Section~\ref{sec:05.02}. However, the copies made from the
 696 initial sparse matrix, chosen from the Davis collection, are placed on the main diagonal and on two
 697 off-diagonals on the left and right of the main diagonal. Table~\ref{tab:06} shows the main characteristics
 698 of sparse matrices of size 25 million of rows and generated from those of the Davis collection. We can
 699 see in the fourth column that the bandwidths of these matrices are as large as their sizes.
 700
 701 \begin{figure}
 702 \centering
 703   \includegraphics[width=100mm,keepaspectratio]{Figures/generation_1}
 704 \caption{Example of the generation of a large sparse matrix having five bands by four computing nodes.}
 705 \label{fig:07}
 706 \end{figure}
 707
 708
 709 \begin{table}[!h]
 710 \begin{center}
 711 \begin{tabular}{|c|c|r|r|}
 712 \hline
 713 Matrix type                 & Name              & \# nonzeros   & Bandwidth \\ \hline \hline
 714 \multirow{6}{*}{Symmetric}  & 2cubes\_sphere    & 829 082 728   & 24 999 999     \\
 715                             & ecology2          & 254 892 056   & 25 000 000     \\
 716                             & finan512          & 556 982 339   & 24 999 973     \\
 717                             & G3\_circuit       & 257 982 646   & 25 000 000     \\
 718                             & shallow\_water2   & 200 798 268   & 25 000 000     \\
 719                             & thermal2          & 359 340 179   & 24 999 998     \\ \hline \hline
 720 \multirow{6}{*}{Asymmetric} & cage13            & 879 063 379   & 24 999 998     \\
 721                             & crashbasis        & 820 373 286   & 24 999 803     \\
 722                             & FEM\_3D\_thermal2 & 1 194 012 703 & 24 999 998     \\
 723                             & language          & 155 261 826   & 24 999 492     \\
 724                             & poli\_large       & 106 680 819   & 25 000 000     \\
 725                             & torso3            & 872 029 998   & 25 000 000     \\ \hline
 726 \end{tabular}
 727 \caption{Main characteristics of the sparse matrices having five band and generated from the sparse matrices of the Davis collection.}
 728 \label{tab:06}
 729 \end{center}
 730 \end{table}
 731
 732 In Table~\ref{tab:07} we give the performances of the parallel GMRES algorithm on the CPU and GPU
 733 clusters for solving large linear systems associated to the sparse matrices shown in Table~\ref{tab:06}.
 734 We can notice from the ratios given in the fourth column that solving sparse linear systems associated
 735 to matrices having large bandwidth on the GPU cluster is as fast as on the CPU cluster. This is due
 736 to the large total communication volume necessary to synchronize the computations over the cluster.
 737 In fact, the naive partitioning row-by-row or column-by-column of this type of sparse matrices links
 738 a GPU node to many neighboring nodes and produces a significant number of data dependencies between
 739 the different GPU nodes.
 740
 741 \begin{table}[!h]
 742 \begin{center}
 743 \begin{tabular}{|c|c|c|c|c|c|c|}
 744 \hline
 745 Matrix            & $Time_{cpu}$ & $Time_{gpu}$ & $\tau$ & \# $iter$& $prec$ & $\Delta$   \\ \hline \hline
 746 2cubes\_sphere    & 15.963 s    & 7.250 s      & 2.20  & 58     & 6.23e-16 & 3.25e-19 \\
 747 ecology2          & 3.549 s     & 2.176 s      & 1.63  & 21     & 4.78e-15 & 1.06e-15 \\
 748 finan512          & 3.862 s     & 1.934 s      & 1.99  & 17     & 3.21e-14 & 8.43e-17 \\
 749 G3\_circuit       & 4.636 s     & 2.811 s      & 1.65  & 22     & 1.08e-14 & 1.77e-16 \\
 750 shallow\_water2   & 2.738 s     & 1.539 s      & 1.78  & 17     & 5.54e-23 & 3.82e-26 \\
 751 thermal2          & 5.017 s     & 2.587 s      & 1.94  & 21     & 8.25e-14 & 4.34e-18 \\ \hline \hline
 752 cage13            & 9.315 s     & 3.227 s      & 2.89  & 26     & 3.38e-13 & 2.08e-16 \\
 753 crashbasis        & 35.980 s    & 14.770 s     & 2.43  & 127    & 1.17e-12 & 1.56e-17 \\
 754 FEM\_3D\_thermal2 & 24.611 s    & 7.749 s      & 3.17  & 64     & 3.87e-11 & 2.84e-14 \\
 755 language          & 16.859 s    & 9.697 s      & 1.74  & 89     & 2.17e-12 & 1.70e-12 \\
 756 poli\_large       & 10.200 s    & 6.534 s      & 1.56  & 69     & 5.14e-13 & 1.63e-13 \\
 757 torso3            & 49.074 s    & 19.397 s     & 2.53  & 175    & 2.69e-12 & 2.77e-16 \\ \hline
 758 \end{tabular}
 759 \caption{Performances of the parallel GMRES algorithm using a compressed storage format of the sparse
 760 vectors for solving large sparse linear systems associated to matrices having five bands on a cluster
 761 of 24 CPU cores vs. a cluster of 12 GPUs.}
 762 \label{tab:07}
 763 \end{center}
 764 \end{table}
 765
 766 We propose to use a hypergraph partitioning method which is well adapted to numerous structures
 767 of sparse matrices~\cite{Cat99}. Indeed, it can well model the communications between the computing
 768 nodes especially for the asymmetric and rectangular matrices. It gives in most cases good reductions
 769 of the total communication volume. Nevertheless, it is more expensive in terms of execution time and
 770 memory space consumption than the partitioning method based on graphs.
 771
 772 The sparse matrix $A$ of the linear system to be solved is modelled as a hypergraph
 773 $\mathcal{H}=(\mathcal{V},\mathcal{E})$ as follows:
 774 \begin{itemize}
 775 \item each matrix row $i$ ($0\leq i<n$) corresponds to a vertex $v_i\in\mathcal{V}$,
 776 \item each matrix column $j$ ($0\leq j<n$) corresponds to a hyperedge $e_j\in\mathcal{E}$, such that:
 777 $\forall a_{ij}$ is a nonzero value of the matrix $A$, $v_i\in pins[e_j]$,
 778 \item $w_i$ is the weight of vertex $v_i$,
 779 \item $c_j$ is the cost of hyperedge $e_j$.
 780 \end{itemize}
 781 A $K$-way partitioning of a hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ is defined as a set
 782 of $K$ pairwise disjoint non-empty subsets (or parts) of the vertex set $\mathcal{V}$: $\mathcal{P}=\{\mathcal{V}_1,\ldots,\mathcal{V}_k\}$,
 783 such that $\mathcal{V}=\displaystyle\cup_{k=1}^K\mathcal{V}_{k}$. Each computing node is in charge of
 784 a vertex subset. Figure~\ref{fig:08} shows an example of a hypergraph partitioning of a sparse matrix
 785 of size $(9\times 9)$ into three parts. The circles and squares correspond, respectively, to the vertices
 786 and hyperedges of the hypergraph. The solid squares define the cut hyperedges connecting at least two
 787 different parts. The connectivity $\lambda_j$ denotes the number of different parts spanned by the cut
 788 hyperedge $e_j$.
 789
 790 \begin{figure}
 791 \centering
 792   \includegraphics[width=130mm,keepaspectratio]{Figures/hypergraph}
 793 \caption{A hypergraph partitioning of a sparse matrix between three computing nodes.}
 794 \label{fig:08}
 795 \end{figure}
 796
 797 The cut hyperedges model the communications between the different GPU computing nodes in the cluster,
 798 necessary to perform the SpMV multiplication. Indeed, each hyperedge $e_j$ defines a set of atomic
 799 computations $b_i\leftarrow b_i+a_{ij}x_j$ of the SpMV multiplication which needs the $j^{th}$ element
 800 of  vector $x$. Therefore pins of hyperedge $e_j$ ($pins[e_j]$) denote the set of matrix rows requiring
 801 the same vector element $x_j$. For example in Figure~\ref{fig:08}, hyperedge $e_9$ whose pins are:
 802 $pins[e_9]=\{v_2,v_5,v_9\}$ represents matrix rows 2, 5 and 9 requiring the vector element $x_9$
 803 to compute in parallel the atomic operations: $b_2\leftarrow b_2+a_{29}x_9$, $b_5\leftarrow b_5+a_{59}x_9$
 804 and $b_9\leftarrow b_9+a_{99}x_9$. However, $x_9$ is a vector element of the computing node 3 and it must
 805 be sent to the neighboring nodes 1 and 2.
 806
 807 The hypergraph partitioning allows to reduce the total communication volume while maintaining the computational
 808 load balance between the computing nodes. Indeed, it minimizes at best the following sum:
 809 \begin{equation}
 810 \mathcal{X}(\mathcal{P}) = \displaystyle\sum_{e_j\in\mathcal{E}_C} c_j(\lambda_j-1),
 811 \end{equation}
 812 where $\mathcal{E}_C$ is the set of the cut hyperedges issued from the partitioning $\mathcal{P}$, $c_j$
 813 and $\lambda_j$ are, respectively, the cost and the connectivity of the cut hyperedge $e_j$. In addition,
 814 the hypergraph partitioning is constrained to maintain the load balance between the $K$ parts:
 815 \begin{equation}
 816 W_k\leq (1+\epsilon)W_{avg}\mbox{,~}(1\leq k\leq K)\mbox{~and~}(0<\epsilon<1),
 817 \end{equation}
 818 where $W_k$ is the sum of the vertex weights in the subset $\mathcal{V}_k$, $W_{avg}$ is the average part's
 819 weight and $\epsilon$ is the maximum allowed imbalanced ratio.
 820
 821 The hypergraph partitioning is a NP-complete problem but software tools using heuristics are developed, for
 822 example: hMETIS~\cite{Kar98}, PaToH~\cite{Cata99} and Zoltan~\cite{Dev06}. Due to the large sizes of the
 823 linear systems to be solved, we use a parallel hypergraph partitioning which must be performed by at least
 824 two MPI processes. The hypergraph model $\mathcal{H}$ of the sparse matrix is split into $p$ (number of computing
 825 nodes) sub-hypergraphs $\mathcal{H}_k=(\mathcal{V}_k,\mathcal{E}_k)$, $0\leq k<p$, then the parallel partitioning
 826 is applied by using the MPI communication routines.
 827
 828 Table~\ref{tab:08} shows the performances of the parallel GMRES algorithm for solving the linear systems
 829 associated to the sparse matrices presented in Table~\ref{tab:06}. In the experiments, we have used the
 830 compressed storage format of the sparse vectors and the parallel hypergraph partitioning developed in the
 831 Zoltan tool~\cite{ref20,ref21}. The parameters of the hypergraph partitioning are initialized as follows:
 832 \begin{itemize}
 833 \item The weight $w_i$ of each vertex $v_i$ is set to the number of the nonzero values on the matrix row $i$,
 834 \item For simplicity sake, the cost $c_j$ of each hyperedge $e_j$ is set to 1,
 835 \item The maximum imbalanced ratio $\epsilon$ is limited to 10\%.
 836 \end{itemize}
 837 We can notice from Table~\ref{tab:08} that the execution times on the cluster of 12 GPUs are significantly
 838 improved compared to those presented in Table~\ref{tab:07}. The hypergraph partitioning applied on the large
 839 sparse matrices having large bandwidths have improved the execution times on the GPU cluster by about 65\%.
 840
 841 \begin{table}[!h]
 842 \begin{center}
 843 \begin{tabular}{|c|c|c|c|c|c|c|}
 844 \hline
 845 Matrix            & $Time_{cpu}$ & $Time_{gpu}$ & $\tau$  & \# iter & $prec$     & $\Delta$   \\ \hline \hline
 846 2cubes\_sphere    & 16.430 s    & 2.840 s      & 5.78 & 58     & 6.23e-16 & 3.25e-19 \\
 847 ecology2          & 3.152 s     & 0.367 s      & 8.59 & 21     & 4.78e-15 & 1.06e-15 \\
 848 finan512          & 3.672 s     & 0.723 s      & 5.08 & 17     & 3.21e-14 & 8.43e-17 \\
 849 G3\_circuit       & 4.468 s     & 0.971 s      & 4.60 & 22     & 1.08e-14 & 1.77e-16 \\
 850 shallow\_water2   & 2.647 s     & 0.312 s      & 8.48 & 17     & 5.54e-23 & 3.82e-26 \\
 851 thermal2          & 4.190 s     & 0.666 s      & 6.29 & 21     & 8.25e-14 & 4.34e-18 \\ \hline \hline
 852 cage13            & 8.077 s     & 1.584 s      & 5.10 & 26     & 3.38e-13 & 2.08e-16 \\
 853 crashbasis        & 35.173 s    & 5.546 s      & 6.34 & 127    & 1.17e-12 & 1.56e-17 \\
 854 FEM\_3D\_thermal2 & 24.825 s    & 3.113 s      & 7.97 & 64     & 3.87e-11 & 2.84e-14 \\
 855 language          & 16.706 s    & 2.522 s      & 6.62 & 89     & 2.17e-12 & 1.70e-12 \\
 856 poli\_large       & 12.715 s    & 3.989 s      & 3.19 & 69     & 5.14e-13 & 1.63e-13 \\
 857 torso3            & 48.459 s    & 6.234 s      & 7.77 & 175    & 2.69e-12 & 2.77e-16 \\ \hline
 858 \end{tabular}
 859 \caption{Performances of the parallel GMRES algorithm using a compressed storage format of the sparse
 860 vectors and a hypergraph partitioning method for solving large sparse linear systems associated to matrices
 861 having five bands on a cluster of 24 CPU cores vs. a cluster of 12 GPUs.}
 862 \label{tab:08}
 863 \end{center}
 864 \end{table}
 865
 866 Table~\ref{tab:09} shows in the second and third columns the total communication volume on a cluster
 867 of 12 GPUs by using row-by-row partitioning and hypergraph partitioning, respectively. The total communication
 868 volume defines the total number of the vector elements exchanged between the 12 GPUs. This table shows
 869 that the hypergraph partitioning allows to split the large sparse matrices so as to minimize  data
 870 dependencies between the GPU computing nodes. However, we can notice in the fourth column that the hypergraph
 871 partitioning takes longer than the computation times. As we have mentioned before, the hypergraph partitioning
 872 method is less efficient in terms of memory consumption and partitioning time than its graph counterpart.
 873 So for the applications which often use the same sparse matrices, we can perform the hypergraph partitioning
 874 only once and, then, we save the traces in files to be reused several times. Therefore, this allows us to
 875 avoid the partitioning of the sparse matrices at each resolution of the linear systems.
 876
 877 \begin{table}[!h]
 878 \begin{center}
 879 \begin{tabular}{|c|c|c|c|}
 880 \hline
 881 \multirow{4}{*}{Matrix} & Total communication    & Total communication & Time of  \\
 882                         & volume using           & volume using        & hypergraph     \\
 883                         & row-by-row             & hypergraph          & partitioning      \\
 884                         & partitioning           & partitioning        & in minutes                \\ \hline \hline
 885 2cubes\_sphere          & 25 360 543             & 240 679             & 68.98         \\
 886 ecology2                & 26 044 002             & 73 021              & 4.92          \\
 887 finan512                & 26 087 431             & 900 729             & 33.72         \\
 888 G3\_circuit             & 31 912 003             & 5 366 774           & 11.63         \\
 889 shallow\_water2         & 25 105 108             & 60 899              & 5.06          \\
 890 thermal2                & 30 012 846             & 1 077 921           & 17.88         \\ \hline \hline
 891 cage13                  & 28 254 282             & 3 845 440           & 196.45        \\
 892 crashbasis              & 29 020 060             & 2 401 876           & 33.39         \\
 893 FEM\_3D\_thermal2       & 25 263 767             & 250 105             & 49.89         \\
 894 language                & 27 291 486             & 1 537 835           & 9.07          \\
 895 poli\_large             & 25 053 554             & 7 388 883           & 5.92          \\
 896 torso3                  & 25 682 514             & 613 250             & 61.51        \\ \hline
 897 \end{tabular}
 898 \caption{Total communication volume on a cluster of 12 GPUs using row-by-row or hypergraph partitioning methods}
 899 \label{tab:09}
 900 \end{center}
 901 \end{table}
 902
 903
 904 %%--------------------%%
 905 %%      SECTION 7     %%
 906 %%--------------------%%
 907 \section{Conclusion and perspectives}
 908 \label{sec:07}
 909 In this paper, we have aimed at harnessing the computing power of a GPU cluster for
 910 solving large sparse linear systems. We have implemented the parallel algorithm of the
 911 GMRES iterative method. We have used a heterogeneous parallel programming based on the
 912 CUDA language to program the GPUs and the MPI parallel environment to distribute the
 913 computations between the GPU nodes on the cluster.
 914
 915 The experiments have shown that solving large sparse linear systems is more efficient
 916 on a cluster of GPUs than on a cluster of CPUs. However, the efficiency of a GPU cluster
 917 is ensured as long as the spatial and temporal localization of the data is well managed.
 918 The data dependency scheme on a GPU cluster is related to the sparse structures of the
 919 matrices (positions of the nonzero values) and the number of the computing nodes. We have
 920 shown that a large number of communications between the GPU computing nodes affects
 921 considerably the performances of the parallel GMRES algorithm on the GPU cluster. Therefore,
 922 we have proposed to reorder the columns of the sparse local sub-matrices on each GPU node
 923 and to use a compressed storage format for the sparse vector involved in the parallel
 924 sparse matrix-vector multiplication. This solution allows to minimize the communication
 925 overheads. In addition, we have shown that it is interesting to choose a partitioning method
 926 according to the structure of the sparse matrix. This reduces the total communication
 927 volume between the GPU computing nodes.
 928
 929 In future works, it would be interesting to implement and study the scalability of the
 930 parallel GMRES algorithm on large GPU clusters (hundreds or thousands of GPUs) or on geographically
 931 distant GPU clusters. In this context, other methods might be used to reduce communication
 932 and to improve the performances of the parallel GMRES algorithm as the multisplitting methods.
 933 The recent GPU hardware and software architectures provide the GPU-Direct system which allows
 934 two GPUs, placed in the same machine or in two remote machines, to exchange data without using
 935 CPUs. This improves the data transfers between GPUs. Finally, it would be interesting to implement
 936 other iterative methods on GPU clusters for solving large sparse linear or nonlinear systems.
 937
 938 \paragraph{Acknowledgments}
 939 This paper is based upon work supported by the R\'egion de Franche-Comt\'e.
 940
 941 \bibliography{bib}
 942 \bibliographystyle{abbrv}
 943 \end{document}