GMRES_Journal.tex

   1 \documentclass[11pt]{article}
   2 \usepackage{multicol}
   3
   4 \usepackage[paper=a4paper,dvips,top=1.5cm,left=1.5cm,right=1.5cm,foot=1cm,bottom=1.5cm]{geometry}
   5 \usepackage{times}
   6 \usepackage{graphicx}
   7 \usepackage{amsmath}
   8 \usepackage{amsfonts}
   9 \usepackage{amssymb}
  10 \usepackage{amsthm}
  11 \usepackage{amsopn}
  12 \usepackage{xspace}
  13 \usepackage{array}
  14 \usepackage{epsfig}
  15 \usepackage{algorithmic}
  16 \usepackage[ruled,english,boxed,linesnumbered]{algorithm2e}
  17 \usepackage[english]{algorithm2e}
  18 \usepackage{url}
  19 \usepackage{mdwlist}
  20 \usepackage{multirow}
  21
  22 \date{}
  23
  24 \title{Parallel sparse linear solver with GMRES method using minimization techniques of communications for GPU clusters}
  25
  26 \author{
  27 \textsc{Lilia Ziane Khodja}
  28 \qquad
  29 \textsc{Rapha\"el Couturier}\thanks{Contact author}
  30 \qquad
  31 \textsc{Arnaud Giersch}
  32 \qquad
  33 \textsc{Jacques M. Bahi}
  34 \mbox{}\\ %
  35 \\
  36 FEMTO-ST Institute, University of Franche-Comte\\
  37 IUT Belfort-Montb\'eliard\\
  38 19 Av. du Maréchal Juin, BP 527, 90016 Belfort, France\\
  39 \mbox{}\\ %
  40 \normalsize
  41 \{\texttt{lilia.ziane\_khoja},~\texttt{raphael.couturier},~\texttt{arnaud.giersch},~\texttt{jacques.bahi}\}\texttt{@univ-fcomte.fr}
  42 }
  43
  44 \newcommand{\BW}{\mathit{bw}}
  45 \newcommand{\Iter}{\mathit{iter}}
  46 \newcommand{\Max}{\mathit{max}}
  47 \newcommand{\Offset}{\mathit{offset}}
  48 \newcommand{\Prec}{\mathit{prec}}
  49 \newcommand{\Ratio}{\mathit{Ratio}}
  50 \newcommand{\Time}[1]{\mathit{Time}_\mathit{#1}}
  51 \newcommand{\Wavg}{W_{\mathit{avg}}}
  52
  53 \begin{document}
  54
  55 \maketitle
  56
  57 \begin{abstract}
  58 In this paper, we aim at exploiting the power computing of a GPU cluster for solving large sparse
  59 linear systems. We implement the parallel algorithm of the GMRES iterative method using the CUDA
  60 programming language and the MPI parallel environment. The experiments show that a GPU cluster
  61 is more efficient than a CPU cluster. In order to optimize the performances, we use a compressed
  62 storage format for the sparse vectors and the hypergraph partitioning. These solutions improve
  63 the spatial and temporal localization of the shared data between the computing nodes of the GPU
  64 cluster.
  65 \end{abstract}
  66
  67
  68
  69 %%--------------------%%
  70 %%      SECTION 1     %%
  71 %%--------------------%%
  72 \section{Introduction}
  73 \label{sec:01}
  74 Large sparse linear systems arise in  most numerical scientific or industrial simulations.
  75 They model numerous complex problems in different areas of applications such as mathematics,
  76 engineering, biology or physics~\cite{ref18}. However, solving these systems of equations is
  77 often an expensive operation in terms of execution time and memory space consumption. Indeed,
  78 the linear systems arising in most applications are very large  and have many zero
  79 coefficients, and this sparse nature leads to irregular accesses to load the nonzero coefficients
  80 from the memory.
  81
  82 Parallel computing has become a key issue for solving sparse linear systems of large sizes.
  83 This is due to the computing power and the storage capacity of the current parallel computers as
  84 well as the availability of different parallel programming languages and environments such as the
  85 MPI communication standard. Nowadays, graphics processing units (GPUs) are the most commonly used
  86 hardware accelerators in high performance computing. They are equipped with a massively parallel
  87 architecture allowing them to compute faster than CPUs. However, the parallel computers equipped
  88 with GPUs introduce new programming difficulties to adapt  parallel algorithms to their architectures.
  89
  90 In this paper, we use the GMRES iterative method for solving large sparse linear systems on a cluster
  91 of GPUs. The parallel algorithm of this method is implemented using the CUDA programming language for
  92 the GPUs and the MPI parallel environment to distribute the computations between the different GPU nodes
  93 of the cluster. Particularly, we focus on improving the performances of the parallel sparse matrix-vector multiplication.
  94 Indeed, this operation is not only very time-consuming but it also requires communications
  95 between the GPU nodes. These communications are needed to build the global vector involved in
  96 the parallel sparse matrix-vector multiplication. It should be noted that a communication between two
  97 GPU nodes involves data transfers between the GPU and CPU memories in the same node and the MPI communications
  98 between the CPUs of the GPU nodes. For performance purposes, we propose to use a compressed storage
  99 format to reduce the size of the vectors to be exchanged between the GPU nodes and a hypergraph partitioning
 100 of the sparse matrix to reduce the total communication volume.
 101
 102 The present paper is organized as follows. In Section~\ref{sec:02} some previous works about solving
 103 sparse linear systems on GPUs are presented. In Section~\ref{sec:03} is given a general overview of the GPU architectures,
 104 followed by that the GMRES method in Section~\ref{sec:04}. In Section~\ref{sec:05} the main key points
 105 of the parallel implementation of the GMRES method on a GPU cluster are described. Finally, in Section~\ref{sec:06}
 106 is presented the performance improvements of the parallel GMRES algorithm on a GPU cluster.
 107
 108
 109 %%--------------------%%
 110 %%      SECTION 2     %%
 111 %%--------------------%%
 112 \section{Related work}
 113 \label{sec:02}
 114 Numerous works have shown the efficiency of GPUs for solving sparse linear systems compared
 115 to their CPUs counterpart. Different iterative methods are implemented on one GPU, for example
 116 Jacobi and Gauss-Seidel in~\cite{refa}, conjugate and biconjugate gradients in~\cite{refd,refe,reff,refj}
 117 and GMRES in~\cite{refb,refc,refg,refm}. In addition, some iterative methods are implemented on
 118 shared memory multi-GPUs machines as~\cite{refh,refi,refk,refl}. A limited set of studies are
 119 devoted to the parallel implementation of the iterative methods on distributed memory GPU clusters
 120 as~\cite{refn,refo,refp}.
 121
 122 Traditionally, the parallel iterative algorithms do not often scale well on GPU clusters due to
 123 the significant cost of the communications between the computing nodes. Some authors have already
 124 studied how to reduce these communications. In~\cite{cev10}, the authors used a hypergraph partitioning
 125 as a preprocessing to the parallel conjugate gradient algorithm in order to reduce the inter-GPU
 126 communications over a GPU cluster. The sequential hypergraph partitioning method provided by the
 127 PaToH tool~\cite{Cata99} is used because of the small sizes of the sparse symmetric linear systems
 128 to be solved. In~\cite{refq}, a compression and decompression technique is proposed to reduce the
 129 communication overheads. This technique is performed on the shared vectors to be exchanged between
 130 the computing nodes. In~\cite{refr}, the authors studied the impact of asynchronism on parallel
 131 iterative algorithms on local GPU clusters. Asynchronous communication primitives suppress some
 132 synchronization barriers and allow overlap of communication and computation. In~\cite{refs}, a
 133 communication reduction method is used for implementing finite element methods (FEM) on GPU clusters.
 134 This method firstly uses the Reverse Cuthill-McKee reordering to reduce the total communication
 135 volume. In addition, the performances of the parallel FEM algorithm are improved by overlapping
 136 the communication with computation.
 137
 138  Our main contribution in this work is to show the difficulties of implementing the GMRES method to solve sparse linear systems on a cluster of GPUs. First, we show the main key points of the parallel GMRES algorithm on a GPU cluster. Then, we discuss the improvements of the algorithm which are mainly performed on the sparse matrix-vector multiplication when the matrix is distributed on several GPUs. In fact, on a cluster of GPUs the influence of the communications is greater than on clusters of CPUs due to the CPU/GPU communications between two GPUs that are not on the same machines. We propose to perform a hypergraph partitioning on the problem to be solved, then we reorder the matrix columns according to the partitioning scheme, and we use a compressed format to store the vectors in order to minimize the communication overheads between two GPUs.
 139
 140
 141 %%--------------------%%
 142 %%      SECTION 3     %%
 143 %%--------------------%%
 144 \section{{GPU} architectures}
 145 \label{sec:03}
 146 A GPU (Graphics processing unit) is a hardware accelerator for high performance computing.
 147 Its hardware architecture is composed of hundreds of cores organized in several blocks called
 148 \emph{streaming multiprocessors}. It is also equipped with a memory hierarchy. It has a set
 149 of registers and a private read-write \emph{local memory} per core, a fast \emph{shared memory},
 150 read-only \emph{constant} and \emph{texture} caches per multiprocessor and a read-write
 151 \emph{global memory} shared by all its multiprocessors. The new architectures (Fermi, Kepler,
 152 etc) have also L1 and L2 caches to improve the accesses to the global memory.
 153
 154 NVIDIA has released the CUDA platform (Compute Unified Device Architecture)~\cite{ref19}
 155 which provides a high level GPGPU-based programming language (General-Purpose computing
 156 on GPUs), allowing to program GPUs for general purpose computations. In CUDA programming
 157 environment, all data-parallel and compute intensive portions of an application running
 158 on the CPU are off-loaded onto the GPU. Indeed, an application developed in CUDA is a
 159 program written in C language (or Fortran) with a minimal set of extensions to define
 160 the parallel functions to be executed by the GPU, called \emph{kernels}. We define kernels,
 161 as separate functions from those of the CPU, by assigning them a function type qualifiers
 162 \verb+__global__+ or \verb+__device__+.
 163
 164 At the GPU level, the same kernel is executed by a large number of parallel CUDA threads
 165 grouped together as a grid of thread blocks. Each multiprocessor of the GPU executes one
 166 or more thread blocks in SIMD fashion (Single Instruction, Multiple Data) and in turn each
 167 core of a GPU multiprocessor runs one or more threads within a block in SIMT fashion (Single
 168 Instruction, Multiple threads). In order to maximize the occupation of the GPU cores, the
 169 number of CUDA threads to be involved in a kernel execution is computed according to the
 170 size of the problem to be solved. In contrast, the block size is restricted by the limited
 171 memory resources of a core. On current GPUs, a thread block may contain up-to $1,024$ concurrent
 172 threads. At any given clock cycle, the threads execute the same instruction of a kernel,
 173 but each of them operates on different data. Moreover, threads within a block can cooperate
 174 by sharing data through the fast shared memory and coordinate their execution through
 175 synchronization points. In contrast, within a grid of thread blocks, there is no synchronization
 176 at all between blocks.
 177
 178 GPUs  only work on data filled in their global memory and the final results of their kernel
 179 executions must be communicated to their hosts (CPUs). Hence, the data must be transferred
 180 \emph{in} and \emph{out} of the GPU. However, the speed of memory copy between the CPU and
 181 the GPU is slower than the memory copy speed of GPUs. Accordingly, it is necessary to limit
 182 the transfer of data between the GPU and its host.
 183
 184
 185 %%--------------------%%
 186 %%      SECTION 4     %%
 187 %%--------------------%%
 188 \section{{GMRES} method}
 189 \label{sec:04}
 190
 191 The generalized minimal residual method (GMRES) is an iterative method designed by Saad and Schultz in 1986~\cite{Saa86}. It is a generalization of the minimal residual method (MNRES)~\cite{Pai75} to deal with asymmetric and non Hermitian problems and indefinite symmetric problems.
 192
 193 Let us consider the following sparse linear system of $n$ equations:
 194 \begin{equation}
 195   Ax = b,
 196 \label{eq:01}
 197 \end{equation}
 198 where $A\in\mathbb{R}^{n\times n}$ is a sparse square and nonsingular matrix, $x\in\mathbb{R}^n$ is the solution vector and $b\in\mathbb{R}^n$ is the right-hand side vector. The main idea of the GMRES method is to find a sequence of solutions $\{x_k\}_{k\in\mathbb{N}}$ which minimizes at best the residual $r_k=b-Ax_k$. The solution $x_k$ is computed in a Krylov sub-space $\mathcal{K}_k(A,v_1)$:
 199 \begin{equation}
 200 \begin{array}{ll}
 201   \mathcal{K}_{k}(A,v_{1}) \equiv \text{span}\{v_{1}, Av_{1}, A^{2}v_{1},..., A^{k-1}v_{1}\}, & v_{1}=\frac{r_{0}}{\|r_{0}\|_{2}},
 202 \end{array}
 203 \end{equation}
 204 such that the Petrov-Galerkin condition is satisfied:
 205 \begin{equation}
 206   r_{k} \perp A\mathcal{K}_{k}(A, v_{1}).
 207 \end{equation}
 208
 209 Algorithm~\ref{alg:01} illustrates the main key points of the GMRES method with restarts. The linear system to be solved in this algorithm is left-preconditioned where $M$ is the preconditioning matrix. The Arnoldi process~\cite{Arn51} is used (from line~$7$ to line~$17$ of algorithm~\ref{alg:01}) to construct an orthonormal basis $V_m$ and a Hessenberg matrix $\bar{H}_m$ of order $(m+1)\times m$ such that $m\ll n$. Then, the least-squares problem is solved (line~$18$) to find the vector $y\in\mathbb{R}^m$ which minimizes the residual. Finally, the solution $x_m$ is computed in the Krylov sub-space spanned by $V_m$ (line~$19$). In practice, the GMRES algorithm stops when the Euclidean norm of the residual is small enough and/or the maximum number of iterations is reached.
 210 %%% END %%%
 211
 212 \begin{algorithm}[!ht]
 213   \newcommand{\Convergence}{\mathit{convergence}}
 214   \newcommand{\False}{\mathit{false}}
 215   \newcommand{\True}{\mathit{true}}
 216   %\SetAlgoLined
 217   \KwIn{$A$ (matrix), $b$ (vector), $M$ (preconditioning matrix),
 218 $x_{0}$ (initial guess), $\varepsilon$ (tolerance threshold), $\Max$ (maximum number of iterations),
 219 $m$ (number of iterations of the Arnoldi process)}
 220   \KwOut{$x$ (solution vector)}
 221   \BlankLine
 222   $r_{0} \leftarrow M^{-1}(b - Ax_{0})$\;
 223   $\beta \leftarrow \|r_{0}\|_{2}$\;
 224   $\alpha \leftarrow \|M^{-1}b\|_{2}$\;
 225   $\Convergence \leftarrow \False$\;
 226   $k \leftarrow 1$\;
 227   \BlankLine
 228   \While{$(\neg \Convergence)$}{
 229     $v_{1} \leftarrow r_{0} / \beta$\;
 230     \For{$j=1$ {\bf to} $m$}{
 231       $w_{j} \leftarrow M^{-1}Av_{j}$\;
 232       \For{$i=1$ {\bf to} $j$}{
 233         $h_{i,j} \leftarrow (w_{j},v_{i})$\;
 234         $w_{j} \leftarrow w_{j} - h_{i,j} \times v_{i}$\;
 235       }
 236       $h_{j+1,j} \leftarrow \|w_{j}\|_{2}$\;
 237       $v_{j+1} \leftarrow w_{j} / h_{j+1,j}$\;
 238     }
 239     \BlankLine
 240     Put $V_{m}=\{v_{j}\}_{1\leq j \leq m}$ and $\bar{H}_{m}=(h_{i,j})$ Hessenberg matrix of order $(m+1)\times m$\;
 241     Solve the least-squares problem of size $m$: $\displaystyle\min_{y\in\mathbb{R}^{m}} \|\beta e_{1}-\bar{H}_{m}y\|_{2}$\;
 242     \BlankLine
 243     $x_{m} \leftarrow x_{0} + V_{m}y$\;
 244     $r_{m} \leftarrow M^{-1}(b-Ax_{m})$\;
 245     $\beta \leftarrow \|r_{m}\|_{2}$\;
 246     \BlankLine
 247     \eIf{$(\frac{\beta}{\alpha}<\varepsilon)$ {\bf or} $(k\geq \Max)$}{
 248       $\Convergence \leftarrow \True$\;
 249     }{
 250       $x_{0} \leftarrow x_{m}$\;
 251       $r_{0} \leftarrow r_{m}$\;
 252       $k \leftarrow k + 1$\;
 253     }
 254   }
 255 \caption{Left-preconditioned GMRES algorithm with restarts}
 256 \label{alg:01}
 257 \end{algorithm}
 258
 259
 260
 261 %%--------------------%%
 262 %%      SECTION 5     %%
 263 %%--------------------%%
 264 \section{Parallel GMRES method on {GPU} clusters}
 265 \label{sec:05}
 266
 267 \subsection{Parallel implementation on a GPU cluster}
 268 \label{sec:05.01}
 269 The implementation of the GMRES algorithm on a GPU cluster is performed by using
 270 a parallel heterogeneous programming. We use the programming language CUDA for the
 271 GPUs and the parallel environment MPI for the distribution of the computations between
 272 the GPU computing nodes. In this work, a GPU computing node is composed of a GPU and
 273 a CPU core managed by a MPI process.
 274
 275 Let us consider a cluster composed of $p$ GPU computing nodes. First, the sparse linear
 276 system~(\ref{eq:01}) is split into $p$ sub-linear systems, each is attributed to a GPU
 277 computing node. We partition row-by-row the sparse matrix $A$ and both vectors $x$ and
 278 $b$ in $p$ parts (see Figure~\ref{fig:01}). The data issued from the partitioning operation
 279 are off-loaded on the GPU global memories to be proceeded by the GPUs. Then, all the
 280 computing nodes of the GPU cluster execute the same GMRES iterative algorithm but on
 281 different data. Finally, the GPU computing nodes synchronize their computations by using
 282 MPI communication routines to solve the global sparse linear system. In what follows,
 283 the computing nodes sharing data are called the neighboring nodes.
 284
 285 \begin{figure}
 286 \centering
 287   \includegraphics[width=80mm,keepaspectratio]{Figures/partition}
 288 \caption{Data partitioning of the sparse matrix $A$, the solution vector $x$ and the right-hand side $b$ in $4$ partitions}
 289 \label{fig:01}
 290 \end{figure}
 291
 292 In order to exploit the computing power of the GPUs, we have to execute maximum computations
 293 on GPUs to avoid the data transfers between the GPU and its host (CPU), and to maximize the
 294 GPU cores utilization to hide global memory access latency. The implementation of the GMRES
 295 algorithm is performed by executing the functions operating on vectors and matrices as kernels
 296 on GPUs. These operations are often easy to parallelize and more efficient on parallel architectures
 297 when they operate on large vectors. We use the fastest routines of the CUBLAS library
 298 (CUDA Basic Linear Algebra Subroutines) to implement the dot product (\verb+cublasDdot()+),
 299 the Euclidean norm (\verb+cublasDnrm2()+) and the AXPY operation (\verb+cublasDaxpy()+).
 300 In addition, we have coded in CUDA a kernel for the scalar-vector product (lines~$7$ and~$15$
 301 of Algorithm~\ref{alg:01}), a kernel for solving the least-squares problem (line~$18$) and a
 302 kernel for solution vector updates (line~$19$).
 303
 304 The solution of the least-squares problem in the GMRES algorithm is based on:
 305 \begin{itemize}
 306   \item a QR factorization of the Hessenberg matrix $\bar{H}$ by using plane rotations and,
 307   \item backward-substitution method to compute the vector $y$ minimizing the residual.
 308 \end{itemize}
 309 This operation is not easy to parallelize and it is not interesting to implement it on GPUs.
 310 However, the size $m$ of the linear least-squares problem to solve in the GMRES method with
 311 restarts is very small. So, this problem is solved in sequential by one GPU thread.
 312
 313 The most important operation in the GMRES method is the sparse matrix-vector multiplication.
 314 It is quite expensive for large size matrices in terms of execution time and memory space. In
 315 addition, it performs irregular memory accesses to read the nonzero values of the sparse matrix,
 316 implying non coalescent accesses to the GPU global memory which slow down the performances of
 317 the GPUs. So we use the HYB kernel developed and optimized by NVIDIA~\cite{CUSP} which gives on
 318 average the best performance in sparse matrix-vector multiplications on GPUs~\cite{Bel09}. The
 319 HYB (Hybrid) storage format is the combination of two sparse storage formats: Ellpack format
 320 (ELL) and Coordinate format (COO). It stores a typical number of nonzero values per row in ELL
 321 format and remaining entries of exceptional rows in COO format. It combines the efficiency of
 322 ELL, due to the regularity of its memory accessing and the flexibility of COO which is insensitive
 323 to the matrix structure.
 324
 325 In the parallel GMRES algorithm, the GPU computing nodes must exchange between them their shared data in
 326 order to construct the global vector necessary to compute the parallel sparse matrix-vector
 327 multiplication (SpMV). In fact, each computing node has locally the vector elements corresponding
 328 to the rows of its sparse sub-matrix and, in order to compute its part of the SpMV, it also
 329 requires the vector elements of its neighboring nodes corresponding to the column indices in
 330 which its local sub-matrix has nonzero values. Consequently, each computing node manages a global
 331 vector composed of a local vector of size $\frac{n}{p}$ and a shared vector of size $S$:
 332 \begin{equation}
 333   S = \BW - \frac{n}{p},
 334 \label{eq:11}
 335 \end{equation}
 336 where $\frac{n}{p}$ is the size of the local vector and $\BW$ is the bandwidth of the local sparse
 337 sub-matrix which represents the number of columns between the minimum and the maximum column indices
 338 (see Figure~\ref{fig:01}). In order to improve memory accesses, we use the texture memory to
 339 cache elements of the global vector.
 340
 341 On a GPU cluster, the exchanges of the shared vectors elements between the neighboring nodes are
 342 performed as follows:
 343 \begin{itemize}
 344   \item at the level of the sending node: data transfers of the shared data from the GPU global memory
 345 to the CPU memory by using the CUBLAS communication routine \verb+cublasGetVector()+,
 346   \item data exchanges between the CPUs by the MPI communication routine \verb+MPI_Alltoallv()+ and,
 347   \item at the level of the receiving node: data transfers of the received shared data from the CPU
 348 memory to the GPU global memory by using CUBLAS communication routine \verb+cublasSetVector()+.
 349 \end{itemize}
 350
 351 \subsection{Experimentations}
 352 \label{sec:05.02}
 353 The experiments are done on a cluster composed of six machines interconnected by an Infiniband network
 354 of $20$~GB/s. Each machine is a Xeon E5530 Quad-Core running at $2.4$~GHz. It provides $12$~GB of RAM
 355 memory with a memory bandwidth of $25.6$~GB/s and it is equipped with two Tesla C1060 GPUs. Each GPU
 356 is composed of $240$ cores running at $1.3$ GHz and has $4$~GB of global memory with a memory bandwidth
 357 of $102$~GB/s. The GPU is connected to the CPU via a PCI-Express 16x Gen2.0 with a throughput of $8$~GB/s.
 358 Figure~\ref{fig:02} shows the general scheme of the GPU cluster.
 359
 360 \begin{figure}[!ht]
 361 \centering
 362   \includegraphics[width=80mm,keepaspectratio]{Figures/clusterGPU}
 363 \caption{A cluster composed of six machines, each equipped with two Tesla C1060 GPUs}
 364 \label{fig:02}
 365 \end{figure}
 366
 367 Scientific Linux 5.10, with Linux version 2.6.18, is installed on the six machines. The C programming language is used for
 368 coding the GMRES algorithm on both the CPU and the GPU versions. CUDA version 4.0~\cite{ref19} is used for programming
 369 the GPUs, using CUBLAS library~\cite{ref37} to deal with the functions operating on vectors. Finally, MPI routines
 370 of OpenMPI 1.3.3 are used to carry out the communication between the CPU cores.
 371
 372 The experiments are done on linear systems associated to sparse matrices chosen from the Davis collection of the
 373 University of Florida~\cite{Dav97}. They are matrices arising in real-world applications. Table~\ref{tab:01} shows
 374 the main characteristics of these sparse matrices and Figure~\ref{fig:03} shows their sparse structures. For
 375 each matrix, we give the number of rows (column~$3$ in Table~\ref{tab:01}), the number of nonzero values (column~$4$)
 376 and the bandwidth (column~$5$).
 377
 378 \begin{table}
 379 \begin{center}
 380 \begin{tabular}{|c|c|r|r|r|}
 381 \hline
 382 Matrix type                 & Name              & \# Rows   & \# Nonzeros  & Bandwidth \\\hline \hline
 383 \multirow{6}{*}{Symmetric}  & 2cubes\_sphere    & 101 492   & 1 647 264    & 100 464        \\
 384                             & ecology2          & 999 999   & 4 995 991    & 2 001          \\
 385                             & finan512          & 74 752    & 596 992      & 74 725         \\
 386                             & G3\_circuit       & 1 585 478 & 7 660 826    & 1 219 059      \\
 387                             & shallow\_water2   & 81 920    & 327 680      & 58 710         \\
 388                             & thermal2          & 1 228 045 & 8 580 313    & 1 226 629      \\ \hline \hline
 389 \multirow{6}{*}{Asymmetric} & cage13            & 445 315   & 7 479 343    & 318 788        \\
 390                             & crashbasis        & 160 000   & 1 750 416    & 120 202        \\
 391                             & FEM\_3D\_thermal2 & 147 900   & 3 489 300    & 117 827        \\
 392                             & language          & 399 130   & 1 216 334    & 398 622        \\
 393                             & poli\_large       & 15 575    & 33 074       & 15 575         \\
 394                             & torso3            & 259 156   & 4 429 042    & 216 854        \\ \hline
 395 \end{tabular}
 396 \caption{Main characteristics of the sparse matrices chosen from the Davis collection}
 397 \label{tab:01}
 398 \end{center}
 399 \end{table}
 400
 401 \begin{figure}
 402 \centering
 403   \includegraphics[width=120mm,keepaspectratio]{Figures/matrices}
 404 \caption{Structures of the sparse matrices chosen from the Davis collection}
 405 \label{fig:03}
 406 \end{figure}
 407
 408 All the experiments are performed on double-precision data. The parameters of the parallel
 409 GMRES algorithm are as follows: the tolerance threshold $\varepsilon=10^{-12}$, the maximum
 410 number of iterations $\Max=500$, the Arnoldi process is limited to $m=16$ iterations, the elements
 411 of the guess solution $x_0$ is initialized to $0$ and those of the right-hand side vector are
 412 initialized to $1$. For simplicity's sake, we chose the matrix preconditioning $M$ as the
 413 main diagonal of the sparse matrix $A$. Indeed, it allows us to easily compute the required inverse
 414 matrix $M^{-1}$ and it provides relatively good preconditioning in most cases. Finally, we set
 415 the size of a thread-block in GPUs to $512$ threads.
 416  It should be noted that the same optimizations are performed on the CPU version and on the GPU version of the parallel GMRES algorithm.
 417
 418 \begin{table}[!ht]
 419 \begin{center}
 420 \begin{tabular}{|c|c|c|c|c|c|c|}
 421 \hline
 422 Matrix            & $\Time{cpu}$  & $\Time{gpu}$  & $\tau$  & \# $\Iter$ & $\Prec$ & $\Delta$   \\ \hline \hline
 423 2cubes\_sphere    & 0.234 s     & 0.124 s      & 1.88  & 21     & 2.10e-14      & 3.47e-18 \\
 424 ecology2          & 0.076 s     & 0.035 s      & 2.15  & 21     & 4.30e-13      & 4.38e-15 \\
 425 finan512          & 0.073 s     & 0.052 s      & 1.40  & 17     & 3.21e-12      & 5.00e-16 \\
 426 G3\_circuit       & 1.016 s     & 0.649 s      & 1.56  & 22     & 1.04e-12      & 2.00e-15 \\
 427 shallow\_water2   & 0.061 s     & 0.044 s      & 1.38  & 17     & 5.42e-22      & 2.71e-25 \\
 428 thermal2          & 1.666 s     & 0.880 s      & 1.89  & 21     & 6.58e-12      & 2.77e-16 \\ \hline \hline
 429 cage13            & 0.721 s     & 0.338 s      & 2.13  & 26     & 3.37e-11      & 2.66e-15 \\
 430 crashbasis        & 1.349 s     & 0.830 s      & 1.62  & 121    & 9.10e-12      & 6.90e-12 \\
 431 FEM\_3D\_thermal2 & 0.797 s     & 0.419 s      & 1.90  & 64     & 3.87e-09      & 9.09e-13 \\
 432 language          & 2.252 s     & 1.204 s      & 1.87  & 90     & 1.18e-10      & 8.00e-11 \\
 433 poli\_large       & 0.097 s     & 0.095 s      & 1.02  & 69     & 4.98e-11      & 1.14e-12 \\
 434 torso3            & 4.242 s     & 2.030 s      & 2.09  & 175    & 2.69e-10      & 1.78e-14 \\ \hline
 435 \end{tabular}
 436 \caption{Performances of the parallel GMRES algorithm on a cluster of 24 CPU cores vs. a cluster of 12 GPUs}
 437 \label{tab:02}
 438 \end{center}
 439 \end{table}
 440
 441 In Table~\ref{tab:02}, we give the performances of the parallel GMRES algorithm for solving the linear
 442 systems associated to the sparse matrices shown in Table~\ref{tab:01}. The second and third columns show
 443 the execution times in seconds obtained on a cluster of 24 CPU cores and on a cluster of 12 GPUs, respectively.
 444 The fourth column shows the ratio $\tau$ between the CPU time $\Time{cpu}$ and the GPU time $\Time{gpu}$ that
 445 is computed as follows:
 446 \begin{equation}
 447   \tau = \frac{\Time{cpu}}{\Time{gpu}}.
 448 \end{equation}
 449 From these ratios, we can notice that the use of many GPUs is not interesting to solve small sparse linear
 450 systems. Solving these sparse linear systems on a cluster of 12 GPUs is as fast as on a cluster of 24 CPU
 451 cores. Indeed, the small sizes of the sparse matrices do not allow to maximize the utilization of the GPU
 452 cores of the cluster. The fifth, sixth and seventh columns show, respectively, the number of iterations performed
 453 by the parallel GMRES algorithm to converge, the precision of the solution, $\Prec$, computed on the GPU
 454 cluster and the difference, $\Delta$, between the solutions computed on the GPU and the GPU clusters. The
 455 last two parameters are used to validate the results obtained on the GPU cluster and they are computed as
 456 follows:
 457 \begin{equation}
 458   \begin{aligned}
 459     \Prec &= \|M^{-1}(b-Ax^{gpu})\|_{\infty}, \\
 460     \Delta &= \|x^{cpu}-x^{gpu}\|_{\infty},
 461   \end{aligned}
 462 \end{equation}
 463 where $x^{cpu}$ and $x^{gpu}$ are the solutions computed, respectively, on the CPU cluster and on the GPU cluster.
 464 We can see that the precision of the solutions computed on the GPU cluster are sufficient, they are about $10^{-10}$,
 465 and the parallel GMRES algorithm computes almost the same solutions in both CPU and GPU clusters, with $\Delta$ varying
 466 from $10^{-11}$ to $10^{-25}$.
 467
 468 Afterwards, we evaluate the performances of the parallel GMRES algorithm for solving large linear systems. We have
 469 developed in C programming language a generator of large sparse matrices having a band structure which arises
 470 in  most numerical problems. This generator uses the sparse matrices of the Davis collection as the initial
 471 matrices to build the large band matrices. It is executed in parallel by all the MPI processes of the cluster
 472 so that each process constructs its own sub-matrix as a rectangular block of the global sparse matrix. Each process
 473 $i$ computes the size $n_i$ and the offset $\Offset_i$ of its sub-matrix in the global sparse matrix according to the
 474 size $n$ of the linear system to be solved and the number of the GPU computing nodes $p$ as follows:
 475 \begin{equation}
 476   n_i = \frac{n}{p},
 477 \end{equation}
 478 \begin{equation}
 479   \Offset_i =
 480   \begin{cases}
 481     0 & \text{if $i=0$,}\\
 482     \Offset_{i-1}+n_{i-1} & \text{otherwise.}
 483   \end{cases}
 484 \end{equation}
 485 So each process $i$ performs several copies of the same initial matrix chosen from the Davis collection and it
 486 puts all these copies on the main diagonal of the global matrix in order to construct a band matrix. Moreover,
 487 it fulfills the empty spaces between two successive copies by small copies, \textit{lower\_copy} and \textit{upper\_copy},
 488 of the same initial matrix. Figure~\ref{fig:04} shows a generation of a sparse bended matrix by four computing nodes.
 489
 490 \begin{figure}
 491 \centering
 492   \includegraphics[width=100mm,keepaspectratio]{Figures/generation}
 493 \caption{Example of the generation of a large sparse and band matrix by four computing nodes.}
 494 \label{fig:04}
 495 \end{figure}
 496
 497 Table~\ref{tab:03} shows the main characteristics (the number of nonzero values and the bandwidth) of the
 498 large sparse matrices generated from those of the Davis collection. These matrices are associated to the
 499 linear systems of 25 million of unknown values (each generated sparse matrix has 25 million rows). In Table~\ref{tab:04}
 500 we show the performances of the parallel GMRES algorithm for solving large linear systems associated to the
 501 sparse band matrices of Table~\ref{tab:03}. The fourth column gives the ratio between the execution time
 502 spent on a cluster of 24 CPU cores and that spent on a cluster of 12 GPUs. We can notice from these ratios
 503 that for solving large sparse matrices the GPU cluster is more efficient (about 5 times faster) than the CPU
 504 cluster. The computing power of the GPUs allows to accelerate the computation of the functions operating
 505 on large vectors of the parallel GMRES algorithm.
 506
 507 \begin{table}[!ht]
 508 \begin{center}
 509 \begin{tabular}{|c|c|r|r|}
 510 \hline
 511 Matrix type                 & Name              & \# nonzeros & Bandwidth \\ \hline \hline
 512 \multirow{6}{*}{Symmetric}  & 2cubes\_sphere    & 413 703 602 & 198 836   \\
 513                             & ecology2          & 124 948 019 & 2 002     \\
 514                             & finan512          & 278 175 945 & 123 900   \\
 515                             & G3\_circuit       & 125 262 292 & 1 891 887 \\
 516                             & shallow\_water2   & 100 235 292 & 62 806    \\
 517                             & thermal2          & 175 300 284 & 2 421 285 \\ \hline \hline
 518 \multirow{6}{*}{Asymmetric} & cage13            & 435 770 480 & 352 566   \\
 519                             & crashbasis        & 409 291 236 & 200 203   \\
 520                             & FEM\_3D\_thermal2 & 595 266 787 & 206 029   \\
 521                             & language          & 76 912 824  & 398 626   \\
 522                             & poli\_large       & 53 322 580  & 15 576    \\
 523                             & torso3            & 433 795 264 & 328 757   \\ \hline
 524 \end{tabular}
 525 \caption{Main characteristics of the sparse and band matrices generated from the sparse matrices of the Davis collection.}
 526 \label{tab:03}
 527 \end{center}
 528 \end{table}
 529
 530
 531 \begin{table}[!ht]
 532 \begin{center}
 533 \begin{tabular}{|c|c|c|c|c|c|c|}
 534 \hline
 535 Matrix            & $\Time{cpu}$ & $\Time{gpu}$ & $\tau$ & \# $\Iter$& $\Prec$   & $\Delta$   \\ \hline \hline
 536 2cubes\_sphere    & 3.683 s     & 0.870 s      & 4.23   & 21       & 2.11e-14 & 8.67e-18 \\
 537 ecology2          & 2.570 s     & 0.424 s      & 6.06   & 21       & 4.88e-13 & 2.08e-14 \\
 538 finan512          & 2.727 s     & 0.533 s      & 5.11   & 17       & 3.22e-12 & 8.82e-14 \\
 539 G3\_circuit       & 4.656 s     & 1.024 s      & 4.54   & 22       & 1.04e-12 & 5.00e-15 \\
 540 shallow\_water2   & 2.268 s     & 0.384 s      & 5.91   & 17       & 5.54e-21 & 7.92e-24 \\
 541 thermal2          & 4.650 s     & 1.130 s      & 4.11   & 21       & 8.89e-12 & 3.33e-16 \\ \hline \hline
 542 cage13            & 6.068 s     & 1.054 s      & 5.75   & 26       & 3.29e-11 & 1.59e-14 \\
 543 crashbasis        & 25.906 s    & 4.569 s      & 5.67   & 135      & 6.81e-11 & 4.61e-15 \\
 544 FEM\_3D\_thermal2 & 13.555 s    & 2.654 s      & 5.11   & 64       & 3.88e-09 & 1.82e-12 \\
 545 language          & 13.538 s    & 2.621 s      & 5.16   & 89       & 2.11e-10 & 1.60e-10 \\
 546 poli\_large       & 8.619 s     & 1.474 s      & 5.85   & 69       & 5.05e-11 & 6.59e-12 \\
 547 torso3            & 35.213 s    & 6.763 s      & 5.21   & 175      & 2.69e-10 & 2.66e-14 \\ \hline
 548 \end{tabular}
 549 \caption{Performances of the parallel GMRES algorithm for solving large sparse linear systems associated
 550 to band matrices on a cluster of 24 CPU cores vs. a cluster of 12 GPUs.}
 551 \label{tab:04}
 552 \end{center}
 553 \end{table}
 554
 555
 556 %%--------------------%%
 557 %%      SECTION 6     %%
 558 %%--------------------%%
 559 \section{Minimization of communications}
 560 \label{sec:06}
 561 The parallel sparse matrix-vector multiplication requires  data exchanges between the GPU computing nodes
 562 to construct the global vector. However, a GPU cluster requires communications between the GPU nodes and the
 563 data transfers between the GPUs and their hosts CPUs. In fact, a communication between two GPU nodes implies:
 564 a data transfer from the GPU memory to the CPU memory at the sending node, a MPI communication between the CPUs
 565 of two GPU nodes, and a data transfer from the CPU memory to the GPU memory at the receiving node. Moreover,
 566 the data transfers between a GPU and a CPU are considered as the most expensive communications on a GPU cluster.
 567 For example in our GPU cluster, the data throughput between a GPU and a CPU is of 8 GB/s which is about twice
 568 lower than the data transfer rate between CPUs (20 GB/s) and 12 times lower than the memory bandwidth of the
 569 GPU global memory (102 GB/s). In this section, we propose two solutions to improve the execution time of the
 570 parallel GMRES algorithm on GPU clusters.
 571
 572 \subsection{Compressed storage format of the sparse vectors}
 573 \label{sec:06.01}
 574 In Section~\ref{sec:05.01}, the SpMV multiplication uses a global vector having a size equivalent to the matrix
 575 bandwidth (see Formula~\ref{eq:11}). However, we can notice that a SpMV multiplication does not often need all
 576 the vector elements of the global vector composed of the local and shared sub-vectors. For example in Figure~\ref{fig:01},
 577  node 1 only needs  a single vector element from  node 0 (element 1), two elements from node 2 (elements 8
 578 and 9) and two elements from  node 3 (elements 13 and 14). Therefore to reduce the communication overheads
 579 of the unused vector elements, the GPU computing nodes must exchange between them only the vector elements necessary
 580 to perform their local sparse matrix-vector multiplications.
 581
 582 \begin{figure}[!ht]
 583 \centering
 584   \includegraphics[width=120mm,keepaspectratio]{Figures/compress}
 585 \caption{Example of data exchanges between node 1 and its neighbors 0, 2 and 3.}
 586 \label{fig:05}
 587 \end{figure}
 588
 589 \begin{figure}[!ht]
 590 \centering
 591   \includegraphics[width=100mm,keepaspectratio]{Figures/reorder}
 592 \caption{Reordering of the columns of a local sparse matrix.}
 593 \label{fig:06}
 594 \end{figure}
 595
 596 We propose to use a compressed storage format of the sparse global vector. In Figure~\ref{fig:05}, we show an
 597 example of the data exchanges between node 1 and its neighbors to construct the compressed global vector.
 598 First, the neighboring nodes 0, 2 and 3 determine the vector elements needed by node 1 and, then, they send
 599 only these elements to it. Node 1 receives these shared elements in a compressed vector. However to compute
 600 the sparse matrix-vector multiplication, it must first copy the received elements to the corresponding indices
 601 in the global vector. In order to avoid this process at each iteration, we propose to reorder the columns of the
 602 local sub-matrices so as to use the shared vectors in their compressed storage format (see Figure~\ref{fig:06}).
 603 For performance purposes, the computation of the shared data to send to the neighboring nodes is performed
 604 by the GPU as a kernel. In addition, we use the MPI point-to-point communication routines: a blocking send routine
 605 \verb+MPI_Send()+ and a nonblocking receive routine \verb+MPI_Irecv()+.
 606
 607 Table~\ref{tab:05} shows the performances of the parallel GMRES algorithm using the compressed storage format
 608 of the sparse global vector. The results are obtained from solving large linear systems associated to the sparse
 609 band matrices presented in Table~\ref{tab:03}. We can see from Table~\ref{tab:05} that the execution times
 610 of the parallel GMRES algorithm on a cluster of 12 GPUs are improved by about 38\% compared to those presented
 611 in Table~\ref{tab:04}. In addition, the ratios between the execution times spent on the cluster of 24 CPU cores
 612 and those spent on the cluster of 12 GPUs have increased. Indeed, the reordering of the sparse sub-matrices and
 613 the use of a compressed storage format for the sparse vectors minimize the communication overheads between the
 614 GPU computing nodes.
 615
 616 \begin{table}[!ht]
 617 \begin{center}
 618 \begin{tabular}{|c|c|c|c|c|c|c|}
 619 \hline
 620 Matrix            & $\Time{cpu}$ & $\Time{gpu}$ & $\tau$& \# $\Iter$& $\Prec$   & $\Delta$   \\ \hline \hline
 621 2cubes\_sphere    & 3.597 s     & 0.514 s      & 6.99  & 21       & 2.11e-14 & 8.67e-18 \\
 622 ecology2          & 2.549 s     & 0.288 s      & 8.83  & 21       & 4.88e-13 & 2.08e-14 \\
 623 finan512          & 2.660 s     & 0.377 s      & 7.05  & 17       & 3.22e-12 & 8.82e-14 \\
 624 G3\_circuit       & 3.139 s     & 0.480 s      & 6.53  & 22       & 1.04e-12 & 5.00e-15 \\
 625 shallow\_water2   & 2.195 s     & 0.253 s      & 8.68  & 17       & 5.54e-21 & 7.92e-24 \\
 626 thermal2          & 3.206 s     & 0.463 s      & 6.93  & 21       & 8.89e-12 & 3.33e-16 \\ \hline \hline
 627 cage13            & 5.560 s     & 0.663 s      & 8.39  & 26       & 3.29e-11 & 1.59e-14 \\
 628 crashbasis        & 25.802 s    & 3.511 s      & 7.35  & 135      & 6.81e-11 & 4.61e-15 \\
 629 FEM\_3D\_thermal2 & 13.281 s    & 1.572 s      & 8.45  & 64       & 3.88e-09 & 1.82e-12 \\
 630 language          & 12.553 s    & 1.760 s      & 7.13  & 89       & 2.11e-10 & 1.60e-10 \\
 631 poli\_large       & 8.515 s     & 1.053 s      & 8.09  & 69       & 5.05e-11 & 6.59e-12 \\
 632 torso3            & 31.463 s    & 3.681 s      & 8.55  & 175      & 2.69e-10 & 2.66e-14 \\ \hline
 633 \end{tabular}
 634 \caption{Performances of the parallel GMRES algorithm using a compressed storage format of the sparse
 635 vectors for solving large sparse linear systems associated to band matrices on a cluster of 24 CPU cores vs.
 636 a cluster of 12 GPUs.}
 637 \label{tab:05}
 638 \end{center}
 639 \end{table}
 640
 641 \subsection{Hypergraph partitioning}
 642 \label{sec:06.02}
 643 In this section, we use another structure of the sparse matrices. We are interested in sparse matrices
 644 whose nonzero values are distributed along their large bandwidths. We developed in C programming
 645 language a generator of sparse matrices having five bands (see Figure~\ref{fig:07}). The principle of
 646 this generator is the same as the one presented in Section~\ref{sec:05.02}. However, the copies made from the
 647 initial sparse matrix, chosen from the Davis collection, are placed on the main diagonal and on two
 648 off-diagonals on the left and right of the main diagonal. Table~\ref{tab:06} shows the main characteristics
 649 of sparse matrices of size 25 million of rows and generated from those of the Davis collection. We can
 650 see in the fourth column that the bandwidths of these matrices are as large as their sizes.
 651
 652 \begin{figure}[!ht]
 653 \centering
 654   \includegraphics[width=100mm,keepaspectratio]{Figures/generation_1}
 655 \caption{Example of the generation of a large sparse matrix having five bands by four computing nodes.}
 656 \label{fig:07}
 657 \end{figure}
 658
 659 \begin{table}[!ht]
 660 \begin{center}
 661 \begin{tabular}{|c|c|r|r|}
 662 \hline
 663 Matrix type                 & Name              & \# nonzeros   & Bandwidth \\ \hline \hline
 664 \multirow{6}{*}{Symmetric}  & 2cubes\_sphere    & 829 082 728   & 24 999 999     \\
 665                             & ecology2          & 254 892 056   & 25 000 000     \\
 666                             & finan512          & 556 982 339   & 24 999 973     \\
 667                             & G3\_circuit       & 257 982 646   & 25 000 000     \\
 668                             & shallow\_water2   & 200 798 268   & 25 000 000     \\
 669                             & thermal2          & 359 340 179   & 24 999 998     \\ \hline \hline
 670 \multirow{6}{*}{Asymmetric} & cage13            & 879 063 379   & 24 999 998     \\
 671                             & crashbasis        & 820 373 286   & 24 999 803     \\
 672                             & FEM\_3D\_thermal2 & 1 194 012 703 & 24 999 998     \\
 673                             & language          & 155 261 826   & 24 999 492     \\
 674                             & poli\_large       & 106 680 819   & 25 000 000     \\
 675                             & torso3            & 872 029 998   & 25 000 000     \\ \hline
 676 \end{tabular}
 677 \caption{Main characteristics of the sparse matrices having five band and generated from the sparse matrices of the Davis collection.}
 678 \label{tab:06}
 679 \end{center}
 680 \end{table}
 681
 682 In Table~\ref{tab:07} we give the performances of the parallel GMRES algorithm on the CPU and GPU
 683 clusters for solving large linear systems associated to the sparse matrices shown in Table~\ref{tab:06}.
 684 We can notice from the ratios given in the fourth column that solving sparse linear systems associated
 685 to matrices having large bandwidth on the GPU cluster is as fast as on the CPU cluster. This is due
 686 to the large total communication volume necessary to synchronize the computations over the cluster.
 687 In fact, the naive partitioning row-by-row or column-by-column of this type of sparse matrices links
 688 a GPU node to many neighboring nodes and produces a significant number of data dependencies between
 689 the different GPU nodes.
 690
 691 \begin{table}[!ht]
 692 \begin{center}
 693 \begin{tabular}{|c|c|c|c|c|c|c|}
 694 \hline
 695 Matrix            & $\Time{cpu}$ & $\Time{gpu}$ & $\tau$ & \# $\Iter$& $\Prec$ & $\Delta$   \\ \hline \hline
 696 2cubes\_sphere    & 15.963 s    & 7.250 s      & 2.20  & 58     & 6.23e-16 & 3.25e-19 \\
 697 ecology2          & 3.549 s     & 2.176 s      & 1.63  & 21     & 4.78e-15 & 1.06e-15 \\
 698 finan512          & 3.862 s     & 1.934 s      & 1.99  & 17     & 3.21e-14 & 8.43e-17 \\
 699 G3\_circuit       & 4.636 s     & 2.811 s      & 1.65  & 22     & 1.08e-14 & 1.77e-16 \\
 700 shallow\_water2   & 2.738 s     & 1.539 s      & 1.78  & 17     & 5.54e-23 & 3.82e-26 \\
 701 thermal2          & 5.017 s     & 2.587 s      & 1.94  & 21     & 8.25e-14 & 4.34e-18 \\ \hline \hline
 702 cage13            & 9.315 s     & 3.227 s      & 2.89  & 26     & 3.38e-13 & 2.08e-16 \\
 703 crashbasis        & 35.980 s    & 14.770 s     & 2.43  & 127    & 1.17e-12 & 1.56e-17 \\
 704 FEM\_3D\_thermal2 & 24.611 s    & 7.749 s      & 3.17  & 64     & 3.87e-11 & 2.84e-14 \\
 705 language          & 16.859 s    & 9.697 s      & 1.74  & 89     & 2.17e-12 & 1.70e-12 \\
 706 poli\_large       & 10.200 s    & 6.534 s      & 1.56  & 69     & 5.14e-13 & 1.63e-13 \\
 707 torso3            & 49.074 s    & 19.397 s     & 2.53  & 175    & 2.69e-12 & 2.77e-16 \\ \hline
 708 \end{tabular}
 709 \caption{Performances of the parallel GMRES algorithm using a compressed storage format of the sparse
 710 vectors for solving large sparse linear systems associated to matrices having five bands on a cluster
 711 of 24 CPU cores vs. a cluster of 12 GPUs.}
 712 \label{tab:07}
 713 \end{center}
 714 \end{table}
 715
 716 We propose to use a hypergraph partitioning method which is well adapted to numerous structures
 717 of sparse matrices~\cite{Cat99}. Indeed, it can well model the communications between the computing
 718 nodes especially for the asymmetric and rectangular matrices. It gives in most cases good reductions
 719 of the total communication volume. Nevertheless, it is more expensive in terms of execution time and
 720 memory space consumption than the partitioning method based on graphs.
 721
 722 The sparse matrix $A$ of the linear system to be solved is modelled as a hypergraph
 723 $\mathcal{H}=(\mathcal{V},\mathcal{E})$ as follows:
 724 \begin{itemize}
 725 \item each matrix row $i$ ($0\leq i<n$) corresponds to a vertex $v_i\in\mathcal{V}$,
 726 \item each matrix column $j$ ($0\leq j<n$) corresponds to a hyperedge $e_j\in\mathcal{E}$, such that:
 727 $\forall a_{ij}$ is a nonzero value of the matrix $A$, $v_i\in pins[e_j]$,
 728 \item $w_i$ is the weight of vertex $v_i$,
 729 \item $c_j$ is the cost of hyperedge $e_j$.
 730 \end{itemize}
 731 A $K$-way partitioning of a hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ is defined as a set
 732 of $K$ pairwise disjoint non-empty subsets (or parts) of the vertex set $\mathcal{V}$: $\mathcal{P}=\{\mathcal{V}_1,\ldots,\mathcal{V}_k\}$,
 733 such that $\mathcal{V}=\displaystyle\cup_{k=1}^K\mathcal{V}_{k}$. Each computing node is in charge of
 734 a vertex subset. Figure~\ref{fig:08} shows an example of a hypergraph partitioning of a sparse matrix
 735 of size $(9\times 9)$ into three parts. The circles and squares correspond, respectively, to the vertices
 736 and hyperedges of the hypergraph. The solid squares define the cut hyperedges connecting at least two
 737 different parts. The connectivity $\lambda_j$ denotes the number of different parts spanned by the cut
 738 hyperedge $e_j$.
 739
 740 \begin{figure}
 741 \centering
 742   \includegraphics[width=130mm,keepaspectratio]{Figures/hypergraph}
 743 \caption{A hypergraph partitioning of a sparse matrix between three computing nodes.}
 744 \label{fig:08}
 745 \end{figure}
 746
 747 The cut hyperedges model the communications between the different GPU computing nodes in the cluster,
 748 necessary to perform the SpMV multiplication. Indeed, each hyperedge $e_j$ defines a set of atomic
 749 computations $b_i\leftarrow b_i+a_{ij}x_j$ of the SpMV multiplication which needs the $j^{th}$ element
 750 of  vector $x$. Therefore pins of hyperedge $e_j$ ($pins[e_j]$) denote the set of matrix rows requiring
 751 the same vector element $x_j$. For example in Figure~\ref{fig:08}, hyperedge $e_9$ whose pins are:
 752 $pins[e_9]=\{v_2,v_5,v_9\}$ represents matrix rows 2, 5 and 9 requiring the vector element $x_9$
 753 to compute in parallel the atomic operations: $b_2\leftarrow b_2+a_{29}x_9$, $b_5\leftarrow b_5+a_{59}x_9$
 754 and $b_9\leftarrow b_9+a_{99}x_9$. However, $x_9$ is a vector element of the computing node 3 and it must
 755 be sent to the neighboring nodes 1 and 2.
 756
 757 The hypergraph partitioning allows to reduce the total communication volume while maintaining the computational
 758 load balance between the computing nodes. Indeed, it minimizes at best the following sum:
 759 \begin{equation}
 760 \mathcal{X}(\mathcal{P}) = \displaystyle\sum_{e_j\in\mathcal{E}_C} c_j(\lambda_j-1),
 761 \end{equation}
 762 where $\mathcal{E}_C$ is the set of the cut hyperedges issued from the partitioning $\mathcal{P}$, $c_j$
 763 and $\lambda_j$ are, respectively, the cost and the connectivity of the cut hyperedge $e_j$. In addition,
 764 the hypergraph partitioning is constrained to maintain the load balance between the $K$ parts:
 765 \begin{equation}
 766 W_k\leq (1+\epsilon)\Wavg\mbox{,~}(1\leq k\leq K)\mbox{~and~}(0<\epsilon<1),
 767 \end{equation}
 768 where $W_k$ is the sum of the vertex weights in the subset $\mathcal{V}_k$, $\Wavg$ is the average part's
 769 weight and $\epsilon$ is the maximum allowed imbalanced ratio.
 770
 771 The hypergraph partitioning is a NP-complete problem but software tools using heuristics are developed, for
 772 example: hMETIS~\cite{Kar98}, PaToH~\cite{Cata99} and Zoltan~\cite{Dev06}. Due to the large sizes of the
 773 linear systems to be solved, we use a parallel hypergraph partitioning which must be performed by at least
 774 two MPI processes. The hypergraph model $\mathcal{H}$ of the sparse matrix is split into $p$ (number of computing
 775 nodes) sub-hypergraphs $\mathcal{H}_k=(\mathcal{V}_k,\mathcal{E}_k)$, $0\leq k<p$, then the parallel partitioning
 776 is applied by using the MPI communication routines.
 777
 778 Table~\ref{tab:08} shows the performances of the parallel GMRES algorithm for solving the linear systems
 779 associated to the sparse matrices presented in Table~\ref{tab:06}. In the experiments, we have used the
 780 compressed storage format of the sparse vectors and the parallel hypergraph partitioning developed in the
 781 Zoltan tool~\cite{ref20,ref21}. The parameters of the hypergraph partitioning are initialized as follows:
 782 \begin{itemize}
 783 \item The weight $w_i$ of each vertex $v_i$ is set to the number of the nonzero values on the matrix row $i$,
 784 \item For simplicity's sake, the cost $c_j$ of each hyperedge $e_j$ is set to 1,
 785 \item The maximum imbalanced ratio $\epsilon$ is limited to 10\%.
 786 \end{itemize}
 787 We can notice from Table~\ref{tab:08} that the execution times on the cluster of 12 GPUs are significantly
 788 improved compared to those presented in Table~\ref{tab:07}. The hypergraph partitioning applied on the large
 789 sparse matrices having large bandwidths have improved the execution times on the GPU cluster by about 65\%.
 790
 791 \begin{table}[!ht]
 792 \begin{center}
 793 \begin{tabular}{|c|c|c|c|c|c|c|}
 794 \hline
 795 Matrix            & $\Time{cpu}$ & $\Time{gpu}$ & $\tau$  & \# $\Iter$ & $\Prec$     & $\Delta$   \\ \hline \hline
 796 2cubes\_sphere    & 16.430 s    & 2.840 s      & 5.78 & 58     & 6.23e-16 & 3.25e-19 \\
 797 ecology2          & 3.152 s     & 0.367 s      & 8.59 & 21     & 4.78e-15 & 1.06e-15 \\
 798 finan512          & 3.672 s     & 0.723 s      & 5.08 & 17     & 3.21e-14 & 8.43e-17 \\
 799 G3\_circuit       & 4.468 s     & 0.971 s      & 4.60 & 22     & 1.08e-14 & 1.77e-16 \\
 800 shallow\_water2   & 2.647 s     & 0.312 s      & 8.48 & 17     & 5.54e-23 & 3.82e-26 \\
 801 thermal2          & 4.190 s     & 0.666 s      & 6.29 & 21     & 8.25e-14 & 4.34e-18 \\ \hline \hline
 802 cage13            & 8.077 s     & 1.584 s      & 5.10 & 26     & 3.38e-13 & 2.08e-16 \\
 803 crashbasis        & 35.173 s    & 5.546 s      & 6.34 & 127    & 1.17e-12 & 1.56e-17 \\
 804 FEM\_3D\_thermal2 & 24.825 s    & 3.113 s      & 7.97 & 64     & 3.87e-11 & 2.84e-14 \\
 805 language          & 16.706 s    & 2.522 s      & 6.62 & 89     & 2.17e-12 & 1.70e-12 \\
 806 poli\_large       & 12.715 s    & 3.989 s      & 3.19 & 69     & 5.14e-13 & 1.63e-13 \\
 807 torso3            & 48.459 s    & 6.234 s      & 7.77 & 175    & 2.69e-12 & 2.77e-16 \\ \hline
 808 \end{tabular}
 809 \caption{Performances of the parallel GMRES algorithm using a compressed storage format of the sparse
 810 vectors and a hypergraph partitioning method for solving large sparse linear systems associated to matrices
 811 having five bands on a cluster of 24 CPU cores vs. a cluster of 12 GPUs.}
 812 \label{tab:08}
 813 \end{center}
 814 \end{table}
 815
 816 Table~\ref{tab:09} shows in the second, third and fourth columns the total communication volume on a cluster of 12 GPUs by using row-by-row partitioning or hypergraph partitioning and compressed format. The total communication volume defines the total number of the vector elements exchanged between the 12 GPUs. From these columns we can see that the two heuristics, compressed format for the vectors and the hypergraph partitioning, minimize the number of vector elements to be exchanged over the GPU cluster. The compressed format allows the GPUs to exchange the needed vector elements without any communication overheads. The hypergraph partitioning allows to split the large sparse matrices so as to minimize  data dependencies between the GPU computing nodes. However, we can notice in the fifth column that the hypergraph partitioning takes longer than the computation times. As we have mentioned before, the hypergraph partitioning method is less efficient in terms of memory consumption and partitioning time than its graph counterpart. So for the applications which often use the same sparse matrices, we can perform the hypergraph partitioning only once and, then, we save the traces in files to be reused several times. Therefore, this allows us to avoid the partitioning of the sparse matrices at each resolution of the linear systems.
 817
 818 \begin{table}
 819 \begin{center}
 820 \begin{tabular}{|c|c|c|c|c|}
 821 \hline
 822 \multirow{3}{*}{Matrix} & Total comm. vol. & Total comm. vol. & Total comm. vol. using        & Time of hypergraph \\
 823                         & using row-by row & using compressed & hypergraph partitioning       & partitioning \\
 824                         & partitioning     & format           & and compressed format         & in minutes \\ \hline \hline
 825 2cubes\_sphere          & 182 061 791      & 25 360 543       & 240 679                       & 68.98         \\
 826 ecology2                & 181 267 000      & 26 044 002       & 73 021                        & 4.92          \\
 827 finan512                & 182 090 692      & 26 087 431       & 900 729                       & 33.72         \\
 828 G3\_circuit             & 192 244 835      & 31 912 003       & 5 366 774                     & 11.63         \\
 829 shallow\_water2         & 181 729 606      & 25 105 108       & 60 899                        & 5.06          \\
 830 thermal2                & 191 350 306      & 30 012 846       & 1 077 921                     & 17.88         \\ \hline \hline
 831 cage13                  & 183 970 606      & 28 254 282       & 3 845 440                     & 196.45        \\
 832 crashbasis              & 182 931 818      & 29 020 060       & 2 401 876                     & 33.39         \\
 833 FEM\_3D\_thermal2       & 182 503 894      & 25 263 767       & 250 105                       & 49.89         \\
 834 language                & 183 055 017      & 27 291 486       & 1 537 835                     & 9.07          \\
 835 poli\_large             & 181 381 470      & 25 053 554       & 7 388 883                     & 5.92          \\
 836 torso3                  & 183 863 292      & 25 682 514       & 613 250                       & 61.51        \\ \hline
 837 \end{tabular}
 838 \caption{Total communication volume on a cluster of 12 GPUs using row-by-row or hypergraph partitioning methods and compressed vectors. The total communication volume is defined as the total number of vector elements exchanged between all GPUs of the cluster.}
 839 \label{tab:09}
 840 \end{center}
 841 \end{table}
 842
 843
 844
 845
 846
 847
 848
 849
 850
 851
 852
 853
 854
 855
 856
 857 Hereafter, we show the influence of the communications on a GPU cluster compared to a CPU cluster. In Tables~\ref{tab:10},~\ref{tab:11} and~\ref{tab:12}, we compute the ratios between the computation time over the communication time of three versions of the parallel GMRES algorithm to solve sparse linear systems associated to matrices of Table~\ref{tab:06}. These tables show that the hypergraph partitioning and the compressed format of the vectors increase the ratios either on the GPU cluster or on the CPU cluster. That means that the two optimization techniques allow the minimization of the total communication volume between the computing nodes. However, we can notice that the ratios obtained on the GPU cluster are lower than those obtained on the CPU cluster. Indeed, GPUs compute faster than CPUs but with GPUs there are more communications due to CPU/GPU communications, so communications are more time-consuming while the computation time remains unchanged. Furthermore, we can notice that the GPU computation times on Tables~\ref{tab:11} and~\ref{tab:12} are about 10\% lower than those on Table~\ref{tab:10}. Indeed, the compression of the vectors and the reordering of matrix columns allow to perform coalesced accesses to the GPU memory and thus accelerate the sparse matrix-vector multiplication.
 858
 859 \begin{table}[p]
 860 \begin{center}
 861 \begin{tabular}{|c||c|c|c||c|c|c|}
 862 \hline
 863 \multirow{2}{*}{Matrix} & \multicolumn{3}{c||}{GPU version} & \multicolumn{3}{c|}{CPU version}  \\ \cline{2-7}
 864                   & $\Time{comput}$ & $\Time{comm}$ & $\Ratio$ & $\Time{comput}$ & $\Time{comm}$ & $\Ratio$ \\ \hline \hline
 865 2cubes\_sphere    & 37.067 s       & 1434.512 s   & {\bf 0.026}   & 312.061 s      & 3453.931 s   & {\bf 0.090}\\
 866 ecology2          & 4.116 s        & 501.327 s    & {\bf 0.008}   & 60.776 s       & 1216.607 s   & {\bf 0.050}\\
 867 finan512          & 7.170 s        & 386.742 s    & {\bf 0.019}   & 72.464 s       & 932.538 s    & {\bf 0.078}\\
 868 G3\_circuit       & 4.797 s        & 537.343 s    & {\bf 0.009}   & 66.011 s       & 1407.378 s   & {\bf 0.047}\\
 869 shallow\_water2   & 3.620 s        & 411.208 s    & {\bf 0.009}   & 51.294 s       & 973.446 s    & {\bf 0.053}\\
 870 thermal2          & 6.902 s        & 511.618 s    & {\bf 0.013}   & 77.255 s       & 1281.979 s   & {\bf 0.060}\\ \hline \hline
 871 cage13            & 12.837 s       & 625.175 s    & {\bf 0.021}   & 139.178 s      & 1518.349 s   & {\bf 0.092}\\
 872 crashbasis        & 48.532 s       & 3195.183 s   & {\bf 0.015}   & 623.686 s      & 7741.777 s   & {\bf 0.081}\\
 873 FEM\_3D\_thermal2 & 37.211 s       & 1584.650 s   & {\bf 0.023}   & 370.297 s      & 3810.255 s   & {\bf 0.097}\\
 874 language          & 22.912 s       & 2242.897 s   & {\bf 0.010}   & 286.682 s      & 5348.733 s   & {\bf 0.054}\\
 875 poli\_large       & 13.618 s       & 1722.304 s   & {\bf 0.008}   & 190.302 s      & 4059.642 s   & {\bf 0.047}\\
 876 torso3            & 74.194 s       & 4454.936 s   & {\bf 0.017}   & 897.440 s      & 10800.787 s  & {\bf 0.083}\\ \hline
 877 \end{tabular}
 878 \caption{Ratios of the computation time over the communication time obtained from the parallel GMRES algorithm using row-by-row partitioning on 12 GPUs and 24 CPUs.}
 879 \label{tab:10}
 880 \end{center}
 881 \end{table}
 882
 883
 884 \begin{table}[p]
 885 \begin{center}
 886 \begin{tabular}{|c||c|c|c||c|c|c|}
 887 \hline
 888 \multirow{2}{*}{Matrix} & \multicolumn{3}{c||}{GPU version} & \multicolumn{3}{c|}{CPU version}  \\ \cline{2-7}
 889                   & $\Time{comput}$ & $\Time{comm}$ & $\Ratio$ & $\Time{comput}$ & $\Time{comm}$ & $\Ratio$ \\ \hline \hline
 890 2cubes\_sphere    & 27.386 s       & 154.861 s   & {\bf 0.177}   & 342.255 s      & 42.100 s   & {\bf 8.130}\\
 891 ecology2          & 3.822 s        & 53.131 s    & {\bf 0.072}   & 69.956 s       & 15.019 s   & {\bf 4.658}\\
 892 finan512          & 6.366 s        & 41.155 s    & {\bf 0.155}   & 79.592 s       & 8.604 s    & {\bf 9.251}\\
 893 G3\_circuit       & 4.543 s        & 63.132 s    & {\bf 0.072}   & 76.540 s       & 27.371 s   & {\bf 2.796}\\
 894 shallow\_water2   & 3.282 s        & 43.080 s    & {\bf 0.076}   & 58.348 s       & 8.088 s    & {\bf 7.214}\\
 895 thermal2          & 5.986 s        & 57.100 s    & {\bf 0.105}   & 87.682 s       & 28.544 s   & {\bf 3.072}\\ \hline \hline
 896 cage13            & 10.227 s       & 70.388 s    & {\bf 0.145}   & 152.718 s      & 30.785 s   & {\bf 4.961}\\
 897 crashbasis        & 41.527 s       & 369.071 s   & {\bf 0.113}   & 701.040 s      & 158.916 s  & {\bf 4.411}\\
 898 FEM\_3D\_thermal2 & 28.691 s       & 167.140 s   & {\bf 0.172}   & 403.510 s      & 50.935 s   & {\bf 7.922}\\
 899 language          & 22.408 s       & 242.589 s   & {\bf 0.092}   & 333.119 s      & 64.409 s   & {\bf 5.172}\\
 900 poli\_large       & 13.710 s       & 179.208 s   & {\bf 0.077}   & 215.934 s      & 30.903 s   & {\bf 6.987}\\
 901 torso3            & 58.455 s       & 480.315 s   & {\bf 0.122}   & 993.609 s      & 152.173 s  & {\bf 6.529}\\ \hline
 902 \end{tabular}
 903 \caption{Ratios of the computation time over the communication time obtained from the parallel GMRES algorithm using row-by-row partitioning and compressed format for vectors on 12 GPUs and 24 CPUs.}
 904 \label{tab:11}
 905 \end{center}
 906 \end{table}
 907
 908
 909 \begin{table}[p]
 910 \begin{center}
 911 \begin{tabular}{|c||c|c|c||c|c|c|}
 912 \hline
 913 \multirow{2}{*}{Matrix} & \multicolumn{3}{c||}{GPU version} & \multicolumn{3}{c|}{CPU version}  \\ \cline{2-7}
 914                   & $\Time{comput}$ & $\Time{comm}$ & $\Ratio$ & $\Time{comput}$ & $\Time{comm}$ & $\Ratio$ \\ \hline \hline
 915 2cubes\_sphere    & 28.440 s       & 7.768 s      & {\bf 3.661}   & 327.109 s      & 63.788 s   & {\bf 5.128}\\
 916 ecology2          & 3.652 s        & 0.757 s      & {\bf 4.823}   & 63.632 s       & 13.520 s   & {\bf 4.707}\\
 917 finan512          & 7.579 s        & 4.569 s      & {\bf 1.659}   & 74.120 s       & 22.505 s   & {\bf 3.294}\\
 918 G3\_circuit       & 4.876 s        & 8.745 s      & {\bf 0.558}   & 72.280 s       & 28.395 s   & {\bf 2.546}\\
 919 shallow\_water2   & 3.146 s        & 0.606 s      & {\bf 5.191}   & 52.903 s       & 11.177 s   & {\bf 4.733}\\
 920 thermal2          & 6.473 s        & 4.325 s      & {\bf 1.497}   & 81.171 s       & 20.907 s   & {\bf 3.882}\\ \hline \hline
 921 cage13            & 11.676 s       & 7.723 s      & {\bf 1.512}   & 145.755 s      & 46.547 s   & {\bf 3.131}\\
 922 crashbasis        & 42.799 s       & 29.399 s     & {\bf 1.456}   & 650.386 s      & 203.918 s  & {\bf 3.189}\\
 923 FEM\_3D\_thermal2 & 29.875 s       & 8.915 s      & {\bf 3.351}   & 382.887 s      & 93.252 s   & {\bf 4.106}\\
 924 language          & 20.991 s       & 11.197 s     & {\bf 1.875}   & 310.679 s      & 82.480 s   & {\bf 3.767}\\
 925 poli\_large       & 13.817 s       & 102.760 s    & {\bf 0.134}   & 197.508 s      & 151.672 s  & {\bf 1.302}\\
 926 torso3            & 57.469 s       & 16.828 s     & {\bf 3.415}   & 926.588 s      & 242.721 s  & {\bf 3.817}\\ \hline
 927 \end{tabular}
 928 \caption{Ratios of the computation time over the communication time obtained from the parallel GMRES algorithm using hypergraph partitioning and compressed format for vectors on 12 GPUs and 24 CPUs.}
 929 \label{tab:12}
 930 \end{center}
 931 \end{table}
 932
 933 \begin{figure}
 934 \centering
 935   \includegraphics[width=120mm,keepaspectratio]{Figures/weak}
 936 \caption{Weak scaling of the parallel GMRES algorithm on a GPU cluster.}
 937 \label{fig:09}
 938 \end{figure}
 939
 940  Figure~\ref{fig:09} presents the weak scaling of four versions of the parallel GMRES algorithm on a GPU cluster. We fixed the size of a sub-matrix to 5 million of rows per GPU computing node. We used matrices having five bands generated from the symmetric matrix thermal2. This figure shows that the parallel GMRES algorithm, in its naive version or using either the compression format for vectors or the hypergraph partitioning, is not scalable on a GPU cluster due to the large amount of communications between GPUs. In contrast, we can see that the algorithm using both optimization techniques is fairly scalable. That means that in this version the cost of communications is relatively constant regardless the number of computing nodes in the cluster.
 941
 942 \bigskip
 943
 944  Finally, as far as we are concerned, the parallel solving of a linear system can be easy to optimize when the associated matrix is regular. This is unfortunately not the case for many real-world applications. When the matrix has an irregular structure, the amount of communications between processors is not the same. Another important parameter is the size of the matrix bandwidth which has a huge influence on the amount of communications. In this work, we have generated different kinds of matrices in order to analyze several difficulties. With a bandwidth as large as possible, involving communications between all processors, which is the most difficult situation, we proposed to use two heuristics. Unfortunately, there is no fast method that optimizes the communications in any situation. For systems of non linear equations, there are different algorithms but most of them consist in linearizing the system of equations. In this case, a linear system needs to be solved. The big interest is that the matrix is the same at each step of the non linear system solving, so the partitioning method which is a time consuming step is performed only once.
 945
 946
 947
 948 Another very important issue, which might be ignored by too many people, is that the communications have a greater influence on a cluster of GPUs than on a cluster of CPUs. There are two reasons for that. The first one comes from the fact that with a cluster of GPUs, the CPU/GPU data transfers slow down communications between two GPUs that are not on the same machines. The second one is due to the fact that with GPUs the ratio of the computation time over the communication time decreases since the computation time is reduced. So the impact of the communications between GPUs might be a very important issue that can limit the scalability of parallel algorithms.
 949
 950 %%--------------------%%
 951 %%      SECTION 7     %%
 952 %%--------------------%%
 953 \section{Conclusion and perspectives}
 954 \label{sec:07}
 955 In this paper, we have aimed at harnessing the computing power of a GPU cluster for
 956 solving large sparse linear systems. We have implemented the parallel algorithm of the
 957 GMRES iterative method. We have used a heterogeneous parallel programming based on the
 958 CUDA language to program the GPUs and the MPI parallel environment to distribute the
 959 computations between the GPU nodes on the cluster.
 960
 961 The experiments have shown that solving large sparse linear systems is more efficient
 962 on a cluster of GPUs than on a cluster of CPUs. However, the efficiency of a GPU cluster
 963 is ensured as long as the spatial and temporal localization of the data is well managed.
 964 The data dependency scheme on a GPU cluster is related to the sparse structures of the
 965 matrices (positions of the nonzero values) and the number of the computing nodes. We have
 966 shown that a large number of communications between the GPU computing nodes affects
 967 considerably the performances of the parallel GMRES algorithm on the GPU cluster. Therefore,
 968 we have proposed to reorder the columns of the sparse local sub-matrices on each GPU node
 969 and to use a compressed storage format for the sparse vector involved in the parallel
 970 sparse matrix-vector multiplication. This solution allows to minimize the communication
 971 overheads. In addition, we have shown that it is interesting to choose a partitioning method
 972 according to the structure of the sparse matrix. This reduces the total communication
 973 volume between the GPU computing nodes.
 974
 975 In future works, it would be interesting to implement and study the scalability of the
 976 parallel GMRES algorithm on large GPU clusters (hundreds or thousands of GPUs) or on geographically
 977 distant GPU clusters. In this context, other methods might be used to reduce communication
 978 and to improve the performances of the parallel GMRES algorithm as the multisplitting methods.
 979 The recent GPU hardware and software architectures provide the GPU-Direct system which allows
 980 two GPUs, placed in the same machine or in two remote machines, to exchange data without using
 981 CPUs. This improves the data transfers between GPUs. Finally, it would be interesting to implement
 982 other iterative methods on GPU clusters for solving large sparse linear or non linear systems.
 983
 984 \paragraph{Acknowledgments}
 985 This paper is based upon work supported by the R\'egion de Franche-Comt\'e.
 986
 987 \bibliography{bib}
 988 \bibliographystyle{abbrv}
 989 \end{document}