--- /dev/null
+\documentclass[11pt]{article}
+%\documentclass{acmconf}
+
+\usepackage[paper=a4paper,dvips,top=1.5cm,left=1.5cm,right=1.5cm,foot=1cm,bottom=1.5cm]{geometry}
+\usepackage{times}
+\usepackage{graphicx}
+\usepackage{amsmath}
+\usepackage{amsfonts}
+\usepackage{amssymb}
+\usepackage{amsthm}
+\usepackage{amsopn}
+\usepackage{xspace}
+\usepackage{array}
+\usepackage{epsfig}
+\usepackage{algorithmic}
+\usepackage[ruled,english,boxed,linesnumbered]{algorithm2e}
+\usepackage[english]{algorithm2e}
+\usepackage{url}
+\usepackage{mdwlist}
+\usepackage{multirow}
+
+\date{}
+
+\title{Parallel sparse linear solver with GMRES method using minimization techniques of communications for GPU clusters}
+
+\author{
+\textsc{Jacques M. Bahi}
+\qquad
+\textsc{Rapha\"el Couturier}\thanks{Contact author}
+\qquad
+\textsc{Lilia Ziane Khodja}
+\mbox{}\\ %
+\\
+FEMTO-ST Institute, University of Franche-Comte\\
+IUT Belfort-Montb\'eliard\\
+Rue Engel Gros, BP 527, 90016 Belfort, \underline{France}\\
+\mbox{}\\ %
+\normalsize
+\{\texttt{jacques.bahi},~\texttt{raphael.couturier},~\texttt{lilia.ziane\_khoja}\}\texttt{@univ-fcomte.fr}
+}
+
+\begin{document}
+
+\maketitle
+
+\begin{abstract}
+In this paper, we aim at exploiting the power computing of a GPU cluster for solving large sparse
+linear systems. We implement the parallel algorithm of the GMRES iterative method using the CUDA
+programming language and the MPI parallel environment. The experiments shows that a GPU cluster
+is more efficient than a CPU cluster. In order to optimize the performances, we use a compressed
+storage format for the sparse vectors and the hypergraph partitioning. These solutions improve
+the spatial and temporal localization of the shared data between the computing nodes of the GPU
+cluster.
+\end{abstract}
+
+
+
+%%--------------------%%
+%% SECTION 1 %%
+%%--------------------%%
+\section{Introduction}
+\label{sec:01}
+Large sparse linear systems arise in most numerical scientific or industrial simulations.
+They model numerous complex problems in different areas of applications such as mathematics,
+engineering, biology or physics~\cite{ref18}. However, solving these systems of equations is
+often an expensive operation in terms of execution time and memory space consumption. Indeed,
+the linear systems arising in most applications are very large and have many zero
+coefficients, and this sparse nature leads to irregular accesses to load the nonzero coefficients
+from the memory.
+
+Parallel computing has become a key issue for solving sparse linear systems of large sizes.
+This is due to the computing power and the storage capacity of the current parallel computers as
+well as the availability of different parallel programming languages and environments such as the
+MPI communication standard. Nowadays, graphics processing units (GPUs) are the most commonly used
+hardware accelerators in high performance computing. They are equipped with a massively parallel
+architecture allowing them to compute faster than CPUs. However, the parallel computers equipped
+with GPUs introduce new programming difficulties to adapt parallel algorithms to their architectures.
+
+In this paper, we use the GMRES iterative method for solving large sparse linear systems on a cluster
+of GPUs. The parallel algorithm of this method is implemented using the CUDA programming language for
+the GPUs and the MPI parallel environment to distribute the computations between the different GPU nodes
+of the cluster. Particularly, we focus on improving the performances of the parallel sparse matrix-vector
+multiplication. Indeed, this operation is not only very time-consuming but it also requires communications
+between the GPU nodes. These communications are needed to build the global vector involved in
+the parallel sparse matrix-vector multiplication. It should be noted that a communication between two
+GPU nodes involves data transfers between the GPU and CPU memories in the same node and the MPI communications
+between the CPUs of the GPU nodes. For performance purposes, we propose to use a compressed storage
+format to reduce the size of the vectors to be exchanged between the GPU nodes and a hypergraph partitioning
+of the sparse matrix to reduce the total communication volume.
+
+The present paper is organized as follows. In Section~\ref{sec:02} some previous works about solving
+sparse linear systems on GPUs are presented. In Section~\ref{sec:03} is given a general overview of the GPU architectures,
+followed by that the GMRES method in Section~\ref{sec:04}. In Section~\ref{sec:05} the main key points
+of the parallel implementation of the GMRES method on a GPU cluster are described. Finally, in Section~\ref{sec:06}
+is presented the performance improvements of the parallel GMRES algorithm on a GPU cluster.
+
+
+%%--------------------%%
+%% SECTION 2 %%
+%%--------------------%%
+\section{Related work}
+\label{sec:02}
+Numerous works have shown the efficiency of GPUs for solving sparse linear systems compared
+to their CPUs counterpart. Different iterative methods are implemented on one GPU, for example
+Jacobi and Gauss-Seidel in~\cite{refa}, conjugate and biconjugate gradients in~\cite{refd,refe,reff,refj}
+and GMRES in~\cite{refb,refc,refg,refm}. In addition, some iterative methods are implemented on
+shared memory multi-GPUs machines as~\cite{refh,refi,refk,refl}. A limited set of studies are
+devoted to the parallel implementation of the iterative methods on distributed memory GPU clusters
+as~\cite{refn,refo,refp}.
+
+Traditionally, the parallel iterative algorithms do not often scale well on GPU clusters due to
+the significant cost of the communications between the computing nodes. Some authors have already
+studied how to reduce these communications. In~\cite{cev10}, the authors used a hypergraph partitioning
+as a preprocessing to the parallel conjugate gradient algorithm in order to reduce the inter-GPU
+communications over a GPU cluster. The sequential hypergraph partitioning method provided by the
+PaToH tool~\cite{Cata99} is used because of the small sizes of the sparse symmetric linear systems
+to be solved. In~\cite{refq}, a compression and decompression technique is proposed to reduce the
+communication overheads. This technique is performed on the shared vectors to be exchanged between
+the computing nodes. In~\cite{refr}, the authors studied the impact of asynchronism on parallel
+iterative algorithms on local GPU clusters. Asynchronous communication primitives suppress some
+synchronization barriers and allow overlap of communication and computation. In~\cite{refs}, a
+communication reduction method is used for implementing finite element methods (FEM) on GPU clusters.
+This method firstly uses the Reverse Cuthill-McKee reordering to reduce the total communication
+volume. In addition, the performances of the parallel FEM algorithm are improved by overlapping
+the communication with computation.
+
+
+%%--------------------%%
+%% SECTION 3 %%
+%%--------------------%%
+\section{{GPU} architectures}
+\label{sec:03}
+A GPU (Graphics processing unit) is a hardware accelerator for high performance computing.
+Its hardware architecture is composed of hundreds of cores organized in several blocks called
+\emph{streaming multiprocessors}. It is also equipped with a memory hierarchy. It has a set
+of registers and a private read-write \emph{local memory} per core, a fast \emph{shared memory},
+read-only \emph{constant} and \emph{texture} caches per multiprocessor and a read-write
+\emph{global memory} shared by all its multiprocessors. The new architectures (Fermi, Kepler,
+etc) have also L1 and L2 caches to improve the accesses to the global memory.
+
+NVIDIA has released the CUDA platform (Compute Unified Device Architecture)~\cite{Nvi10}
+which provides a high level GPGPU-based programming language (General-Purpose computing
+on GPUs), allowing to program GPUs for general purpose computations. In CUDA programming
+environment, all data-parallel and compute intensive portions of an application running
+on the CPU are off-loaded onto the GPU. Indeed, an application developed in CUDA is a
+program written in C language (or Fortran) with a minimal set of extensions to define
+the parallel functions to be executed by the GPU, called \emph{kernels}. We define kernels,
+as separate functions from those of the CPU, by assigning them a function type qualifiers
+\verb+__global__+ or \verb+__device__+.
+
+At the GPU level, the same kernel is executed by a large number of parallel CUDA threads
+grouped together as a grid of thread blocks. Each multiprocessor of the GPU executes one
+or more thread blocks in SIMD fashion (Single Instruction, Multiple Data) and in turn each
+core of a GPU multiprocessor runs one or more threads within a block in SIMT fashion (Single
+Instruction, Multiple threads). In order to maximize the occupation of the GPU cores, the
+number of CUDA threads to be involved in a kernel execution is computed according to the
+size of the problem to be solved. In contrast, the block size is restricted by the limited
+memory resources of a core. On current GPUs, a thread block may contain up-to $1,024$ concurrent
+threads. At any given clock cycle, the threads execute the same instruction of a kernel,
+but each of them operates on different data. Moreover, threads within a block can cooperate
+by sharing data through the fast shared memory and coordinate their execution through
+synchronization points. In contrast, within a grid of thread blocks, there is no synchronization
+at all between blocks.
+
+GPUs only work on data filled in their global memory and the final results of their kernel
+executions must be communicated to their hosts (CPUs). Hence, the data must be transferred
+\emph{in} and \emph{out} of the GPU. However, the speed of memory copy between the CPU and
+the GPU is slower than the memory copy speed of GPUs. Accordingly, it is necessary to limit
+the transfer of data between the GPU and its host.
+
+
+%%--------------------%%
+%% SECTION 4 %%
+%%--------------------%%
+\section{{GMRES} method}
+\label{sec:04}
+The generalized minimal residual method (GMRES) is an iterative method designed by Saad
+and Schultz in 1986~\cite{Saa86}. It is a generalization of the minimal residual method
+(MNRES)~\cite{Pai75} to deal with asymmetric and non Hermitian problems and indefinite
+symmetric problems.
+
+Let us consider the following sparse linear system of $n$ equations:
+\begin{equation}
+ Ax = b,
+\label{eq:01}
+\end{equation}
+where $A\in\mathbb{R}^{n\times n}$ is a sparse square and nonsingular matrix, $x\in\mathbb{R}^n$
+is the solution vector and $b\in\mathbb{R}^n$ is the right-hand side vector. The main idea of
+the GMRES method is to find a sequence of solutions $\{x_k\}_{k\in\mathbb{N}}$ which minimizes at
+best the residual $r_k=b-Ax_k$. The solution $x_k$ is computed in a Krylov sub-space $\mathcal{K}_k(A,v_1)$:
+\begin{equation}
+\begin{array}{ll}
+ \mathcal{K}_{k}(A,v_{1}) \equiv \text{span}\{v_{1}, Av_{1}, A^{2}v_{1},..., A^{k-1}v_{1}\}, & v_{1}=\frac{r_{0}}{\|r_{0}\|_{2}},
+\end{array}
+\end{equation}
+such that the Petrov-Galerkin condition is satisfied:
+\begin{equation}
+ r_{k} \perp A\mathcal{K}_{k}(A, v_{1}).
+\end{equation}
+The GMRES method uses the Arnoldi method~\cite{Arn51} to build an orthonormal basis $V_k$
+and a Hessenberg upper matrix $\bar{H}_k$ of order $(k+1)\times k$:
+\begin{equation}
+\begin{array}{lll}
+ V_{k}=\{v_{1},v_{2},...,v_{k}\}, & \text{such that} & \forall k>1, v_{k}=A^{k-1}v_{1},
+\end{array}
+\end{equation}
+and
+\begin{equation}
+ AV_{k} = V_{k+1}\bar{H}_{k}.
+\label{eq:02}
+\end{equation}
+Then at each iteration $k$, a solution $x_k$ is computed in the Krylov sub-space $\mathcal{K}_{k}$
+spanned by $V_k$ as follows:
+\begin{equation}
+\begin{array}{ll}
+ x_{k} = x_{0} + V_{k}y, & y\in\mathbb{R}^{k}.
+\end{array}
+\label{eq:03}
+\end{equation}
+From both equations~(\ref{eq:02}) and~(\ref{eq:03}), we can deduce:
+\begin{equation}
+\begin{array}{lll}
+ r_{k} & = & b - A (x_{0} + V_{k}y) \\
+ & = & r_{0} - AV_{k}y \\
+ & = & \beta v_{1} - V_{k+1}\bar{H}_{k}y \\
+ & = & V_{k+1}(\beta e_{1} - \bar{H}_{k}y),
+\end{array}
+\end{equation}
+where $\beta=\|r_{0}\|_{2}$ and $e_{1}=(1,0,...,0)$ is the first vector of the canonical basis of
+$\mathbb{R}^{k}$. The vector $y$ is computed in $\mathbb{R}^{k}$ so as to minimize at best the Euclidean
+norm of the residual $r_{k}$. This means that the following least-squares problem of size $k$ must
+be solved:
+\begin{equation}
+ \underset{y\in\mathbb{R}^{k}}{min}\|r_{k}\|_{2}=\underset{y\in\mathbb{R}^{k}}{min}\|\beta e_{1}-\bar{H}_{k}y\|_{2}.
+\end{equation}
+The solution of this problem is computed thanks to the QR factorization of the Hessenberg matrix
+$\bar{H}_{k}$ by using the Givens rotations~\cite{Saa03,Saa86} such that:
+\begin{equation}
+\begin{array}{lll}
+ \bar{H}_{k}=Q_{k}R_{k}, & Q_{k}\in\mathbb{R}^{(k+1)\times (k+1)}, & R_{k}\in\mathbb{R}^{(k+1)\times k},
+\end{array}
+\end{equation}
+where $Q_{k}Q_{k}^{T}=I_{k}$ and $R_{k}$ is a upper triangular matrix.
+
+The GMRES method finds a solution at most after $n$ iterations. So, the Krylov sub-space dimension
+can increase up-to the large size $n$ of the linear system. Then in order to avoid a large storage
+of the orthonormal basis $V_{k}$, the Arnoldi process is restricted at $m\ll n$ iterations and restarted
+with the last iterate $x_{m}$ as an initial guess to the next iteration. Moreover, GMRES sometimes
+does not converge or takes too many iterations to satisfy the convergence criteria. Therefore, in
+most cases, GMRES must contain a preconditioning step to improve its convergence. The preconditioning
+technique replaces the system~(\ref{eq:01}) with the modified systems:
+\begin{equation}M^{-1}Ax=M^{-1}b.\end{equation}
+
+Algorithm~\ref{alg:01} illustrates the main key points of the GMRES method with restarts.
+The linear system to be solved in this algorithm is left-preconditioned where $M$ is the preconditioning
+matrix. At each iteration $k$, the Arnoldi process is used (from line~$7$ to line~$17$ of algorithm~\ref{alg:01})
+to construct an orthonormal basis $V_m$ and a Hessenberg matrix $\bar{H}_m$ of order $(m+1)\times m$.
+Then, the least-squares problem is solved (line~$18$) to find the vector $y\in\mathbb{R}^m$ which minimizes
+the residual. Finally, the solution $x_m$ is computed in the Krylov sub-space spanned by $V_m$ (line~$19$).
+In practice, the GMRES algorithm stops when the Euclidean norm of the residual is small enough and/or
+the maximum number of iterations is reached.
+
+\begin{algorithm}[!h]
+ \SetAlgoLined
+ \Entree{$A$ (matrix), $b$ (vector), $M$ (preconditioning matrix),
+$x_{0}$ (initial guess), $\varepsilon$ (tolerance threshold), $max$ (maximum number of iterations),
+$m$ (number of iterations of the Arnoldi process)}
+ \Sortie{$x$ (solution vector)}
+ \BlankLine
+ $r_{0} \leftarrow M^{-1}(b - Ax_{0})$\;
+ $\beta \leftarrow \|r_{0}\|_{2}$\;
+ $\alpha \leftarrow \|M^{-1}b\|_{2}$\;
+ $convergence \leftarrow false$\;
+ $k \leftarrow 1$\;
+ \BlankLine
+ \While{$(\neg convergence)$}{
+ $v_{1} \leftarrow r_{0} / \beta$\;
+ \For{$j=1$ {\bf to} $m$}{
+ $w_{j} \leftarrow M^{-1}Av_{j}$\;
+ \For{$i=1$ {\bf to} $j$}{
+ $h_{i,j} \leftarrow (w_{j},v_{i})$\;
+ $w_{j} \leftarrow w_{j} - h_{i,j} \times v_{i}$\;
+ }
+ $h_{j+1,j} \leftarrow \|w_{j}\|_{2}$\;
+ $v_{j+1} \leftarrow w_{j} / h_{j+1,j}$\;
+ }
+ \BlankLine
+ Put $V_{m}=\{v_{j}\}_{1\leq j \leq m}$ and $\bar{H}_{m}=(h_{i,j})$ Hessenberg matrix of order $(m+1)\times m$\;
+ Solve the least-squares problem of size $m$: $\underset{y\in\mathbb{R}^{m}}{min}\|\beta e_{1}-\bar{H}_{m}y\|_{2}$\;
+ \BlankLine
+ $x_{m} \leftarrow x_{0} + V_{m}y$\;
+ $r_{m} \leftarrow M^{-1}(b-Ax_{m})$\;
+ $\beta \leftarrow \|r_{m}\|_{2}$\;
+ \BlankLine
+ \eIf{$(\frac{\beta}{\alpha}<\varepsilon)$ {\bf or} $(k\geq max)$}{
+ $convergence \leftarrow true$\;
+ }{
+ $x_{0} \leftarrow x_{m}$\;
+ $r_{0} \leftarrow r_{m}$\;
+ $k \leftarrow k + 1$\;
+ }
+ }
+\caption{Left-preconditioned GMRES algorithm with restarts}
+\label{alg:01}
+\end{algorithm}
+
+
+
+%%--------------------%%
+%% SECTION 5 %%
+%%--------------------%%
+\section{Parallel GMRES method on {GPU} clusters}
+\label{sec:05}
+
+\subsection{Parallel implementation on a GPU cluster}
+\label{sec:05.01}
+The implementation of the GMRES algorithm on a GPU cluster is performed by using
+a parallel heterogeneous programming. We use the programming language CUDA for the
+GPUs and the parallel environment MPI for the distribution of the computations between
+the GPU computing nodes. In this work, a GPU computing node is composed of a GPU and
+a CPU core managed by a MPI process.
+
+Let us consider a cluster composed of $p$ GPU computing nodes. First, the sparse linear
+system~(\ref{eq:01}) is split into $p$ sub-linear systems, each is attributed to a GPU
+computing node. We partition row-by-row the sparse matrix $A$ and both vectors $x$ and
+$b$ in $p$ parts (see Figure~\ref{fig:01}). The data issued from the partitioning operation
+are off-loaded on the GPU global memories to be proceeded by the GPUs. Then, all the
+computing nodes of the GPU cluster execute the same GMRES iterative algorithm but on
+different data. Finally, the GPU computing nodes synchronize their computations by using
+MPI communication routines to solve the global sparse linear system. In what follows,
+the computing nodes sharing data are called the neighboring nodes.
+
+\begin{figure}
+\centering
+ \includegraphics[width=80mm,keepaspectratio]{Figures/partition}
+\caption{Data partitioning of the sparse matrix $A$, the solution vector $x$ and the right-hand side $b$ in $4$ partitions}
+\label{fig:01}
+\end{figure}
+
+In order to exploit the computing power of the GPUs, we have to execute maximum computations
+on GPUs to avoid the data transfers between the GPU and its host (CPU), and to maximize the
+GPU cores utilization to hide global memory access latency. The implementation of the GMRES
+algorithm is performed by executing the functions operating on vectors and matrices as kernels
+on GPUs. These operations are often easy to parallelize and more efficient on parallel architectures
+when they operate on large vectors. We use the fastest routines of the CUBLAS library
+(CUDA Basic Linear Algebra Subroutines) to implement the dot product (\verb+cublasDdot()+),
+the Euclidean norm (\verb+cublasDnrm2()+) and the AXPY operation (\verb+cublasDaxpy()+).
+In addition, we have coded in CUDA a kernel for the scalar-vector product (lines~$7$ and~$15$
+of Algorithm~\ref{alg:01}), a kernel for solving the least-squares problem (line~$18$) and a
+kernel for solution vector updates (line~$19$).
+
+The solution of the least-squares problem in the GMRES algorithm is based on:
+\begin{itemize}
+ \item a QR factorization of the Hessenberg matrix $\bar{H}$ by using plane rotations and,
+ \item backward-substitution method to compute the vector $y$ minimizing the residual.
+\end{itemize}
+This operation is not easy to parallelize and it is not interesting to implement it on GPUs.
+However, the size $m$ of the linear least-squares problem to solve in the GMRES method with
+restarts is very small. So, this problem is solved in sequential by one GPU thread.
+
+The most important operation in the GMRES method is the sparse matrix-vector multiplication.
+It is quite expensive for large size matrices in terms of execution time and memory space. In
+addition, it performs irregular memory accesses to read the nonzero values of the sparse matrix,
+implying non coalescent accesses to the GPU global memory which slow down the performances of
+the GPUs. So we use the HYB kernel developed and optimized by Nvidia~\cite{CUSP} which gives on
+average the best performance in sparse matrix-vector multiplications on GPUs~\cite{Bel09}. The
+HYB (Hybrid) storage format is the combination of two sparse storage formats: Ellpack format
+(ELL) and Coordinate format (COO). It stores a typical number of nonzero values per row in ELL
+format and remaining entries of exceptional rows in COO format. It combines the efficiency of
+ELL, due to the regularity of its memory accessing and the flexibility of COO which is insensitive
+to the matrix structure.
+
+In the parallel GMRES algorithm, the GPU computing nodes must exchange between them their shared data in
+order to construct the global vector necessary to compute the parallel sparse matrix-vector
+multiplication (SpMV). In fact, each computing node has locally the vector elements corresponding
+to the rows of its sparse sub-matrix and, in order to compute its part of the SpMV, it also
+requires the vector elements of its neighboring nodes corresponding to the column indices in
+which its local sub-matrix has nonzero values. Consequently, each computing node manages a global
+vector composed of a local vector of size $\frac{n}{p}$ and a shared vector of size $S$:
+\begin{equation}
+ S = bw - \frac{n}{p},
+\label{eq:11}
+\end{equation}
+where $\frac{n}{p}$ is the size of the local vector and $bw$ is the bandwidth of the local sparse
+sub-matrix which represents the number of columns between the minimum and the maximum column indices
+(see Figure~\ref{fig:01}). In order to improve memory accesses, we use the texture memory to
+cache elements of the global vector.
+
+On a GPU cluster, the exchanges of the shared vectors elements between the neighboring nodes are
+performed as follows:
+\begin{itemize}
+ \item at the level of the sending node: data transfers of the shared data from the GPU global memory
+to the CPU memory by using the CUBLAS communication routine \verb+cublasGetVector()+,
+ \item data exchanges between the CPUs by the MPI communication routine \verb+MPI_Alltoallv()+ and,
+ \item at the level of the receiving node: data transfers of the received shared data from the CPU
+memory to the GPU global memory by using CUBLAS communication routine \verb+cublasSetVector()+.
+\end{itemize}
+
+\subsection{Experimentations}
+\label{sec:05.02}
+The experiments are done on a cluster composed of six machines interconnected by an Infiniband network
+of $20$~GB/s. Each machine is a Xeon E5530 Quad-Core running at $2.4$~GHz. It provides $12$~GB of RAM
+memory with a memory bandwidth of $25.6$~GB/s and it is equipped with two Tesla C1060 GPUs. Each GPU
+is composed of $240$ cores running at $1.3$ GHz and has $4$~GB of global memory with a memory bandwidth
+of $102$~GB/s. The GPU is connected to the CPU via a PCI-Express 16x Gen2.0 with a throughput of $8$~GB/s.
+Figure~\ref{fig:02} shows the general scheme of the GPU cluster.
+
+\begin{figure}[!h]
+\centering
+ \includegraphics[width=80mm,keepaspectratio]{Figures/clusterGPU}
+\caption{A cluster composed of six machines, each equipped with two Tesla C1060 GPUs}
+\label{fig:02}
+\end{figure}
+
+Linux cluster version 2.6.18 OS is installed on the six machines. The C programming language is used for
+coding the GMRES algorithm on both the CPU and the GPU versions. CUDA version 4.0~\cite{ref19} is used for programming
+the GPUs, using CUBLAS library~\cite{ref37} to deal with the functions operating on vectors. Finally, MPI routines
+of OpenMPI 1.3.3 are used to carry out the communication between the CPU cores.
+
+The experiments are done on linear systems associated to sparse matrices chosen from the Davis collection of the
+university of Florida~\cite{Dav97}. They are matrices arising in real-world applications. Table~\ref{tab:01} shows
+the main characteristics of these sparse matrices and Figure~\ref{fig:03} shows their sparse structures. For
+each matrix, we give the number of rows (column~$3$ in Table~\ref{tab:01}), the number of nonzero values (column~$4$)
+and the bandwidth (column~$5$).
+
+\begin{table}
+\begin{center}
+\begin{tabular}{|c|c|r|r|r|}
+\hline
+Matrix type & Name & \# Rows & \# Nonzeros & Bandwidth \\\hline \hline
+\multirow{6}{*}{Symmetric} & 2cubes\_sphere & 101 492 & 1 647 264 & 100 464 \\
+ & ecology2 & 999 999 & 4 995 991 & 2 001 \\
+ & finan512 & 74 752 & 596 992 & 74 725 \\
+ & G3\_circuit & 1 585 478 & 7 660 826 & 1 219 059 \\
+ & shallow\_water2 & 81 920 & 327 680 & 58 710 \\
+ & thermal2 & 1 228 045 & 8 580 313 & 1 226 629 \\ \hline \hline
+\multirow{6}{*}{Asymmetric} & cage13 & 445 315 & 7 479 343 & 318 788 \\
+ & crashbasis & 160 000 & 1 750 416 & 120 202 \\
+ & FEM\_3D\_thermal2 & 147 900 & 3 489 300 & 117 827 \\
+ & language & 399 130 & 1 216 334 & 398 622 \\
+ & poli\_large & 15 575 & 33 074 & 15 575 \\
+ & torso3 & 259 156 & 4 429 042 & 216 854 \\ \hline
+\end{tabular}
+\caption{Main characteristics of the sparse matrices chosen from the Davis collection}
+\label{tab:01}
+\end{center}
+\end{table}
+
+\begin{figure}
+\centering
+ \includegraphics[width=120mm,keepaspectratio]{Figures/matrices}
+\caption{Structures of the sparse matrices chosen from the Davis collection}
+\label{fig:03}
+\end{figure}
+
+All the experiments are performed on double-precision data. The parameters of the parallel
+GMRES algorithm are as follows: the tolerance threshold $\varepsilon=10^{-12}$, the maximum
+number of iterations $max=500$, the Arnoldi process is limited to $m=16$ iterations, the elements
+of the guess solution $x_0$ is initialized to $0$ and those of the right-hand side vector are
+initialized to $1$. For simplicity sake, we chose the matrix preconditioning $M$ as the
+main diagonal of the sparse matrix $A$. Indeed, it allows us to easily compute the required inverse
+matrix $M^{-1}$ and it provides relatively good preconditioning in most cases. Finally, we set
+the size of a thread-block in GPUs to $512$ threads.
+
+\begin{table}[!h]
+\begin{center}
+\begin{tabular}{|c|c|c|c|c|c|c|}
+\hline
+Matrix & $Time_{cpu}$ & $Time_{gpu}$ & $\tau$ & \# $iter$ & $prec$ & $\Delta$ \\ \hline \hline
+2cubes\_sphere & 0.234 s & 0.124 s & 1.88 & 21 & 2.10e-14 & 3.47e-18 \\
+ecology2 & 0.076 s & 0.035 s & 2.15 & 21 & 4.30e-13 & 4.38e-15 \\
+finan512 & 0.073 s & 0.052 s & 1.40 & 17 & 3.21e-12 & 5.00e-16 \\
+G3\_circuit & 1.016 s & 0.649 s & 1.56 & 22 & 1.04e-12 & 2.00e-15 \\
+shallow\_water2 & 0.061 s & 0.044 s & 1.38 & 17 & 5.42e-22 & 2.71e-25 \\
+thermal2 & 1.666 s & 0.880 s & 1.89 & 21 & 6.58e-12 & 2.77e-16 \\ \hline \hline
+cage13 & 0.721 s & 0.338 s & 2.13 & 26 & 3.37e-11 & 2.66e-15 \\
+crashbasis & 1.349 s & 0.830 s & 1.62 & 121 & 9.10e-12 & 6.90e-12 \\
+FEM\_3D\_thermal2 & 0.797 s & 0.419 s & 1.90 & 64 & 3.87e-09 & 9.09e-13 \\
+language & 2.252 s & 1.204 s & 1.87 & 90 & 1.18e-10 & 8.00e-11 \\
+poli\_large & 0.097 s & 0.095 s & 1.02 & 69 & 4.98e-11 & 1.14e-12 \\
+torso3 & 4.242 s & 2.030 s & 2.09 & 175 & 2.69e-10 & 1.78e-14 \\ \hline
+\end{tabular}
+\caption{Performances of the parallel GMRES algorithm on a cluster of 24 CPU cores vs. a cluster of 12 GPUs}
+\label{tab:02}
+\end{center}
+\end{table}
+
+In Table~\ref{tab:02}, we give the performances of the parallel GMRES algorithm for solving the linear
+systems associated to the sparse matrices shown in Table~\ref{tab:01}. The second and third columns show
+the execution times in seconds obtained on a cluster of 24 CPU cores and on a cluster of 12 GPUs, respectively.
+The fourth column shows the ratio $\tau$ between the CPU time $Time_{cpu}$ and the GPU time $Time_{gpu}$ that
+is computed as follows:
+\begin{equation}
+ \tau = \frac{Time_{cpu}}{Time_{gpu}}.
+\end{equation}
+From these ratios, we can notice that the use of many GPUs is not interesting to solve small sparse linear
+systems. Solving these sparse linear systems on a cluster of 12 GPUs is as fast as on a cluster of 24 CPU
+cores. Indeed, the small sizes of the sparse matrices do not allow to maximize the utilization of the GPU
+cores of the cluster. The fifth, sixth and seventh columns show, respectively, the number of iterations performed
+by the parallel GMRES algorithm to converge, the precision of the solution, $prec$, computed on the GPU
+cluster and the difference, $\Delta$, between the solutions computed on the GPU and the GPU clusters. The
+last two parameters are used to validate the results obtained on the GPU cluster and they are computed as
+follows:
+\begin{equation}
+\begin{array}{c}
+ prec = \|M^{-1}(b-Ax^{gpu})\|_{\infty}, \\
+ \Delta = \|x^{cpu}-x^{gpu}\|_{\infty},
+\end{array}
+\end{equation}
+where $x^{cpu}$ and $x^{gpu}$ are the solutions computed, respectively, on the CPU cluster and on the GPU cluster.
+We can see that the precision of the solutions computed on the GPU cluster are sufficient, they are about $10^{-10}$,
+and the parallel GMRES algorithm computes almost the same solutions in both CPU and GPU clusters, with $\Delta$ varying
+from $10^{-11}$ to $10^{-25}$.
+
+Afterwards, we evaluate the performances of the parallel GMRES algorithm for solving large linear systems. We have
+developed in C programming language a generator of large sparse matrices having a band structure which arises
+in most numerical problems. This generator uses the sparse matrices of the Davis collection as the initial
+matrices to build the large band matrices. It is executed in parallel by all the MPI processes of the cluster
+so that each process constructs its own sub-matrix as a rectangular block of the global sparse matrix. Each process
+$i$ computes the size $n_i$ and the offset $offset_i$ of its sub-matrix in the global sparse matrix according to the
+size $n$ of the linear system to be solved and the number of the GPU computing nodes $p$ as follows:
+\begin{equation}
+ n_i = \frac{n}{p},
+\end{equation}
+\begin{equation}
+ offset_i = \left\{
+ \begin{array}{l}
+ 0\mbox{~if~}i=0,\\
+ offset_{i-1}+n_{i-1}\mbox{~otherwise.}
+ \end{array}
+ \right.
+\end{equation}
+So each process $i$ performs several copies of the same initial matrix chosen from the Davis collection and it
+puts all these copies on the main diagonal of the global matrix in order to construct a band matrix. Moreover,
+it fulfills the empty spaces between two successive copies by small copies, \textit{lower\_copy} and \textit{upper\_copy},
+of the same initial matrix. Figure~\ref{fig:04} shows a generation of a sparse bended matrix by four computing nodes.
+
+\begin{figure}
+\centering
+ \includegraphics[width=100mm,keepaspectratio]{Figures/generation}
+\caption{Example of the generation of a large sparse and band matrix by four computing nodes.}
+\label{fig:04}
+\end{figure}
+
+Table~\ref{tab:03} shows the main characteristics (the number of nonzero values and the bandwidth) of the
+large sparse matrices generated from those of the Davis collection. These matrices are associated to the
+linear systems of 25 million of unknown values (each generated sparse matrix has 25 million rows). In Table~\ref{tab:04}
+we show the performances of the parallel GMRES algorithm for solving large linear systems associated to the
+sparse band matrices of Table~\ref{tab:03}. The fourth column gives the ratio between the execution time
+spent on a cluster of 24 CPU cores and that spent on a cluster of 12 GPUs. We can notice from these ratios
+that for solving large sparse matrices the GPU cluster is more efficient (about 5 times faster) than the CPU
+cluster. The computing power of the GPUs allows to accelerate the computation of the functions operating
+on large vectors of the parallel GMRES algorithm.
+
+\begin{table}[!h]
+\begin{center}
+\begin{tabular}{|c|c|r|r|}
+\hline
+Matrix type & Name & \# nonzeros & Bandwidth \\ \hline \hline
+\multirow{6}{*}{Symmetric} & 2cubes\_sphere & 413 703 602 & 198 836 \\
+ & ecology2 & 124 948 019 & 2 002 \\
+ & finan512 & 278 175 945 & 123 900 \\
+ & G3\_circuit & 125 262 292 & 1 891 887 \\
+ & shallow\_water2 & 100 235 292 & 62 806 \\
+ & thermal2 & 175 300 284 & 2 421 285 \\ \hline \hline
+\multirow{6}{*}{Asymmetric} & cage13 & 435 770 480 & 352 566 \\
+ & crashbasis & 409 291 236 & 200 203 \\
+ & FEM\_3D\_thermal2 & 595 266 787 & 206 029 \\
+ & language & 76 912 824 & 398 626 \\
+ & poli\_large & 53 322 580 & 15 576 \\
+ & torso3 & 433 795 264 & 328 757 \\ \hline
+\end{tabular}
+\caption{Main characteristics of the sparse and band matrices generated from the sparse matrices of the Davis collection.}
+\label{tab:03}
+\end{center}
+\end{table}
+
+
+\begin{table}[!h]
+\begin{center}
+\begin{tabular}{|c|c|c|c|c|c|c|}
+\hline
+Matrix & $Time_{cpu}$ & $Time_{gpu}$ & $\tau$ & \# $iter$& $prec$ & $\Delta$ \\ \hline \hline
+2cubes\_sphere & 3.683 s & 0.870 s & 4.23 & 21 & 2.11e-14 & 8.67e-18 \\
+ecology2 & 2.570 s & 0.424 s & 6.06 & 21 & 4.88e-13 & 2.08e-14 \\
+finan512 & 2.727 s & 0.533 s & 5.11 & 17 & 3.22e-12 & 8.82e-14 \\
+G3\_circuit & 4.656 s & 1.024 s & 4.54 & 22 & 1.04e-12 & 5.00e-15 \\
+shallow\_water2 & 2.268 s & 0.384 s & 5.91 & 17 & 5.54e-21 & 7.92e-24 \\
+thermal2 & 4.650 s & 1.130 s & 4.11 & 21 & 8.89e-12 & 3.33e-16 \\ \hline \hline
+cage13 & 6.068 s & 1.054 s & 5.75 & 26 & 3.29e-11 & 1.59e-14 \\
+crashbasis & 25.906 s & 4.569 s & 5.67 & 135 & 6.81e-11 & 4.61e-15 \\
+FEM\_3D\_thermal2 & 13.555 s & 2.654 s & 5.11 & 64 & 3.88e-09 & 1.82e-12 \\
+language & 13.538 s & 2.621 s & 5.16 & 89 & 2.11e-10 & 1.60e-10 \\
+poli\_large & 8.619 s & 1.474 s & 5.85 & 69 & 5.05e-11 & 6.59e-12 \\
+torso3 & 35.213 s & 6.763 s & 5.21 & 175 & 2.69e-10 & 2.66e-14 \\ \hline
+\end{tabular}
+\caption{Performances of the parallel GMRES algorithm for solving large sparse linear systems associated
+to band matrices on a cluster of 24 CPU cores vs. a cluster of 12 GPUs.}
+\label{tab:04}
+\end{center}
+\end{table}
+
+
+%%--------------------%%
+%% SECTION 6 %%
+%%--------------------%%
+\section{Minimization of communications}
+\label{sec:06}
+The parallel sparse matrix-vector multiplication requires data exchanges between the GPU computing nodes
+to construct the global vector. However, a GPU cluster requires communications between the GPU nodes and the
+data transfers between the GPUs and their hosts CPUs. In fact, a communication between two GPU nodes implies:
+a data transfer from the GPU memory to the CPU memory at the sending node, a MPI communication between the CPUs
+of two GPU nodes, and a data transfer from the CPU memory to the GPU memory at the receiving node. Moreover,
+the data transfers between a GPU and a CPU are considered as the most expensive communications on a GPU cluster.
+For example in our GPU cluster, the data throughput between a GPU and a CPU is of 8 GB/s which is about twice
+lower than the data transfer rate between CPUs (20 GB/s) and 12 times lower than the memory bandwidth of the
+GPU global memory (102 GB/s). In this section, we propose two solutions to improve the execution time of the
+parallel GMRES algorithm on GPU clusters.
+
+\subsection{Compressed storage format of the sparse vectors}
+\label{sec:06.01}
+In Section~\ref{sec:05.01}, the SpMV multiplication uses a global vector having a size equivalent to the matrix
+bandwidth (see Formula~\ref{eq:11}). However, we can notice that a SpMV multiplication does not often need all
+the vector elements of the global vector composed of the local and shared sub-vectors. For example in Figure~\ref{fig:01},
+ node 1 only needs a single vector element from node 0 (element 1), two elements from node 2 (elements 8
+and 9) and two elements from node 3 (elements 13 and 14). Therefore to reduce the communication overheads
+of the unused vector elements, the GPU computing nodes must exchange between them only the vector elements necessary
+to perform their local sparse matrix-vector multiplications.
+
+\begin{figure}
+\centering
+ \includegraphics[width=120mm,keepaspectratio]{Figures/compress}
+\caption{Example of data exchanges between node 1 and its neighbors 0, 2 and 3.}
+\label{fig:05}
+\end{figure}
+
+We propose to use a compressed storage format of the sparse global vector. In Figure~\ref{fig:05}, we show an
+example of the data exchanges between node 1 and its neighbors to construct the compressed global vector.
+First, the neighboring nodes 0, 2 and 3 determine the vector elements needed by node 1 and, then, they send
+only these elements to it. Node 1 receives these shared elements in a compressed vector. However to compute
+the sparse matrix-vector multiplication, it must first copy the received elements to the corresponding indices
+in the global vector. In order to avoid this process at each iteration, we propose to reorder the columns of the
+local sub-matrices so as to use the shared vectors in their compressed storage format (see Figure~\ref{fig:06}).
+For performance purposes, the computation of the shared data to send to the neighboring nodes is performed
+by the GPU as a kernel. In addition, we use the MPI point-to-point communication routines: a blocking send routine
+\verb+MPI_Send()+ and a nonblocking receive routine \verb+MPI_Irecv()+.
+
+\begin{figure}
+\centering
+ \includegraphics[width=100mm,keepaspectratio]{Figures/reorder}
+\caption{Reordering of the columns of a local sparse matrix.}
+\label{fig:06}
+\end{figure}
+
+Table~\ref{tab:05} shows the performances of the parallel GMRES algorithm using the compressed storage format
+of the sparse global vector. The results are obtained from solving large linear systems associated to the sparse
+band matrices presented in Table~\ref{tab:03}. We can see from Table~\ref{tab:05} that the execution times
+of the parallel GMRES algorithm on a cluster of 12 GPUs are improved by about 38\% compared to those presented
+in Table~\ref{tab:04}. In addition, the ratios between the execution times spent on the cluster of 24 CPU cores
+and those spent on the cluster of 12 GPUs have increased. Indeed, the reordering of the sparse sub-matrices and
+the use of a compressed storage format for the sparse vectors minimize the communication overheads between the
+GPU computing nodes.
+
+\begin{table}[!h]
+\begin{center}
+\begin{tabular}{|c|c|c|c|c|c|c|}
+\hline
+Matrix & $Time_{cpu}$ & $Time_{gpu}$ & $\tau$& \# $iter$& $prec$ & $\Delta$ \\ \hline \hline
+2cubes\_sphere & 3.597 s & 0.514 s & 6.99 & 21 & 2.11e-14 & 8.67e-18 \\
+ecology2 & 2.549 s & 0.288 s & 8.83 & 21 & 4.88e-13 & 2.08e-14 \\
+finan512 & 2.660 s & 0.377 s & 7.05 & 17 & 3.22e-12 & 8.82e-14 \\
+G3\_circuit & 3.139 s & 0.480 s & 6.53 & 22 & 1.04e-12 & 5.00e-15 \\
+shallow\_water2 & 2.195 s & 0.253 s & 8.68 & 17 & 5.54e-21 & 7.92e-24 \\
+thermal2 & 3.206 s & 0.463 s & 6.93 & 21 & 8.89e-12 & 3.33e-16 \\ \hline \hline
+cage13 & 5.560 s & 0.663 s & 8.39 & 26 & 3.29e-11 & 1.59e-14 \\
+crashbasis & 25.802 s & 3.511 s & 7.35 & 135 & 6.81e-11 & 4.61e-15 \\
+FEM\_3D\_thermal2 & 13.281 s & 1.572 s & 8.45 & 64 & 3.88e-09 & 1.82e-12 \\
+language & 12.553 s & 1.760 s & 7.13 & 89 & 2.11e-10 & 1.60e-10 \\
+poli\_large & 8.515 s & 1.053 s & 8.09 & 69 & 5.05e-11 & 6.59e-12 \\
+torso3 & 31.463 s & 3.681 s & 8.55 & 175 & 2.69e-10 & 2.66e-14 \\ \hline
+\end{tabular}
+\caption{Performances of the parallel GMRES algorithm using a compressed storage format of the sparse
+vectors for solving large sparse linear systems associated to band matrices on a cluster of 24 CPU cores vs.
+a cluster of 12 GPUs.}
+\label{tab:05}
+\end{center}
+\end{table}
+
+
+\subsection{Hypergraph partitioning}
+\label{sec:06.02}
+In this section, we use another structure of the sparse matrices. We are interested in sparse matrices
+whose nonzero values are distributed along their large bandwidths. We developed in C programming
+language a generator of sparse matrices having five bands (see Figure~\ref{fig:07}). The principle of
+this generator is the same as the one presented in Section~\ref{sec:05.02}. However, the copies made from the
+initial sparse matrix, chosen from the Davis collection, are placed on the main diagonal and on two
+off-diagonals on the left and right of the main diagonal. Table~\ref{tab:06} shows the main characteristics
+of sparse matrices of size 25 million of rows and generated from those of the Davis collection. We can
+see in the fourth column that the bandwidths of these matrices are as large as their sizes.
+
+\begin{figure}
+\centering
+ \includegraphics[width=100mm,keepaspectratio]{Figures/generation_1}
+\caption{Example of the generation of a large sparse matrix having five bands by four computing nodes.}
+\label{fig:07}
+\end{figure}
+
+
+\begin{table}[!h]
+\begin{center}
+\begin{tabular}{|c|c|r|r|}
+\hline
+Matrix type & Name & \# nonzeros & Bandwidth \\ \hline \hline
+\multirow{6}{*}{Symmetric} & 2cubes\_sphere & 829 082 728 & 24 999 999 \\
+ & ecology2 & 254 892 056 & 25 000 000 \\
+ & finan512 & 556 982 339 & 24 999 973 \\
+ & G3\_circuit & 257 982 646 & 25 000 000 \\
+ & shallow\_water2 & 200 798 268 & 25 000 000 \\
+ & thermal2 & 359 340 179 & 24 999 998 \\ \hline \hline
+\multirow{6}{*}{Asymmetric} & cage13 & 879 063 379 & 24 999 998 \\
+ & crashbasis & 820 373 286 & 24 999 803 \\
+ & FEM\_3D\_thermal2 & 1 194 012 703 & 24 999 998 \\
+ & language & 155 261 826 & 24 999 492 \\
+ & poli\_large & 106 680 819 & 25 000 000 \\
+ & torso3 & 872 029 998 & 25 000 000 \\ \hline
+\end{tabular}
+\caption{Main characteristics of the sparse matrices having five band and generated from the sparse matrices of the Davis collection.}
+\label{tab:06}
+\end{center}
+\end{table}
+
+In Table~\ref{tab:07} we give the performances of the parallel GMRES algorithm on the CPU and GPU
+clusters for solving large linear systems associated to the sparse matrices shown in Table~\ref{tab:06}.
+We can notice from the ratios given in the fourth column that solving sparse linear systems associated
+to matrices having large bandwidth on the GPU cluster is as fast as on the CPU cluster. This is due
+to the large total communication volume necessary to synchronize the computations over the cluster.
+In fact, the naive partitioning row-by-row or column-by-column of this type of sparse matrices links
+a GPU node to many neighboring nodes and produces a significant number of data dependencies between
+the different GPU nodes.
+
+\begin{table}[!h]
+\begin{center}
+\begin{tabular}{|c|c|c|c|c|c|c|}
+\hline
+Matrix & $Time_{cpu}$ & $Time_{gpu}$ & $\tau$ & \# $iter$& $prec$ & $\Delta$ \\ \hline \hline
+2cubes\_sphere & 15.963 s & 7.250 s & 2.20 & 58 & 6.23e-16 & 3.25e-19 \\
+ecology2 & 3.549 s & 2.176 s & 1.63 & 21 & 4.78e-15 & 1.06e-15 \\
+finan512 & 3.862 s & 1.934 s & 1.99 & 17 & 3.21e-14 & 8.43e-17 \\
+G3\_circuit & 4.636 s & 2.811 s & 1.65 & 22 & 1.08e-14 & 1.77e-16 \\
+shallow\_water2 & 2.738 s & 1.539 s & 1.78 & 17 & 5.54e-23 & 3.82e-26 \\
+thermal2 & 5.017 s & 2.587 s & 1.94 & 21 & 8.25e-14 & 4.34e-18 \\ \hline \hline
+cage13 & 9.315 s & 3.227 s & 2.89 & 26 & 3.38e-13 & 2.08e-16 \\
+crashbasis & 35.980 s & 14.770 s & 2.43 & 127 & 1.17e-12 & 1.56e-17 \\
+FEM\_3D\_thermal2 & 24.611 s & 7.749 s & 3.17 & 64 & 3.87e-11 & 2.84e-14 \\
+language & 16.859 s & 9.697 s & 1.74 & 89 & 2.17e-12 & 1.70e-12 \\
+poli\_large & 10.200 s & 6.534 s & 1.56 & 69 & 5.14e-13 & 1.63e-13 \\
+torso3 & 49.074 s & 19.397 s & 2.53 & 175 & 2.69e-12 & 2.77e-16 \\ \hline
+\end{tabular}
+\caption{Performances of the parallel GMRES algorithm using a compressed storage format of the sparse
+vectors for solving large sparse linear systems associated to matrices having five bands on a cluster
+of 24 CPU cores vs. a cluster of 12 GPUs.}
+\label{tab:07}
+\end{center}
+\end{table}
+
+We propose to use a hypergraph partitioning method which is well adapted to numerous structures
+of sparse matrices~\cite{Cat99}. Indeed, it can well model the communications between the computing
+nodes especially for the asymmetric and rectangular matrices. It gives in most cases good reductions
+of the total communication volume. Nevertheless, it is more expensive in terms of execution time and
+memory space consumption than the partitioning method based on graphs.
+
+The sparse matrix $A$ of the linear system to be solved is modelled as a hypergraph
+$\mathcal{H}=(\mathcal{V},\mathcal{E})$ as follows:
+\begin{itemize}
+\item each matrix row $i$ ($0\leq i<n$) corresponds to a vertex $v_i\in\mathcal{V}$,
+\item each matrix column $j$ ($0\leq j<n$) corresponds to a hyperedge $e_j\in\mathcal{E}$, such that:
+$\forall a_{ij}$ is a nonzero value of the matrix $A$, $v_i\in pins[e_j]$,
+\item $w_i$ is the weight of vertex $v_i$,
+\item $c_j$ is the cost of hyperedge $e_j$.
+\end{itemize}
+A $K$-way partitioning of a hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ is defined as a set
+of $K$ pairwise disjoint non-empty subsets (or parts) of the vertex set $\mathcal{V}$: $\mathcal{P}=\{\mathcal{V}_1,\ldots,\mathcal{V}_k\}$,
+such that $\mathcal{V}=\displaystyle\cup_{k=1}^K\mathcal{V}_{k}$. Each computing node is in charge of
+a vertex subset. Figure~\ref{fig:08} shows an example of a hypergraph partitioning of a sparse matrix
+of size $(9\times 9)$ into three parts. The circles and squares correspond, respectively, to the vertices
+and hyperedges of the hypergraph. The solid squares define the cut hyperedges connecting at least two
+different parts. The connectivity $\lambda_j$ denotes the number of different parts spanned by the cut
+hyperedge $e_j$.
+
+\begin{figure}
+\centering
+ \includegraphics[width=130mm,keepaspectratio]{Figures/hypergraph}
+\caption{A hypergraph partitioning of a sparse matrix between three computing nodes.}
+\label{fig:08}
+\end{figure}
+
+The cut hyperedges model the communications between the different GPU computing nodes in the cluster,
+necessary to perform the SpMV multiplication. Indeed, each hyperedge $e_j$ defines a set of atomic
+computations $b_i\leftarrow b_i+a_{ij}x_j$ of the SpMV multiplication which needs the $j^{th}$ element
+of vector $x$. Therefore pins of hyperedge $e_j$ ($pins[e_j]$) denote the set of matrix rows requiring
+the same vector element $x_j$. For example in Figure~\ref{fig:08}, hyperedge $e_9$ whose pins are:
+$pins[e_9]=\{v_2,v_5,v_9\}$ represents matrix rows 2, 5 and 9 requiring the vector element $x_9$
+to compute in parallel the atomic operations: $b_2\leftarrow b_2+a_{29}x_9$, $b_5\leftarrow b_5+a_{59}x_9$
+and $b_9\leftarrow b_9+a_{99}x_9$. However, $x_9$ is a vector element of the computing node 3 and it must
+be sent to the neighboring nodes 1 and 2.
+
+The hypergraph partitioning allows to reduce the total communication volume while maintaining the computational
+load balance between the computing nodes. Indeed, it minimizes at best the following sum:
+\begin{equation}
+\mathcal{X}(\mathcal{P}) = \displaystyle\sum_{e_j\in\mathcal{E}_C} c_j(\lambda_j-1),
+\end{equation}
+where $\mathcal{E}_C$ is the set of the cut hyperedges issued from the partitioning $\mathcal{P}$, $c_j$
+and $\lambda_j$ are, respectively, the cost and the connectivity of the cut hyperedge $e_j$. In addition,
+the hypergraph partitioning is constrained to maintain the load balance between the $K$ parts:
+\begin{equation}
+W_k\leq (1+\epsilon)W_{avg}\mbox{,~}(1\leq k\leq K)\mbox{~and~}(0<\epsilon<1),
+\end{equation}
+where $W_k$ is the sum of the vertex weights in the subset $\mathcal{V}_k$, $W_{avg}$ is the average part's
+weight and $\epsilon$ is the maximum allowed imbalanced ratio.
+
+The hypergraph partitioning is a NP-complete problem but software tools using heuristics are developed, for
+example: hMETIS~\cite{Kar98}, PaToH~\cite{Cata99} and Zoltan~\cite{Dev06}. Due to the large sizes of the
+linear systems to be solved, we use a parallel hypergraph partitioning which must be performed by at least
+two MPI processes. The hypergraph model $\mathcal{H}$ of the sparse matrix is split into $p$ (number of computing
+nodes) sub-hypergraphs $\mathcal{H}_k=(\mathcal{V}_k,\mathcal{E}_k)$, $0\leq k<p$, then the parallel partitioning
+is applied by using the MPI communication routines.
+
+Table~\ref{tab:08} shows the performances of the parallel GMRES algorithm for solving the linear systems
+associated to the sparse matrices presented in Table~\ref{tab:06}. In the experiments, we have used the
+compressed storage format of the sparse vectors and the parallel hypergraph partitioning developed in the
+Zoltan tool~\cite{ref20,ref21}. The parameters of the hypergraph partitioning are initialized as follows:
+\begin{itemize}
+\item The weight $w_i$ of each vertex $v_i$ is set to the number of the nonzero values on the matrix row $i$,
+\item For simplicity sake, the cost $c_j$ of each hyperedge $e_j$ is set to 1,
+\item The maximum imbalanced ratio $\epsilon$ is limited to 10\%.
+\end{itemize}
+We can notice from Table~\ref{tab:08} that the execution times on the cluster of 12 GPUs are significantly
+improved compared to those presented in Table~\ref{tab:07}. The hypergraph partitioning applied on the large
+sparse matrices having large bandwidths have improved the execution times on the GPU cluster by about 65\%.
+
+\begin{table}[!h]
+\begin{center}
+\begin{tabular}{|c|c|c|c|c|c|c|}
+\hline
+Matrix & $Time_{cpu}$ & $Time_{gpu}$ & $\tau$ & \# iter & $prec$ & $\Delta$ \\ \hline \hline
+2cubes\_sphere & 16.430 s & 2.840 s & 5.78 & 58 & 6.23e-16 & 3.25e-19 \\
+ecology2 & 3.152 s & 0.367 s & 8.59 & 21 & 4.78e-15 & 1.06e-15 \\
+finan512 & 3.672 s & 0.723 s & 5.08 & 17 & 3.21e-14 & 8.43e-17 \\
+G3\_circuit & 4.468 s & 0.971 s & 4.60 & 22 & 1.08e-14 & 1.77e-16 \\
+shallow\_water2 & 2.647 s & 0.312 s & 8.48 & 17 & 5.54e-23 & 3.82e-26 \\
+thermal2 & 4.190 s & 0.666 s & 6.29 & 21 & 8.25e-14 & 4.34e-18 \\ \hline \hline
+cage13 & 8.077 s & 1.584 s & 5.10 & 26 & 3.38e-13 & 2.08e-16 \\
+crashbasis & 35.173 s & 5.546 s & 6.34 & 127 & 1.17e-12 & 1.56e-17 \\
+FEM\_3D\_thermal2 & 24.825 s & 3.113 s & 7.97 & 64 & 3.87e-11 & 2.84e-14 \\
+language & 16.706 s & 2.522 s & 6.62 & 89 & 2.17e-12 & 1.70e-12 \\
+poli\_large & 12.715 s & 3.989 s & 3.19 & 69 & 5.14e-13 & 1.63e-13 \\
+torso3 & 48.459 s & 6.234 s & 7.77 & 175 & 2.69e-12 & 2.77e-16 \\ \hline
+\end{tabular}
+\caption{Performances of the parallel GMRES algorithm using a compressed storage format of the sparse
+vectors and a hypergraph partitioning method for solving large sparse linear systems associated to matrices
+having five bands on a cluster of 24 CPU cores vs. a cluster of 12 GPUs.}
+\label{tab:08}
+\end{center}
+\end{table}
+
+Table~\ref{tab:09} shows in the second and third columns the total communication volume on a cluster
+of 12 GPUs by using row-by-row partitioning and hypergraph partitioning, respectively. The total communication
+volume defines the total number of the vector elements exchanged between the 12 GPUs. This table shows
+that the hypergraph partitioning allows to split the large sparse matrices so as to minimize data
+dependencies between the GPU computing nodes. However, we can notice in the fourth column that the hypergraph
+partitioning takes longer than the computation times. As we have mentioned before, the hypergraph partitioning
+method is less efficient in terms of memory consumption and partitioning time than its graph counterpart.
+So for the applications which often use the same sparse matrices, we can perform the hypergraph partitioning
+only once and, then, we save the traces in files to be reused several times. Therefore, this allows us to
+avoid the partitioning of the sparse matrices at each resolution of the linear systems.
+
+\begin{table}[!h]
+\begin{center}
+\begin{tabular}{|c|c|c|c|}
+\hline
+\multirow{4}{*}{Matrix} & Total communication & Total communication & Time of \\
+ & volume using & volume using & hypergraph \\
+ & row-by-row & hypergraph & partitioning \\
+ & partitioning & partitioning & in minutes \\ \hline \hline
+2cubes\_sphere & 25 360 543 & 240 679 & 68.98 \\
+ecology2 & 26 044 002 & 73 021 & 4.92 \\
+finan512 & 26 087 431 & 900 729 & 33.72 \\
+G3\_circuit & 31 912 003 & 5 366 774 & 11.63 \\
+shallow\_water2 & 25 105 108 & 60 899 & 5.06 \\
+thermal2 & 30 012 846 & 1 077 921 & 17.88 \\ \hline \hline
+cage13 & 28 254 282 & 3 845 440 & 196.45 \\
+crashbasis & 29 020 060 & 2 401 876 & 33.39 \\
+FEM\_3D\_thermal2 & 25 263 767 & 250 105 & 49.89 \\
+language & 27 291 486 & 1 537 835 & 9.07 \\
+poli\_large & 25 053 554 & 7 388 883 & 5.92 \\
+torso3 & 25 682 514 & 613 250 & 61.51 \\ \hline
+\end{tabular}
+\caption{Total communication volume on a cluster of 12 GPUs using row-by-row or hypergraph partitioning methods}
+\label{tab:09}
+\end{center}
+\end{table}
+
+
+%%--------------------%%
+%% SECTION 7 %%
+%%--------------------%%
+\section{Conclusion and perspectives}
+\label{sec:07}
+In this paper, we have aimed at harnessing the computing power of a GPU cluster for
+solving large sparse linear systems. We have implemented the parallel algorithm of the
+GMRES iterative method. We have used a heterogeneous parallel programming based on the
+CUDA language to program the GPUs and the MPI parallel environment to distribute the
+computations between the GPU nodes on the cluster.
+
+The experiments have shown that solving large sparse linear systems is more efficient
+on a cluster of GPUs than on a cluster of CPUs. However, the efficiency of a GPU cluster
+is ensured as long as the spatial and temporal localization of the data is well managed.
+The data dependency scheme on a GPU cluster is related to the sparse structures of the
+matrices (positions of the nonzero values) and the number of the computing nodes. We have
+shown that a large number of communications between the GPU computing nodes affects
+considerably the performances of the parallel GMRES algorithm on the GPU cluster. Therefore,
+we have proposed to reorder the columns of the sparse local sub-matrices on each GPU node
+and to use a compressed storage format for the sparse vector involved in the parallel
+sparse matrix-vector multiplication. This solution allows to minimize the communication
+overheads. In addition, we have shown that it is interesting to choose a partitioning method
+according to the structure of the sparse matrix. This reduces the total communication
+volume between the GPU computing nodes.
+
+In future works, it would be interesting to implement and study the scalability of the
+parallel GMRES algorithm on large GPU clusters (hundreds or thousands of GPUs) or on geographically
+distant GPU clusters. In this context, other methods might be used to reduce communication
+and to improve the performances of the parallel GMRES algorithm as the multisplitting methods.
+The recent GPU hardware and software architectures provide the GPU-Direct system which allows
+two GPUs, placed in the same machine or in two remote machines, to exchange data without using
+CPUs. This improves the data transfers between GPUs. Finally, it would be interesting to implement
+other iterative methods on GPU clusters for solving large sparse linear or nonlinear systems.
+
+\paragraph{Acknowledgments}
+This paper is based upon work supported by the R\'egion de Franche-Comt\'e.
+
+\bibliography{bib}
+\bibliographystyle{abbrv}
+\end{document}