From 52171668da3276d5c04af49188a575e31974ad67 Mon Sep 17 00:00:00 2001 From: couturie Date: Mon, 20 Jan 2014 13:57:57 +0100 Subject: [PATCH 1/1] 1st --- GMRES_Journal.tex | 943 ++++++++++++++++++++++++++++++++++++++++++++++ 1 file changed, 943 insertions(+) create mode 100644 GMRES_Journal.tex diff --git a/GMRES_Journal.tex b/GMRES_Journal.tex new file mode 100644 index 0000000..8382eea --- /dev/null +++ b/GMRES_Journal.tex @@ -0,0 +1,943 @@ +\documentclass[11pt]{article} +%\documentclass{acmconf} + +\usepackage[paper=a4paper,dvips,top=1.5cm,left=1.5cm,right=1.5cm,foot=1cm,bottom=1.5cm]{geometry} +\usepackage{times} +\usepackage{graphicx} +\usepackage{amsmath} +\usepackage{amsfonts} +\usepackage{amssymb} +\usepackage{amsthm} +\usepackage{amsopn} +\usepackage{xspace} +\usepackage{array} +\usepackage{epsfig} +\usepackage{algorithmic} +\usepackage[ruled,english,boxed,linesnumbered]{algorithm2e} +\usepackage[english]{algorithm2e} +\usepackage{url} +\usepackage{mdwlist} +\usepackage{multirow} + +\date{} + +\title{Parallel sparse linear solver with GMRES method using minimization techniques of communications for GPU clusters} + +\author{ +\textsc{Jacques M. Bahi} +\qquad +\textsc{Rapha\"el Couturier}\thanks{Contact author} +\qquad +\textsc{Lilia Ziane Khodja} +\mbox{}\\ % +\\ +FEMTO-ST Institute, University of Franche-Comte\\ +IUT Belfort-Montb\'eliard\\ +Rue Engel Gros, BP 527, 90016 Belfort, \underline{France}\\ +\mbox{}\\ % +\normalsize +\{\texttt{jacques.bahi},~\texttt{raphael.couturier},~\texttt{lilia.ziane\_khoja}\}\texttt{@univ-fcomte.fr} +} + +\begin{document} + +\maketitle + +\begin{abstract} +In this paper, we aim at exploiting the power computing of a GPU cluster for solving large sparse +linear systems. We implement the parallel algorithm of the GMRES iterative method using the CUDA +programming language and the MPI parallel environment. The experiments shows that a GPU cluster +is more efficient than a CPU cluster. In order to optimize the performances, we use a compressed +storage format for the sparse vectors and the hypergraph partitioning. These solutions improve +the spatial and temporal localization of the shared data between the computing nodes of the GPU +cluster. +\end{abstract} + + + +%%--------------------%% +%% SECTION 1 %% +%%--------------------%% +\section{Introduction} +\label{sec:01} +Large sparse linear systems arise in most numerical scientific or industrial simulations. +They model numerous complex problems in different areas of applications such as mathematics, +engineering, biology or physics~\cite{ref18}. However, solving these systems of equations is +often an expensive operation in terms of execution time and memory space consumption. Indeed, +the linear systems arising in most applications are very large and have many zero +coefficients, and this sparse nature leads to irregular accesses to load the nonzero coefficients +from the memory. + +Parallel computing has become a key issue for solving sparse linear systems of large sizes. +This is due to the computing power and the storage capacity of the current parallel computers as +well as the availability of different parallel programming languages ​​and environments such as the +MPI communication standard. Nowadays, graphics processing units (GPUs) are the most commonly used +hardware accelerators in high performance computing. They are equipped with a massively parallel +architecture allowing them to compute faster than CPUs. However, the parallel computers equipped +with GPUs introduce new programming difficulties to adapt parallel algorithms to their architectures. + +In this paper, we use the GMRES iterative method for solving large sparse linear systems on a cluster +of GPUs. The parallel algorithm of this method is implemented using the CUDA programming language for +the GPUs and the MPI parallel environment to distribute the computations between the different GPU nodes +of the cluster. Particularly, we focus on improving the performances of the parallel sparse matrix-vector +multiplication. Indeed, this operation is not only very time-consuming but it also requires communications +between the GPU nodes. These communications are needed to build the global vector involved in +the parallel sparse matrix-vector multiplication. It should be noted that a communication between two +GPU nodes involves data transfers between the GPU and CPU memories in the same node and the MPI communications +between the CPUs of the GPU nodes. For performance purposes, we propose to use a compressed storage +format to reduce the size of the vectors to be exchanged between the GPU nodes and a hypergraph partitioning +of the sparse matrix to reduce the total communication volume. + +The present paper is organized as follows. In Section~\ref{sec:02} some previous works about solving +sparse linear systems on GPUs are presented. In Section~\ref{sec:03} is given a general overview of the GPU architectures, +followed by that the GMRES method in Section~\ref{sec:04}. In Section~\ref{sec:05} the main key points +of the parallel implementation of the GMRES method on a GPU cluster are described. Finally, in Section~\ref{sec:06} +is presented the performance improvements of the parallel GMRES algorithm on a GPU cluster. + + +%%--------------------%% +%% SECTION 2 %% +%%--------------------%% +\section{Related work} +\label{sec:02} +Numerous works have shown the efficiency of GPUs for solving sparse linear systems compared +to their CPUs counterpart. Different iterative methods are implemented on one GPU, for example +Jacobi and Gauss-Seidel in~\cite{refa}, conjugate and biconjugate gradients in~\cite{refd,refe,reff,refj} +and GMRES in~\cite{refb,refc,refg,refm}. In addition, some iterative methods are implemented on +shared memory multi-GPUs machines as~\cite{refh,refi,refk,refl}. A limited set of studies are +devoted to the parallel implementation of the iterative methods on distributed memory GPU clusters +as~\cite{refn,refo,refp}. + +Traditionally, the parallel iterative algorithms do not often scale well on GPU clusters due to +the significant cost of the communications between the computing nodes. Some authors have already +studied how to reduce these communications. In~\cite{cev10}, the authors used a hypergraph partitioning +as a preprocessing to the parallel conjugate gradient algorithm in order to reduce the inter-GPU +communications over a GPU cluster. The sequential hypergraph partitioning method provided by the +PaToH tool~\cite{Cata99} is used because of the small sizes of the sparse symmetric linear systems +to be solved. In~\cite{refq}, a compression and decompression technique is proposed to reduce the +communication overheads. This technique is performed on the shared vectors to be exchanged between +the computing nodes. In~\cite{refr}, the authors studied the impact of asynchronism on parallel +iterative algorithms on local GPU clusters. Asynchronous communication primitives suppress some +synchronization barriers and allow overlap of communication and computation. In~\cite{refs}, a +communication reduction method is used for implementing finite element methods (FEM) on GPU clusters. +This method firstly uses the Reverse Cuthill-McKee reordering to reduce the total communication +volume. In addition, the performances of the parallel FEM algorithm are improved by overlapping +the communication with computation. + + +%%--------------------%% +%% SECTION 3 %% +%%--------------------%% +\section{{GPU} architectures} +\label{sec:03} +A GPU (Graphics processing unit) is a hardware accelerator for high performance computing. +Its hardware architecture is composed of hundreds of cores organized in several blocks called +\emph{streaming multiprocessors}. It is also equipped with a memory hierarchy. It has a set +of registers and a private read-write \emph{local memory} per core, a fast \emph{shared memory}, +read-only \emph{constant} and \emph{texture} caches per multiprocessor and a read-write +\emph{global memory} shared by all its multiprocessors. The new architectures (Fermi, Kepler, +etc) have also L1 and L2 caches to improve the accesses to the global memory. + +NVIDIA has released the CUDA platform (Compute Unified Device Architecture)~\cite{Nvi10} +which provides a high level GPGPU-based programming language (General-Purpose computing +on GPUs), allowing to program GPUs for general purpose computations. In CUDA programming +environment, all data-parallel and compute intensive portions of an application running +on the CPU are off-loaded onto the GPU. Indeed, an application developed in CUDA is a +program written in C language (or Fortran) with a minimal set of extensions to define +the parallel functions to be executed by the GPU, called \emph{kernels}. We define kernels, +as separate functions from those of the CPU, by assigning them a function type qualifiers +\verb+__global__+ or \verb+__device__+. + +At the GPU level, the same kernel is executed by a large number of parallel CUDA threads +grouped together as a grid of thread blocks. Each multiprocessor of the GPU executes one +or more thread blocks in SIMD fashion (Single Instruction, Multiple Data) and in turn each +core of a GPU multiprocessor runs one or more threads within a block in SIMT fashion (Single +Instruction, Multiple threads). In order to maximize the occupation of the GPU cores, the +number of CUDA threads to be involved in a kernel execution is computed according to the +size of the problem to be solved. In contrast, the block size is restricted by the limited +memory resources of a core. On current GPUs, a thread block may contain up-to $1,024$ concurrent +threads. At any given clock cycle, the threads execute the same instruction of a kernel, +but each of them operates on different data. Moreover, threads within a block can cooperate +by sharing data through the fast shared memory and coordinate their execution through +synchronization points. In contrast, within a grid of thread blocks, there is no synchronization +at all between blocks. + +GPUs only work on data filled in their global memory and the final results of their kernel +executions must be communicated to their hosts (CPUs). Hence, the data must be transferred +\emph{in} and \emph{out} of the GPU. However, the speed of memory copy between the CPU and +the GPU is slower than the memory copy speed of GPUs. Accordingly, it is necessary to limit +the transfer of data between the GPU and its host. + + +%%--------------------%% +%% SECTION 4 %% +%%--------------------%% +\section{{GMRES} method} +\label{sec:04} +The generalized minimal residual method (GMRES) is an iterative method designed by Saad +and Schultz in 1986~\cite{Saa86}. It is a generalization of the minimal residual method +(MNRES)~\cite{Pai75} to deal with asymmetric and non Hermitian problems and indefinite +symmetric problems. + +Let us consider the following sparse linear system of $n$ equations: +\begin{equation} + Ax = b, +\label{eq:01} +\end{equation} +where $A\in\mathbb{R}^{n\times n}$ is a sparse square and nonsingular matrix, $x\in\mathbb{R}^n$ +is the solution vector and $b\in\mathbb{R}^n$ is the right-hand side vector. The main idea of +the GMRES method is to find a sequence of solutions $\{x_k\}_{k\in\mathbb{N}}$ which minimizes at +best the residual $r_k=b-Ax_k$. The solution $x_k$ is computed in a Krylov sub-space $\mathcal{K}_k(A,v_1)$: +\begin{equation} +\begin{array}{ll} + \mathcal{K}_{k}(A,v_{1}) \equiv \text{span}\{v_{1}, Av_{1}, A^{2}v_{1},..., A^{k-1}v_{1}\}, & v_{1}=\frac{r_{0}}{\|r_{0}\|_{2}}, +\end{array} +\end{equation} +such that the Petrov-Galerkin condition is satisfied: +\begin{equation} + r_{k} \perp A\mathcal{K}_{k}(A, v_{1}). +\end{equation} +The GMRES method uses the Arnoldi method~\cite{Arn51} to build an orthonormal basis $V_k$ +and a Hessenberg upper matrix $\bar{H}_k$ of order $(k+1)\times k$: +\begin{equation} +\begin{array}{lll} + V_{k}=\{v_{1},v_{2},...,v_{k}\}, & \text{such that} & \forall k>1, v_{k}=A^{k-1}v_{1}, +\end{array} +\end{equation} +and +\begin{equation} + AV_{k} = V_{k+1}\bar{H}_{k}. +\label{eq:02} +\end{equation} +Then at each iteration $k$, a solution $x_k$ is computed in the Krylov sub-space $\mathcal{K}_{k}$ +spanned by $V_k$ as follows: +\begin{equation} +\begin{array}{ll} + x_{k} = x_{0} + V_{k}y, & y\in\mathbb{R}^{k}. +\end{array} +\label{eq:03} +\end{equation} +From both equations~(\ref{eq:02}) and~(\ref{eq:03}), we can deduce: +\begin{equation} +\begin{array}{lll} + r_{k} & = & b - A (x_{0} + V_{k}y) \\ + & = & r_{0} - AV_{k}y \\ + & = & \beta v_{1} - V_{k+1}\bar{H}_{k}y \\ + & = & V_{k+1}(\beta e_{1} - \bar{H}_{k}y), +\end{array} +\end{equation} +where $\beta=\|r_{0}\|_{2}$ and $e_{1}=(1,0,...,0)$ is the first vector of the canonical basis of +$\mathbb{R}^{k}$. The vector $y$ is computed in $\mathbb{R}^{k}$ so as to minimize at best the Euclidean +norm of the residual $r_{k}$. This means that the following least-squares problem of size $k$ must +be solved: +\begin{equation} + \underset{y\in\mathbb{R}^{k}}{min}\|r_{k}\|_{2}=\underset{y\in\mathbb{R}^{k}}{min}\|\beta e_{1}-\bar{H}_{k}y\|_{2}. +\end{equation} +The solution of this problem is computed thanks to the QR factorization of the Hessenberg matrix +$\bar{H}_{k}$ by using the Givens rotations~\cite{Saa03,Saa86} such that: +\begin{equation} +\begin{array}{lll} + \bar{H}_{k}=Q_{k}R_{k}, & Q_{k}\in\mathbb{R}^{(k+1)\times (k+1)}, & R_{k}\in\mathbb{R}^{(k+1)\times k}, +\end{array} +\end{equation} +where $Q_{k}Q_{k}^{T}=I_{k}$ and $R_{k}$ is a upper triangular matrix. + +The GMRES method finds a solution at most after $n$ iterations. So, the Krylov sub-space dimension +can increase up-to the large size $n$ of the linear system. Then in order to avoid a large storage +of the orthonormal basis $V_{k}$, the Arnoldi process is restricted at $m\ll n$ iterations and restarted +with the last iterate $x_{m}$ as an initial guess to the next iteration. Moreover, GMRES sometimes +does not converge or takes too many iterations to satisfy the convergence criteria. Therefore, in +most cases, GMRES must contain a preconditioning step to improve its convergence. The preconditioning +technique replaces the system~(\ref{eq:01}) with the modified systems: +\begin{equation}M^{-1}Ax=M^{-1}b.\end{equation} + +Algorithm~\ref{alg:01} illustrates the main key points of the GMRES method with restarts. +The linear system to be solved in this algorithm is left-preconditioned where $M$ is the preconditioning +matrix. At each iteration $k$, the Arnoldi process is used (from line~$7$ to line~$17$ of algorithm~\ref{alg:01}) +to construct an orthonormal basis $V_m$ and a Hessenberg matrix $\bar{H}_m$ of order $(m+1)\times m$. +Then, the least-squares problem is solved (line~$18$) to find the vector $y\in\mathbb{R}^m$ which minimizes +the residual. Finally, the solution $x_m$ is computed in the Krylov sub-space spanned by $V_m$ (line~$19$). +In practice, the GMRES algorithm stops when the Euclidean norm of the residual is small enough and/or +the maximum number of iterations is reached. + +\begin{algorithm}[!h] + \SetAlgoLined + \Entree{$A$ (matrix), $b$ (vector), $M$ (preconditioning matrix), +$x_{0}$ (initial guess), $\varepsilon$ (tolerance threshold), $max$ (maximum number of iterations), +$m$ (number of iterations of the Arnoldi process)} + \Sortie{$x$ (solution vector)} + \BlankLine + $r_{0} \leftarrow M^{-1}(b - Ax_{0})$\; + $\beta \leftarrow \|r_{0}\|_{2}$\; + $\alpha \leftarrow \|M^{-1}b\|_{2}$\; + $convergence \leftarrow false$\; + $k \leftarrow 1$\; + \BlankLine + \While{$(\neg convergence)$}{ + $v_{1} \leftarrow r_{0} / \beta$\; + \For{$j=1$ {\bf to} $m$}{ + $w_{j} \leftarrow M^{-1}Av_{j}$\; + \For{$i=1$ {\bf to} $j$}{ + $h_{i,j} \leftarrow (w_{j},v_{i})$\; + $w_{j} \leftarrow w_{j} - h_{i,j} \times v_{i}$\; + } + $h_{j+1,j} \leftarrow \|w_{j}\|_{2}$\; + $v_{j+1} \leftarrow w_{j} / h_{j+1,j}$\; + } + \BlankLine + Put $V_{m}=\{v_{j}\}_{1\leq j \leq m}$ and $\bar{H}_{m}=(h_{i,j})$ Hessenberg matrix of order $(m+1)\times m$\; + Solve the least-squares problem of size $m$: $\underset{y\in\mathbb{R}^{m}}{min}\|\beta e_{1}-\bar{H}_{m}y\|_{2}$\; + \BlankLine + $x_{m} \leftarrow x_{0} + V_{m}y$\; + $r_{m} \leftarrow M^{-1}(b-Ax_{m})$\; + $\beta \leftarrow \|r_{m}\|_{2}$\; + \BlankLine + \eIf{$(\frac{\beta}{\alpha}<\varepsilon)$ {\bf or} $(k\geq max)$}{ + $convergence \leftarrow true$\; + }{ + $x_{0} \leftarrow x_{m}$\; + $r_{0} \leftarrow r_{m}$\; + $k \leftarrow k + 1$\; + } + } +\caption{Left-preconditioned GMRES algorithm with restarts} +\label{alg:01} +\end{algorithm} + + + +%%--------------------%% +%% SECTION 5 %% +%%--------------------%% +\section{Parallel GMRES method on {GPU} clusters} +\label{sec:05} + +\subsection{Parallel implementation on a GPU cluster} +\label{sec:05.01} +The implementation of the GMRES algorithm on a GPU cluster is performed by using +a parallel heterogeneous programming. We use the programming language CUDA for the +GPUs and the parallel environment MPI for the distribution of the computations between +the GPU computing nodes. In this work, a GPU computing node is composed of a GPU and +a CPU core managed by a MPI process. + +Let us consider a cluster composed of $p$ GPU computing nodes. First, the sparse linear +system~(\ref{eq:01}) is split into $p$ sub-linear systems, each is attributed to a GPU +computing node. We partition row-by-row the sparse matrix $A$ and both vectors $x$ and +$b$ in $p$ parts (see Figure~\ref{fig:01}). The data issued from the partitioning operation +are off-loaded on the GPU global memories to be proceeded by the GPUs. Then, all the +computing nodes of the GPU cluster execute the same GMRES iterative algorithm but on +different data. Finally, the GPU computing nodes synchronize their computations by using +MPI communication routines to solve the global sparse linear system. In what follows, +the computing nodes sharing data are called the neighboring nodes. + +\begin{figure} +\centering + \includegraphics[width=80mm,keepaspectratio]{Figures/partition} +\caption{Data partitioning of the sparse matrix $A$, the solution vector $x$ and the right-hand side $b$ in $4$ partitions} +\label{fig:01} +\end{figure} + +In order to exploit the computing power of the GPUs, we have to execute maximum computations +on GPUs to avoid the data transfers between the GPU and its host (CPU), and to maximize the +GPU cores utilization to hide global memory access latency. The implementation of the GMRES +algorithm is performed by executing the functions operating on vectors and matrices as kernels +on GPUs. These operations are often easy to parallelize and more efficient on parallel architectures +when they operate on large vectors. We use the fastest routines of the CUBLAS library +(CUDA Basic Linear Algebra Subroutines) to implement the dot product (\verb+cublasDdot()+), +the Euclidean norm (\verb+cublasDnrm2()+) and the AXPY operation (\verb+cublasDaxpy()+). +In addition, we have coded in CUDA a kernel for the scalar-vector product (lines~$7$ and~$15$ +of Algorithm~\ref{alg:01}), a kernel for solving the least-squares problem (line~$18$) and a +kernel for solution vector updates (line~$19$). + +The solution of the least-squares problem in the GMRES algorithm is based on: +\begin{itemize} + \item a QR factorization of the Hessenberg matrix $\bar{H}$ by using plane rotations and, + \item backward-substitution method to compute the vector $y$ minimizing the residual. +\end{itemize} +This operation is not easy to parallelize and it is not interesting to implement it on GPUs. +However, the size $m$ of the linear least-squares problem to solve in the GMRES method with +restarts is very small. So, this problem is solved in sequential by one GPU thread. + +The most important operation in the GMRES method is the sparse matrix-vector multiplication. +It is quite expensive for large size matrices in terms of execution time and memory space. In +addition, it performs irregular memory accesses to read the nonzero values of the sparse matrix, +implying non coalescent accesses to the GPU global memory which slow down the performances of +the GPUs. So we use the HYB kernel developed and optimized by Nvidia~\cite{CUSP} which gives on +average the best performance in sparse matrix-vector multiplications on GPUs~\cite{Bel09}. The +HYB (Hybrid) storage format is the combination of two sparse storage formats: Ellpack format +(ELL) and Coordinate format (COO). It stores a typical number of nonzero values per row in ELL +format and remaining entries of exceptional rows in COO format. It combines the efficiency of +ELL, due to the regularity of its memory accessing and the flexibility of COO which is insensitive +to the matrix structure. + +In the parallel GMRES algorithm, the GPU computing nodes must exchange between them their shared data in +order to construct the global vector necessary to compute the parallel sparse matrix-vector +multiplication (SpMV). In fact, each computing node has locally the vector elements corresponding +to the rows of its sparse sub-matrix and, in order to compute its part of the SpMV, it also +requires the vector elements of its neighboring nodes corresponding to the column indices in +which its local sub-matrix has nonzero values. Consequently, each computing node manages a global +vector composed of a local vector of size $\frac{n}{p}$ and a shared vector of size $S$: +\begin{equation} + S = bw - \frac{n}{p}, +\label{eq:11} +\end{equation} +where $\frac{n}{p}$ is the size of the local vector and $bw$ is the bandwidth of the local sparse +sub-matrix which represents the number of columns between the minimum and the maximum column indices +(see Figure~\ref{fig:01}). In order to improve memory accesses, we use the texture memory to +cache elements of the global vector. + +On a GPU cluster, the exchanges of the shared vectors elements between the neighboring nodes are +performed as follows: +\begin{itemize} + \item at the level of the sending node: data transfers of the shared data from the GPU global memory +to the CPU memory by using the CUBLAS communication routine \verb+cublasGetVector()+, + \item data exchanges between the CPUs by the MPI communication routine \verb+MPI_Alltoallv()+ and, + \item at the level of the receiving node: data transfers of the received shared data from the CPU +memory to the GPU global memory by using CUBLAS communication routine \verb+cublasSetVector()+. +\end{itemize} + +\subsection{Experimentations} +\label{sec:05.02} +The experiments are done on a cluster composed of six machines interconnected by an Infiniband network +of $20$~GB/s. Each machine is a Xeon E5530 Quad-Core running at $2.4$~GHz. It provides $12$~GB of RAM +memory with a memory bandwidth of $25.6$~GB/s and it is equipped with two Tesla C1060 GPUs. Each GPU +is composed of $240$ cores running at $1.3$ GHz and has $4$~GB of global memory with a memory bandwidth +of $102$~GB/s. The GPU is connected to the CPU via a PCI-Express 16x Gen2.0 with a throughput of $8$~GB/s. +Figure~\ref{fig:02} shows the general scheme of the GPU cluster. + +\begin{figure}[!h] +\centering + \includegraphics[width=80mm,keepaspectratio]{Figures/clusterGPU} +\caption{A cluster composed of six machines, each equipped with two Tesla C1060 GPUs} +\label{fig:02} +\end{figure} + +Linux cluster version 2.6.18 OS is installed on the six machines. The C programming language is used for +coding the GMRES algorithm on both the CPU and the GPU versions. CUDA version 4.0~\cite{ref19} is used for programming +the GPUs, using CUBLAS library~\cite{ref37} to deal with the functions operating on vectors. Finally, MPI routines +of OpenMPI 1.3.3 are used to carry out the communication between the CPU cores. + +The experiments are done on linear systems associated to sparse matrices chosen from the Davis collection of the +university of Florida~\cite{Dav97}. They are matrices arising in real-world applications. Table~\ref{tab:01} shows +the main characteristics of these sparse matrices and Figure~\ref{fig:03} shows their sparse structures. For +each matrix, we give the number of rows (column~$3$ in Table~\ref{tab:01}), the number of nonzero values (column~$4$) +and the bandwidth (column~$5$). + +\begin{table} +\begin{center} +\begin{tabular}{|c|c|r|r|r|} +\hline +Matrix type & Name & \# Rows & \# Nonzeros & Bandwidth \\\hline \hline +\multirow{6}{*}{Symmetric} & 2cubes\_sphere & 101 492 & 1 647 264 & 100 464 \\ + & ecology2 & 999 999 & 4 995 991 & 2 001 \\ + & finan512 & 74 752 & 596 992 & 74 725 \\ + & G3\_circuit & 1 585 478 & 7 660 826 & 1 219 059 \\ + & shallow\_water2 & 81 920 & 327 680 & 58 710 \\ + & thermal2 & 1 228 045 & 8 580 313 & 1 226 629 \\ \hline \hline +\multirow{6}{*}{Asymmetric} & cage13 & 445 315 & 7 479 343 & 318 788 \\ + & crashbasis & 160 000 & 1 750 416 & 120 202 \\ + & FEM\_3D\_thermal2 & 147 900 & 3 489 300 & 117 827 \\ + & language & 399 130 & 1 216 334 & 398 622 \\ + & poli\_large & 15 575 & 33 074 & 15 575 \\ + & torso3 & 259 156 & 4 429 042 & 216 854 \\ \hline +\end{tabular} +\caption{Main characteristics of the sparse matrices chosen from the Davis collection} +\label{tab:01} +\end{center} +\end{table} + +\begin{figure} +\centering + \includegraphics[width=120mm,keepaspectratio]{Figures/matrices} +\caption{Structures of the sparse matrices chosen from the Davis collection} +\label{fig:03} +\end{figure} + +All the experiments are performed on double-precision data. The parameters of the parallel +GMRES algorithm are as follows: the tolerance threshold $\varepsilon=10^{-12}$, the maximum +number of iterations $max=500$, the Arnoldi process is limited to $m=16$ iterations, the elements +of the guess solution $x_0$ is initialized to $0$ and those of the right-hand side vector are +initialized to $1$. For simplicity sake, we chose the matrix preconditioning $M$ as the +main diagonal of the sparse matrix $A$. Indeed, it allows us to easily compute the required inverse +matrix $M^{-1}$ and it provides relatively good preconditioning in most cases. Finally, we set +the size of a thread-block in GPUs to $512$ threads. + +\begin{table}[!h] +\begin{center} +\begin{tabular}{|c|c|c|c|c|c|c|} +\hline +Matrix & $Time_{cpu}$ & $Time_{gpu}$ & $\tau$ & \# $iter$ & $prec$ & $\Delta$ \\ \hline \hline +2cubes\_sphere & 0.234 s & 0.124 s & 1.88 & 21 & 2.10e-14 & 3.47e-18 \\ +ecology2 & 0.076 s & 0.035 s & 2.15 & 21 & 4.30e-13 & 4.38e-15 \\ +finan512 & 0.073 s & 0.052 s & 1.40 & 17 & 3.21e-12 & 5.00e-16 \\ +G3\_circuit & 1.016 s & 0.649 s & 1.56 & 22 & 1.04e-12 & 2.00e-15 \\ +shallow\_water2 & 0.061 s & 0.044 s & 1.38 & 17 & 5.42e-22 & 2.71e-25 \\ +thermal2 & 1.666 s & 0.880 s & 1.89 & 21 & 6.58e-12 & 2.77e-16 \\ \hline \hline +cage13 & 0.721 s & 0.338 s & 2.13 & 26 & 3.37e-11 & 2.66e-15 \\ +crashbasis & 1.349 s & 0.830 s & 1.62 & 121 & 9.10e-12 & 6.90e-12 \\ +FEM\_3D\_thermal2 & 0.797 s & 0.419 s & 1.90 & 64 & 3.87e-09 & 9.09e-13 \\ +language & 2.252 s & 1.204 s & 1.87 & 90 & 1.18e-10 & 8.00e-11 \\ +poli\_large & 0.097 s & 0.095 s & 1.02 & 69 & 4.98e-11 & 1.14e-12 \\ +torso3 & 4.242 s & 2.030 s & 2.09 & 175 & 2.69e-10 & 1.78e-14 \\ \hline +\end{tabular} +\caption{Performances of the parallel GMRES algorithm on a cluster of 24 CPU cores vs. a cluster of 12 GPUs} +\label{tab:02} +\end{center} +\end{table} + +In Table~\ref{tab:02}, we give the performances of the parallel GMRES algorithm for solving the linear +systems associated to the sparse matrices shown in Table~\ref{tab:01}. The second and third columns show +the execution times in seconds obtained on a cluster of 24 CPU cores and on a cluster of 12 GPUs, respectively. +The fourth column shows the ratio $\tau$ between the CPU time $Time_{cpu}$ and the GPU time $Time_{gpu}$ that +is computed as follows: +\begin{equation} + \tau = \frac{Time_{cpu}}{Time_{gpu}}. +\end{equation} +From these ratios, we can notice that the use of many GPUs is not interesting to solve small sparse linear +systems. Solving these sparse linear systems on a cluster of 12 GPUs is as fast as on a cluster of 24 CPU +cores. Indeed, the small sizes of the sparse matrices do not allow to maximize the utilization of the GPU +cores of the cluster. The fifth, sixth and seventh columns show, respectively, the number of iterations performed +by the parallel GMRES algorithm to converge, the precision of the solution, $prec$, computed on the GPU +cluster and the difference, $\Delta$, between the solutions computed on the GPU and the GPU clusters. The +last two parameters are used to validate the results obtained on the GPU cluster and they are computed as +follows: +\begin{equation} +\begin{array}{c} + prec = \|M^{-1}(b-Ax^{gpu})\|_{\infty}, \\ + \Delta = \|x^{cpu}-x^{gpu}\|_{\infty}, +\end{array} +\end{equation} +where $x^{cpu}$ and $x^{gpu}$ are the solutions computed, respectively, on the CPU cluster and on the GPU cluster. +We can see that the precision of the solutions computed on the GPU cluster are sufficient, they are about $10^{-10}$, +and the parallel GMRES algorithm computes almost the same solutions in both CPU and GPU clusters, with $\Delta$ varying +from $10^{-11}$ to $10^{-25}$. + +Afterwards, we evaluate the performances of the parallel GMRES algorithm for solving large linear systems. We have +developed in C programming language a generator of large sparse matrices having a band structure which arises +in most numerical problems. This generator uses the sparse matrices of the Davis collection as the initial +matrices to build the large band matrices. It is executed in parallel by all the MPI processes of the cluster +so that each process constructs its own sub-matrix as a rectangular block of the global sparse matrix. Each process +$i$ computes the size $n_i$ and the offset $offset_i$ of its sub-matrix in the global sparse matrix according to the +size $n$ of the linear system to be solved and the number of the GPU computing nodes $p$ as follows: +\begin{equation} + n_i = \frac{n}{p}, +\end{equation} +\begin{equation} + offset_i = \left\{ + \begin{array}{l} + 0\mbox{~if~}i=0,\\ + offset_{i-1}+n_{i-1}\mbox{~otherwise.} + \end{array} + \right. +\end{equation} +So each process $i$ performs several copies of the same initial matrix chosen from the Davis collection and it +puts all these copies on the main diagonal of the global matrix in order to construct a band matrix. Moreover, +it fulfills the empty spaces between two successive copies by small copies, \textit{lower\_copy} and \textit{upper\_copy}, +of the same initial matrix. Figure~\ref{fig:04} shows a generation of a sparse bended matrix by four computing nodes. + +\begin{figure} +\centering + \includegraphics[width=100mm,keepaspectratio]{Figures/generation} +\caption{Example of the generation of a large sparse and band matrix by four computing nodes.} +\label{fig:04} +\end{figure} + +Table~\ref{tab:03} shows the main characteristics (the number of nonzero values and the bandwidth) of the +large sparse matrices generated from those of the Davis collection. These matrices are associated to the +linear systems of 25 million of unknown values (each generated sparse matrix has 25 million rows). In Table~\ref{tab:04} +we show the performances of the parallel GMRES algorithm for solving large linear systems associated to the +sparse band matrices of Table~\ref{tab:03}. The fourth column gives the ratio between the execution time +spent on a cluster of 24 CPU cores and that spent on a cluster of 12 GPUs. We can notice from these ratios +that for solving large sparse matrices the GPU cluster is more efficient (about 5 times faster) than the CPU +cluster. The computing power of the GPUs allows to accelerate the computation of the functions operating +on large vectors of the parallel GMRES algorithm. + +\begin{table}[!h] +\begin{center} +\begin{tabular}{|c|c|r|r|} +\hline +Matrix type & Name & \# nonzeros & Bandwidth \\ \hline \hline +\multirow{6}{*}{Symmetric} & 2cubes\_sphere & 413 703 602 & 198 836 \\ + & ecology2 & 124 948 019 & 2 002 \\ + & finan512 & 278 175 945 & 123 900 \\ + & G3\_circuit & 125 262 292 & 1 891 887 \\ + & shallow\_water2 & 100 235 292 & 62 806 \\ + & thermal2 & 175 300 284 & 2 421 285 \\ \hline \hline +\multirow{6}{*}{Asymmetric} & cage13 & 435 770 480 & 352 566 \\ + & crashbasis & 409 291 236 & 200 203 \\ + & FEM\_3D\_thermal2 & 595 266 787 & 206 029 \\ + & language & 76 912 824 & 398 626 \\ + & poli\_large & 53 322 580 & 15 576 \\ + & torso3 & 433 795 264 & 328 757 \\ \hline +\end{tabular} +\caption{Main characteristics of the sparse and band matrices generated from the sparse matrices of the Davis collection.} +\label{tab:03} +\end{center} +\end{table} + + +\begin{table}[!h] +\begin{center} +\begin{tabular}{|c|c|c|c|c|c|c|} +\hline +Matrix & $Time_{cpu}$ & $Time_{gpu}$ & $\tau$ & \# $iter$& $prec$ & $\Delta$ \\ \hline \hline +2cubes\_sphere & 3.683 s & 0.870 s & 4.23 & 21 & 2.11e-14 & 8.67e-18 \\ +ecology2 & 2.570 s & 0.424 s & 6.06 & 21 & 4.88e-13 & 2.08e-14 \\ +finan512 & 2.727 s & 0.533 s & 5.11 & 17 & 3.22e-12 & 8.82e-14 \\ +G3\_circuit & 4.656 s & 1.024 s & 4.54 & 22 & 1.04e-12 & 5.00e-15 \\ +shallow\_water2 & 2.268 s & 0.384 s & 5.91 & 17 & 5.54e-21 & 7.92e-24 \\ +thermal2 & 4.650 s & 1.130 s & 4.11 & 21 & 8.89e-12 & 3.33e-16 \\ \hline \hline +cage13 & 6.068 s & 1.054 s & 5.75 & 26 & 3.29e-11 & 1.59e-14 \\ +crashbasis & 25.906 s & 4.569 s & 5.67 & 135 & 6.81e-11 & 4.61e-15 \\ +FEM\_3D\_thermal2 & 13.555 s & 2.654 s & 5.11 & 64 & 3.88e-09 & 1.82e-12 \\ +language & 13.538 s & 2.621 s & 5.16 & 89 & 2.11e-10 & 1.60e-10 \\ +poli\_large & 8.619 s & 1.474 s & 5.85 & 69 & 5.05e-11 & 6.59e-12 \\ +torso3 & 35.213 s & 6.763 s & 5.21 & 175 & 2.69e-10 & 2.66e-14 \\ \hline +\end{tabular} +\caption{Performances of the parallel GMRES algorithm for solving large sparse linear systems associated +to band matrices on a cluster of 24 CPU cores vs. a cluster of 12 GPUs.} +\label{tab:04} +\end{center} +\end{table} + + +%%--------------------%% +%% SECTION 6 %% +%%--------------------%% +\section{Minimization of communications} +\label{sec:06} +The parallel sparse matrix-vector multiplication requires data exchanges between the GPU computing nodes +to construct the global vector. However, a GPU cluster requires communications between the GPU nodes and the +data transfers between the GPUs and their hosts CPUs. In fact, a communication between two GPU nodes implies: +a data transfer from the GPU memory to the CPU memory at the sending node, a MPI communication between the CPUs +of two GPU nodes, and a data transfer from the CPU memory to the GPU memory at the receiving node. Moreover, +the data transfers between a GPU and a CPU are considered as the most expensive communications on a GPU cluster. +For example in our GPU cluster, the data throughput between a GPU and a CPU is of 8 GB/s which is about twice +lower than the data transfer rate between CPUs (20 GB/s) and 12 times lower than the memory bandwidth of the +GPU global memory (102 GB/s). In this section, we propose two solutions to improve the execution time of the +parallel GMRES algorithm on GPU clusters. + +\subsection{Compressed storage format of the sparse vectors} +\label{sec:06.01} +In Section~\ref{sec:05.01}, the SpMV multiplication uses a global vector having a size equivalent to the matrix +bandwidth (see Formula~\ref{eq:11}). However, we can notice that a SpMV multiplication does not often need all +the vector elements of the global vector composed of the local and shared sub-vectors. For example in Figure~\ref{fig:01}, + node 1 only needs a single vector element from node 0 (element 1), two elements from node 2 (elements 8 +and 9) and two elements from node 3 (elements 13 and 14). Therefore to reduce the communication overheads +of the unused vector elements, the GPU computing nodes must exchange between them only the vector elements necessary +to perform their local sparse matrix-vector multiplications. + +\begin{figure} +\centering + \includegraphics[width=120mm,keepaspectratio]{Figures/compress} +\caption{Example of data exchanges between node 1 and its neighbors 0, 2 and 3.} +\label{fig:05} +\end{figure} + +We propose to use a compressed storage format of the sparse global vector. In Figure~\ref{fig:05}, we show an +example of the data exchanges between node 1 and its neighbors to construct the compressed global vector. +First, the neighboring nodes 0, 2 and 3 determine the vector elements needed by node 1 and, then, they send +only these elements to it. Node 1 receives these shared elements in a compressed vector. However to compute +the sparse matrix-vector multiplication, it must first copy the received elements to the corresponding indices +in the global vector. In order to avoid this process at each iteration, we propose to reorder the columns of the +local sub-matrices so as to use the shared vectors in their compressed storage format (see Figure~\ref{fig:06}). +For performance purposes, the computation of the shared data to send to the neighboring nodes is performed +by the GPU as a kernel. In addition, we use the MPI point-to-point communication routines: a blocking send routine +\verb+MPI_Send()+ and a nonblocking receive routine \verb+MPI_Irecv()+. + +\begin{figure} +\centering + \includegraphics[width=100mm,keepaspectratio]{Figures/reorder} +\caption{Reordering of the columns of a local sparse matrix.} +\label{fig:06} +\end{figure} + +Table~\ref{tab:05} shows the performances of the parallel GMRES algorithm using the compressed storage format +of the sparse global vector. The results are obtained from solving large linear systems associated to the sparse +band matrices presented in Table~\ref{tab:03}. We can see from Table~\ref{tab:05} that the execution times +of the parallel GMRES algorithm on a cluster of 12 GPUs are improved by about 38\% compared to those presented +in Table~\ref{tab:04}. In addition, the ratios between the execution times spent on the cluster of 24 CPU cores +and those spent on the cluster of 12 GPUs have increased. Indeed, the reordering of the sparse sub-matrices and +the use of a compressed storage format for the sparse vectors minimize the communication overheads between the +GPU computing nodes. + +\begin{table}[!h] +\begin{center} +\begin{tabular}{|c|c|c|c|c|c|c|} +\hline +Matrix & $Time_{cpu}$ & $Time_{gpu}$ & $\tau$& \# $iter$& $prec$ & $\Delta$ \\ \hline \hline +2cubes\_sphere & 3.597 s & 0.514 s & 6.99 & 21 & 2.11e-14 & 8.67e-18 \\ +ecology2 & 2.549 s & 0.288 s & 8.83 & 21 & 4.88e-13 & 2.08e-14 \\ +finan512 & 2.660 s & 0.377 s & 7.05 & 17 & 3.22e-12 & 8.82e-14 \\ +G3\_circuit & 3.139 s & 0.480 s & 6.53 & 22 & 1.04e-12 & 5.00e-15 \\ +shallow\_water2 & 2.195 s & 0.253 s & 8.68 & 17 & 5.54e-21 & 7.92e-24 \\ +thermal2 & 3.206 s & 0.463 s & 6.93 & 21 & 8.89e-12 & 3.33e-16 \\ \hline \hline +cage13 & 5.560 s & 0.663 s & 8.39 & 26 & 3.29e-11 & 1.59e-14 \\ +crashbasis & 25.802 s & 3.511 s & 7.35 & 135 & 6.81e-11 & 4.61e-15 \\ +FEM\_3D\_thermal2 & 13.281 s & 1.572 s & 8.45 & 64 & 3.88e-09 & 1.82e-12 \\ +language & 12.553 s & 1.760 s & 7.13 & 89 & 2.11e-10 & 1.60e-10 \\ +poli\_large & 8.515 s & 1.053 s & 8.09 & 69 & 5.05e-11 & 6.59e-12 \\ +torso3 & 31.463 s & 3.681 s & 8.55 & 175 & 2.69e-10 & 2.66e-14 \\ \hline +\end{tabular} +\caption{Performances of the parallel GMRES algorithm using a compressed storage format of the sparse +vectors for solving large sparse linear systems associated to band matrices on a cluster of 24 CPU cores vs. +a cluster of 12 GPUs.} +\label{tab:05} +\end{center} +\end{table} + + +\subsection{Hypergraph partitioning} +\label{sec:06.02} +In this section, we use another structure of the sparse matrices. We are interested in sparse matrices +whose nonzero values are distributed along their large bandwidths. We developed in C programming +language a generator of sparse matrices having five bands (see Figure~\ref{fig:07}). The principle of +this generator is the same as the one presented in Section~\ref{sec:05.02}. However, the copies made from the +initial sparse matrix, chosen from the Davis collection, are placed on the main diagonal and on two +off-diagonals on the left and right of the main diagonal. Table~\ref{tab:06} shows the main characteristics +of sparse matrices of size 25 million of rows and generated from those of the Davis collection. We can +see in the fourth column that the bandwidths of these matrices are as large as their sizes. + +\begin{figure} +\centering + \includegraphics[width=100mm,keepaspectratio]{Figures/generation_1} +\caption{Example of the generation of a large sparse matrix having five bands by four computing nodes.} +\label{fig:07} +\end{figure} + + +\begin{table}[!h] +\begin{center} +\begin{tabular}{|c|c|r|r|} +\hline +Matrix type & Name & \# nonzeros & Bandwidth \\ \hline \hline +\multirow{6}{*}{Symmetric} & 2cubes\_sphere & 829 082 728 & 24 999 999 \\ + & ecology2 & 254 892 056 & 25 000 000 \\ + & finan512 & 556 982 339 & 24 999 973 \\ + & G3\_circuit & 257 982 646 & 25 000 000 \\ + & shallow\_water2 & 200 798 268 & 25 000 000 \\ + & thermal2 & 359 340 179 & 24 999 998 \\ \hline \hline +\multirow{6}{*}{Asymmetric} & cage13 & 879 063 379 & 24 999 998 \\ + & crashbasis & 820 373 286 & 24 999 803 \\ + & FEM\_3D\_thermal2 & 1 194 012 703 & 24 999 998 \\ + & language & 155 261 826 & 24 999 492 \\ + & poli\_large & 106 680 819 & 25 000 000 \\ + & torso3 & 872 029 998 & 25 000 000 \\ \hline +\end{tabular} +\caption{Main characteristics of the sparse matrices having five band and generated from the sparse matrices of the Davis collection.} +\label{tab:06} +\end{center} +\end{table} + +In Table~\ref{tab:07} we give the performances of the parallel GMRES algorithm on the CPU and GPU +clusters for solving large linear systems associated to the sparse matrices shown in Table~\ref{tab:06}. +We can notice from the ratios given in the fourth column that solving sparse linear systems associated +to matrices having large bandwidth on the GPU cluster is as fast as on the CPU cluster. This is due +to the large total communication volume necessary to synchronize the computations over the cluster. +In fact, the naive partitioning row-by-row or column-by-column of this type of sparse matrices links +a GPU node to many neighboring nodes and produces a significant number of data dependencies between +the different GPU nodes. + +\begin{table}[!h] +\begin{center} +\begin{tabular}{|c|c|c|c|c|c|c|} +\hline +Matrix & $Time_{cpu}$ & $Time_{gpu}$ & $\tau$ & \# $iter$& $prec$ & $\Delta$ \\ \hline \hline +2cubes\_sphere & 15.963 s & 7.250 s & 2.20 & 58 & 6.23e-16 & 3.25e-19 \\ +ecology2 & 3.549 s & 2.176 s & 1.63 & 21 & 4.78e-15 & 1.06e-15 \\ +finan512 & 3.862 s & 1.934 s & 1.99 & 17 & 3.21e-14 & 8.43e-17 \\ +G3\_circuit & 4.636 s & 2.811 s & 1.65 & 22 & 1.08e-14 & 1.77e-16 \\ +shallow\_water2 & 2.738 s & 1.539 s & 1.78 & 17 & 5.54e-23 & 3.82e-26 \\ +thermal2 & 5.017 s & 2.587 s & 1.94 & 21 & 8.25e-14 & 4.34e-18 \\ \hline \hline +cage13 & 9.315 s & 3.227 s & 2.89 & 26 & 3.38e-13 & 2.08e-16 \\ +crashbasis & 35.980 s & 14.770 s & 2.43 & 127 & 1.17e-12 & 1.56e-17 \\ +FEM\_3D\_thermal2 & 24.611 s & 7.749 s & 3.17 & 64 & 3.87e-11 & 2.84e-14 \\ +language & 16.859 s & 9.697 s & 1.74 & 89 & 2.17e-12 & 1.70e-12 \\ +poli\_large & 10.200 s & 6.534 s & 1.56 & 69 & 5.14e-13 & 1.63e-13 \\ +torso3 & 49.074 s & 19.397 s & 2.53 & 175 & 2.69e-12 & 2.77e-16 \\ \hline +\end{tabular} +\caption{Performances of the parallel GMRES algorithm using a compressed storage format of the sparse +vectors for solving large sparse linear systems associated to matrices having five bands on a cluster +of 24 CPU cores vs. a cluster of 12 GPUs.} +\label{tab:07} +\end{center} +\end{table} + +We propose to use a hypergraph partitioning method which is well adapted to numerous structures +of sparse matrices~\cite{Cat99}. Indeed, it can well model the communications between the computing +nodes especially for the asymmetric and rectangular matrices. It gives in most cases good reductions +of the total communication volume. Nevertheless, it is more expensive in terms of execution time and +memory space consumption than the partitioning method based on graphs. + +The sparse matrix $A$ of the linear system to be solved is modelled as a hypergraph +$\mathcal{H}=(\mathcal{V},\mathcal{E})$ as follows: +\begin{itemize} +\item each matrix row $i$ ($0\leq i