-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-%% %%
-%% CHAPTER 12 %%
-%% %%
-%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-
-%\chapterauthor{}{}
-\chapterauthor{Lilia Ziane Khodja}{Femto-ST Institute, University of Franche-Comte, France}
-\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte, France}
-\chapterauthor{Jacques Bahi}{Femto-ST Institute, University of Franche-Comte, France}
-
-\chapter{Solving sparse linear systems with GMRES and CG methods on GPU clusters}
-\label{ch12}
-
-%%--------------------------%%
-%% SECTION 1 %%
-%%--------------------------%%
-\section{Introduction}
-\label{ch12:sec:01}
-The sparse linear systems are used to model many scientific and industrial problems,
-such as the environmental simulations or the industrial processing of the complex or
-non-Newtonian fluids. Moreover, the resolution of these problems often involves the
-solving of such linear systems which is considered as the most expensive process in
-terms of execution time and memory space. Therefore, solving sparse linear systems
-must be as efficient as possible in order to deal with problems of ever increasing
-size.
-
-There are, in the jargon of numerical analysis, different methods of solving sparse
-linear systems that can be classified in two classes: the direct and iterative methods.
-However, the iterative methods are often more suitable than their counterpart, direct
-methods, for solving these systems. Indeed, they are less memory consuming and easier
-to parallelize on parallel computers than direct methods. Different computing platforms,
-sequential and parallel computers, are used for solving sparse linear systems with iterative
-solutions. Nowadays, graphics processing units (GPUs) have become attractive for solving
-these systems, due to their computing power and their ability to compute faster than
-traditional CPUs.
-
-In Section~\ref{ch12:sec:02}, we describe the general principle of two well-known iterative
-methods: the conjugate gradient method and the generalized minimal residual method. In Section~\ref{ch12:sec:03},
-we give the main key points of the parallel implementation of both methods on a cluster of
-GPUs. Then, in Section~\ref{ch12:sec:04}, we present the experimental results obtained on a
-CPU cluster and on a GPU cluster, for solving sparse linear systems associated to matrices
-of different structures. Finally, in Section~\ref{ch12:sec:05}, we apply the hypergraph partitioning
-technique to reduce the total communication volume between the computing nodes and, thus,
-to improve the execution times of the parallel algorithms of both iterative methods.
-
-
-%%--------------------------%%
-%% SECTION 2 %%
-%%--------------------------%%
-\section{Krylov iterative methods}
-\label{ch12:sec:02}
-Let us consider the following system of $n$ linear equations\index{Sparse~linear~system}
-in $\mathbb{R}$:
-\begin{equation}
-Ax=b,
-\label{ch12:eq:01}
-\end{equation}
-where $A\in\mathbb{R}^{n\times n}$ is a sparse nonsingular square matrix, $x\in\mathbb{R}^{n}$
-is the solution vector, $b\in\mathbb{R}^{n}$ is the right-hand side and $n\in\mathbb{N}$ is a
-large integer number.
-
-The iterative methods\index{Iterative~method} for solving the large sparse linear system~(\ref{ch12:eq:01})
-proceed by successive iterations of a same block of elementary operations, during which an
-infinite number of approximate solutions $\{x_k\}_{k\geq 0}$ are computed. Indeed, from an
-initial guess $x_0$, an iterative method determines at each iteration $k>0$ an approximate
-solution $x_k$ which, gradually, converges to the exact solution $x^{*}$ as follows:
-\begin{equation}
-x^{*}=\lim\limits_{k\to\infty}x_{k}=A^{-1}b.
-\label{ch12:eq:02}
-\end{equation}
-The number of iterations necessary to reach the exact solution $x^{*}$ is not known beforehand
-and can be infinite. In practice, an iterative method often finds an approximate solution $\tilde{x}$
-after a fixed number of iterations and/or when a given convergence criterion\index{Convergence}
-is satisfied as follows:
-\begin{equation}
-\|b-A\tilde{x}\| < \varepsilon,
-\label{ch12:eq:03}
-\end{equation}
-where $\varepsilon<1$ is the required convergence tolerance threshold\index{Convergence!Tolerance~threshold}.
-
-Some of the most iterative methods that have proven their efficiency for solving large sparse
-linear systems are those called \textit{Krylov subspace methods}~\cite{ch12:ref1}\index{Iterative~method!Krylov~subspace}.
-In the present chapter, we describe two Krylov methods which are widely used: the conjugate
-gradient method (CG) and the generalized minimal residual method (GMRES). In practice, the
-Krylov subspace methods are usually used with preconditioners that allow to improve their
-convergence. So, in what follows, the CG and GMRES methods are used for solving the left-preconditioned\index{Sparse~linear~system!Preconditioned}
-sparse linear system:
-\begin{equation}
-M^{-1}Ax=M^{-1}b,
-\label{ch12:eq:11}
-\end{equation}
-where $M$ is the preconditioning matrix.
-
-
-%%****************%%
-%%****************%%
-\subsection{CG method}
-\label{ch12:sec:02.01}
-The conjugate gradient method is initially developed by Hestenes and Stiefel in 1952~\cite{ch12:ref2}.
-It is one of the well known iterative method for solving large sparse linear systems. In addition, it
-can be adapted for solving nonlinear equations and optimization problems. However, it can only be applied
-to problems with positive definite symmetric matrices.
-
-The main idea of the CG method\index{Iterative~method!CG} is the computation of a sequence of approximate
-solutions $\{x_k\}_{k\geq 0}$ in a Krylov subspace\index{Iterative~method!Krylov~subspace} of order $k$ as
-follows:
-\begin{equation}
-x_k \in x_0 + \mathcal{K}_k(A,r_0),
-\label{ch12:eq:04}
-\end{equation}
-such that the Galerkin condition\index{Galerkin~condition} must be satisfied:
-\begin{equation}
-r_k \bot \mathcal{K}_k(A,r_0),
-\label{ch12:eq:05}
-\end{equation}
-where $x_0$ is the initial guess, $r_k=b-Ax_k$ is the residual of the computed solution $x_k$ and $\mathcal{K}_k$
-the Krylov subspace of order $k$: \[\mathcal{K}_k(A,r_0) \equiv\text{span}\{r_0, Ar_0, A^2r_0,\ldots, A^{k-1}r_0\}.\]
-In fact, CG is based on the construction of a sequence $\{p_k\}_{k\in\mathbb{N}}$ of direction vectors in $\mathcal{K}_k$
-which are pairwise $A$-conjugate ($A$-orthogonal):
-\begin{equation}
-\begin{array}{ll}
-p_i^T A p_j = 0, & i\neq j.
-\end{array}
-\label{ch12:eq:06}
-\end{equation}
-At each iteration $k$, an approximate solution $x_k$ is computed by recurrence as follows:
-\begin{equation}
-\begin{array}{ll}
-x_k = x_{k-1} + \alpha_k p_k, & \alpha_k\in\mathbb{R}.
-\end{array}
-\label{ch12:eq:07}
-\end{equation}
-Consequently, the residuals $r_k$ are computed in the same way:
-\begin{equation}
-r_k = r_{k-1} - \alpha_k A p_k.
-\label{ch12:eq:08}
-\end{equation}
-In the case where all residuals are nonzero, the direction vectors $p_k$ can be determined so that
-the following recurrence holds:
-\begin{equation}
-\begin{array}{lll}
-p_0=r_0, & p_k=r_k+\beta_k p_{k-1}, & \beta_k\in\mathbb{R}.
-\end{array}
-\label{ch12:eq:09}
-\end{equation}
-Moreover, the scalars $\{\alpha_k\}_{k>0}$ are chosen so as to minimize the $A$-norm error $\|x^{*}-x_k\|_A$
-over the Krylov subspace $\mathcal{K}_{k}$ and the scalars $\{\beta_k\}_{k>0}$ are chosen so as to ensure
-that the direction vectors are pairwise $A$-conjugate. So, the assumption that matrix $A$ is symmetric and
-the recurrences~(\ref{ch12:eq:08}) and~(\ref{ch12:eq:09}) allow to deduce that:
-\begin{equation}
-\begin{array}{ll}
-\alpha_{k}=\frac{r^{T}_{k-1}r_{k-1}}{p_{k}^{T}Ap_{k}}, & \beta_{k}=\frac{r_{k}^{T}r_{k}}{r_{k-1}^{T}r_{k-1}}.
-\end{array}
-\label{ch12:eq:10}
-\end{equation}
-
-\begin{algorithm}[!t]
- Choose an initial guess $x_0$\;
- $r_{0} = b - A x_{0}$\;
- $convergence$ = false\;
- $k = 1$\;
- \Repeat{convergence}{
- $z_{k} = M^{-1} r_{k-1}$\;
- $\rho_{k} = (r_{k-1},z_{k})$\;
- \eIf{$k = 1$}{
- $p_{k} = z_{k}$\;
- }{
- $\beta_{k} = \rho_{k} / \rho_{k-1}$\;
- $p_{k} = z_{k} + \beta_{k} \times p_{k-1}$\;
- }
- $q_{k} = A \times p_{k}$\;
- $\alpha_{k} = \rho_{k} / (p_{k},q_{k})$\;
- $x_{k} = x_{k-1} + \alpha_{k} \times p_{k}$\;
- $r_{k} = r_{k-1} - \alpha_{k} \times q_{k}$\;
- \eIf{$(\rho_{k} < \varepsilon)$ {\bf or} $(k \geq maxiter)$}{
- $convergence$ = true\;
- }{
- $k = k + 1$\;
- }
- }
-\caption{Left-preconditioned CG method}
-\label{ch12:alg:01}
-\end{algorithm}
-
-Algorithm~\ref{ch12:alg:01} shows the main key points of the preconditioned CG method. It allows
-to solve the left-preconditioned\index{Sparse~linear~system!Preconditioned} sparse linear system~(\ref{ch12:eq:11}).
-In this algorithm, $\varepsilon$ is the convergence tolerance threshold, $maxiter$ is the maximum
-number of iterations and $(\cdot,\cdot)$ defines the dot product between two vectors in $\mathbb{R}^{n}$.
-At every iteration, a direction vector $p_k$ is determined, so that it is orthogonal to the preconditioned
-residual $z_k$ and to the direction vectors $\{p_i\}_{i<k}$ previously determined (from line~$8$ to
-line~$13$). Then, at lines~$16$ and~$17$, the iterate $x_k$ and the residual $r_k$ are computed using
-formulas~(\ref{ch12:eq:07}) and~(\ref{ch12:eq:08}), respectively. The CG method converges after, at
-most, $n$ iterations. In practice, the CG algorithm stops when the tolerance threshold\index{Convergence!Tolerance~threshold}
-$\varepsilon$ and/or the maximum number of iterations\index{Convergence!Maximum~number~of~iterations}
-$maxiter$ are reached.
-
-
-%%****************%%
-%%****************%%
-\subsection{GMRES method}
-\label{ch12:sec:02.02}
-The iterative GMRES method is developed by Saad and Schultz in 1986~\cite{ch12:ref3} as a generalization
-of the minimum residual method MINRES~\cite{ch12:ref4}\index{Iterative~method!MINRES}. Indeed, GMRES can
-be applied for solving symmetric or nonsymmetric linear systems.
-
-The main principle of the GMRES method\index{Iterative~method!GMRES} is to find an approximation minimizing
-at best the residual norm. In fact, GMRES computes a sequence of approximate solutions $\{x_k\}_{k>0}$ in
-a Krylov subspace\index{Iterative~method!Krylov~subspace} $\mathcal{K}_k$ as follows:
-\begin{equation}
-\begin{array}{ll}
-x_k \in x_0 + \mathcal{K}_k(A, v_1),& v_1=\frac{r_0}{\|r_0\|_2},
-\end{array}
-\label{ch12:eq:12}
-\end{equation}
-so that the Petrov-Galerkin condition\index{Petrov-Galerkin~condition} is satisfied:
-\begin{equation}
-\begin{array}{ll}
-r_k \bot A \mathcal{K}_k(A, v_1).
-\end{array}
-\label{ch12:eq:13}
-\end{equation}
-GMRES uses the Arnoldi process~\cite{ch12:ref5}\index{Iterative~method!Arnoldi~process} to construct an
-orthonormal basis $V_k$ for the Krylov subspace $\mathcal{K}_k$ and an upper Hessenberg matrix\index{Hessenberg~matrix}
-$\bar{H}_k$ of order $(k+1)\times k$:
-\begin{equation}
-\begin{array}{ll}
-V_k = \{v_1, v_2,\ldots,v_k\}, & \forall k>1, v_k=A^{k-1}v_1,
-\end{array}
-\label{ch12:eq:14}
-\end{equation}
-and
-\begin{equation}
-V_k A = V_{k+1} \bar{H}_k.
-\label{ch12:eq:15}
-\end{equation}
-
-Then, at each iteration $k$, an approximate solution $x_k$ is computed in the Krylov subspace $\mathcal{K}_k$
-spanned by $V_k$ as follows:
-\begin{equation}
-\begin{array}{ll}
-x_k = x_0 + V_k y, & y\in\mathbb{R}^{k}.
-\end{array}
-\label{ch12:eq:16}
-\end{equation}
-From both formulas~(\ref{ch12:eq:15}) and~(\ref{ch12:eq:16}) and $r_k=b-Ax_k$, we can deduce that:
-\begin{equation}
-\begin{array}{lll}
- r_{k} & = & b - A (x_{0} + V_{k}y) \\
- & = & r_{0} - AV_{k}y \\
- & = & \beta v_{1} - V_{k+1}\bar{H}_{k}y \\
- & = & V_{k+1}(\beta e_{1} - \bar{H}_{k}y),
-\end{array}
-\label{ch12:eq:17}
-\end{equation}
-such that $\beta=\|r_0\|_2$ and $e_1=(1,0,\cdots,0)$ is the first vector of the canonical basis of
-$\mathbb{R}^k$. So, the vector $y$ is chosen in $\mathbb{R}^k$ so as to minimize at best the Euclidean
-norm of the residual $r_k$. Consequently, a linear least-squares problem of size $k$ is solved:
-\begin{equation}
-\underset{y\in\mathbb{R}^{k}}{min}\|r_{k}\|_{2}=\underset{y\in\mathbb{R}^{k}}{min}\|\beta e_{1}-\bar{H}_{k}y\|_{2}.
-\label{ch12:eq:18}
-\end{equation}
-The QR factorization of matrix $\bar{H}_k$ is used to compute the solution of this problem by using
-Givens rotations~\cite{ch12:ref1,ch12:ref3}, such that:
-\begin{equation}
-\begin{array}{lll}
-\bar{H}_{k}=Q_{k}R_{k}, & Q_{k}\in\mathbb{R}^{(k+1)\times (k+1)}, & R_{k}\in\mathbb{R}^{(k+1)\times k},
-\end{array}
-\label{ch12:eq:19}
-\end{equation}
-where $Q_kQ_k^T=I_k$ and $R_k$ is an upper triangular matrix.
-
-The GMRES method computes an approximate solution with a sufficient precision after, at most, $n$
-iterations ($n$ is the size of the sparse linear system to be solved). However, the GMRES algorithm
-must construct and store in the memory an orthonormal basis $V_k$ whose size is proportional to the
-number of iterations required to achieve the convergence. Then, to avoid a huge memory storage, the
-GMRES method must be restarted at each $m$ iterations, such that $m$ is very small ($m\ll n$), and
-with $x_m$ as the initial guess to the next iteration. This allows to limit the size of the basis
-$V$ to $m$ orthogonal vectors.
-
-\begin{algorithm}[!t]
- Choose an initial guess $x_0$\;
- $convergence$ = false\;
- $k = 1$\;
- $r_{0} = M^{-1}(b-Ax_{0})$\;
- $\beta = \|r_{0}\|_{2}$\;
- \While{$\neg convergence$}{
- $v_{1} = r_{0}/\beta$\;
- \For{$j=1$ \KwTo $m$}{
- $w_{j} = M^{-1}Av_{j}$\;
- \For{$i=1$ \KwTo $j$}{
- $h_{i,j} = (w_{j},v_{i})$\;
- $w_{j} = w_{j}-h_{i,j}v_{i}$\;
- }
- $h_{j+1,j} = \|w_{j}\|_{2}$\;
- $v_{j+1} = w_{j}/h_{j+1,j}$\;
- }
- Set $V_{m}=\{v_{j}\}_{1\leq j \leq m}$ and $\bar{H}_{m}=(h_{i,j})$ a $(m+1)\times m$ upper Hessenberg matrix\;
- Solve a least-squares problem of size $m$: $min_{y\in\mathrm{I\!R}^{m}}\|\beta e_{1}-\bar{H}_{m}y\|_{2}$\;
- $x_{m} = x_{0}+V_{m}y_{m}$\;
- $r_{m} = M^{-1}(b-Ax_{m})$\;
- $\beta = \|r_{m}\|_{2}$\;
- \eIf{ $(\beta<\varepsilon)$ {\bf or} $(k\geq maxiter)$}{
- $convergence$ = true\;
- }{
- $x_{0} = x_{m}$\;
- $r_{0} = r_{m}$\;
- $k = k + 1$\;
- }
- }
-\caption{Left-preconditioned GMRES method with restarts}
-\label{ch12:alg:02}
-\end{algorithm}
-
-Algorithm~\ref{ch12:alg:02} shows the main key points of the GMRES method with restarts.
-It solves the left-preconditioned\index{Sparse~linear~system!Preconditioned} sparse linear
-system~(\ref{ch12:eq:11}), such that $M$ is the preconditioning matrix. At each iteration
-$k$, GMRES uses the Arnoldi process\index{Iterative~method!Arnoldi~process} (defined from
-line~$7$ to line~$17$) to construct a basis $V_m$ of $m$ orthogonal vectors and an upper
-Hessenberg matrix\index{Hessenberg~matrix} $\bar{H}_m$ of size $(m+1)\times m$. Then, it
-solves the linear least-squares problem of size $m$ to find the vector $y\in\mathbb{R}^{m}$
-which minimizes at best the residual norm (line~$18$). Finally, it computes an approximate
-solution $x_m$ in the Krylov subspace spanned by $V_m$ (line~$19$). The GMRES algorithm is
-stopped when the residual norm is sufficiently small ($\|r_m\|_2<\varepsilon$) and/or the
-maximum number of iterations\index{Convergence!Maximum~number~of~iterations} ($maxiter$)
-is reached.
-
-
-%%--------------------------%%
-%% SECTION 3 %%
-%%--------------------------%%
-\section{Parallel implementation on a GPU cluster}
-\label{ch12:sec:03}
-In this section, we present the parallel algorithms of both iterative CG\index{Iterative~method!CG}
-and GMRES\index{Iterative~method!GMRES} methods for GPU clusters. The implementation is performed on
-a GPU cluster composed of different computing nodes, such that each node is a CPU core managed by a
-MPI process and equipped with a GPU card. The parallelization of these algorithms is carried out by
-using the MPI communication routines between the GPU computing nodes\index{Computing~node} and the
-CUDA programming environment inside each node. In what follows, the algorithms of the iterative methods
-are called iterative solvers.
-
-
-%%****************%%
-%%****************%%
-\subsection{Data partitioning}
-\label{ch12:sec:03.01}
-The parallel solving of the large sparse linear system~(\ref{ch12:eq:11}) requires a data partitioning
-between the computing nodes of the GPU cluster. Let $p$ denotes the number of the computing nodes on the
-GPU cluster. The partitioning operation consists in the decomposition of the vectors and matrices, involved
-in the iterative solver, in $p$ portions. Indeed, this operation allows to assign to each computing node
-$i$:
-\begin{itemize}
-\item a portion of size $\frac{n}{p}$ elements of each vector,
-\item a sparse rectangular sub-matrix $A_i$ of size $(\frac{n}{p},n)$ and,
-\item a square preconditioning sub-matrix $M_i$ of size $(\frac{n}{p},\frac{n}{p})$,
-\end{itemize}
-where $n$ is the size of the sparse linear system to be solved. In the first instance, we perform a naive
-row-wise partitioning (decomposition row-by-row) on the data of the sparse linear systems to be solved.
-Figure~\ref{ch12:fig:01} shows an example of a row-wise data partitioning between four computing nodes
-of a sparse linear system (sparse matrix $A$, solution vector $x$ and right-hand side $b$) of size $16$
-unknown values.
-
-\begin{figure}
-\centerline{\includegraphics[scale=0.35]{Chapters/chapter12/figures/partition}}
-\caption{A data partitioning of the sparse matrix $A$, the solution vector $x$ and the right-hand side $b$ into four portions.}
-\label{ch12:fig:01}
-\end{figure}
-
-
-%%****************%%
-%%****************%%
-\subsection{GPU computing}
-\label{ch12:sec:03.02}
-After the partitioning operation, all the data involved from this operation must be
-transferred from the CPU memories to the GPU memories, in order to be processed by
-GPUs. We use two functions of the CUBLAS\index{CUBLAS} library (CUDA Basic Linear
-Algebra Subroutines), developed by Nvidia~\cite{ch12:ref6}: \verb+cublasAlloc()+
-for the memory allocations on GPUs and \verb+cublasSetVector()+ for the memory
-copies from the CPUs to the GPUs.
-
-An efficient implementation of CG and GMRES solvers on a GPU cluster requires to
-determine all parts of their codes that can be executed in parallel and, thus, take
-advantage of the GPU acceleration. As many Krylov subspace methods, the CG and GMRES
-methods are mainly based on arithmetic operations dealing with vectors or matrices:
-sparse matrix-vector multiplications, scalar-vector multiplications, dot products,
-Euclidean norms, AXPY operations ($y\leftarrow ax+y$ where $x$ and $y$ are vectors
-and $a$ is a scalar) and so on. These vector operations are often easy to parallelize
-and they are more efficient on parallel computers when they work on large vectors.
-Therefore, all the vector operations used in CG and GMRES solvers must be executed
-by the GPUs as kernels.
-
-We use the kernels of the CUBLAS library to compute some vector operations of CG and
-GMRES solvers. The following kernels of CUBLAS (dealing with double floating point)
-are used: \verb+cublasDdot()+ for the dot products, \verb+cublasDnrm2()+ for the
-Euclidean norms and \verb+cublasDaxpy()+ for the AXPY operations. For the rest of
-the data-parallel operations, we code their kernels in CUDA. In the CG solver, we
-develop a kernel for the XPAY operation ($y\leftarrow x+ay$) used at line~$12$ in
-Algorithm~\ref{ch12:alg:01}. In the GMRES solver, we program a kernel for the scalar-vector
-multiplication (lines~$7$ and~$15$ in Algorithm~\ref{ch12:alg:02}), a kernel for
-solving the least-squares problem and a kernel for the elements updates of the solution
-vector $x$.
-
-The least-squares problem in the GMRES method is solved by performing a QR factorization
-on the Hessenberg matrix\index{Hessenberg~matrix} $\bar{H}_m$ with plane rotations and,
-then, solving the triangular system by backward substitutions to compute $y$. Consequently,
-solving the least-squares problem on the GPU is not interesting. Indeed, the triangular
-solves are not easy to parallelize and inefficient on GPUs. However, the least-squares
-problem to solve in the GMRES method with restarts has, generally, a very small size $m$.
-Therefore, we develop an inexpensive kernel which must be executed in sequential by a
-single CUDA thread.
-
-The most important operation in CG\index{Iterative~method!CG} and GMRES\index{Iterative~method!GMRES}
-methods is the sparse matrix-vector multiplication (SpMV)\index{SpMV~multiplication},
-because it is often an expensive operation in terms of execution time and memory space.
-Moreover, it requires to take care of the storage format of the sparse matrix in the
-memory. Indeed, the naive storage, row-by-row or column-by-column, of a sparse matrix
-can cause a significant waste of memory space and execution time. In addition, the sparsity
-nature of the matrix often leads to irregular memory accesses to read the matrix nonzero
-values. So, the computation of the SpMV multiplication on GPUs can involve non coalesced
-accesses to the global memory, which slows down even more its performances. One of the
-most efficient compressed storage formats\index{Compressed~storage~format} of sparse
-matrices on GPUs is HYB\index{Compressed~storage~format!HYB} format~\cite{ch12:ref7}.
-It is a combination of ELLpack (ELL) and Coordinate (COO) formats. Indeed, it stores
-a typical number of nonzero values per row in ELL\index{Compressed~storage~format!ELL}
-format and remaining entries of exceptional rows in COO format. It combines the efficiency
-of ELL due to the regularity of its memory accesses and the flexibility of COO\index{Compressed~storage~format!COO}
-which is insensitive to the matrix structure. Consequently, we use the HYB kernel~\cite{ch12:ref8}
-developed by Nvidia to implement the SpMV multiplication of CG and GMRES methods on GPUs.
-Moreover, to avoid the non coalesced accesses to the high-latency global memory, we fill
-the elements of the iterate vector $x$ in the cached texture memory.
-
-
-%%****************%%
-%%****************%%
-\subsection{Data communications}
-\label{ch12:sec:03.03}
-All the computing nodes of the GPU cluster execute in parallel the same iterative solver
-(Algorithm~\ref{ch12:alg:01} or Algorithm~\ref{ch12:alg:02}) adapted to GPUs, but on their
-own portions of the sparse linear system\index{Sparse~linear~system}: $M^{-1}_iA_ix_i=M^{-1}_ib_i$,
-$0\leq i<p$. However, in order to solve the complete sparse linear system~(\ref{ch12:eq:11}),
-synchronizations must be performed between the local computations of the computing nodes over
-the cluster. In what follows, two computing nodes sharing data are called neighboring nodes\index{Neighboring~node}.
-
-As already mentioned, the most important operation of CG and GMRES methods is the SpMV multiplication.
-In the parallel implementation of the iterative methods, each computing node $i$ performs the
-SpMV multiplication on its own sparse rectangular sub-matrix $A_i$. Locally, it has only sub-vectors
-of size $\frac{n}{p}$ corresponding to rows of its sub-matrix $A_i$. However, it also requires
-the vector elements of its neighbors, corresponding to the column indices on which its sub-matrix
-has nonzero values (see Figure~\ref{ch12:fig:01}). So, in addition to the local vectors, each
-node must also manage vector elements shared with neighbors and required to compute the SpMV
-multiplication. Therefore, the iterate vector $x$ managed by each computing node is composed
-of a local sub-vector $x^{local}$ of size $\frac{n}{p}$ and a sub-vector of shared elements $x^{shared}$.
-In the same way, the vector used to construct the orthonormal basis of the Krylov subspace (vectors
-$p$ and $v$ in CG and GMRES methods, respectively) is composed of a local sub-vector and a shared
-sub-vector.
-
-Therefore, before computing the SpMV multiplication\index{SpMV~multiplication}, the neighboring
-nodes\index{Neighboring~node} over the GPU cluster must exchange between them the shared vector
-elements necessary to compute this multiplication. First, each computing node determines, in its
-local sub-vector, the vector elements needed by other nodes. Then, the neighboring nodes exchange
-between them these shared vector elements. The data exchanges are implemented by using the MPI
-point-to-point communication routines: blocking\index{MPI~subroutines!Blocking} sends with \verb+MPI_Send()+
-and nonblocking\index{MPI~subroutines!Nonblocking} receives with \verb+MPI_Irecv()+. Figure~\ref{ch12:fig:02}
-shows an example of data exchanges between \textit{Node 1} and its neighbors \textit{Node 0}, \textit{Node 2}
-and \textit{Node 3}. In this example, the iterate matrix $A$ split between these four computing
-nodes is that presented in Figure~\ref{ch12:fig:01}.
-
-\begin{figure}
-\centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/compress}}
-\caption{Data exchanges between \textit{Node 1} and its neighbors \textit{Node 0}, \textit{Node 2} and \textit{Node 3}.}
-\label{ch12:fig:02}
-\end{figure}
-
-After the synchronization operation, the computing nodes receive, from their respective neighbors,
-the shared elements in a sub-vector stored in a compressed format. However, in order to compute the
-SpMV multiplication, the computing nodes operate on sparse global vectors (see Figure~\ref{ch12:fig:02}).
-In this case, the received vector elements must be copied to the corresponding indices in the global
-vector. So as not to need to perform this at each iteration, we propose to reorder the columns of
-each sub-matrix $\{A_i\}_{0\leq i<p}$, so that the shared sub-vectors could be used in their compressed
-storage formats. Figure~\ref{ch12:fig:03} shows a reordering of a sparse sub-matrix (sub-matrix of
-\textit{Node 1}).
-
-\begin{figure}
-\centerline{\includegraphics[scale=0.35]{Chapters/chapter12/figures/reorder}}
-\caption{Columns reordering of a sparse sub-matrix.}
-\label{ch12:fig:03}
-\end{figure}
-
-A GPU cluster\index{GPU~cluster} is a parallel platform with a distributed memory. So, the synchronizations
-and communication data between GPU nodes are carried out by passing messages. However, GPUs can not communicate
-between them in direct way. Then, CPUs via MPI processes are in charge of the synchronizations within the GPU
-cluster. Consequently, the vector elements to be exchanged must be copied from the GPU memory to the CPU memory
-and vice-versa before and after the synchronization operation between CPUs. We have used the CUBLAS\index{CUBLAS}
-communication subroutines to perform the data transfers between a CPU core and its GPU: \verb+cublasGetVector()+
-and \verb+cublasSetVector()+. Finally, in addition to the data exchanges, GPU nodes perform reduction operations
-to compute in parallel the dot products and Euclidean norms. This is implemented by using the MPI global communication\index{MPI~subroutines!Global}
-\verb+MPI_Allreduce()+.
-
-
-
-%%--------------------------%%
-%% SECTION 4 %%
-%%--------------------------%%
-\section{Experimental results}
-\label{ch12:sec:04}
-In this section, we present the performances of the parallel CG and GMRES linear solvers obtained
-on a cluster of $12$ GPUs. Indeed, this GPU cluster of tests is composed of six machines connected
-by $20$Gbps InfiniBand network. Each machine is a Quad-Core Xeon E5530 CPU running at $2.4$GHz and
-providing $12$GB of RAM with a memory bandwidth of $25.6$GB/s. In addition, two Tesla C1060 GPUs are
-connected to each machine via a PCI-Express 16x Gen 2.0 interface with a throughput of $8$GB/s. A
-Tesla C1060 GPU contains $240$ cores running at $1.3$GHz and providing a global memory of $4$GB with
-a memory bandwidth of $102$GB/s. Figure~\ref{ch12:fig:04} shows the general scheme of the GPU cluster\index{GPU~cluster}
-that we used in the experimental tests.
-
-\begin{figure}
-\centerline{\includegraphics[scale=0.25]{Chapters/chapter12/figures/cluster}}
-\caption{General scheme of the GPU cluster of tests composed of six machines, each with two GPUs.}
-\label{ch12:fig:04}
-\end{figure}
-
-Linux cluster version 2.6.39 OS is installed on CPUs. C programming language is used for coding
-the parallel algorithms of both methods on the GPU cluster. CUDA version 4.0~\cite{ch12:ref9}
-is used for programming GPUs, using CUBLAS library~\cite{ch12:ref6} to deal with vector operations
-in GPUs and, finally, MPI routines of OpenMPI 1.3.3 are used to carry out the communications between
-CPU cores. Indeed, the experiments are done on a cluster of $12$ computing nodes, where each node
-is managed by a MPI process and it is composed of one CPU core and one GPU card.
-
-All tests are made on double-precision floating point operations. The parameters of both linear
-solvers are initialized as follows: the residual tolerance threshold $\varepsilon=10^{-12}$, the
-maximum number of iterations $maxiter=500$, the right-hand side $b$ is filled with $1.0$ and the
-initial guess $x_0$ is filled with $0.0$. In addition, we limited the Arnoldi process\index{Iterative~method!Arnoldi~process}
-used in the GMRES method to $16$ iterations ($m=16$). For the sake of simplicity, we have chosen
-the preconditioner $M$ as the main diagonal of the sparse matrix $A$. Indeed, it allows to easily
-compute the required inverse matrix $M^{-1}$ and it provides a relatively good preconditioning for
-not too ill-conditioned matrices. In the GPU computing, the size of thread blocks is fixed to $512$
-threads. Finally, the performance results, presented hereafter, are obtained from the mean value
-over $10$ executions of the same parallel linear solver and for the same input data.
-
-To get more realistic results, we tested the CG and GMRES algorithms on sparse matrices of the Davis's
-collection~\cite{ch12:ref10}, that arise in a wide spectrum of real-world applications. We chose six
-symmetric sparse matrices and six nonsymmetric ones from this collection. In Figure~\ref{ch12:fig:05},
-we show structures of these matrices and in Table~\ref{ch12:tab:01} we present their main characteristics
-which are the number of rows, the total number of nonzero values (nnz) and the maximal bandwidth. In
-the present chapter, the bandwidth of a sparse matrix is defined as the number of matrix columns separating
-the first and the last nonzero value on a matrix row.
-
-\begin{figure}
-\centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/matrices}}
-\caption{Sketches of sparse matrices chosen from the Davis's collection.}
-\label{ch12:fig:05}
-\end{figure}
-
-\begin{table}
-\centering
-\begin{tabular}{|c|c|c|c|c|}
-\hline
-{\bf Matrix type} & {\bf Matrix name} & {\bf \# rows} & {\bf \# nnz} & {\bf Bandwidth} \\ \hline \hline
-
-\multirow{6}{*}{Symmetric} & 2cubes\_sphere & $101,492$ & $1,647,264$ & $100,464$ \\
-
- & ecology2 & $999,999$ & $4,995,991$ & $2,001$ \\
-
- & finan512 & $74,752$ & $596,992$ & $74,725$ \\
-
- & G3\_circuit & $1,585,478$ & $7,660,826$ & $1,219,059$ \\
-
- & shallow\_water2 & $81,920$ & $327,680$ & $58,710$ \\
-
- & thermal2 & $1,228,045$ & $8,580,313$ & $1,226,629$ \\ \hline \hline
-
-\multirow{6}{*}{Nonsymmetric} & cage13 & $445,315$ & $7,479,343$ & $318,788$\\
-
- & crashbasis & $160,000$ & $1,750,416$ & $120,202$ \\
-
- & FEM\_3D\_thermal2 & $147,900$ & $3,489.300$ & $117,827$ \\
-
- & language & $399,130$ & $1,216,334$ & $398,622$\\
-
- & poli\_large & $15,575$ & $33,074$ & $15,575$ \\
-
- & torso3 & $259,156$ & $4,429,042$ & $216,854$ \\ \hline
-\end{tabular}
-\vspace{0.5cm}
-\caption{Main characteristics of sparse matrices chosen from the Davis's collection.}
-\label{ch12:tab:01}
-\end{table}
-
-\begin{table}
-\begin{center}
-\begin{tabular}{|c|c|c|c|c|c|c|}
-\hline
-{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline
-
-2cubes\_sphere & $0.132s$ & $0.069s$ & $1.93$ & $12$ & $1.14e$-$09$ & $3.47e$-$18$ \\
-
-ecology2 & $0.026s$ & $0.017s$ & $1.52$ & $13$ & $5.06e$-$09$ & $8.33e$-$17$ \\
-
-finan512 & $0.053s$ & $0.036s$ & $1.49$ & $12$ & $3.52e$-$09$ & $1.66e$-$16$ \\
-
-G3\_circuit & $0.704s$ & $0.466s$ & $1.51$ & $16$ & $4.16e$-$10$ & $4.44e$-$16$ \\
-
-shallow\_water2 & $0.017s$ & $0.010s$ & $1.68$ & $5$ & $2.24e$-$14$ & $3.88e$-$26$ \\
-
-thermal2 & $1.172s$ & $0.622s$ & $1.88$ & $15$ & $5.11e$-$09$ & $3.33e$-$16$ \\ \hline
-\end{tabular}
-\caption{Performances of the parallel CG method on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs.}
-\label{ch12:tab:02}
-\end{center}
-\end{table}
-
-\begin{table}[!h]
-\begin{center}
-\begin{tabular}{|c|c|c|c|c|c|c|}
-\hline
-{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline
-
-2cubes\_sphere & $0.234s$ & $0.124s$ & $1.88$ & $21$ & $2.10e$-$14$ & $3.47e$-$18$ \\
-
-ecology2 & $0.076s$ & $0.035s$ & $2.15$ & $21$ & $4.30e$-$13$ & $4.38e$-$15$ \\
-
-finan512 & $0.073s$ & $0.052s$ & $1.40$ & $17$ & $3.21e$-$12$ & $5.00e$-$16$ \\
-
-G3\_circuit & $1.016s$ & $0.649s$ & $1.56$ & $22$ & $1.04e$-$12$ & $2.00e$-$15$ \\
-
-shallow\_water2 & $0.061s$ & $0.044s$ & $1.38$ & $17$ & $5.42e$-$22$ & $2.71e$-$25$ \\
-
-thermal2 & $1.666s$ & $0.880s$ & $1.89$ & $21$ & $6.58e$-$12$ & $2.77e$-$16$ \\ \hline \hline
-
-cage13 & $0.721s$ & $0.338s$ & $2.13$ & $26$ & $3.37e$-$11$ & $2.66e$-$15$ \\
-
-crashbasis & $1.349s$ & $0.830s$ & $1.62$ & $121$ & $9.10e$-$12$ & $6.90e$-$12$ \\
-
-FEM\_3D\_thermal2 & $0.797s$ & $0.419s$ & $1.90$ & $64$ & $3.87e$-$09$ & $9.09e$-$13$ \\
-
-language & $2.252s$ & $1.204s$ & $1.87$ & $90$ & $1.18e$-$10$ & $8.00e$-$11$ \\
-
-poli\_large & $0.097s$ & $0.095s$ & $1.02$ & $69$ & $4.98e$-$11$ & $1.14e$-$12$ \\
-
-torso3 & $4.242s$ & $2.030s$ & $2.09$ & $175$ & $2.69e$-$10$ & $1.78e$-$14$ \\ \hline
-\end{tabular}
-\caption{Performances of the parallel GMRES method on a cluster 24 CPU cores vs. on cluster of 12 GPUs.}
-\label{ch12:tab:03}
-\end{center}
-\end{table}
-
-Tables~\ref{ch12:tab:02} and~\ref{ch12:tab:03} shows the performances of the parallel
-CG and GMRES solvers, respectively, for solving linear systems associated to the sparse
-matrices presented in Tables~\ref{ch12:tab:01}. They allow to compare the performances
-obtained on a cluster of $24$ CPU cores and on a cluster of $12$ GPUs. However, Table~\ref{ch12:tab:02}
-shows only the performances of solving symmetric sparse linear systems, due to the inability
-of the CG method to solve the nonsymmetric systems. In both tables, the second and third
-columns give, respectively, the execution times in seconds obtained on $24$ CPU cores
-($Time_{gpu}$) and that obtained on $12$ GPUs ($Time_{gpu}$). Moreover, we take into account
-the relative gains $\tau$ of a solver implemented on the GPU cluster compared to the same
-solver implemented on the CPU cluster. The relative gains\index{Relative~gain}, presented
-in the fourth column, are computed as a ratio of the CPU execution time over the GPU
-execution time:
-\begin{equation}
-\tau = \frac{Time_{cpu}}{Time_{gpu}}.
-\label{ch12:eq:20}
-\end{equation}
-In addition, Tables~\ref{ch12:tab:02} and~\ref{ch12:tab:03} give the number of iterations
-($iter$), the precision $prec$ of the solution computed on the GPU cluster and the difference
-$\Delta$ between the solution computed on the CPU cluster and that computed on the GPU cluster.
-Both parameters $prec$ and $\Delta$ allow to validate and verify the accuracy of the solution
-computed on the GPU cluster. We have computed them as follows:
-\begin{eqnarray}
-\Delta = max|x^{cpu}-x^{gpu}|,\\
-prec = max|M^{-1}r^{gpu}|,
-\end{eqnarray}
-where $\Delta$ is the maximum vector element, in absolute value, of the difference between
-the two solutions $x^{cpu}$ and $x^{gpu}$ computed, respectively, on CPU and GPU clusters and
-$prec$ is the maximum element, in absolute value, of the residual vector $r^{gpu}\in\mathbb{R}^{n}$
-of the solution $x^{gpu}$. Thus, we can see that the solutions obtained on the GPU cluster
-were computed with a sufficient accuracy (about $10^{-10}$) and they are, more or less, equivalent
-to those computed on the CPU cluster with a small difference ranging from $10^{-10}$ to $10^{-26}$.
-However, we can notice from the relative gains $\tau$ that is not interesting to use multiple
-GPUs for solving small sparse linear systems. in fact, a small sparse matrix does not allow to
-maximize utilization of GPU cores. In addition, the communications required to synchronize the
-computations over the cluster increase the idle times of GPUs and slow down further the parallel
-computations.
-
-Consequently, in order to test the performances of the parallel solvers, we developed in C programming
-language a generator of large sparse matrices. This generator takes a matrix from the Davis's collection~\cite{ch12:ref10}
-as an initial matrix to construct large sparse matrices exceeding ten million of rows. It must be executed
-in parallel by the MPI processes of the computing nodes, so that each process could construct its sparse
-sub-matrix. In first experimental tests, we are focused on sparse matrices having a banded structure,
-because they are those arise in the most of numerical problems. So to generate the global sparse matrix,
-each MPI process constructs its sub-matrix by performing several copies of an initial sparse matrix chosen
-from the Davis's collection. Then, it puts all these copies on the main diagonal of the global matrix
-(see Figure~\ref{ch12:fig:06}). Moreover, the empty spaces between two successive copies in the main
-diagonal are filled with sub-copies (left-copy and right-copy in Figure~\ref{ch12:fig:06}) of the same
-initial matrix.
-
-\begin{figure}
-\centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/generation}}
-\caption{Parallel generation of a large sparse matrix by four computing nodes.}
-\label{ch12:fig:06}
-\end{figure}
-
-\begin{table}[!h]
-\centering
-\begin{tabular}{|c|c|c|c|}
-\hline
-{\bf Matrix type} & {\bf Matrix name} & {\bf \# nnz} & {\bf Bandwidth} \\ \hline \hline
-
-\multirow{6}{*}{Symmetric} & 2cubes\_sphere & $413,703,602$ & $198,836$ \\
-
- & ecology2 & $124,948,019$ & $2,002$ \\
-
- & finan512 & $278,175,945$ & $123,900$ \\
-
- & G3\_circuit & $125,262,292$ & $1,891,887$ \\
-
- & shallow\_water2 & $100,235,292$ & $62,806$ \\
-
- & thermal2 & $175,300,284$ & $2,421,285$ \\ \hline \hline
-
-\multirow{6}{*}{Nonsymmetric} & cage13 & $435,770,480$ & $352,566$ \\
-
- & crashbasis & $409,291,236$ & $200,203$ \\
-
- & FEM\_3D\_thermal2 & $595,266,787$ & $206,029$ \\
-
- & language & $76,912,824$ & $398,626$ \\
-
- & poli\_large & $53,322,580$ & $15,576$ \\
-
- & torso3 & $433,795,264$ & $328,757$ \\ \hline
-\end{tabular}
-\vspace{0.5cm}
-\caption{Main characteristics of sparse banded matrices generated from those of the Davis's collection.}
-\label{ch12:tab:04}
-\end{table}
-
-We have used the parallel CG and GMRES algorithms for solving sparse linear systems of $25$
-million unknown values. The sparse matrices associated to these linear systems are generated
-from those presented in Table~\ref{ch12:tab:01}. Their main characteristics are given in Table~\ref{ch12:tab:04}.
-Tables~\ref{ch12:tab:05} and~\ref{ch12:tab:06} shows the performances of the parallel CG and
-GMRES solvers, respectively, obtained on a cluster of $24$ CPU cores and on a cluster of $12$
-GPUs. Obviously, we can notice from these tables that solving large sparse linear systems on
-a GPU cluster is more efficient than on a CPU cluster (see relative gains $\tau$). We can also
-notice that the execution times of the CG method, whether in a CPU cluster or on a GPU cluster,
-are better than those the GMRES method for solving large symmetric linear systems. In fact, the
-CG method is characterized by a better convergence\index{Convergence} rate and a shorter execution
-time of an iteration than those of the GMRES method. Moreover, an iteration of the parallel GMRES
-method requires more data exchanges between computing nodes compared to the parallel CG method.
-
-\begin{table}[!h]
-\begin{center}
-\begin{tabular}{|c|c|c|c|c|c|c|}
-\hline
-{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline
-
-2cubes\_sphere & $1.625s$ & $0.401s$ & $4.05$ & $14$ & $5.73e$-$11$ & $5.20e$-$18$ \\
-
-ecology2 & $0.856s$ & $0.103s$ & $8.27$ & $15$ & $3.75e$-$10$ & $1.11e$-$16$ \\
-
-finan512 & $1.210s$ & $0.354s$ & $3.42$ & $14$ & $1.04e$-$10$ & $2.77e$-$16$ \\
-
-G3\_circuit & $1.346s$ & $0.263s$ & $5.12$ & $17$ & $1.10e$-$10$ & $5.55e$-$16$ \\
-
-shallow\_water2 & $0.397s$ & $0.055s$ & $7.23$ & $7$ & $3.43e$-$15$ & $5.17e$-$26$ \\
-
-thermal2 & $1.411s$ & $0.244s$ & $5.78$ & $16$ & $1.67e$-$09$ & $3.88e$-$16$ \\ \hline
-\end{tabular}
-\caption{Performances of the parallel CG method for solving linear systems associated to sparse banded matrices on a cluster of 24 CPU cores vs.
-on a cluster of 12 GPUs.}
-\label{ch12:tab:05}
-\end{center}
-\end{table}
-
-\begin{table}[!h]
-\begin{center}
-\begin{tabular}{|c|c|c|c|c|c|c|}
-\hline
-{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline
-
-2cubes\_sphere & $3.597s$ & $0.514s$ & $6.99$ & $21$ & $2.11e$-$14$ & $8.67e$-$18$ \\
-
-ecology2 & $2.549s$ & $0.288s$ & $8.83$ & $21$ & $4.88e$-$13$ & $2.08e$-$14$ \\
-
-finan512 & $2.660s$ & $0.377s$ & $7.05$ & $17$ & $3.22e$-$12$ & $8.82e$-$14$ \\
-
-G3\_circuit & $3.139s$ & $0.480s$ & $6.53$ & $22$ & $1.04e$-$12$ & $5.00e$-$15$ \\
-
-shallow\_water2 & $2.195s$ & $0.253s$ & $8.68$ & $17$ & $5.54e$-$21$ & $7.92e$-$24$ \\
-
-thermal2 & $3.206s$ & $0.463s$ & $6.93$ & $21$ & $8.89e$-$12$ & $3.33e$-$16$ \\ \hline \hline
-
-cage13 & $5.560s$ & $0.663s$ & $8.39$ & $26$ & $3.29e$-$11$ & $1.59e$-$14$ \\
-
-crashbasis & $25.802s$ & $3.511s$ & $7.35$ & $135$ & $6.81e$-$11$ & $4.61e$-$15$ \\
-
-FEM\_3D\_thermal2 & $13.281s$ & $1.572s$ & $8.45$ & $64$ & $3.88e$-$09$ & $1.82e$-$12$ \\
-
-language & $12.553s$ & $1.760s$ & $7.13$ & $89$ & $2.11e$-$10$ & $1.60e$-$10$ \\
-
-poli\_large & $8.515s$ & $1.053s$ & $8.09$ & $69$ & $5.05e$-$11$ & $6.59e$-$12$ \\
-
-torso3 & $31.463s$ & $3.681s$ & $8.55$ & $175$ & $2.69e$-$10$ & $2.66e$-$14$ \\ \hline
-\end{tabular}
-\caption{Performances of the parallel GMRES method for solving linear systems associated to sparse banded matrices on a cluster of 24 CPU cores vs.
-on a cluster of 12 GPUs.}
-\label{ch12:tab:06}
-\end{center}
-\end{table}
-
-
-%%--------------------------%%
-%% SECTION 5 %%
-%%--------------------------%%
-\section{Hypergraph partitioning}
-\label{ch12:sec:05}
-In this section, we present the performances of both parallel CG and GMRES solvers for solving linear
-systems associated to sparse matrices having large bandwidths. Indeed, we are interested on sparse
-matrices having the nonzero values distributed along their bandwidths.
-
-\begin{figure}
-\centerline{\includegraphics[scale=0.22]{Chapters/chapter12/figures/generation_1}}
-\caption{Parallel generation of a large sparse five-bands matrix by four computing nodes.}
-\label{ch12:fig:07}
-\end{figure}
-
-\begin{table}[!h]
-\begin{center}
-\begin{tabular}{|c|c|c|c|}
-\hline
-{\bf Matrix type} & {\bf Matrix name} & {\bf \# nnz} & {\bf Bandwidth} \\ \hline \hline
-
-\multirow{6}{*}{Symmetric} & 2cubes\_sphere & $829,082,728$ & $24,999,999$ \\
-
- & ecology2 & $254,892,056$ & $25,000,000$ \\
-
- & finan512 & $556,982,339$ & $24,999,973$ \\
-
- & G3\_circuit & $257,982,646$ & $25,000,000$ \\
-
- & shallow\_water2 & $200,798,268$ & $25,000,000$ \\
-
- & thermal2 & $359,340,179$ & $24,999,998$ \\ \hline \hline
-
-\multirow{6}{*}{Nonsymmetric} & cage13 & $879,063,379$ & $24,999,998$ \\
-
- & crashbasis & $820,373,286$ & $24,999,803$ \\
-
- & FEM\_3D\_thermal2 & $1,194,012,703$ & $24,999,998$ \\
-
- & language & $155,261,826$ & $24,999,492$ \\
-
- & poli\_large & $106,680,819$ & $25,000,000$ \\
-
- & torso3 & $872,029,998$ & $25,000,000$\\ \hline
-\end{tabular}
-\caption{Main characteristics of sparse five-bands matrices generated from those of the Davis's collection.}
-\label{ch12:tab:07}
-\end{center}
-\end{table}
-
-We have developed in C programming language a generator of large sparse matrices
-having five bands distributed along their bandwidths (see Figure~\ref{ch12:fig:07}).
-The principle of this generator is equivalent to that in Section~\ref{ch12:sec:04}.
-However, the copies performed on the initial matrix (chosen from the Davis's collection)
-are placed on the main diagonal and on four off-diagonals, two on the right and two
-on the left of the main diagonal. Figure~\ref{ch12:fig:07} shows an example of a
-generation of a sparse five-bands matrix by four computing nodes. Table~\ref{ch12:tab:07}
-shows the main characteristics of sparse five-bands matrices generated from those
-presented in Table~\ref{ch12:tab:01} and associated to linear systems of $25$ million
-unknown values.
-
-\begin{table}[!h]
-\begin{center}
-\begin{tabular}{|c|c|c|c|c|c|c|}
-\hline
-{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline
-
-2cubes\_sphere & $6.041s$ & $3.338s$ & $1.81$ & $30$ & $6.77e$-$11$ & $3.25e$-$19$ \\
-
-ecology2 & $1.404s$ & $1.301s$ & $1.08$ & $13$ & $5.22e$-$11$ & $2.17e$-$18$ \\
-
-finan512 & $1.822s$ & $1.299s$ & $1.40$ & $12$ & $3.52e$-$11$ & $3.47e$-$18$ \\
-
-G3\_circuit & $2.331s$ & $2.129s$ & $1.09$ & $15$ & $1.36e$-$11$ & $5.20e$-$18$ \\
-
-shallow\_water2 & $0.541s$ & $0.504s$ & $1.07$ & $6$ & $2.12e$-$16$ & $5.05e$-$28$ \\
-
-thermal2 & $2.549s$ & $1.705s$ & $1.49$ & $14$ & $2.36e$-$10$ & $5.20e$-$18$ \\ \hline
-\end{tabular}
-\caption{Performances of parallel CG solver for solving linear systems associated to sparse five-bands matrices
-on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs}
-\label{ch12:tab:08}
-\end{center}
-\end{table}
-
-\begin{table}
-\begin{center}
-\begin{tabular}{|c|c|c|c|c|c|c|}
-\hline
-{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline
-
-2cubes\_sphere & $15.963s$ & $7.250s$ & $2.20$ & $58$ & $6.23e$-$16$ & $3.25e$-$19$ \\
-
-ecology2 & $3.549s$ & $2.176s$ & $1.63$ & $21$ & $4.78e$-$15$ & $1.06e$-$15$ \\
-
-finan512 & $3.862s$ & $1.934s$ & $1.99$ & $17$ & $3.21e$-$14$ & $8.43e$-$17$ \\
-
-G3\_circuit & $4.636s$ & $2.811s$ & $1.65$ & $22$ & $1.08e$-$14$ & $1.77e$-$16$ \\
-
-shallow\_water2 & $2.738s$ & $1.539s$ & $1.78$ & $17$ & $5.54e$-$23$ & $3.82e$-$26$ \\
-
-thermal2 & $5.017s$ & $2.587s$ & $1.94$ & $21$ & $8.25e$-$14$ & $4.34e$-$18$ \\ \hline \hline
-
-cage13 & $9.315s$ & $3.227s$ & $2.89$ & $26$ & $3.38e$-$13$ & $2.08e$-$16$ \\
-
-crashbasis & $35.980s$ & $14.770s$ & $2.43$ & $127$ & $1.17e$-$12$ & $1.56e$-$17$ \\
-
-FEM\_3D\_thermal2 & $24.611s$ & $7.749s$ & $3.17$ & $64$ & $3.87e$-$11$ & $2.84e$-$14$ \\
-
-language & $16.859s$ & $9.697s$ & $1.74$ & $89$ & $2.17e$-$12$ & $1.70e$-$12$ \\
-
-poli\_large & $10.200s$ & $6.534s$ & $1.56$ & $69$ & $5.14e$-$13$ & $1.63e$-$13$ \\
-
-torso3 & $49.074s$ & $19.397s$ & $2.53$ & $175$ & $2.69e$-$12$ & $2.77e$-$16$ \\ \hline
-\end{tabular}
-\caption{Performances of parallel GMRES solver for solving linear systems associated to sparse five-bands matrices
-on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs}
-\label{ch12:tab:09}
-\end{center}
-\end{table}
-
-Tables~\ref{ch12:tab:08} and~\ref{ch12:tab:09} shows the performances of the parallel
-CG and GMRES solvers, respectively, obtained on a cluster of $24$ CPU cores and on a
-cluster of $12$ GPUs. The linear systems solved in these tables are associated to the
-sparse five-bands matrices presented on Table~\ref{ch12:tab:07}. We can notice from
-both Tables~\ref{ch12:tab:08} and~\ref{ch12:tab:09} that using a GPU cluster is not
-efficient for solving these kind of sparse linear systems\index{Sparse~linear~system}.
-We can see that the execution times obtained on the GPU cluster are almost equivalent
-to those obtained on the CPU cluster (see the relative gains presented in column~$4$
-of each table). This is due to the large number of communications necessary to synchronize
-the computations over the cluster. Indeed, the naive partitioning, row-by-row or column-by-column,
-of sparse matrices having large bandwidths can link a computing node to many neighbors
-and then generate a large number of data dependencies between these computing nodes in
-the cluster.
-
-Therefore, we have chosen to use a hypergraph partitioning method\index{Hypergraph},
-which is well-suited to numerous kinds of sparse matrices~\cite{ch12:ref11}. Indeed,
-it can well model the communications between the computing nodes, particularly in the
-case of nonsymmetric and irregular matrices, and it gives good reduction of the total
-communication volume. In contrast, it is an expensive operation in terms of execution
-time and memory space.
-
-The sparse matrix $A$ of the linear system to be solved is modeled as a hypergraph
-$\mathcal{H}=(\mathcal{V},\mathcal{E})$\index{Hypergraph} as follows:
-\begin{itemize}
-\item each matrix row $\{i\}_{0\leq i<n}$ corresponds to a vertex $v_i\in\mathcal{V}$ and,
-\item each matrix column $\{j\}_{0\leq j<n}$ corresponds to a hyperedge $e_j\in\mathcal{E}$, where:
-\begin{equation}
-\forall a_{ij} \neq 0 \mbox{~is a nonzero value of matrix~} A \mbox{~:~} v_i \in pins[e_j],
-\end{equation}
-\item $w_i$ is the weight of vertex $v_i$ and,
-\item $c_j$ is the cost of hyperedge $e_j$.
-\end{itemize}
-A $K$-way partitioning of a hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ is
-defined as $\mathcal{P}=\{\mathcal{V}_1,\ldots,\mathcal{V}_K\}$ a set of pairwise
-disjoint non-empty subsets (or parts) of the vertex set $\mathcal{V}$, so that each
-subset is attributed to a computing node. Figure~\ref{ch12:fig:08} shows an example
-of the hypergraph model of a $(9\times 9)$ sparse matrix in three parts. The circles
-and squares correspond, respectively, to the vertices and hyperedges of the hypergraph.
-The solid squares define the cut hyperedges connecting at least two different parts.
-The connectivity $\lambda_j$ of a cut hyperedge $e_j$ denotes the number of different
-parts spanned by $e_j$.
-
-\begin{figure}
-\centerline{\includegraphics[scale=0.5]{Chapters/chapter12/figures/hypergraph}}
-\caption{An example of the hypergraph partitioning of a sparse matrix decomposed between three computing nodes.}
-\label{ch12:fig:08}
-\end{figure}
-
-The cut hyperedges model the total communication volume between the different computing
-nodes in the cluster, necessary to perform the parallel SpMV multiplication\index{SpMV~multiplication}.
-Indeed, each hyperedge $e_j$ defines a set of atomic computations $b_i\leftarrow b_i+a_{ij}x_j$,
-$0\leq i,j<n$, of the SpMV multiplication $Ax=b$ that need the $j^{th}$ unknown value of
-solution vector $x$. Therefore, pins of hyperedge $e_j$, $pins[e_j]$, are the set of matrix
-rows sharing and requiring the same unknown value $x_j$. For example in Figure~\ref{ch12:fig:08},
-hyperedge $e_9$ whose pins are: $pins[e_9]=\{v_2,v_5,v_9\}$ represents the dependency of matrix
-rows $2$, $5$ and $9$ to unknown $x_9$ needed to perform in parallel the atomic operations:
-$b_2\leftarrow b_2+a_{29}x_9$, $b_5\leftarrow b_5+a_{59}x_9$ and $b_9\leftarrow b_9+a_{99}x_9$.
-However, unknown $x_9$ is the third entry of the sub-solution vector $x$ of part (or node) $3$.
-So the computing node $3$ must exchange this value with nodes $1$ and $2$, which leads to perform
-two communications.
-
-The hypergraph partitioning\index{Hypergraph} allows to reduce the total communication volume
-required to perform the parallel SpMV multiplication, while maintaining the load balancing between
-the computing nodes. In fact, it allows to minimize at best the following amount:
-\begin{equation}
-\mathcal{X}(\mathcal{P})=\sum_{e_{j}\in\mathcal{E}_{C}}c_{j}(\lambda_{j}-1),
-\end{equation}
-where $\mathcal{E}_{C}$ denotes the set of the cut hyperedges coming from the hypergraph partitioning
-$\mathcal{P}$ and $c_j$ and $\lambda_j$ are, respectively, the cost and the connectivity of cut hyperedge
-$e_j$. Moreover, it also ensures the load balancing between the $K$ parts as follows:
-\begin{equation}
- W_{k}\leq (1+\epsilon)W_{avg}, \hspace{0.2cm} (1\leq k\leq K) \hspace{0.2cm} \text{and} \hspace{0.2cm} (0<\epsilon<1),
-\end{equation}
-where $W_{k}$ is the sum of all vertex weights ($w_{i}$) in part $\mathcal{V}_{k}$, $W_{avg}$ is the
-average weight of all $K$ parts and $\epsilon$ is the maximum allowed imbalanced ratio.
-
-The hypergraph partitioning is a NP-complete problem but software tools using heuristics are developed,
-for example: hMETIS~\cite{ch12:ref12}, PaToH~\cite{ch12:ref13} and Zoltan~\cite{ch12:ref14}. Since our
-objective is solving large sparse linear systems, we use the parallel hypergraph partitioning which must
-be performed by at least two MPI processes. It allows to accelerate the data partitioning of large sparse
-matrices. For this, the hypergraph $\mathcal{H}$ must be partitioned in $p$ (number of MPI processes)
-sub-hypergraphs $\mathcal{H}_k=(\mathcal{V}_k,\mathcal{E}_k)$, $0\leq k<p$, and then we performed the
-parallel hypergraph partitioning method using some functions of the MPI library between the $p$ processes.
-
-Tables~\ref{ch12:tab:10} and~\ref{ch12:tab:11} shows the performances of the parallel CG and GMRES solvers,
-respectively, using the hypergraph partitioning for solving large linear systems associated to the sparse
-five-bands matrices presented in Table~\ref{ch12:tab:07}. For these experimental tests, we have applied the
-parallel hypergraph partitioning~\cite{ch12:ref15} developed in Zoltan tool~\cite{ch12:ref14}. We have initialized
-the parameters of the partitioning operation as follows:
-\begin{itemize}
-\item the weight $w_{i}$ of each vertex $v_{j}\in\mathcal{V}$ is set to the number of nonzero values on matrix row $i$,
-\item for the sake of simplicity, the cost $c_{j}$ of each hyperedge $e_{j}\in\mathcal{E}$ is fixed to $1$,
-\item the maximum imbalanced load ratio $\epsilon$ is limited to $10\%$.\\
-\end{itemize}
-
-\begin{table}
-\begin{center}
-\begin{tabular}{|c|c|c|c|c|}
-\hline
-{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{Gains \%}$ \\ \hline \hline
-
-2cubes\_sphere & $5.935s$ & $1.213s$ & $4.89$ & $63.66\%$ \\
-
-ecology2 & $1.093s$ & $0.136s$ & $8.00$ & $89.55\%$ \\
-
-finan512 & $1.762s$ & $0.475s$ & $3.71$ & $63.43\%$ \\
-
-G3\_circuit & $2.095s$ & $0.558s$ & $3.76$ & $73.79\%$ \\
-
-shallow\_water2 & $0.498s$ & $0.068s$ & $7.31$ & $86.51\%$ \\
-
-thermal2 & $1.889s$ & $0.348s$ & $5.43$ & $79.59\%$ \\ \hline
-\end{tabular}
-\caption{Performances of the parallel CG solver using hypergraph partitioning for solving linear systems associated to
-sparse five-bands matrices on a cluster of 24 CPU cores vs. on a cluster of 12 GPU.}
-\label{ch12:tab:10}
-\end{center}
-\end{table}
-
-\begin{table}
-\begin{center}
-\begin{tabular}{|c|c|c|c|c|}
-\hline
-{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{Gains \%}$ \\ \hline \hline
-
-2cubes\_sphere & $16.430s$ & $2.840s$ & $5.78$ & $60.83\%$ \\
-
-ecology2 & $3.152s$ & $0.367s$ & $8.59$ & $83.13\%$ \\
-
-finan512 & $3.672s$ & $0.723s$ & $5.08$ & $62.62\%$ \\
-
-G3\_circuit & $4.468s$ & $0.971s$ & $4.60$ & $65.46\%$ \\
-
-shallow\_water2 & $2.647s$ & $0.312s$ & $8.48$ & $79.73\%$ \\
-
-thermal2 & $4.190s$ & $0.666s$ & $6.29$ & $74.25\%$ \\ \hline \hline
-
-cage13 & $8.077s$ & $1.584s$ & $5.10$ & $50.91\%$ \\
-
-crashbasis & $35.173s$ & $5.546s$ & $6.34$ & $62.43\%$ \\
-
-FEM\_3D\_thermal2 & $24.825s$ & $3.113s$ & $7.97$ & $59.83\%$ \\
-
-language & $16.706s$ & $2.522s$ & $6.62$ & $73.99\%$ \\
-
-poli\_large & $12.715s$ & $3.989s$ & $3.19$ & $38.95\%$ \\
-
-torso3 & $48.459s$ & $6.234s$ & $7.77$ & $67.86\%$ \\ \hline
-\end{tabular}
-\caption{Performances of the parallel GMRES solver using hypergraph partitioning for solving linear systems associated to
-sparse five-bands matrices on a cluster of 24 CPU cores vs. on a cluster of 12 GPU.}
-\label{ch12:tab:11}
-\end{center}
-\end{table}
-
-We can notice from both Tables~\ref{ch12:tab:10} and~\ref{ch12:tab:11} that the
-hypergraph partitioning has improved the performances of both parallel CG and GMRES
-algorithms. The execution times on the GPU cluster of both parallel solvers are
-significantly improved compared to those obtained by using the partitioning row-by-row.
-For these examples of sparse matrices, the execution times of CG and GMRES solvers
-are reduced about $76\%$ and $65\%$ respectively (see column~$5$ of each table)
-compared to those obtained in Tables~\ref{ch12:tab:08} and~\ref{ch12:tab:09}.
-
-In fact, the hypergraph partitioning\index{Hypergraph} applied to sparse matrices
-having large bandwidths allows to reduce the total communication volume necessary
-to synchronize the computations between the computing nodes in the GPU cluster.
-Table~\ref{ch12:tab:12} presents, for each sparse matrix, the total communication
-volume between $12$ GPU computing nodes obtained by using the partitioning row-by-row
-(column~$2$), the total communication volume obtained by using the hypergraph partitioning
-(column~$3$) and the execution times in minutes of the hypergraph partitioning
-operation performed by $12$ MPI processes (column~$4$). The total communication
-volume defines the total number of the vector elements exchanged by the computing
-nodes. Then, Table~\ref{ch12:tab:12} shows that the hypergraph partitioning method
-can split the sparse matrix so as to minimize the data dependencies between the
-computing nodes and thus to reduce the total communication volume.
-
-\begin{table}
-\begin{center}
-\begin{tabular}{|c|c|c|c|}
-\hline
-\multirow{4}{*}{\bf Matrix} & {\bf Total comms.} & {\bf Total comms.} & {\bf Execution} \\
- & {\bf volume without} & {\bf volume with} & {\bf trime} \\
- & {\bf hypergraph} & {\bf hypergraph } & {\bf of the parti.} \\
- & {\bf parti.} & {\bf parti.} & {\bf in minutes}\\ \hline \hline
-
-2cubes\_sphere & $25,360,543$ & $240,679$ & $68.98$ \\
-
-ecology2 & $26,044,002$ & $73,021$ & $4.92$ \\
-
-finan512 & $26,087,431$ & $900,729$ & $33.72$ \\
-
-G3\_circuit & $31,912,003$ & $5,366,774$ & $11.63$ \\
-
-shallow\_water2 & $25,105,108$ & $60,899$ & $5.06$ \\
-
-thermal2 & $30,012,846$ & $1,077,921$ & $17.88$ \\ \hline \hline
-
-cage13 & $28,254,282$ & $3,845,440$ & $196.45$ \\
-
-crashbasis & $29,020,060$ & $2,401,876$ & $33.39$ \\
-
-FEM\_3D\_thermal2 & $25,263,767$ & $250,105$ & $49.89$ \\
-
-language & $27,291,486$ & $1,537,835$ & $9.07$ \\
-
-poli\_large & $25,053,554$ & $7,388,883$ & $5.92$ \\
-
-torso3 & $25,682,514$ & $613,250$ & $61.51$ \\ \hline
-\end{tabular}
-\caption{The total communication volume between 12 GPU computing nodes without and with the hypergraph partitioning method.}
-\label{ch12:tab:12}
-\end{center}
-\end{table}
-
-Nevertheless, as we can see from the fourth column of Table~\ref{ch12:tab:12},
-the hypergraph partitioning takes longer compared to the execution times of the
-resolutions. As previously mentioned, the hypergraph partitioning method is less
-efficient in terms of memory consumption and partitioning time than its graph
-counterpart, but the hypergraph well models the nonsymmetric and irregular problems.
-So for the applications which often use the same sparse matrices, we can perform
-the hypergraph partitioning on these matrices only once for each and then, we save
-the traces of these partitionings in files to be reused several times. Therefore,
-this allows to avoid the partitioning of the sparse matrices at each resolution
-of the linear systems.
-
-\begin{figure}[!h]
-\centering
- \mbox{\subfigure[Sparse band matrices]{\includegraphics[scale=0.7]{Chapters/chapter12/figures/scale_band}\label{ch12:fig:09.01}}}
-\vfill
- \mbox{\subfigure[Sparse five-bands matrices]{\includegraphics[scale=0.7]{Chapters/chapter12/figures/scale_5band}\label{ch12:fig:09.02}}}
-\caption{Weak-scaling of the parallel CG and GMRES solvers on a GPU cluster for solving large sparse linear systems.}
-\label{ch12:fig:09}
-\end{figure}
-
-However, the most important performance parameter is the scalability of the parallel
-CG\index{Iterative~method!CG} and GMRES\index{Iterative~method!GMRES} solvers on a GPU
-cluster. Particularly, we have taken into account the weak-scaling of both parallel
-algorithms on a cluster of one to 12 GPU computing nodes. We have performed a set of
-experiments on both matrix structures: band matrices and five-bands matrices. The sparse
-matrices of tests are generated from the symmetric sparse matrix {\it thermal2} chosen
-from the Davis's collection. Figures~\ref{ch12:fig:09.01} and~\ref{ch12:fig:09.02}
-show the execution times of both parallel methods for solving large linear systems
-associated to band matrices and those associated to five-bands matrices, respectively.
-The size of a sparse sub-matrix per computing node, for each matrix structure, is fixed
-as follows:
-\begin{itemize}
-\item band matrix: $15$ million of rows and $105,166,557$ of nonzero values,
-\item five-bands matrix: $5$ million of rows and $78,714,492$ of nonzero values.
-\end{itemize}
-We can see from these figures that both parallel solvers are quite scalable on a GPU
-cluster. Indeed, the execution times remains almost constant while the size of the
-sparse linear systems to be solved increases proportionally with the number of the
-GPU computing nodes. This means that the communication cost is relatively constant
-regardless of the number the computing nodes in the GPU cluster.
-
-
-
-%%--------------------------%%
-%% SECTION 6 %%
-%%--------------------------%%
-\section{Conclusion}
-\label{ch12:sec:06}
-In this chapter, we have aimed at harnessing the computing power of a
-cluster of GPUs for solving large sparse linear systems. For this, we
-have used two Krylov subspace iterative methods: the CG and GMRES methods.
-The first method is well-known to its efficiency for solving symmetric
-linear systems and the second one is used, particularly, for solving
-nonsymmetric linear systems.
-
-We have presented the parallel implementation of both iterative methods
-on a GPU cluster. Particularly, the operations dealing with the vectors
-and/or matrices, of these methods, are parallelized between the different
-GPU computing nodes of the cluster. Indeed, the data-parallel vector operations
-are accelerated by GPUs and the communications required to synchronize the
-parallel computations are carried out by CPU cores. For this, we have used
-a heterogeneous CUDA/MPI programming to implement the parallel iterative
-algorithms.
-
-In the experimental tests, we have shown that using a GPU cluster is efficient
-for solving linear systems associated to very large sparse matrices. The experimental
-results, obtained in the present chapter, showed that a cluster of $12$ GPUs is
-about $7$ times faster than a cluster of $24$ CPU cores for solving large sparse
-linear systems of $25$ million unknown values. This is due to the GPU ability to
-compute the data-parallel operations faster than the CPUs. However, we have shown
-that solving linear systems associated to matrices having large bandwidths uses
-many communications to synchronize the computations of GPUs, which slow down even
-more the resolution. Moreover, there are two kinds of communications: between a
-CPU and its GPU and between CPUs of the computing nodes, such that the first ones
-are the slowest communications on a GPU cluster. So, we have proposed to use the
-hypergraph partitioning instead of the row-by-row partitioning. This allows to
-minimize the data dependencies between the GPU computing nodes and thus to reduce
-the total communication volume. The experimental results showed that using the
-hypergraph partitioning technique improve the execution times on average of $76\%$
-to the CG method and of $65\%$ to the GMRES method on a cluster of $12$ GPUs.
-
-In the recent GPU hardware and software architectures, the GPU-Direct system with
-CUDA version 5.0 is used so that two GPUs located on the same node or on distant
-nodes can communicate between them directly without CPUs. This allows to improve
-the data transfers between GPUs.
-
-\putbib[Chapters/chapter12/biblio12]
-