X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/bc21a4e60ae5e4ef525ce6933905824c6597401d..26cebc7573a80ca26562fdc140322ce3df3976b1:/BookGPU/Chapters/chapter12/ch12.tex?ds=inline diff --git a/BookGPU/Chapters/chapter12/ch12.tex b/BookGPU/Chapters/chapter12/ch12.tex index eaa4f9c..0b743eb 100755 --- a/BookGPU/Chapters/chapter12/ch12.tex +++ b/BookGPU/Chapters/chapter12/ch12.tex @@ -5,11 +5,11 @@ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %\chapterauthor{}{} -\chapterauthor{Lilia Ziane Khodja}{Femto-ST Institute, University of Franche-Comte, France} -\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte, France} -\chapterauthor{Jacques Bahi}{Femto-ST Institute, University of Franche-Comte, France} +\chapterauthor{Lilia Ziane Khodja, Raphaël Couturier and Jacques Bahi}{Femto-ST Institute, University of Franche-Comte, France} +%\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte, France} +%\chapterauthor{Jacques Bahi}{Femto-ST Institute, University of Franche-Comte, France} -\chapter{Solving sparse linear systems with GMRES and CG methods on GPU clusters} +\chapter[Solving linear systems with GMRES and CG methods on GPU clusters]{Solving sparse linear systems with GMRES and CG methods on GPU clusters} \label{ch12} %%--------------------------%% @@ -17,10 +17,10 @@ %%--------------------------%% \section{Introduction} \label{ch12:sec:01} -The sparse linear systems are used to model many scientific and industrial problems, +Sparse linear systems are used to model many scientific and industrial problems, such as the environmental simulations or the industrial processing of the complex or non-Newtonian fluids. Moreover, the resolution of these problems often involves the -solving of such linear systems which is considered as the most expensive process in +solving of such linear systems which are considered as the most expensive process in terms of execution time and memory space. Therefore, solving sparse linear systems must be as efficient as possible in order to deal with problems of ever increasing size. @@ -28,21 +28,18 @@ size. There are, in the jargon of numerical analysis, different methods of solving sparse linear systems that can be classified in two classes: the direct and iterative methods. However, the iterative methods are often more suitable than their counterpart, direct -methods, for solving these systems. Indeed, they are less memory consuming and easier +methods, to solve these systems. Indeed, they are less memory consuming and easier to parallelize on parallel computers than direct methods. Different computing platforms, -sequential and parallel computers, are used for solving sparse linear systems with iterative -solutions. Nowadays, graphics processing units (GPUs) have become attractive for solving +sequential and parallel computers, are used to solve sparse linear systems with iterative +solutions. Nowadays, graphics processing units (GPUs) have become attractive to solve these systems, due to their computing power and their ability to compute faster than traditional CPUs. In Section~\ref{ch12:sec:02}, we describe the general principle of two well-known iterative methods: the conjugate gradient method and the generalized minimal residual method. In Section~\ref{ch12:sec:03}, we give the main key points of the parallel implementation of both methods on a cluster of -GPUs. Then, in Section~\ref{ch12:sec:04}, we present the experimental results obtained on a -CPU cluster and on a GPU cluster, for solving sparse linear systems associated to matrices -of different structures. Finally, in Section~\ref{ch12:sec:05}, we apply the hypergraph partitioning -technique to reduce the total communication volume between the computing nodes and, thus, -to improve the execution times of the parallel algorithms of both iterative methods. +GPUs. Finally, in Section~\ref{ch12:sec:04}, we present the experimental results obtained on a +CPU cluster and on a GPU cluster, to solve large sparse linear systems. %%--------------------------%% @@ -84,7 +81,7 @@ linear systems are those called \textit{Krylov subspace methods}~\cite{ch12:ref1 In the present chapter, we describe two Krylov methods which are widely used: the conjugate gradient method (CG) and the generalized minimal residual method (GMRES). In practice, the Krylov subspace methods are usually used with preconditioners that allow to improve their -convergence. So, in what follows, the CG and GMRES methods are used for solving the left-preconditioned\index{Sparse~linear~system!Preconditioned} +convergence. So, in what follows, the CG and GMRES methods are used to solve the left-preconditioned\index{Sparse~linear~system!Preconditioned} sparse linear system: \begin{equation} M^{-1}Ax=M^{-1}b, @@ -97,9 +94,9 @@ where $M$ is the preconditioning matrix. %%****************%% \subsection{CG method} \label{ch12:sec:02.01} -The conjugate gradient method is initially developed by Hestenes and Stiefel in 1952~\cite{ch12:ref2}. -It is one of the well known iterative method for solving large sparse linear systems. In addition, it -can be adapted for solving nonlinear equations and optimization problems. However, it can only be applied +The conjugate gradient method was initially developed by Hestenes and Stiefel in 1952~\cite{ch12:ref2}. +It is one of the well known iterative method to solve large sparse linear systems. In addition, it +can be adapted to solve nonlinear equations and optimization problems. However, it can only be applied to problems with positive definite symmetric matrices. The main idea of the CG method\index{Iterative~method!CG} is the computation of a sequence of approximate @@ -200,7 +197,7 @@ $maxiter$ are reached. %%****************%% \subsection{GMRES method} \label{ch12:sec:02.02} -The iterative GMRES method is developed by Saad and Schultz in 1986~\cite{ch12:ref3} as a generalization +The iterative GMRES method was developed by Saad and Schultz in 1986~\cite{ch12:ref3} as a generalization of the minimum residual method MINRES~\cite{ch12:ref4}\index{Iterative~method!MINRES}. Indeed, GMRES can be applied for solving symmetric or nonsymmetric linear systems. @@ -231,7 +228,7 @@ V_k = \{v_1, v_2,\ldots,v_k\}, & \forall k>1, v_k=A^{k-1}v_1, \end{equation} and \begin{equation} -V_k A = V_{k+1} \bar{H}_k. +A V_k = V_{k+1} \bar{H}_k. \label{ch12:eq:15} \end{equation} @@ -312,7 +309,7 @@ $V$ to $m$ orthogonal vectors. \label{ch12:alg:02} \end{algorithm} -Algorithm~\ref{ch12:alg:02} shows the main key points of the GMRES method with restarts. +Algorithm~\ref{ch12:alg:02} shows the key points of the GMRES method with restarts. It solves the left-preconditioned\index{Sparse~linear~system!Preconditioned} sparse linear system~(\ref{ch12:eq:11}), such that $M$ is the preconditioning matrix. At each iteration $k$, GMRES uses the Arnoldi process\index{Iterative~method!Arnoldi~process} (defined from @@ -355,7 +352,7 @@ $i$: \item a square preconditioning sub-matrix $M_i$ of size $(\frac{n}{p},\frac{n}{p})$, \end{itemize} where $n$ is the size of the sparse linear system to be solved. In the first instance, we perform a naive -row-wise partitioning (decomposition row-by-row) on the data of the sparse linear systems to be solved. +row-wise partitioning (row-by-row decomposition) on the data of the sparse linear systems to be solved. Figure~\ref{ch12:fig:01} shows an example of a row-wise data partitioning between four computing nodes of a sparse linear system (sparse matrix $A$, solution vector $x$ and right-hand side $b$) of size $16$ unknown values. @@ -394,10 +391,10 @@ GMRES solvers. The following kernels of CUBLAS (dealing with double floating poi are used: \verb+cublasDdot()+ for the dot products, \verb+cublasDnrm2()+ for the Euclidean norms and \verb+cublasDaxpy()+ for the AXPY operations. For the rest of the data-parallel operations, we code their kernels in CUDA. In the CG solver, we -develop a kernel for the XPAY operation ($y\leftarrow x+ay$) used at line~$12$ in +develop a kernel for the XPAY operation ($y\leftarrow x+ay$) used line~$12$ in Algorithm~\ref{ch12:alg:01}. In the GMRES solver, we program a kernel for the scalar-vector -multiplication (lines~$7$ and~$15$ in Algorithm~\ref{ch12:alg:02}), a kernel for -solving the least-squares problem and a kernel for the elements updates of the solution +multiplication (lines~$7$ and~$15$ in Algorithm~\ref{ch12:alg:02}), a kernel to +solve the least-squares problem and a kernel to update the elements of the solution vector $x$. The least-squares problem in the GMRES method is solved by performing a QR factorization @@ -414,12 +411,12 @@ methods is the sparse matrix-vector multiplication (SpMV)\index{SpMV~multiplicat because it is often an expensive operation in terms of execution time and memory space. Moreover, it requires to take care of the storage format of the sparse matrix in the memory. Indeed, the naive storage, row-by-row or column-by-column, of a sparse matrix -can cause a significant waste of memory space and execution time. In addition, the sparsity +can cause a significant waste of memory space and execution time. In addition, the sparse nature of the matrix often leads to irregular memory accesses to read the matrix nonzero values. So, the computation of the SpMV multiplication on GPUs can involve non coalesced -accesses to the global memory, which slows down even more its performances. One of the +accesses to the global memory, which slows down its performances even more. One of the most efficient compressed storage formats\index{Compressed~storage~format} of sparse -matrices on GPUs is HYB\index{Compressed~storage~format!HYB} format~\cite{ch12:ref7}. +matrices on GPUs is the HYB\index{Compressed~storage~format!HYB} format~\cite{ch12:ref7}. It is a combination of ELLpack (ELL) and Coordinate (COO) formats. Indeed, it stores a typical number of nonzero values per row in ELL\index{Compressed~storage~format!ELL} format and remaining entries of exceptional rows in COO format. It combines the efficiency @@ -488,7 +485,7 @@ storage formats. Figure~\ref{ch12:fig:03} shows a reordering of a sparse sub-mat A GPU cluster\index{GPU~cluster} is a parallel platform with a distributed memory. So, the synchronizations and communication data between GPU nodes are carried out by passing messages. However, GPUs can not communicate -between them in direct way. Then, CPUs via MPI processes are in charge of the synchronizations within the GPU +between them in a direct way. Then, CPUs via MPI processes are in charge of the synchronizations within the GPU cluster. Consequently, the vector elements to be exchanged must be copied from the GPU memory to the CPU memory and vice-versa before and after the synchronization operation between CPUs. We have used the CUBLAS\index{CUBLAS} communication subroutines to perform the data transfers between a CPU core and its GPU: \verb+cublasGetVector()+ @@ -512,19 +509,19 @@ Tesla C1060 GPU contains $240$ cores running at $1.3$GHz and providing a global a memory bandwidth of $102$GB/s. Figure~\ref{ch12:fig:04} shows the general scheme of the GPU cluster\index{GPU~cluster} that we used in the experimental tests. -\begin{figure} -\centerline{\includegraphics[scale=0.25]{Chapters/chapter12/figures/cluster}} -\caption{General scheme of the GPU cluster of tests composed of six machines, each with two GPUs.} -\label{ch12:fig:04} -\end{figure} - -Linux cluster version 2.6.39 OS is installed on CPUs. C programming language is used for coding +Linux cluster version 2.6.39 OS is installed on CPUs. C programming language is used to code the parallel algorithms of both methods on the GPU cluster. CUDA version 4.0~\cite{ch12:ref9} -is used for programming GPUs, using CUBLAS library~\cite{ch12:ref6} to deal with vector operations +is used to program GPUs, using CUBLAS library~\cite{ch12:ref6} to deal with vector operations in GPUs and, finally, MPI routines of OpenMPI 1.3.3 are used to carry out the communications between CPU cores. Indeed, the experiments are done on a cluster of $12$ computing nodes, where each node is managed by a MPI process and it is composed of one CPU core and one GPU card. +\begin{figure}[!h] +\centerline{\includegraphics[scale=0.25]{Chapters/chapter12/figures/cluster}} +\caption{General scheme of the GPU cluster of tests composed of six machines, each with two GPUs.} +\label{ch12:fig:04} +\end{figure} + All tests are made on double-precision floating point operations. The parameters of both linear solvers are initialized as follows: the residual tolerance threshold $\varepsilon=10^{-12}$, the maximum number of iterations $maxiter=500$, the right-hand side $b$ is filled with $1.0$ and the @@ -536,17 +533,9 @@ not too ill-conditioned matrices. In the GPU computing, the size of thread block threads. Finally, the performance results, presented hereafter, are obtained from the mean value over $10$ executions of the same parallel linear solver and for the same input data. -To get more realistic results, we tested the CG and GMRES algorithms on sparse matrices of the Davis's -collection~\cite{ch12:ref10}, that arise in a wide spectrum of real-world applications. We chose six -symmetric sparse matrices and six nonsymmetric ones from this collection. In Figure~\ref{ch12:fig:05}, -we show structures of these matrices and in Table~\ref{ch12:tab:01} we present their main characteristics -which are the number of rows, the total number of nonzero values (nnz) and the maximal bandwidth. In -the present chapter, the bandwidth of a sparse matrix is defined as the number of matrix columns separating -the first and the last nonzero value on a matrix row. - \begin{figure} \centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/matrices}} -\caption{Sketches of sparse matrices chosen from the Davis's collection.} +\caption{Sketches of sparse matrices chosen from the Davis collection.} \label{ch12:fig:05} \end{figure} @@ -580,11 +569,18 @@ the first and the last nonzero value on a matrix row. & torso3 & $259,156$ & $4,429,042$ & $216,854$ \\ \hline \end{tabular} -\vspace{0.5cm} -\caption{Main characteristics of sparse matrices chosen from the Davis's collection.} +\caption{Main characteristics of sparse matrices chosen from the Davis collection.} \label{ch12:tab:01} \end{table} +To get more realistic results, we have tested the CG and GMRES algorithms on sparse matrices of the Davis +collection~\cite{ch12:ref10}, that arise in a wide spectrum of real-world applications. We have chosen six +symmetric sparse matrices and six nonsymmetric ones from this collection. In Figure~\ref{ch12:fig:05}, +we show the structures of these matrices and in Table~\ref{ch12:tab:01} we present their main characteristics +which are the number of rows, the total number of nonzero values (nnz) and the maximal bandwidth. In +the present chapter, the bandwidth of a sparse matrix is defined as the number of matrix columns separating +the first and the last nonzero value on a matrix row. + \begin{table} \begin{center} \begin{tabular}{|c|c|c|c|c|c|c|} @@ -608,7 +604,7 @@ thermal2 & $1.172s$ & $0.622s$ & $1.88$ & $ \end{center} \end{table} -\begin{table}[!h] +\begin{table} \begin{center} \begin{tabular}{|c|c|c|c|c|c|c|} \hline @@ -643,11 +639,11 @@ torso3 & $4.242s$ & $2.030s$ & $2.09$ & $ \end{center} \end{table} -Tables~\ref{ch12:tab:02} and~\ref{ch12:tab:03} shows the performances of the parallel +Tables~\ref{ch12:tab:02} and~\ref{ch12:tab:03} show the performances of the parallel CG and GMRES solvers, respectively, for solving linear systems associated to the sparse matrices presented in Tables~\ref{ch12:tab:01}. They allow to compare the performances obtained on a cluster of $24$ CPU cores and on a cluster of $12$ GPUs. However, Table~\ref{ch12:tab:02} -shows only the performances of solving symmetric sparse linear systems, due to the inability +only shows the performances of solving symmetric sparse linear systems, due to the inability of the CG method to solve the nonsymmetric systems. In both tables, the second and third columns give, respectively, the execution times in seconds obtained on $24$ CPU cores ($Time_{gpu}$) and that obtained on $12$ GPUs ($Time_{gpu}$). Moreover, we take into account @@ -674,31 +670,31 @@ $prec$ is the maximum element, in absolute value, of the residual vector $r^{gpu of the solution $x^{gpu}$. Thus, we can see that the solutions obtained on the GPU cluster were computed with a sufficient accuracy (about $10^{-10}$) and they are, more or less, equivalent to those computed on the CPU cluster with a small difference ranging from $10^{-10}$ to $10^{-26}$. -However, we can notice from the relative gains $\tau$ that is not interesting to use multiple -GPUs for solving small sparse linear systems. in fact, a small sparse matrix does not allow to +However, we can notice from the relative gains $\tau$ that it is not interesting to use multiple +GPUs for solving small sparse linear systems. In fact, a small sparse matrix does not allow to maximize utilization of GPU cores. In addition, the communications required to synchronize the -computations over the cluster increase the idle times of GPUs and slow down further the parallel -computations. +computations over the cluster increase the idle times of GPUs and slow down the parallel +computations further. Consequently, in order to test the performances of the parallel solvers, we developed in C programming -language a generator of large sparse matrices. This generator takes a matrix from the Davis's collection~\cite{ch12:ref10} -as an initial matrix to construct large sparse matrices exceeding ten million of rows. It must be executed -in parallel by the MPI processes of the computing nodes, so that each process could construct its sparse -sub-matrix. In first experimental tests, we are focused on sparse matrices having a banded structure, -because they are those arise in the most of numerical problems. So to generate the global sparse matrix, +language a generator of large sparse matrices. This generator takes a matrix from the Davis collection~\cite{ch12:ref10} +as an initial matrix to build large sparse matrices exceeding ten million of rows. It must be executed +in parallel by the MPI processes of the computing nodes, so that each process could build its sparse +sub-matrix. In the first experimental tests, we focused on sparse matrices having a banded structure, +because they are those arising the most in the majority of numerical problems. So to generate the global sparse matrix, each MPI process constructs its sub-matrix by performing several copies of an initial sparse matrix chosen -from the Davis's collection. Then, it puts all these copies on the main diagonal of the global matrix +from the Davis collection. Then, it puts all these copies on the main diagonal of the global matrix (see Figure~\ref{ch12:fig:06}). Moreover, the empty spaces between two successive copies in the main diagonal are filled with sub-copies (left-copy and right-copy in Figure~\ref{ch12:fig:06}) of the same initial matrix. -\begin{figure} +\begin{figure}[htbp] \centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/generation}} \caption{Parallel generation of a large sparse matrix by four computing nodes.} \label{ch12:fig:06} \end{figure} -\begin{table}[!h] +\begin{table}[htbp] \centering \begin{tabular}{|c|c|c|c|} \hline @@ -729,24 +725,11 @@ initial matrix. & torso3 & $433,795,264$ & $328,757$ \\ \hline \end{tabular} \vspace{0.5cm} -\caption{Main characteristics of sparse banded matrices generated from those of the Davis's collection.} +\caption{Main characteristics of sparse banded matrices generated from those of the Davis collection.} \label{ch12:tab:04} \end{table} -We have used the parallel CG and GMRES algorithms for solving sparse linear systems of $25$ -million unknown values. The sparse matrices associated to these linear systems are generated -from those presented in Table~\ref{ch12:tab:01}. Their main characteristics are given in Table~\ref{ch12:tab:04}. -Tables~\ref{ch12:tab:05} and~\ref{ch12:tab:06} shows the performances of the parallel CG and -GMRES solvers, respectively, obtained on a cluster of $24$ CPU cores and on a cluster of $12$ -GPUs. Obviously, we can notice from these tables that solving large sparse linear systems on -a GPU cluster is more efficient than on a CPU cluster (see relative gains $\tau$). We can also -notice that the execution times of the CG method, whether in a CPU cluster or on a GPU cluster, -are better than those the GMRES method for solving large symmetric linear systems. In fact, the -CG method is characterized by a better convergence\index{Convergence} rate and a shorter execution -time of an iteration than those of the GMRES method. Moreover, an iteration of the parallel GMRES -method requires more data exchanges between computing nodes compared to the parallel CG method. - -\begin{table}[!h] +\begin{table}[htbp] \begin{center} \begin{tabular}{|c|c|c|c|c|c|c|} \hline @@ -770,7 +753,7 @@ on a cluster of 12 GPUs.} \end{center} \end{table} -\begin{table}[!h] +\begin{table} \begin{center} \begin{tabular}{|c|c|c|c|c|c|c|} \hline @@ -807,394 +790,30 @@ on a cluster of 12 GPUs.} \end{table} -%%--------------------------%% -%% SECTION 5 %% -%%--------------------------%% -\section{Hypergraph partitioning} -\label{ch12:sec:05} -In this section, we present the performances of both parallel CG and GMRES solvers for solving linear -systems associated to sparse matrices having large bandwidths. Indeed, we are interested on sparse -matrices having the nonzero values distributed along their bandwidths. - -\begin{figure} -\centerline{\includegraphics[scale=0.22]{Chapters/chapter12/figures/generation_1}} -\caption{Parallel generation of a large sparse five-bands matrix by four computing nodes.} -\label{ch12:fig:07} -\end{figure} - -\begin{table}[!h] -\begin{center} -\begin{tabular}{|c|c|c|c|} -\hline -{\bf Matrix type} & {\bf Matrix name} & {\bf \# nnz} & {\bf Bandwidth} \\ \hline \hline - -\multirow{6}{*}{Symmetric} & 2cubes\_sphere & $829,082,728$ & $24,999,999$ \\ - - & ecology2 & $254,892,056$ & $25,000,000$ \\ - - & finan512 & $556,982,339$ & $24,999,973$ \\ - - & G3\_circuit & $257,982,646$ & $25,000,000$ \\ - - & shallow\_water2 & $200,798,268$ & $25,000,000$ \\ - - & thermal2 & $359,340,179$ & $24,999,998$ \\ \hline \hline - -\multirow{6}{*}{Nonsymmetric} & cage13 & $879,063,379$ & $24,999,998$ \\ - - & crashbasis & $820,373,286$ & $24,999,803$ \\ - - & FEM\_3D\_thermal2 & $1,194,012,703$ & $24,999,998$ \\ - - & language & $155,261,826$ & $24,999,492$ \\ - - & poli\_large & $106,680,819$ & $25,000,000$ \\ - - & torso3 & $872,029,998$ & $25,000,000$\\ \hline -\end{tabular} -\caption{Main characteristics of sparse five-bands matrices generated from those of the Davis's collection.} -\label{ch12:tab:07} -\end{center} -\end{table} - -We have developed in C programming language a generator of large sparse matrices -having five bands distributed along their bandwidths (see Figure~\ref{ch12:fig:07}). -The principle of this generator is equivalent to that in Section~\ref{ch12:sec:04}. -However, the copies performed on the initial matrix (chosen from the Davis's collection) -are placed on the main diagonal and on four off-diagonals, two on the right and two -on the left of the main diagonal. Figure~\ref{ch12:fig:07} shows an example of a -generation of a sparse five-bands matrix by four computing nodes. Table~\ref{ch12:tab:07} -shows the main characteristics of sparse five-bands matrices generated from those -presented in Table~\ref{ch12:tab:01} and associated to linear systems of $25$ million -unknown values. - -\begin{table}[!h] -\begin{center} -\begin{tabular}{|c|c|c|c|c|c|c|} -\hline -{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline - -2cubes\_sphere & $6.041s$ & $3.338s$ & $1.81$ & $30$ & $6.77e$-$11$ & $3.25e$-$19$ \\ - -ecology2 & $1.404s$ & $1.301s$ & $1.08$ & $13$ & $5.22e$-$11$ & $2.17e$-$18$ \\ - -finan512 & $1.822s$ & $1.299s$ & $1.40$ & $12$ & $3.52e$-$11$ & $3.47e$-$18$ \\ - -G3\_circuit & $2.331s$ & $2.129s$ & $1.09$ & $15$ & $1.36e$-$11$ & $5.20e$-$18$ \\ - -shallow\_water2 & $0.541s$ & $0.504s$ & $1.07$ & $6$ & $2.12e$-$16$ & $5.05e$-$28$ \\ - -thermal2 & $2.549s$ & $1.705s$ & $1.49$ & $14$ & $2.36e$-$10$ & $5.20e$-$18$ \\ \hline -\end{tabular} -\caption{Performances of parallel CG solver for solving linear systems associated to sparse five-bands matrices -on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs} -\label{ch12:tab:08} -\end{center} -\end{table} - -\begin{table} -\begin{center} -\begin{tabular}{|c|c|c|c|c|c|c|} -\hline -{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline - -2cubes\_sphere & $15.963s$ & $7.250s$ & $2.20$ & $58$ & $6.23e$-$16$ & $3.25e$-$19$ \\ - -ecology2 & $3.549s$ & $2.176s$ & $1.63$ & $21$ & $4.78e$-$15$ & $1.06e$-$15$ \\ - -finan512 & $3.862s$ & $1.934s$ & $1.99$ & $17$ & $3.21e$-$14$ & $8.43e$-$17$ \\ - -G3\_circuit & $4.636s$ & $2.811s$ & $1.65$ & $22$ & $1.08e$-$14$ & $1.77e$-$16$ \\ - -shallow\_water2 & $2.738s$ & $1.539s$ & $1.78$ & $17$ & $5.54e$-$23$ & $3.82e$-$26$ \\ - -thermal2 & $5.017s$ & $2.587s$ & $1.94$ & $21$ & $8.25e$-$14$ & $4.34e$-$18$ \\ \hline \hline - -cage13 & $9.315s$ & $3.227s$ & $2.89$ & $26$ & $3.38e$-$13$ & $2.08e$-$16$ \\ - -crashbasis & $35.980s$ & $14.770s$ & $2.43$ & $127$ & $1.17e$-$12$ & $1.56e$-$17$ \\ - -FEM\_3D\_thermal2 & $24.611s$ & $7.749s$ & $3.17$ & $64$ & $3.87e$-$11$ & $2.84e$-$14$ \\ - -language & $16.859s$ & $9.697s$ & $1.74$ & $89$ & $2.17e$-$12$ & $1.70e$-$12$ \\ - -poli\_large & $10.200s$ & $6.534s$ & $1.56$ & $69$ & $5.14e$-$13$ & $1.63e$-$13$ \\ - -torso3 & $49.074s$ & $19.397s$ & $2.53$ & $175$ & $2.69e$-$12$ & $2.77e$-$16$ \\ \hline -\end{tabular} -\caption{Performances of parallel GMRES solver for solving linear systems associated to sparse five-bands matrices -on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs} -\label{ch12:tab:09} -\end{center} -\end{table} - -Tables~\ref{ch12:tab:08} and~\ref{ch12:tab:09} shows the performances of the parallel -CG and GMRES solvers, respectively, obtained on a cluster of $24$ CPU cores and on a -cluster of $12$ GPUs. The linear systems solved in these tables are associated to the -sparse five-bands matrices presented on Table~\ref{ch12:tab:07}. We can notice from -both Tables~\ref{ch12:tab:08} and~\ref{ch12:tab:09} that using a GPU cluster is not -efficient for solving these kind of sparse linear systems\index{Sparse~linear~system}. -We can see that the execution times obtained on the GPU cluster are almost equivalent -to those obtained on the CPU cluster (see the relative gains presented in column~$4$ -of each table). This is due to the large number of communications necessary to synchronize -the computations over the cluster. Indeed, the naive partitioning, row-by-row or column-by-column, -of sparse matrices having large bandwidths can link a computing node to many neighbors -and then generate a large number of data dependencies between these computing nodes in -the cluster. - -Therefore, we have chosen to use a hypergraph partitioning method\index{Hypergraph}, -which is well-suited to numerous kinds of sparse matrices~\cite{ch12:ref11}. Indeed, -it can well model the communications between the computing nodes, particularly in the -case of nonsymmetric and irregular matrices, and it gives good reduction of the total -communication volume. In contrast, it is an expensive operation in terms of execution -time and memory space. - -The sparse matrix $A$ of the linear system to be solved is modeled as a hypergraph -$\mathcal{H}=(\mathcal{V},\mathcal{E})$\index{Hypergraph} as follows: -\begin{itemize} -\item each matrix row $\{i\}_{0\leq i