In Section~\ref{ch12:sec:02}, we describe the general principle of two well-known iterative
methods: the conjugate gradient method and the generalized minimal residual method. In Section~\ref{ch12:sec:03},
we give the main key points of the parallel implementation of both methods on a cluster of
-GPUs. Then, in Section~\ref{ch12:sec:04}, we present the experimental results obtained on a
-CPU cluster and on a GPU cluster, for solving sparse linear systems associated to matrices
-of different structures. Finally, in Section~\ref{ch12:sec:05}, we apply the hypergraph partitioning
-technique to reduce the total communication volume between the computing nodes and, thus,
-to improve the execution times of the parallel algorithms of both iterative methods.
+GPUs. Finally, in Section~\ref{ch12:sec:04}, we present the experimental results obtained on a
+CPU cluster and on a GPU cluster, for solving large sparse linear systems.
%%--------------------------%%
%%****************%%
\subsection{GMRES method}
\label{ch12:sec:02.02}
-The iterative GMRES method is developed by Saad and Schultz in 1986~\cite{ch12:ref3} as a generalization
+The iterative GMRES method was developed by Saad and Schultz in 1986~\cite{ch12:ref3} as a generalization
of the minimum residual method MINRES~\cite{ch12:ref4}\index{Iterative~method!MINRES}. Indeed, GMRES can
be applied for solving symmetric or nonsymmetric linear systems.
\end{equation}
and
\begin{equation}
-V_k A = V_{k+1} \bar{H}_k.
+A V_k = V_{k+1} \bar{H}_k.
\label{ch12:eq:15}
\end{equation}
a memory bandwidth of $102$GB/s. Figure~\ref{ch12:fig:04} shows the general scheme of the GPU cluster\index{GPU~cluster}
that we used in the experimental tests.
-\begin{figure}
-\centerline{\includegraphics[scale=0.25]{Chapters/chapter12/figures/cluster}}
-\caption{General scheme of the GPU cluster of tests composed of six machines, each with two GPUs.}
-\label{ch12:fig:04}
-\end{figure}
-
Linux cluster version 2.6.39 OS is installed on CPUs. C programming language is used for coding
the parallel algorithms of both methods on the GPU cluster. CUDA version 4.0~\cite{ch12:ref9}
is used for programming GPUs, using CUBLAS library~\cite{ch12:ref6} to deal with vector operations
CPU cores. Indeed, the experiments are done on a cluster of $12$ computing nodes, where each node
is managed by a MPI process and it is composed of one CPU core and one GPU card.
+\begin{figure}[!h]
+\centerline{\includegraphics[scale=0.25]{Chapters/chapter12/figures/cluster}}
+\caption{General scheme of the GPU cluster of tests composed of six machines, each with two GPUs.}
+\label{ch12:fig:04}
+\end{figure}
+
All tests are made on double-precision floating point operations. The parameters of both linear
solvers are initialized as follows: the residual tolerance threshold $\varepsilon=10^{-12}$, the
maximum number of iterations $maxiter=500$, the right-hand side $b$ is filled with $1.0$ and the
threads. Finally, the performance results, presented hereafter, are obtained from the mean value
over $10$ executions of the same parallel linear solver and for the same input data.
-To get more realistic results, we tested the CG and GMRES algorithms on sparse matrices of the Davis's
-collection~\cite{ch12:ref10}, that arise in a wide spectrum of real-world applications. We chose six
-symmetric sparse matrices and six nonsymmetric ones from this collection. In Figure~\ref{ch12:fig:05},
-we show structures of these matrices and in Table~\ref{ch12:tab:01} we present their main characteristics
-which are the number of rows, the total number of nonzero values (nnz) and the maximal bandwidth. In
-the present chapter, the bandwidth of a sparse matrix is defined as the number of matrix columns separating
-the first and the last nonzero value on a matrix row.
-
\begin{figure}
\centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/matrices}}
-\caption{Sketches of sparse matrices chosen from the Davis's collection.}
+\caption{Sketches of sparse matrices chosen from the Davis collection.}
\label{ch12:fig:05}
\end{figure}
& torso3 & $259,156$ & $4,429,042$ & $216,854$ \\ \hline
\end{tabular}
-\vspace{0.5cm}
-\caption{Main characteristics of sparse matrices chosen from the Davis's collection.}
+\caption{Main characteristics of sparse matrices chosen from the Davis collection.}
\label{ch12:tab:01}
\end{table}
+To get more realistic results, we tested the CG and GMRES algorithms on sparse matrices of the Davis
+collection~\cite{ch12:ref10}, that arise in a wide spectrum of real-world applications. We chose six
+symmetric sparse matrices and six nonsymmetric ones from this collection. In Figure~\ref{ch12:fig:05},
+we show structures of these matrices and in Table~\ref{ch12:tab:01} we present their main characteristics
+which are the number of rows, the total number of nonzero values (nnz) and the maximal bandwidth. In
+the present chapter, the bandwidth of a sparse matrix is defined as the number of matrix columns separating
+the first and the last nonzero value on a matrix row.
+
\begin{table}
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|}
\end{center}
\end{table}
-\begin{table}[!h]
+\begin{table}
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|}
\hline
were computed with a sufficient accuracy (about $10^{-10}$) and they are, more or less, equivalent
to those computed on the CPU cluster with a small difference ranging from $10^{-10}$ to $10^{-26}$.
However, we can notice from the relative gains $\tau$ that is not interesting to use multiple
-GPUs for solving small sparse linear systems. in fact, a small sparse matrix does not allow to
+GPUs for solving small sparse linear systems. In fact, a small sparse matrix does not allow to
maximize utilization of GPU cores. In addition, the communications required to synchronize the
computations over the cluster increase the idle times of GPUs and slow down further the parallel
computations.
Consequently, in order to test the performances of the parallel solvers, we developed in C programming
-language a generator of large sparse matrices. This generator takes a matrix from the Davis's collection~\cite{ch12:ref10}
+language a generator of large sparse matrices. This generator takes a matrix from the Davis collection~\cite{ch12:ref10}
as an initial matrix to construct large sparse matrices exceeding ten million of rows. It must be executed
in parallel by the MPI processes of the computing nodes, so that each process could construct its sparse
sub-matrix. In first experimental tests, we are focused on sparse matrices having a banded structure,
because they are those arise in the most of numerical problems. So to generate the global sparse matrix,
each MPI process constructs its sub-matrix by performing several copies of an initial sparse matrix chosen
-from the Davis's collection. Then, it puts all these copies on the main diagonal of the global matrix
+from the Davis collection. Then, it puts all these copies on the main diagonal of the global matrix
(see Figure~\ref{ch12:fig:06}). Moreover, the empty spaces between two successive copies in the main
diagonal are filled with sub-copies (left-copy and right-copy in Figure~\ref{ch12:fig:06}) of the same
initial matrix.
& torso3 & $433,795,264$ & $328,757$ \\ \hline
\end{tabular}
\vspace{0.5cm}
-\caption{Main characteristics of sparse banded matrices generated from those of the Davis's collection.}
+\caption{Main characteristics of sparse banded matrices generated from those of the Davis collection.}
\label{ch12:tab:04}
\end{table}
time of an iteration than those of the GMRES method. Moreover, an iteration of the parallel GMRES
method requires more data exchanges between computing nodes compared to the parallel CG method.
-\begin{table}[!h]
+\begin{table}
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|}
\hline
\end{center}
\end{table}
-\begin{table}[!h]
+\begin{table}
\begin{center}
\begin{tabular}{|c|c|c|c|c|c|c|}
\hline
\end{center}
\end{table}
-
%%--------------------------%%
%% SECTION 5 %%
%%--------------------------%%
-\section{Hypergraph partitioning}
-\label{ch12:sec:05}
-In this section, we present the performances of both parallel CG and GMRES solvers for solving linear
-systems associated to sparse matrices having large bandwidths. Indeed, we are interested on sparse
-matrices having the nonzero values distributed along their bandwidths.
-
-\begin{figure}
-\centerline{\includegraphics[scale=0.22]{Chapters/chapter12/figures/generation_1}}
-\caption{Parallel generation of a large sparse five-bands matrix by four computing nodes.}
-\label{ch12:fig:07}
-\end{figure}
-
-\begin{table}[!h]
-\begin{center}
-\begin{tabular}{|c|c|c|c|}
-\hline
-{\bf Matrix type} & {\bf Matrix name} & {\bf \# nnz} & {\bf Bandwidth} \\ \hline \hline
-
-\multirow{6}{*}{Symmetric} & 2cubes\_sphere & $829,082,728$ & $24,999,999$ \\
-
- & ecology2 & $254,892,056$ & $25,000,000$ \\
-
- & finan512 & $556,982,339$ & $24,999,973$ \\
-
- & G3\_circuit & $257,982,646$ & $25,000,000$ \\
-
- & shallow\_water2 & $200,798,268$ & $25,000,000$ \\
-
- & thermal2 & $359,340,179$ & $24,999,998$ \\ \hline \hline
-
-\multirow{6}{*}{Nonsymmetric} & cage13 & $879,063,379$ & $24,999,998$ \\
-
- & crashbasis & $820,373,286$ & $24,999,803$ \\
-
- & FEM\_3D\_thermal2 & $1,194,012,703$ & $24,999,998$ \\
-
- & language & $155,261,826$ & $24,999,492$ \\
-
- & poli\_large & $106,680,819$ & $25,000,000$ \\
-
- & torso3 & $872,029,998$ & $25,000,000$\\ \hline
-\end{tabular}
-\caption{Main characteristics of sparse five-bands matrices generated from those of the Davis's collection.}
-\label{ch12:tab:07}
-\end{center}
-\end{table}
-
-We have developed in C programming language a generator of large sparse matrices
-having five bands distributed along their bandwidths (see Figure~\ref{ch12:fig:07}).
-The principle of this generator is equivalent to that in Section~\ref{ch12:sec:04}.
-However, the copies performed on the initial matrix (chosen from the Davis's collection)
-are placed on the main diagonal and on four off-diagonals, two on the right and two
-on the left of the main diagonal. Figure~\ref{ch12:fig:07} shows an example of a
-generation of a sparse five-bands matrix by four computing nodes. Table~\ref{ch12:tab:07}
-shows the main characteristics of sparse five-bands matrices generated from those
-presented in Table~\ref{ch12:tab:01} and associated to linear systems of $25$ million
-unknown values.
-
-\begin{table}[!h]
-\begin{center}
-\begin{tabular}{|c|c|c|c|c|c|c|}
-\hline
-{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline
-
-2cubes\_sphere & $6.041s$ & $3.338s$ & $1.81$ & $30$ & $6.77e$-$11$ & $3.25e$-$19$ \\
-
-ecology2 & $1.404s$ & $1.301s$ & $1.08$ & $13$ & $5.22e$-$11$ & $2.17e$-$18$ \\
-
-finan512 & $1.822s$ & $1.299s$ & $1.40$ & $12$ & $3.52e$-$11$ & $3.47e$-$18$ \\
-
-G3\_circuit & $2.331s$ & $2.129s$ & $1.09$ & $15$ & $1.36e$-$11$ & $5.20e$-$18$ \\
-
-shallow\_water2 & $0.541s$ & $0.504s$ & $1.07$ & $6$ & $2.12e$-$16$ & $5.05e$-$28$ \\
-
-thermal2 & $2.549s$ & $1.705s$ & $1.49$ & $14$ & $2.36e$-$10$ & $5.20e$-$18$ \\ \hline
-\end{tabular}
-\caption{Performances of parallel CG solver for solving linear systems associated to sparse five-bands matrices
-on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs}
-\label{ch12:tab:08}
-\end{center}
-\end{table}
-
-\begin{table}
-\begin{center}
-\begin{tabular}{|c|c|c|c|c|c|c|}
-\hline
-{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\# iter.}$ & $\mathbf{prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline
-
-2cubes\_sphere & $15.963s$ & $7.250s$ & $2.20$ & $58$ & $6.23e$-$16$ & $3.25e$-$19$ \\
-
-ecology2 & $3.549s$ & $2.176s$ & $1.63$ & $21$ & $4.78e$-$15$ & $1.06e$-$15$ \\
-
-finan512 & $3.862s$ & $1.934s$ & $1.99$ & $17$ & $3.21e$-$14$ & $8.43e$-$17$ \\
-
-G3\_circuit & $4.636s$ & $2.811s$ & $1.65$ & $22$ & $1.08e$-$14$ & $1.77e$-$16$ \\
-
-shallow\_water2 & $2.738s$ & $1.539s$ & $1.78$ & $17$ & $5.54e$-$23$ & $3.82e$-$26$ \\
-
-thermal2 & $5.017s$ & $2.587s$ & $1.94$ & $21$ & $8.25e$-$14$ & $4.34e$-$18$ \\ \hline \hline
-
-cage13 & $9.315s$ & $3.227s$ & $2.89$ & $26$ & $3.38e$-$13$ & $2.08e$-$16$ \\
-
-crashbasis & $35.980s$ & $14.770s$ & $2.43$ & $127$ & $1.17e$-$12$ & $1.56e$-$17$ \\
-
-FEM\_3D\_thermal2 & $24.611s$ & $7.749s$ & $3.17$ & $64$ & $3.87e$-$11$ & $2.84e$-$14$ \\
-
-language & $16.859s$ & $9.697s$ & $1.74$ & $89$ & $2.17e$-$12$ & $1.70e$-$12$ \\
-
-poli\_large & $10.200s$ & $6.534s$ & $1.56$ & $69$ & $5.14e$-$13$ & $1.63e$-$13$ \\
-
-torso3 & $49.074s$ & $19.397s$ & $2.53$ & $175$ & $2.69e$-$12$ & $2.77e$-$16$ \\ \hline
-\end{tabular}
-\caption{Performances of parallel GMRES solver for solving linear systems associated to sparse five-bands matrices
-on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs}
-\label{ch12:tab:09}
-\end{center}
-\end{table}
-
-Tables~\ref{ch12:tab:08} and~\ref{ch12:tab:09} shows the performances of the parallel
-CG and GMRES solvers, respectively, obtained on a cluster of $24$ CPU cores and on a
-cluster of $12$ GPUs. The linear systems solved in these tables are associated to the
-sparse five-bands matrices presented on Table~\ref{ch12:tab:07}. We can notice from
-both Tables~\ref{ch12:tab:08} and~\ref{ch12:tab:09} that using a GPU cluster is not
-efficient for solving these kind of sparse linear systems\index{Sparse~linear~system}.
-We can see that the execution times obtained on the GPU cluster are almost equivalent
-to those obtained on the CPU cluster (see the relative gains presented in column~$4$
-of each table). This is due to the large number of communications necessary to synchronize
-the computations over the cluster. Indeed, the naive partitioning, row-by-row or column-by-column,
-of sparse matrices having large bandwidths can link a computing node to many neighbors
-and then generate a large number of data dependencies between these computing nodes in
-the cluster.
-
-Therefore, we have chosen to use a hypergraph partitioning method\index{Hypergraph},
-which is well-suited to numerous kinds of sparse matrices~\cite{ch12:ref11}. Indeed,
-it can well model the communications between the computing nodes, particularly in the
-case of nonsymmetric and irregular matrices, and it gives good reduction of the total
-communication volume. In contrast, it is an expensive operation in terms of execution
-time and memory space.
-
-The sparse matrix $A$ of the linear system to be solved is modeled as a hypergraph
-$\mathcal{H}=(\mathcal{V},\mathcal{E})$\index{Hypergraph} as follows:
-\begin{itemize}
-\item each matrix row $\{i\}_{0\leq i<n}$ corresponds to a vertex $v_i\in\mathcal{V}$ and,
-\item each matrix column $\{j\}_{0\leq j<n}$ corresponds to a hyperedge $e_j\in\mathcal{E}$, where:
-\begin{equation}
-\forall a_{ij} \neq 0 \mbox{~is a nonzero value of matrix~} A \mbox{~:~} v_i \in pins[e_j],
-\end{equation}
-\item $w_i$ is the weight of vertex $v_i$ and,
-\item $c_j$ is the cost of hyperedge $e_j$.
-\end{itemize}
-A $K$-way partitioning of a hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ is
-defined as $\mathcal{P}=\{\mathcal{V}_1,\ldots,\mathcal{V}_K\}$ a set of pairwise
-disjoint non-empty subsets (or parts) of the vertex set $\mathcal{V}$, so that each
-subset is attributed to a computing node. Figure~\ref{ch12:fig:08} shows an example
-of the hypergraph model of a $(9\times 9)$ sparse matrix in three parts. The circles
-and squares correspond, respectively, to the vertices and hyperedges of the hypergraph.
-The solid squares define the cut hyperedges connecting at least two different parts.
-The connectivity $\lambda_j$ of a cut hyperedge $e_j$ denotes the number of different
-parts spanned by $e_j$.
-
-\begin{figure}
-\centerline{\includegraphics[scale=0.5]{Chapters/chapter12/figures/hypergraph}}
-\caption{An example of the hypergraph partitioning of a sparse matrix decomposed between three computing nodes.}
-\label{ch12:fig:08}
-\end{figure}
-
-The cut hyperedges model the total communication volume between the different computing
-nodes in the cluster, necessary to perform the parallel SpMV multiplication\index{SpMV~multiplication}.
-Indeed, each hyperedge $e_j$ defines a set of atomic computations $b_i\leftarrow b_i+a_{ij}x_j$,
-$0\leq i,j<n$, of the SpMV multiplication $Ax=b$ that need the $j^{th}$ unknown value of
-solution vector $x$. Therefore, pins of hyperedge $e_j$, $pins[e_j]$, are the set of matrix
-rows sharing and requiring the same unknown value $x_j$. For example in Figure~\ref{ch12:fig:08},
-hyperedge $e_9$ whose pins are: $pins[e_9]=\{v_2,v_5,v_9\}$ represents the dependency of matrix
-rows $2$, $5$ and $9$ to unknown $x_9$ needed to perform in parallel the atomic operations:
-$b_2\leftarrow b_2+a_{29}x_9$, $b_5\leftarrow b_5+a_{59}x_9$ and $b_9\leftarrow b_9+a_{99}x_9$.
-However, unknown $x_9$ is the third entry of the sub-solution vector $x$ of part (or node) $3$.
-So the computing node $3$ must exchange this value with nodes $1$ and $2$, which leads to perform
-two communications.
-
-The hypergraph partitioning\index{Hypergraph} allows to reduce the total communication volume
-required to perform the parallel SpMV multiplication, while maintaining the load balancing between
-the computing nodes. In fact, it allows to minimize at best the following amount:
-\begin{equation}
-\mathcal{X}(\mathcal{P})=\sum_{e_{j}\in\mathcal{E}_{C}}c_{j}(\lambda_{j}-1),
-\end{equation}
-where $\mathcal{E}_{C}$ denotes the set of the cut hyperedges coming from the hypergraph partitioning
-$\mathcal{P}$ and $c_j$ and $\lambda_j$ are, respectively, the cost and the connectivity of cut hyperedge
-$e_j$. Moreover, it also ensures the load balancing between the $K$ parts as follows:
-\begin{equation}
- W_{k}\leq (1+\epsilon)W_{avg}, \hspace{0.2cm} (1\leq k\leq K) \hspace{0.2cm} \text{and} \hspace{0.2cm} (0<\epsilon<1),
-\end{equation}
-where $W_{k}$ is the sum of all vertex weights ($w_{i}$) in part $\mathcal{V}_{k}$, $W_{avg}$ is the
-average weight of all $K$ parts and $\epsilon$ is the maximum allowed imbalanced ratio.
-
-The hypergraph partitioning is a NP-complete problem but software tools using heuristics are developed,
-for example: hMETIS~\cite{ch12:ref12}, PaToH~\cite{ch12:ref13} and Zoltan~\cite{ch12:ref14}. Since our
-objective is solving large sparse linear systems, we use the parallel hypergraph partitioning which must
-be performed by at least two MPI processes. It allows to accelerate the data partitioning of large sparse
-matrices. For this, the hypergraph $\mathcal{H}$ must be partitioned in $p$ (number of MPI processes)
-sub-hypergraphs $\mathcal{H}_k=(\mathcal{V}_k,\mathcal{E}_k)$, $0\leq k<p$, and then we performed the
-parallel hypergraph partitioning method using some functions of the MPI library between the $p$ processes.
-
-Tables~\ref{ch12:tab:10} and~\ref{ch12:tab:11} shows the performances of the parallel CG and GMRES solvers,
-respectively, using the hypergraph partitioning for solving large linear systems associated to the sparse
-five-bands matrices presented in Table~\ref{ch12:tab:07}. For these experimental tests, we have applied the
-parallel hypergraph partitioning~\cite{ch12:ref15} developed in Zoltan tool~\cite{ch12:ref14}. We have initialized
-the parameters of the partitioning operation as follows:
-\begin{itemize}
-\item the weight $w_{i}$ of each vertex $v_{j}\in\mathcal{V}$ is set to the number of nonzero values on matrix row $i$,
-\item for the sake of simplicity, the cost $c_{j}$ of each hyperedge $e_{j}\in\mathcal{E}$ is fixed to $1$,
-\item the maximum imbalanced load ratio $\epsilon$ is limited to $10\%$.\\
-\end{itemize}
-
-\begin{table}
-\begin{center}
-\begin{tabular}{|c|c|c|c|c|}
-\hline
-{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{Gains \%}$ \\ \hline \hline
-
-2cubes\_sphere & $5.935s$ & $1.213s$ & $4.89$ & $63.66\%$ \\
-
-ecology2 & $1.093s$ & $0.136s$ & $8.00$ & $89.55\%$ \\
-
-finan512 & $1.762s$ & $0.475s$ & $3.71$ & $63.43\%$ \\
-
-G3\_circuit & $2.095s$ & $0.558s$ & $3.76$ & $73.79\%$ \\
-
-shallow\_water2 & $0.498s$ & $0.068s$ & $7.31$ & $86.51\%$ \\
-
-thermal2 & $1.889s$ & $0.348s$ & $5.43$ & $79.59\%$ \\ \hline
-\end{tabular}
-\caption{Performances of the parallel CG solver using hypergraph partitioning for solving linear systems associated to
-sparse five-bands matrices on a cluster of 24 CPU cores vs. on a cluster of 12 GPU.}
-\label{ch12:tab:10}
-\end{center}
-\end{table}
-
-\begin{table}
-\begin{center}
-\begin{tabular}{|c|c|c|c|c|}
-\hline
-{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{Gains \%}$ \\ \hline \hline
-
-2cubes\_sphere & $16.430s$ & $2.840s$ & $5.78$ & $60.83\%$ \\
-
-ecology2 & $3.152s$ & $0.367s$ & $8.59$ & $83.13\%$ \\
-
-finan512 & $3.672s$ & $0.723s$ & $5.08$ & $62.62\%$ \\
-
-G3\_circuit & $4.468s$ & $0.971s$ & $4.60$ & $65.46\%$ \\
-
-shallow\_water2 & $2.647s$ & $0.312s$ & $8.48$ & $79.73\%$ \\
-
-thermal2 & $4.190s$ & $0.666s$ & $6.29$ & $74.25\%$ \\ \hline \hline
-
-cage13 & $8.077s$ & $1.584s$ & $5.10$ & $50.91\%$ \\
-
-crashbasis & $35.173s$ & $5.546s$ & $6.34$ & $62.43\%$ \\
-
-FEM\_3D\_thermal2 & $24.825s$ & $3.113s$ & $7.97$ & $59.83\%$ \\
-
-language & $16.706s$ & $2.522s$ & $6.62$ & $73.99\%$ \\
-
-poli\_large & $12.715s$ & $3.989s$ & $3.19$ & $38.95\%$ \\
-
-torso3 & $48.459s$ & $6.234s$ & $7.77$ & $67.86\%$ \\ \hline
-\end{tabular}
-\caption{Performances of the parallel GMRES solver using hypergraph partitioning for solving linear systems associated to
-sparse five-bands matrices on a cluster of 24 CPU cores vs. on a cluster of 12 GPU.}
-\label{ch12:tab:11}
-\end{center}
-\end{table}
-
-We can notice from both Tables~\ref{ch12:tab:10} and~\ref{ch12:tab:11} that the
-hypergraph partitioning has improved the performances of both parallel CG and GMRES
-algorithms. The execution times on the GPU cluster of both parallel solvers are
-significantly improved compared to those obtained by using the partitioning row-by-row.
-For these examples of sparse matrices, the execution times of CG and GMRES solvers
-are reduced about $76\%$ and $65\%$ respectively (see column~$5$ of each table)
-compared to those obtained in Tables~\ref{ch12:tab:08} and~\ref{ch12:tab:09}.
-
-In fact, the hypergraph partitioning\index{Hypergraph} applied to sparse matrices
-having large bandwidths allows to reduce the total communication volume necessary
-to synchronize the computations between the computing nodes in the GPU cluster.
-Table~\ref{ch12:tab:12} presents, for each sparse matrix, the total communication
-volume between $12$ GPU computing nodes obtained by using the partitioning row-by-row
-(column~$2$), the total communication volume obtained by using the hypergraph partitioning
-(column~$3$) and the execution times in minutes of the hypergraph partitioning
-operation performed by $12$ MPI processes (column~$4$). The total communication
-volume defines the total number of the vector elements exchanged by the computing
-nodes. Then, Table~\ref{ch12:tab:12} shows that the hypergraph partitioning method
-can split the sparse matrix so as to minimize the data dependencies between the
-computing nodes and thus to reduce the total communication volume.
-
-\begin{table}
-\begin{center}
-\begin{tabular}{|c|c|c|c|}
-\hline
-\multirow{4}{*}{\bf Matrix} & {\bf Total comms.} & {\bf Total comms.} & {\bf Execution} \\
- & {\bf volume without} & {\bf volume with} & {\bf trime} \\
- & {\bf hypergraph} & {\bf hypergraph } & {\bf of the parti.} \\
- & {\bf parti.} & {\bf parti.} & {\bf in minutes}\\ \hline \hline
-
-2cubes\_sphere & $25,360,543$ & $240,679$ & $68.98$ \\
-
-ecology2 & $26,044,002$ & $73,021$ & $4.92$ \\
-
-finan512 & $26,087,431$ & $900,729$ & $33.72$ \\
-
-G3\_circuit & $31,912,003$ & $5,366,774$ & $11.63$ \\
-
-shallow\_water2 & $25,105,108$ & $60,899$ & $5.06$ \\
-
-thermal2 & $30,012,846$ & $1,077,921$ & $17.88$ \\ \hline \hline
-
-cage13 & $28,254,282$ & $3,845,440$ & $196.45$ \\
-
-crashbasis & $29,020,060$ & $2,401,876$ & $33.39$ \\
-
-FEM\_3D\_thermal2 & $25,263,767$ & $250,105$ & $49.89$ \\
-
-language & $27,291,486$ & $1,537,835$ & $9.07$ \\
-
-poli\_large & $25,053,554$ & $7,388,883$ & $5.92$ \\
-
-torso3 & $25,682,514$ & $613,250$ & $61.51$ \\ \hline
-\end{tabular}
-\caption{The total communication volume between 12 GPU computing nodes without and with the hypergraph partitioning method.}
-\label{ch12:tab:12}
-\end{center}
-\end{table}
-
-Nevertheless, as we can see from the fourth column of Table~\ref{ch12:tab:12},
-the hypergraph partitioning takes longer compared to the execution times of the
-resolutions. As previously mentioned, the hypergraph partitioning method is less
-efficient in terms of memory consumption and partitioning time than its graph
-counterpart, but the hypergraph well models the nonsymmetric and irregular problems.
-So for the applications which often use the same sparse matrices, we can perform
-the hypergraph partitioning on these matrices only once for each and then, we save
-the traces of these partitionings in files to be reused several times. Therefore,
-this allows to avoid the partitioning of the sparse matrices at each resolution
-of the linear systems.
-
-\begin{figure}[!h]
-\centering
- \mbox{\subfigure[Sparse band matrices]{\includegraphics[scale=0.7]{Chapters/chapter12/figures/scale_band}\label{ch12:fig:09.01}}}
-\vfill
- \mbox{\subfigure[Sparse five-bands matrices]{\includegraphics[scale=0.7]{Chapters/chapter12/figures/scale_5band}\label{ch12:fig:09.02}}}
-\caption{Weak-scaling of the parallel CG and GMRES solvers on a GPU cluster for solving large sparse linear systems.}
-\label{ch12:fig:09}
-\end{figure}
-
-However, the most important performance parameter is the scalability of the parallel
-CG\index{Iterative~method!CG} and GMRES\index{Iterative~method!GMRES} solvers on a GPU
-cluster. Particularly, we have taken into account the weak-scaling of both parallel
-algorithms on a cluster of one to 12 GPU computing nodes. We have performed a set of
-experiments on both matrix structures: band matrices and five-bands matrices. The sparse
-matrices of tests are generated from the symmetric sparse matrix {\it thermal2} chosen
-from the Davis's collection. Figures~\ref{ch12:fig:09.01} and~\ref{ch12:fig:09.02}
-show the execution times of both parallel methods for solving large linear systems
-associated to band matrices and those associated to five-bands matrices, respectively.
-The size of a sparse sub-matrix per computing node, for each matrix structure, is fixed
-as follows:
-\begin{itemize}
-\item band matrix: $15$ million of rows and $105,166,557$ of nonzero values,
-\item five-bands matrix: $5$ million of rows and $78,714,492$ of nonzero values.
-\end{itemize}
-We can see from these figures that both parallel solvers are quite scalable on a GPU
-cluster. Indeed, the execution times remains almost constant while the size of the
-sparse linear systems to be solved increases proportionally with the number of the
-GPU computing nodes. This means that the communication cost is relatively constant
-regardless of the number the computing nodes in the GPU cluster.
-
-
-
-%%--------------------------%%
-%% SECTION 6 %%
-%%--------------------------%%
\section{Conclusion}
-\label{ch12:sec:06}
+\label{ch12:sec:05}
In this chapter, we have aimed at harnessing the computing power of a
cluster of GPUs for solving large sparse linear systems. For this, we
have used two Krylov subspace iterative methods: the CG and GMRES methods.
results, obtained in the present chapter, showed that a cluster of $12$ GPUs is
about $7$ times faster than a cluster of $24$ CPU cores for solving large sparse
linear systems of $25$ million unknown values. This is due to the GPU ability to
-compute the data-parallel operations faster than the CPUs. However, we have shown
-that solving linear systems associated to matrices having large bandwidths uses
-many communications to synchronize the computations of GPUs, which slow down even
-more the resolution. Moreover, there are two kinds of communications: between a
-CPU and its GPU and between CPUs of the computing nodes, such that the first ones
-are the slowest communications on a GPU cluster. So, we have proposed to use the
-hypergraph partitioning instead of the row-by-row partitioning. This allows to
-minimize the data dependencies between the GPU computing nodes and thus to reduce
-the total communication volume. The experimental results showed that using the
-hypergraph partitioning technique improve the execution times on average of $76\%$
-to the CG method and of $65\%$ to the GMRES method on a cluster of $12$ GPUs.
-
-In the recent GPU hardware and software architectures, the GPU-Direct system with
-CUDA version 5.0 is used so that two GPUs located on the same node or on distant
-nodes can communicate between them directly without CPUs. This allows to improve
-the data transfers between GPUs.
+compute the data-parallel operations faster than the CPUs.
+
+As future works, we plan to test the parallel algorithms of CG and GMRES methods, adapted
+to GPUs, for solving large linear systems associated to sparse matrices of different structures.
+For example, the matrices having large bandwidths, which can lead to many data dependencies
+between the computing nodes and, thus, degrade the performances of both algorithms. So in
+this case, it would be interesting to study the different data partitioning techniques, in
+order to minimize the dependencies between the computing nodes and thus to reduce the total
+communication volume. This may improve the performances of both algorithms implemented on
+a GPU cluster. Moreover, in the recent GPU hardware and software architectures, the GPU-Direct
+system with CUDA version 5.0 is used so that two GPUs located on the same node or on distant
+nodes can communicate between them directly without CPUs. This allows to improve the data
+transfers between GPUs.
+
+
\putbib[Chapters/chapter12/biblio12]
%% %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\chapterauthor{}{}
+%\chapterauthor{}{}
+\chapterauthor{Lilia Ziane Khodja}{Femto-ST Institute, University of Franche-Comte, France}
+\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte, France}
+\chapterauthor{Jacques Bahi}{Femto-ST Institute, University of Franche-Comte, France}
+
\chapter{Solving sparse linear systems with GMRES and CG methods on GPU clusters}
+\label{ch12}
%%--------------------------%%
%% SECTION 1 %%
%%--------------------------%%
\section{Introduction}
-\label{sec:01}
-The sparse linear systems are used to model many scientific and industrial problems, such as the environmental simulations or
-the industrial processing of the complex or non-Newtonian fluids. Moreover, the resolution of these problems often involves the
-solving of such linear systems which is considered as the most expensive process in terms of time execution and memory space.
-Therefore, solving sparse linear systems must be as efficient as possible in order to deal with problems of ever increasing size.
-
-There are, in the jargon of numerical analysis, different methods of solving sparse linear systems that we can classify in two
-classes: the direct and iterative methods. However, the iterative methods are often more suitable than their counterpart, direct
-methods, for solving large sparse linear systems. Indeed, they are less memory consuming and easier to parallelize on parallel
-computers than direct methods. Different computing platforms, sequential and parallel computers, are used for solving sparse
-linear systems with iterative solutions. Nowadays, graphics processing units (GPUs) have become attractive for solving these
-linear systems, due to their computing power and their ability to compute faster than traditional CPUs.
-
-In Section~\ref{sec:02}, we describe the general principle of two well-known iterative methods: the conjugate gradient method and
-the generalized minimal residual method. In Section~\ref{sec:03}, we give the main key points of the parallel implementation of both
-methods on a cluster of GPUs. Then, in Section~\ref{sec:04}, we present the experimental results obtained on a CPU cluster and on
-a GPU cluster, for solving sparse linear systems associated to matrices of different structures. Finally, in Section~\ref{sec:05},
-we apply the hypergraph partitioning technique to reduce the total communication volume between the computing nodes and, thus, to
-improve the execution times of the parallel algorithms of both iterative methods.
+\label{ch12:sec:01}
+The sparse linear systems are used to model many scientific and industrial problems,
+such as the environmental simulations or the industrial processing of the complex or
+non-Newtonian fluids. Moreover, the resolution of these problems often involves the
+solving of such linear systems which is considered as the most expensive process in
+terms of execution time and memory space. Therefore, solving sparse linear systems
+must be as efficient as possible in order to deal with problems of ever increasing
+size.
+
+There are, in the jargon of numerical analysis, different methods of solving sparse
+linear systems that can be classified in two classes: the direct and iterative methods.
+However, the iterative methods are often more suitable than their counterpart, direct
+methods, for solving these systems. Indeed, they are less memory consuming and easier
+to parallelize on parallel computers than direct methods. Different computing platforms,
+sequential and parallel computers, are used for solving sparse linear systems with iterative
+solutions. Nowadays, graphics processing units (GPUs) have become attractive for solving
+these systems, due to their computing power and their ability to compute faster than
+traditional CPUs.
+
+In Section~\ref{ch12:sec:02}, we describe the general principle of two well-known iterative
+methods: the conjugate gradient method and the generalized minimal residual method. In Section~\ref{ch12:sec:03},
+we give the main key points of the parallel implementation of both methods on a cluster of
+GPUs. Then, in Section~\ref{ch12:sec:04}, we present the experimental results obtained on a
+CPU cluster and on a GPU cluster, for solving sparse linear systems associated to matrices
+of different structures. Finally, in Section~\ref{ch12:sec:05}, we apply the hypergraph partitioning
+technique to reduce the total communication volume between the computing nodes and, thus,
+to improve the execution times of the parallel algorithms of both iterative methods.
%%--------------------------%%
%% SECTION 2 %%
%%--------------------------%%
\section{Krylov iterative methods}
-\label{sec:02}
-Let us consider the following system of $n$ linear equations in $\mathbb{R}$:
+\label{ch12:sec:02}
+Let us consider the following system of $n$ linear equations\index{Sparse~linear~system}
+in $\mathbb{R}$:
\begin{equation}
Ax=b,
-\label{eq:01}
+\label{ch12:eq:01}
\end{equation}
-where $A\in\mathbb{R}^{n\times n}$ is a sparse nonsingular square matrix, $x\in\mathbb{R}^{n}$ is the solution vector,
-$b\in\mathbb{R}^{n}$ is the right-hand side and $n\in\mathbb{N}$ is a large integer number.
-
-The iterative methods for solving the large sparse linear system~(\ref{eq:01}) proceed by successive iterations of a same
-block of elementary operations, during which an infinite number of approximate solutions $\{x_k\}_{k\geq 0}$ are computed.
-Indeed, from an initial guess $x_0$, an iterative method determines at each iteration $k>0$ an approximate solution $x_k$
-which, gradually, converges to the exact solution $x^{*}$ as follows:
+where $A\in\mathbb{R}^{n\times n}$ is a sparse nonsingular square matrix, $x\in\mathbb{R}^{n}$
+is the solution vector, $b\in\mathbb{R}^{n}$ is the right-hand side and $n\in\mathbb{N}$ is a
+large integer number.
+
+The iterative methods\index{Iterative~method} for solving the large sparse linear system~(\ref{ch12:eq:01})
+proceed by successive iterations of a same block of elementary operations, during which an
+infinite number of approximate solutions $\{x_k\}_{k\geq 0}$ are computed. Indeed, from an
+initial guess $x_0$, an iterative method determines at each iteration $k>0$ an approximate
+solution $x_k$ which, gradually, converges to the exact solution $x^{*}$ as follows:
\begin{equation}
x^{*}=\lim\limits_{k\to\infty}x_{k}=A^{-1}b.
-\label{eq:02}
+\label{ch12:eq:02}
\end{equation}
-The number of iterations necessary to reach the exact solution $x^{*}$ is not known beforehand and can be infinite. In
-practice, an iterative method often finds an approximate solution $\tilde{x}$ after a fixed number of iterations and/or
-when a given convergence criterion is satisfied as follows:
+The number of iterations necessary to reach the exact solution $x^{*}$ is not known beforehand
+and can be infinite. In practice, an iterative method often finds an approximate solution $\tilde{x}$
+after a fixed number of iterations and/or when a given convergence criterion\index{Convergence}
+is satisfied as follows:
\begin{equation}
\|b-A\tilde{x}\| < \varepsilon,
-\label{eq:03}
+\label{ch12:eq:03}
\end{equation}
-where $\varepsilon<1$ is the required convergence tolerance threshold.
-
-Some of the most iterative methods that have proven their efficiency for solving large sparse linear systems are those
-called \textit{Krylov sub-space methods}~\cite{ref1}. In the present chapter, we describe two Krylov methods which are
-widely used: the conjugate gradient method (CG) and the generalized minimal residual method (GMRES). In practice, the
-Krylov sub-space methods are usually used with preconditioners that allow to improve their convergence. So, in what
-follows, the CG and GMRES methods are used for solving the left-preconditioned sparse linear system:
+where $\varepsilon<1$ is the required convergence tolerance threshold\index{Convergence!Tolerance~threshold}.
+
+Some of the most iterative methods that have proven their efficiency for solving large sparse
+linear systems are those called \textit{Krylov subspace methods}~\cite{ch12:ref1}\index{Iterative~method!Krylov~subspace}.
+In the present chapter, we describe two Krylov methods which are widely used: the conjugate
+gradient method (CG) and the generalized minimal residual method (GMRES). In practice, the
+Krylov subspace methods are usually used with preconditioners that allow to improve their
+convergence. So, in what follows, the CG and GMRES methods are used for solving the left-preconditioned\index{Sparse~linear~system!Preconditioned}
+sparse linear system:
\begin{equation}
M^{-1}Ax=M^{-1}b,
-\label{eq:11}
+\label{ch12:eq:11}
\end{equation}
where $M$ is the preconditioning matrix.
+
%%****************%%
%%****************%%
\subsection{CG method}
-\label{sec:02.01}
-The conjugate gradient method is initially developed by Hestenes and Stiefel in 1952~\cite{ref2}. It is one of the well
-known iterative method for solving large sparse linear systems. In addition, it can be adapted for solving nonlinear
-equations and optimization problems. However, it can only be applied to problems with positive definite symmetric matrices.
-
-The main idea of the CG method is the computation of a sequence of approximate solutions $\{x_k\}_{k\geq 0}$ in a Krylov
-sub-space of order $k$ as follows:
+\label{ch12:sec:02.01}
+The conjugate gradient method is initially developed by Hestenes and Stiefel in 1952~\cite{ch12:ref2}.
+It is one of the well known iterative method for solving large sparse linear systems. In addition, it
+can be adapted for solving nonlinear equations and optimization problems. However, it can only be applied
+to problems with positive definite symmetric matrices.
+
+The main idea of the CG method\index{Iterative~method!CG} is the computation of a sequence of approximate
+solutions $\{x_k\}_{k\geq 0}$ in a Krylov subspace\index{Iterative~method!Krylov~subspace} of order $k$ as
+follows:
\begin{equation}
x_k \in x_0 + \mathcal{K}_k(A,r_0),
-\label{eq:04}
+\label{ch12:eq:04}
\end{equation}
-such that the Galerkin condition must be satisfied:
+such that the Galerkin condition\index{Galerkin~condition} must be satisfied:
\begin{equation}
r_k \bot \mathcal{K}_k(A,r_0),
-\label{eq:05}
+\label{ch12:eq:05}
\end{equation}
-where $x_0$ is the initial guess, $r_k=b-Ax_k$ is the residual of the computed solution $x_k$ and $\mathcal{K}_k$ the Krylov
-sub-space of order $k$: \[\mathcal{K}_k(A,r_0) \equiv\text{span}\{r_0, Ar_0, A^2r_0,\ldots, A^{k-1}r_0\}.\]
+where $x_0$ is the initial guess, $r_k=b-Ax_k$ is the residual of the computed solution $x_k$ and $\mathcal{K}_k$
+the Krylov subspace of order $k$: \[\mathcal{K}_k(A,r_0) \equiv\text{span}\{r_0, Ar_0, A^2r_0,\ldots, A^{k-1}r_0\}.\]
In fact, CG is based on the construction of a sequence $\{p_k\}_{k\in\mathbb{N}}$ of direction vectors in $\mathcal{K}_k$
which are pairwise $A$-conjugate ($A$-orthogonal):
\begin{equation}
\begin{array}{ll}
p_i^T A p_j = 0, & i\neq j.
\end{array}
-\label{eq:06}
+\label{ch12:eq:06}
\end{equation}
At each iteration $k$, an approximate solution $x_k$ is computed by recurrence as follows:
\begin{equation}
\begin{array}{ll}
x_k = x_{k-1} + \alpha_k p_k, & \alpha_k\in\mathbb{R}.
\end{array}
-\label{eq:07}
+\label{ch12:eq:07}
\end{equation}
Consequently, the residuals $r_k$ are computed in the same way:
\begin{equation}
r_k = r_{k-1} - \alpha_k A p_k.
-\label{eq:08}
+\label{ch12:eq:08}
\end{equation}
-In the case where all residuals are nonzero, the direction vectors $p_k$ can be determined so that the following recurrence
-holds:
+In the case where all residuals are nonzero, the direction vectors $p_k$ can be determined so that
+the following recurrence holds:
\begin{equation}
\begin{array}{lll}
p_0=r_0, & p_k=r_k+\beta_k p_{k-1}, & \beta_k\in\mathbb{R}.
\end{array}
-\label{eq:09}
+\label{ch12:eq:09}
\end{equation}
-Moreover, the scalars $\{\alpha_k\}_{k>0}$ are chosen so as to minimize the $A$-norm error $\|x^{*}-x_k\|_A$ over the Krylov
-sub-space $\mathcal{K}_{k}$ and the scalars $\{\beta_k\}_{k>0}$ are chosen so as to ensure that the direction vectors are
-pairwise $A$-conjugate. So, the assumption that matrix $A$ is symmetric and the recurrences~(\ref{eq:08}) and~(\ref{eq:09})
-allow to deduce that:
+Moreover, the scalars $\{\alpha_k\}_{k>0}$ are chosen so as to minimize the $A$-norm error $\|x^{*}-x_k\|_A$
+over the Krylov subspace $\mathcal{K}_{k}$ and the scalars $\{\beta_k\}_{k>0}$ are chosen so as to ensure
+that the direction vectors are pairwise $A$-conjugate. So, the assumption that matrix $A$ is symmetric and
+the recurrences~(\ref{ch12:eq:08}) and~(\ref{ch12:eq:09}) allow to deduce that:
\begin{equation}
\begin{array}{ll}
\alpha_{k}=\frac{r^{T}_{k-1}r_{k-1}}{p_{k}^{T}Ap_{k}}, & \beta_{k}=\frac{r_{k}^{T}r_{k}}{r_{k-1}^{T}r_{k-1}}.
\end{array}
-\label{eq:10}
+\label{ch12:eq:10}
\end{equation}
\begin{algorithm}[!t]
- \SetLine
- \linesnumbered
Choose an initial guess $x_0$\;
$r_{0} = b - A x_{0}$\;
$convergence$ = false\;
}
}
\caption{Left-preconditioned CG method}
-\label{alg:01}
+\label{ch12:alg:01}
\end{algorithm}
-Algorithm~\ref{alg:01} shows the main key points of the preconditioned CG method. It allows to solve the left-preconditioned
-sparse linear system~(\ref{eq:11}). In this algorithm, $\varepsilon$ is the convergence tolerance threshold, $maxiter$ is the maximum
-number of iterations and $(\cdot,\cdot)$ defines the dot product between two vectors in $\mathbb{R}^{n}$. At every iteration, a direction
-vector $p_k$ is determined, so that it is orthogonal to the preconditioned residual $z_k$ and to the direction vectors $\{p_i\}_{i<k}$
-previously determined (from line~$8$ to line~$13$). Then, at lines~$16$ and~$17$ , the iterate $x_k$ and the residual $r_k$ are computed
-using formulas~(\ref{eq:07}) and~(\ref{eq:08}), respectively. The CG method converges after, at most, $n$ iterations. In practice, the CG
-algorithm stops when the tolerance threshold $\varepsilon$ and/or the maximum number of iterations $maxiter$ are reached.
+Algorithm~\ref{ch12:alg:01} shows the main key points of the preconditioned CG method. It allows
+to solve the left-preconditioned\index{Sparse~linear~system!Preconditioned} sparse linear system~(\ref{ch12:eq:11}).
+In this algorithm, $\varepsilon$ is the convergence tolerance threshold, $maxiter$ is the maximum
+number of iterations and $(\cdot,\cdot)$ defines the dot product between two vectors in $\mathbb{R}^{n}$.
+At every iteration, a direction vector $p_k$ is determined, so that it is orthogonal to the preconditioned
+residual $z_k$ and to the direction vectors $\{p_i\}_{i<k}$ previously determined (from line~$8$ to
+line~$13$). Then, at lines~$16$ and~$17$, the iterate $x_k$ and the residual $r_k$ are computed using
+formulas~(\ref{ch12:eq:07}) and~(\ref{ch12:eq:08}), respectively. The CG method converges after, at
+most, $n$ iterations. In practice, the CG algorithm stops when the tolerance threshold\index{Convergence!Tolerance~threshold}
+$\varepsilon$ and/or the maximum number of iterations\index{Convergence!Maximum~number~of~iterations}
+$maxiter$ are reached.
+
%%****************%%
%%****************%%
\subsection{GMRES method}
-\label{sec:02.02}
-The iterative GMRES method is developed by Saad and Schultz in 1986~\cite{ref3} as a generalization of the minimum residual method
-MINRES~\cite{ref4}. Indeed, GMRES can be applied for solving symmetric or nonsymmetric linear systems.
-
-The main principle of the GMRES method is to find an approximation minimizing at best the residual norm. In fact, GMRES
-computes a sequence of approximate solutions $\{x_k\}_{k>0}$ in a Krylov sub-space $\mathcal{K}_k$ as follows:
+\label{ch12:sec:02.02}
+The iterative GMRES method is developed by Saad and Schultz in 1986~\cite{ch12:ref3} as a generalization
+of the minimum residual method MINRES~\cite{ch12:ref4}\index{Iterative~method!MINRES}. Indeed, GMRES can
+be applied for solving symmetric or nonsymmetric linear systems.
+
+The main principle of the GMRES method\index{Iterative~method!GMRES} is to find an approximation minimizing
+at best the residual norm. In fact, GMRES computes a sequence of approximate solutions $\{x_k\}_{k>0}$ in
+a Krylov subspace\index{Iterative~method!Krylov~subspace} $\mathcal{K}_k$ as follows:
\begin{equation}
\begin{array}{ll}
x_k \in x_0 + \mathcal{K}_k(A, v_1),& v_1=\frac{r_0}{\|r_0\|_2},
\end{array}
-\label{eq:12}
+\label{ch12:eq:12}
\end{equation}
-so that the Petrov-Galerkin condition is satisfied:
+so that the Petrov-Galerkin condition\index{Petrov-Galerkin~condition} is satisfied:
\begin{equation}
\begin{array}{ll}
r_k \bot A \mathcal{K}_k(A, v_1).
\end{array}
-\label{eq:13}
+\label{ch12:eq:13}
\end{equation}
-GMRES uses the Arnoldi process~\cite{ref5} to construct an orthonormal basis $V_k$ for the Krylov sub-space $\mathcal{K}_k$
-and an upper Hessenberg matrix $\bar{H}_k$ of order $(k+1)\times k$:
+GMRES uses the Arnoldi process~\cite{ch12:ref5}\index{Iterative~method!Arnoldi~process} to construct an
+orthonormal basis $V_k$ for the Krylov subspace $\mathcal{K}_k$ and an upper Hessenberg matrix\index{Hessenberg~matrix}
+$\bar{H}_k$ of order $(k+1)\times k$:
\begin{equation}
\begin{array}{ll}
V_k = \{v_1, v_2,\ldots,v_k\}, & \forall k>1, v_k=A^{k-1}v_1,
\end{array}
-\label{eq:14}
+\label{ch12:eq:14}
\end{equation}
and
\begin{equation}
V_k A = V_{k+1} \bar{H}_k.
-\label{eq:15}
+\label{ch12:eq:15}
\end{equation}
-Then, at each iteration $k$, an approximate solution $x_k$ is computed in the Krylov sub-space $\mathcal{K}_k$ spanned by $V_k$
-as follows:
+Then, at each iteration $k$, an approximate solution $x_k$ is computed in the Krylov subspace $\mathcal{K}_k$
+spanned by $V_k$ as follows:
\begin{equation}
\begin{array}{ll}
x_k = x_0 + V_k y, & y\in\mathbb{R}^{k}.
\end{array}
-\label{eq:16}
+\label{ch12:eq:16}
\end{equation}
-From both formulas~(\ref{eq:15}) and~(\ref{eq:16}) and $r_k=b-Ax_k$, we can deduce that:
+From both formulas~(\ref{ch12:eq:15}) and~(\ref{ch12:eq:16}) and $r_k=b-Ax_k$, we can deduce that:
\begin{equation}
\begin{array}{lll}
r_{k} & = & b - A (x_{0} + V_{k}y) \\
& = & \beta v_{1} - V_{k+1}\bar{H}_{k}y \\
& = & V_{k+1}(\beta e_{1} - \bar{H}_{k}y),
\end{array}
-\label{eq:17}
+\label{ch12:eq:17}
\end{equation}
-such that $\beta=\|r_0\|_2$ and $e_1=(1,0,\cdots,0)$ is the first vector of the canonical basis of $\mathbb{R}^k$. So,
-the vector $y$ is chosen in $\mathbb{R}^k$ so as to minimize at best the Euclidean norm of the residual $r_k$. Consequently,
-a linear least-squares problem of size $k$ is solved:
+such that $\beta=\|r_0\|_2$ and $e_1=(1,0,\cdots,0)$ is the first vector of the canonical basis of
+$\mathbb{R}^k$. So, the vector $y$ is chosen in $\mathbb{R}^k$ so as to minimize at best the Euclidean
+norm of the residual $r_k$. Consequently, a linear least-squares problem of size $k$ is solved:
\begin{equation}
\underset{y\in\mathbb{R}^{k}}{min}\|r_{k}\|_{2}=\underset{y\in\mathbb{R}^{k}}{min}\|\beta e_{1}-\bar{H}_{k}y\|_{2}.
-\label{eq:18}
+\label{ch12:eq:18}
\end{equation}
-The QR factorization of matrix $\bar{H}_k$ is used to compute the solution of this problem by using Givens rotations~\cite{ref1,ref3},
-such that:
+The QR factorization of matrix $\bar{H}_k$ is used to compute the solution of this problem by using
+Givens rotations~\cite{ch12:ref1,ch12:ref3}, such that:
\begin{equation}
\begin{array}{lll}
\bar{H}_{k}=Q_{k}R_{k}, & Q_{k}\in\mathbb{R}^{(k+1)\times (k+1)}, & R_{k}\in\mathbb{R}^{(k+1)\times k},
\end{array}
-\label{eq:19}
+\label{ch12:eq:19}
\end{equation}
where $Q_kQ_k^T=I_k$ and $R_k$ is an upper triangular matrix.
-The GMRES method computes an approximate solution with a sufficient precision after, at most, $n$ iterations ($n$ is the size of the
-sparse linear system to be solved). However, the GMRES algorithm must construct and store in the memory an orthonormal basis $V_k$ whose
-size is proportional to the number of iterations required to achieve the convergence. Then, to avoid a huge memory storage, the GMRES
-method must be restarted at each $m$ iterations, such that $m$ is very small ($m\ll n$), and with $x_m$ as the initial guess to the
-next iteration. This allows to limit the size of the basis $V$ to $m$ orthogonal vectors.
+The GMRES method computes an approximate solution with a sufficient precision after, at most, $n$
+iterations ($n$ is the size of the sparse linear system to be solved). However, the GMRES algorithm
+must construct and store in the memory an orthonormal basis $V_k$ whose size is proportional to the
+number of iterations required to achieve the convergence. Then, to avoid a huge memory storage, the
+GMRES method must be restarted at each $m$ iterations, such that $m$ is very small ($m\ll n$), and
+with $x_m$ as the initial guess to the next iteration. This allows to limit the size of the basis
+$V$ to $m$ orthogonal vectors.
\begin{algorithm}[!t]
- \SetLine
- \linesnumbered
Choose an initial guess $x_0$\;
$convergence$ = false\;
$k = 1$\;
}
}
\caption{Left-preconditioned GMRES method with restarts}
-\label{alg:02}
+\label{ch12:alg:02}
\end{algorithm}
-Algorithm~\ref{alg:02} shows the main key points of the GMRES method with restarts. It solves the left-preconditioned sparse linear
-system~(\ref{eq:11}), such that $M$ is the preconditioning matrix. At each iteration $k$, GMRES uses the Arnoldi process (defined
-from line~$7$ to line~$17$) to construct a basis $V_m$ of $m$ orthogonal vectors and an upper Hessenberg matrix $\bar{H}_m$ of size
-$(m+1)\times m$. Then, it solves the linear least-squares problem of size $m$ to find the vector $y\in\mathbb{R}^{m}$ which minimizes
-at best the residual norm (line~$18$). Finally, it computes an approximate solution $x_m$ in the Krylov sub-space spanned by $V_m$
-(line~$19$). The GMRES algorithm is stopped when the residual norm is sufficiently small ($\|r_m\|_2<\varepsilon$) and/or the maximum
-number of iterations ($maxiter$) is reached.
+Algorithm~\ref{ch12:alg:02} shows the main key points of the GMRES method with restarts.
+It solves the left-preconditioned\index{Sparse~linear~system!Preconditioned} sparse linear
+system~(\ref{ch12:eq:11}), such that $M$ is the preconditioning matrix. At each iteration
+$k$, GMRES uses the Arnoldi process\index{Iterative~method!Arnoldi~process} (defined from
+line~$7$ to line~$17$) to construct a basis $V_m$ of $m$ orthogonal vectors and an upper
+Hessenberg matrix\index{Hessenberg~matrix} $\bar{H}_m$ of size $(m+1)\times m$. Then, it
+solves the linear least-squares problem of size $m$ to find the vector $y\in\mathbb{R}^{m}$
+which minimizes at best the residual norm (line~$18$). Finally, it computes an approximate
+solution $x_m$ in the Krylov subspace spanned by $V_m$ (line~$19$). The GMRES algorithm is
+stopped when the residual norm is sufficiently small ($\|r_m\|_2<\varepsilon$) and/or the
+maximum number of iterations\index{Convergence!Maximum~number~of~iterations} ($maxiter$)
+is reached.
+
%%--------------------------%%
%% SECTION 3 %%
%%--------------------------%%
\section{Parallel implementation on a GPU cluster}
-\label{sec:03}
-In this section, we present the parallel algorithms of both iterative CG and GMRES methods for GPU clusters.
-The implementation is performed on a GPU cluster composed of different computing nodes, such that each node
-is a CPU core managed by a MPI process and equipped with a GPU card. The parallelization of these algorithms
-is carried out by using the MPI communication routines between the GPU computing nodes and the CUDA programming
-environment inside each node. In what follows, the algorithms of the iterative methods are called iterative
-solvers.
+\label{ch12:sec:03}
+In this section, we present the parallel algorithms of both iterative CG\index{Iterative~method!CG}
+and GMRES\index{Iterative~method!GMRES} methods for GPU clusters. The implementation is performed on
+a GPU cluster composed of different computing nodes, such that each node is a CPU core managed by a
+MPI process and equipped with a GPU card. The parallelization of these algorithms is carried out by
+using the MPI communication routines between the GPU computing nodes\index{Computing~node} and the
+CUDA programming environment inside each node. In what follows, the algorithms of the iterative methods
+are called iterative solvers.
+
%%****************%%
%%****************%%
\subsection{Data partitioning}
-\label{sec:03.01}
-The parallel solving of the large sparse linear system~(\ref{eq:11}) requires a data partitioning between the computing
-nodes of the GPU cluster. Let $p$ denotes the number of the computing nodes on the GPU cluster. The partitioning operation
-consists in the decomposition of the vectors and matrices, involved in the iterative solver, in $p$ portions. Indeed, this
-operation allows to assign to each computing node $i$:
-\begin{itemize*}
+\label{ch12:sec:03.01}
+The parallel solving of the large sparse linear system~(\ref{ch12:eq:11}) requires a data partitioning
+between the computing nodes of the GPU cluster. Let $p$ denotes the number of the computing nodes on the
+GPU cluster. The partitioning operation consists in the decomposition of the vectors and matrices, involved
+in the iterative solver, in $p$ portions. Indeed, this operation allows to assign to each computing node
+$i$:
+\begin{itemize}
\item a portion of size $\frac{n}{p}$ elements of each vector,
\item a sparse rectangular sub-matrix $A_i$ of size $(\frac{n}{p},n)$ and,
\item a square preconditioning sub-matrix $M_i$ of size $(\frac{n}{p},\frac{n}{p})$,
-\end{itemize*}
-where $n$ is the size of the sparse linear system to be solved. In the first instance, we perform a naive row-wise partitioning
-(decomposition row-by-row) on the data of the sparse linear systems to be solved. Figure~\ref{fig:01} shows an example of a row-wise
-data partitioning between four computing nodes of a sparse linear system (sparse matrix $A$, solution vector $x$ and right-hand
-side $b$) of size $16$ unknown values.
+\end{itemize}
+where $n$ is the size of the sparse linear system to be solved. In the first instance, we perform a naive
+row-wise partitioning (decomposition row-by-row) on the data of the sparse linear systems to be solved.
+Figure~\ref{ch12:fig:01} shows an example of a row-wise data partitioning between four computing nodes
+of a sparse linear system (sparse matrix $A$, solution vector $x$ and right-hand side $b$) of size $16$
+unknown values.
\begin{figure}
\centerline{\includegraphics[scale=0.35]{Chapters/chapter12/figures/partition}}
\caption{A data partitioning of the sparse matrix $A$, the solution vector $x$ and the right-hand side $b$ into four portions.}
-\label{fig:01}
+\label{ch12:fig:01}
\end{figure}
+
%%****************%%
%%****************%%
\subsection{GPU computing}
-\label{sec:03.02}
-After the partitioning operation, all the data involved from this operation must be transferred from the CPU memories to the GPU
-memories, in order to be processed by GPUs. We use two functions of the CUBLAS library (CUDA Basic Linear Algebra Subroutines),
-developed by Nvidia~\cite{ref6}: \verb+cublasAlloc()+ for the memory allocations on GPUs and \verb+cublasSetVector()+ for the
-memory copies from the CPUs to the GPUs.
-
-An efficient implementation of CG and GMRES solvers on a GPU cluster requires to determine all parts of their codes that can be
-executed in parallel and, thus, take advantage of the GPU acceleration. As many Krylov sub-space methods, the CG and GMRES methods
-are mainly based on arithmetic operations dealing with vectors or matrices: sparse matrix-vector multiplications, scalar-vector
-multiplications, dot products, Euclidean norms, AXPY operations ($y\leftarrow ax+y$ where $x$ and $y$ are vectors and $a$ is a
-scalar) and so on. These vector operations are often easy to parallelize and they are more efficient on parallel computers when
-they work on large vectors. Therefore, all the vector operations used in CG and GMRES solvers must be executed by the GPUs as kernels.
-
-We use the kernels of the CUBLAS library to compute some vector operations of CG and GMRES solvers. The following kernels of CUBLAS
-(dealing with double floating point) are used: \verb+cublasDdot()+ for the dot products, \verb+cublasDnrm2()+ for the Euclidean
-norms and \verb+cublasDaxpy()+ for the AXPY operations. For the rest of the data-parallel operations, we code their kernels in CUDA.
-In the CG solver, we develop a kernel for the XPAY operation ($y\leftarrow x+ay$) used at line~$12$ in Algorithm~\ref{alg:01}. In the
-GMRES solver, we program a kernel for the scalar-vector multiplication (lines~$7$ and~$15$ in Algorithm~\ref{alg:02}), a kernel for
-solving the least-squares problem and a kernel for the elements updates of the solution vector $x$.
-
-The least-squares problem in the GMRES method is solved by performing a QR factorization on the Hessenberg matrix $\bar{H}_m$ with
-plane rotations and, then, solving the triangular system by backward substitutions to compute $y$. Consequently, solving the least-squares
-problem on the GPU is not interesting. Indeed, the triangular solves are not easy to parallelize and inefficient on GPUs. However,
-the least-squares problem to solve in the GMRES method with restarts has, generally, a very small size $m$. Therefore, we develop
-an inexpensive kernel which must be executed in sequential by a single CUDA thread.
-
-The most important operation in CG and GMRES methods is the sparse matrix-vector multiplication (SpMV), because it is often an
-expensive operation in terms of execution time and memory space. Moreover, it requires to take care of the storage format of the
-sparse matrix in the memory. Indeed, the naive storage, row-by-row or column-by-column, of a sparse matrix can cause a significant
-waste of memory space and execution time. In addition, the sparsity nature of the matrix often leads to irregular memory accesses
-to read the matrix nonzero values. So, the computation of the SpMV multiplication on GPUs can involve non coalesced accesses to
-the global memory, which slows down even more its performances. One of the most efficient compressed storage formats of sparse
-matrices on GPUs is HYB format~\cite{ref7}. It is a combination of ELLpack (ELL) and Coordinate (COO) formats. Indeed, it stores
-a typical number of nonzero values per row in ELL format and remaining entries of exceptional rows in COO format. It combines
-the efficiency of ELL due to the regularity of its memory accesses and the flexibility of COO which is insensitive to the matrix
-structure. Consequently, we use the HYB kernel~\cite{ref8} developed by Nvidia to implement the SpMV multiplication of CG and
-GMRES methods on GPUs. Moreover, to avoid the non coalesced accesses to the high-latency global memory, we fill the elements of
-the iterate vector $x$ in the cached texture memory.
+\label{ch12:sec:03.02}
+After the partitioning operation, all the data involved from this operation must be
+transferred from the CPU memories to the GPU memories, in order to be processed by
+GPUs. We use two functions of the CUBLAS\index{CUBLAS} library (CUDA Basic Linear
+Algebra Subroutines), developed by Nvidia~\cite{ch12:ref6}: \verb+cublasAlloc()+
+for the memory allocations on GPUs and \verb+cublasSetVector()+ for the memory
+copies from the CPUs to the GPUs.
+
+An efficient implementation of CG and GMRES solvers on a GPU cluster requires to
+determine all parts of their codes that can be executed in parallel and, thus, take
+advantage of the GPU acceleration. As many Krylov subspace methods, the CG and GMRES
+methods are mainly based on arithmetic operations dealing with vectors or matrices:
+sparse matrix-vector multiplications, scalar-vector multiplications, dot products,
+Euclidean norms, AXPY operations ($y\leftarrow ax+y$ where $x$ and $y$ are vectors
+and $a$ is a scalar) and so on. These vector operations are often easy to parallelize
+and they are more efficient on parallel computers when they work on large vectors.
+Therefore, all the vector operations used in CG and GMRES solvers must be executed
+by the GPUs as kernels.
+
+We use the kernels of the CUBLAS library to compute some vector operations of CG and
+GMRES solvers. The following kernels of CUBLAS (dealing with double floating point)
+are used: \verb+cublasDdot()+ for the dot products, \verb+cublasDnrm2()+ for the
+Euclidean norms and \verb+cublasDaxpy()+ for the AXPY operations. For the rest of
+the data-parallel operations, we code their kernels in CUDA. In the CG solver, we
+develop a kernel for the XPAY operation ($y\leftarrow x+ay$) used at line~$12$ in
+Algorithm~\ref{ch12:alg:01}. In the GMRES solver, we program a kernel for the scalar-vector
+multiplication (lines~$7$ and~$15$ in Algorithm~\ref{ch12:alg:02}), a kernel for
+solving the least-squares problem and a kernel for the elements updates of the solution
+vector $x$.
+
+The least-squares problem in the GMRES method is solved by performing a QR factorization
+on the Hessenberg matrix\index{Hessenberg~matrix} $\bar{H}_m$ with plane rotations and,
+then, solving the triangular system by backward substitutions to compute $y$. Consequently,
+solving the least-squares problem on the GPU is not interesting. Indeed, the triangular
+solves are not easy to parallelize and inefficient on GPUs. However, the least-squares
+problem to solve in the GMRES method with restarts has, generally, a very small size $m$.
+Therefore, we develop an inexpensive kernel which must be executed in sequential by a
+single CUDA thread.
+
+The most important operation in CG\index{Iterative~method!CG} and GMRES\index{Iterative~method!GMRES}
+methods is the sparse matrix-vector multiplication (SpMV)\index{SpMV~multiplication},
+because it is often an expensive operation in terms of execution time and memory space.
+Moreover, it requires to take care of the storage format of the sparse matrix in the
+memory. Indeed, the naive storage, row-by-row or column-by-column, of a sparse matrix
+can cause a significant waste of memory space and execution time. In addition, the sparsity
+nature of the matrix often leads to irregular memory accesses to read the matrix nonzero
+values. So, the computation of the SpMV multiplication on GPUs can involve non coalesced
+accesses to the global memory, which slows down even more its performances. One of the
+most efficient compressed storage formats\index{Compressed~storage~format} of sparse
+matrices on GPUs is HYB\index{Compressed~storage~format!HYB} format~\cite{ch12:ref7}.
+It is a combination of ELLpack (ELL) and Coordinate (COO) formats. Indeed, it stores
+a typical number of nonzero values per row in ELL\index{Compressed~storage~format!ELL}
+format and remaining entries of exceptional rows in COO format. It combines the efficiency
+of ELL due to the regularity of its memory accesses and the flexibility of COO\index{Compressed~storage~format!COO}
+which is insensitive to the matrix structure. Consequently, we use the HYB kernel~\cite{ch12:ref8}
+developed by Nvidia to implement the SpMV multiplication of CG and GMRES methods on GPUs.
+Moreover, to avoid the non coalesced accesses to the high-latency global memory, we fill
+the elements of the iterate vector $x$ in the cached texture memory.
+
%%****************%%
%%****************%%
\subsection{Data communications}
-\label{sec:03.03}
-All the computing nodes of the GPU cluster execute in parallel the same iterative solver (Algorithm~\ref{alg:01} or Algorithm~\ref{alg:02})
-adapted to GPUs, but on their own portions of the sparse linear system: $M^{-1}_iA_ix_i=M^{-1}_ib_i$, $0\leq i<p$. However in order to solve
-the complete sparse linear system~(\ref{eq:11}), synchronizations must be performed between the local computations of the computing nodes
-over the cluster. In what follows, two computing nodes sharing data are called neighboring nodes.
-
-As already mentioned, the most important operation of CG and GMRES methods is the SpMV multiplication. In the parallel implementation of
-the iterative methods, each computing node $i$ performs the SpMV multiplication on its own sparse rectangular sub-matrix $A_i$. Locally, it
-has only sub-vectors of size $\frac{n}{p}$ corresponding to rows of its sub-matrix $A_i$. However, it also requires the vector elements
-of its neighbors, corresponding to the column indices on which its sub-matrix has nonzero values (see Figure~\ref{fig:01}). So, in addition
-to the local vectors, each node must also manage vector elements shared with neighbors and required to compute the SpMV multiplication.
-Therefore, the iterate vector $x$ managed by each computing node is composed of a local sub-vector $x^{local}$ of size $\frac{n}{p}$ and a
-sub-vector of shared elements $x^{shared}$. In the same way, the vector used to construct the orthonormal basis of the Krylov sub-space
-(vectors $p$ and $v$ in CG and GMRES methods, respectively) is composed of a local sub-vector and a shared sub-vector.
-
-Therefore, before computing the SpMV multiplication, the neighboring nodes over the GPU cluster must exchange between them the shared
-vector elements necessary to compute this multiplication. First, each computing node determines, in its local sub-vector, the vector
-elements needed by other nodes. Then, the neighboring nodes exchange between them these shared vector elements. The data exchanges
-are implemented by using the MPI point-to-point communication routines: blocking sends with \verb+MPI_Send()+ and nonblocking receives
-with \verb+MPI_Irecv()+. Figure~\ref{fig:02} shows an example of data exchanges between \textit{Node 1} and its neighbors \textit{Node 0},
-\textit{Node 2} and \textit{Node 3}. In this example, the iterate matrix $A$ split between these four computing nodes is that presented
-in Figure~\ref{fig:01}.
+\label{ch12:sec:03.03}
+All the computing nodes of the GPU cluster execute in parallel the same iterative solver
+(Algorithm~\ref{ch12:alg:01} or Algorithm~\ref{ch12:alg:02}) adapted to GPUs, but on their
+own portions of the sparse linear system\index{Sparse~linear~system}: $M^{-1}_iA_ix_i=M^{-1}_ib_i$,
+$0\leq i<p$. However, in order to solve the complete sparse linear system~(\ref{ch12:eq:11}),
+synchronizations must be performed between the local computations of the computing nodes over
+the cluster. In what follows, two computing nodes sharing data are called neighboring nodes\index{Neighboring~node}.
+
+As already mentioned, the most important operation of CG and GMRES methods is the SpMV multiplication.
+In the parallel implementation of the iterative methods, each computing node $i$ performs the
+SpMV multiplication on its own sparse rectangular sub-matrix $A_i$. Locally, it has only sub-vectors
+of size $\frac{n}{p}$ corresponding to rows of its sub-matrix $A_i$. However, it also requires
+the vector elements of its neighbors, corresponding to the column indices on which its sub-matrix
+has nonzero values (see Figure~\ref{ch12:fig:01}). So, in addition to the local vectors, each
+node must also manage vector elements shared with neighbors and required to compute the SpMV
+multiplication. Therefore, the iterate vector $x$ managed by each computing node is composed
+of a local sub-vector $x^{local}$ of size $\frac{n}{p}$ and a sub-vector of shared elements $x^{shared}$.
+In the same way, the vector used to construct the orthonormal basis of the Krylov subspace (vectors
+$p$ and $v$ in CG and GMRES methods, respectively) is composed of a local sub-vector and a shared
+sub-vector.
+
+Therefore, before computing the SpMV multiplication\index{SpMV~multiplication}, the neighboring
+nodes\index{Neighboring~node} over the GPU cluster must exchange between them the shared vector
+elements necessary to compute this multiplication. First, each computing node determines, in its
+local sub-vector, the vector elements needed by other nodes. Then, the neighboring nodes exchange
+between them these shared vector elements. The data exchanges are implemented by using the MPI
+point-to-point communication routines: blocking\index{MPI~subroutines!Blocking} sends with \verb+MPI_Send()+
+and nonblocking\index{MPI~subroutines!Nonblocking} receives with \verb+MPI_Irecv()+. Figure~\ref{ch12:fig:02}
+shows an example of data exchanges between \textit{Node 1} and its neighbors \textit{Node 0}, \textit{Node 2}
+and \textit{Node 3}. In this example, the iterate matrix $A$ split between these four computing
+nodes is that presented in Figure~\ref{ch12:fig:01}.
\begin{figure}
\centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/compress}}
\caption{Data exchanges between \textit{Node 1} and its neighbors \textit{Node 0}, \textit{Node 2} and \textit{Node 3}.}
-\label{fig:02}
+\label{ch12:fig:02}
\end{figure}
-After the synchronization operation, the computing nodes receive, from their respective neighbors, the shared elements in a sub-vector
-stored in a compressed format. However, in order to compute the SpMV multiplication, the computing nodes operate on sparse global vectors
-(see Figure~\ref{fig:02}). In this case, the received vector elements must be copied to the corresponding indices in the global vector.
-So as not to need to perform this at each iteration, we propose to reorder the columns of each sub-matrix $\{A_i\}_{0\leq i<p}$, so that
-the shared sub-vectors could be used in their compressed storage formats. Figure~\ref{fig:03} shows a reordering of a sparse sub-matrix
-(sub-matrix of \textit{Node 1}).
+After the synchronization operation, the computing nodes receive, from their respective neighbors,
+the shared elements in a sub-vector stored in a compressed format. However, in order to compute the
+SpMV multiplication, the computing nodes operate on sparse global vectors (see Figure~\ref{ch12:fig:02}).
+In this case, the received vector elements must be copied to the corresponding indices in the global
+vector. So as not to need to perform this at each iteration, we propose to reorder the columns of
+each sub-matrix $\{A_i\}_{0\leq i<p}$, so that the shared sub-vectors could be used in their compressed
+storage formats. Figure~\ref{ch12:fig:03} shows a reordering of a sparse sub-matrix (sub-matrix of
+\textit{Node 1}).
\begin{figure}
-\centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/reorder}}
+\centerline{\includegraphics[scale=0.35]{Chapters/chapter12/figures/reorder}}
\caption{Columns reordering of a sparse sub-matrix.}
-\label{fig:03}
+\label{ch12:fig:03}
\end{figure}
-A GPU cluster is a parallel platform with a distributed memory. So, the synchronizations and communication data between GPU nodes are
-carried out by passing messages. However, GPUs can not communicate between them in direct way. Then, CPUs via MPI processes are in charge
-of the synchronizations within the GPU cluster. Consequently, the vector elements to be exchanged must be copied from the GPU memory
-to the CPU memory and vice-versa before and after the synchronization operation between CPUs. We have used the CBLAS communication subroutines
-to perform the data transfers between a CPU core and its GPU: \verb+cublasGetVector()+ and \verb+cublasSetVector()+. Finally, in addition
-to the data exchanges, GPU nodes perform reduction operations to compute in parallel the dot products and Euclidean norms. This is implemented
-by using the MPI global communication \verb+MPI_Allreduce()+.
+A GPU cluster\index{GPU~cluster} is a parallel platform with a distributed memory. So, the synchronizations
+and communication data between GPU nodes are carried out by passing messages. However, GPUs can not communicate
+between them in direct way. Then, CPUs via MPI processes are in charge of the synchronizations within the GPU
+cluster. Consequently, the vector elements to be exchanged must be copied from the GPU memory to the CPU memory
+and vice-versa before and after the synchronization operation between CPUs. We have used the CUBLAS\index{CUBLAS}
+communication subroutines to perform the data transfers between a CPU core and its GPU: \verb+cublasGetVector()+
+and \verb+cublasSetVector()+. Finally, in addition to the data exchanges, GPU nodes perform reduction operations
+to compute in parallel the dot products and Euclidean norms. This is implemented by using the MPI global communication\index{MPI~subroutines!Global}
+\verb+MPI_Allreduce()+.
+
%%--------------------------%%
%% SECTION 4 %%
%%--------------------------%%
\section{Experimental results}
-\label{sec:04}
-In this section, we present the performances of the parallel CG and GMRES linear solvers obtained on a cluster of $12$ GPUs. Indeed, this GPU
-cluster of tests is composed of six machines connected by $20$Gbps InfiniBand network. Each machine is a Quad-Core Xeon E5530 CPU running at
-$2.4$GHz and providing $12$GB of RAM with a memory bandwidth of $25.6$GB/s. In addition, two Tesla C1060 GPUs are connected to each machine via
-a PCI-Express 16x Gen 2.0 interface with a throughput of $8$GB/s. A Tesla C1060 GPU contains $240$ cores running at $1.3$GHz and providing a
-global memory of $4$GB with a memory bandwidth of $102$GB/s. Figure~\ref{fig:04} shows the general scheme of the GPU cluster that we used in
-the experimental tests.
+\label{ch12:sec:04}
+In this section, we present the performances of the parallel CG and GMRES linear solvers obtained
+on a cluster of $12$ GPUs. Indeed, this GPU cluster of tests is composed of six machines connected
+by $20$Gbps InfiniBand network. Each machine is a Quad-Core Xeon E5530 CPU running at $2.4$GHz and
+providing $12$GB of RAM with a memory bandwidth of $25.6$GB/s. In addition, two Tesla C1060 GPUs are
+connected to each machine via a PCI-Express 16x Gen 2.0 interface with a throughput of $8$GB/s. A
+Tesla C1060 GPU contains $240$ cores running at $1.3$GHz and providing a global memory of $4$GB with
+a memory bandwidth of $102$GB/s. Figure~\ref{ch12:fig:04} shows the general scheme of the GPU cluster\index{GPU~cluster}
+that we used in the experimental tests.
\begin{figure}
\centerline{\includegraphics[scale=0.25]{Chapters/chapter12/figures/cluster}}
\caption{General scheme of the GPU cluster of tests composed of six machines, each with two GPUs.}
-\label{fig:04}
+\label{ch12:fig:04}
\end{figure}
-Linux cluster version 2.6.39 OS is installed on CPUs. C programming language is used for coding the parallel algorithms of both methods on the
-GPU cluster. CUDA version 4.0~\cite{ref9} is used for programming GPUs, using CUBLAS library~\cite{ref6} to deal with vector operations in GPUs
-and, finally, MPI routines of OpenMPI 1.3.3 are used to carry out the communications between CPU cores. Indeed, the experiments are done on a
-cluster of $12$ computing nodes, where each node is managed by a MPI process and it is composed of one CPU core and one GPU card.
-
-All tests are made on double-precision floating point operations. The parameters of both linear solvers are initialized as follows: the residual
-tolerance threshold $\varepsilon=10^{-12}$, the maximum number of iterations $maxiter=500$, the right-hand side $b$ is filled with $1.0$ and the
-initial guess $x_0$ is filled with $0.0$. In addition, we limited the Arnoldi process used in the GMRES method to $16$ iterations ($m=16$). For
-the sake of simplicity, we have chosen the preconditioner $M$ as the main diagonal of the sparse matrix $A$. Indeed, it allows to easily compute
-the required inverse matrix $M^{-1}$ and it provides a relatively good preconditioning for not too ill-conditioned matrices. In the GPU computing,
-the size of thread blocks is fixed to $512$ threads. Finally, the performance results, presented hereafter, are obtained from the mean value over
-$10$ executions of the same parallel linear solver and for the same input data.
-
-To get more realistic results, we tested the CG and GMRES algorithms on sparse matrices of the Davis's collection~\cite{ref10}, that arise in a wide
-spectrum of real-world applications. We chose six symmetric sparse matrices and six nonsymmetric ones from this collection. In Figure~\ref{fig:05},
-we show structures of these matrices and in Table~\ref{tab:01} we present their main characteristics which are the number of rows, the total number
-of nonzero values (nnz) and the maximal bandwidth. In the present chapter, the bandwidth of a sparse matrix is defined as the number of matrix columns
-separating the first and the last nonzero value on a matrix row.
+Linux cluster version 2.6.39 OS is installed on CPUs. C programming language is used for coding
+the parallel algorithms of both methods on the GPU cluster. CUDA version 4.0~\cite{ch12:ref9}
+is used for programming GPUs, using CUBLAS library~\cite{ch12:ref6} to deal with vector operations
+in GPUs and, finally, MPI routines of OpenMPI 1.3.3 are used to carry out the communications between
+CPU cores. Indeed, the experiments are done on a cluster of $12$ computing nodes, where each node
+is managed by a MPI process and it is composed of one CPU core and one GPU card.
+
+All tests are made on double-precision floating point operations. The parameters of both linear
+solvers are initialized as follows: the residual tolerance threshold $\varepsilon=10^{-12}$, the
+maximum number of iterations $maxiter=500$, the right-hand side $b$ is filled with $1.0$ and the
+initial guess $x_0$ is filled with $0.0$. In addition, we limited the Arnoldi process\index{Iterative~method!Arnoldi~process}
+used in the GMRES method to $16$ iterations ($m=16$). For the sake of simplicity, we have chosen
+the preconditioner $M$ as the main diagonal of the sparse matrix $A$. Indeed, it allows to easily
+compute the required inverse matrix $M^{-1}$ and it provides a relatively good preconditioning for
+not too ill-conditioned matrices. In the GPU computing, the size of thread blocks is fixed to $512$
+threads. Finally, the performance results, presented hereafter, are obtained from the mean value
+over $10$ executions of the same parallel linear solver and for the same input data.
+
+To get more realistic results, we tested the CG and GMRES algorithms on sparse matrices of the Davis's
+collection~\cite{ch12:ref10}, that arise in a wide spectrum of real-world applications. We chose six
+symmetric sparse matrices and six nonsymmetric ones from this collection. In Figure~\ref{ch12:fig:05},
+we show structures of these matrices and in Table~\ref{ch12:tab:01} we present their main characteristics
+which are the number of rows, the total number of nonzero values (nnz) and the maximal bandwidth. In
+the present chapter, the bandwidth of a sparse matrix is defined as the number of matrix columns separating
+the first and the last nonzero value on a matrix row.
\begin{figure}
\centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/matrices}}
\caption{Sketches of sparse matrices chosen from the Davis's collection.}
-\label{fig:05}
+\label{ch12:fig:05}
\end{figure}
\begin{table}
\end{tabular}
\vspace{0.5cm}
\caption{Main characteristics of sparse matrices chosen from the Davis's collection.}
-\label{tab:01}
+\label{ch12:tab:01}
\end{table}
\begin{table}
thermal2 & $1.172s$ & $0.622s$ & $1.88$ & $15$ & $5.11e$-$09$ & $3.33e$-$16$ \\ \hline
\end{tabular}
\caption{Performances of the parallel CG method on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs.}
-\label{tab:02}
+\label{ch12:tab:02}
\end{center}
\end{table}
torso3 & $4.242s$ & $2.030s$ & $2.09$ & $175$ & $2.69e$-$10$ & $1.78e$-$14$ \\ \hline
\end{tabular}
\caption{Performances of the parallel GMRES method on a cluster 24 CPU cores vs. on cluster of 12 GPUs.}
-\label{tab:03}
+\label{ch12:tab:03}
\end{center}
\end{table}
-Tables~\ref{tab:02} and~\ref{tab:03} shows the performances of the parallel CG and GMRES solvers, respectively, for solving linear systems associated to
-the sparse matrices presented in Tables~\ref{tab:01}. They allow to compare the performances obtained on a cluster of $24$ CPU cores and on a cluster
-of $12$ GPUs. However, Table~\ref{tab:02} shows only the performances of solving symmetric sparse linear systems, due to the inability of the CG method
-to solve the nonsymmetric systems. In both tables, the second and third columns give, respectively, the execution times in seconds obtained on $24$ CPU
-cores ($Time_{gpu}$) and that obtained on $12$ GPUs ($Time_{gpu}$). Moreover, we take into account the relative gains $\tau$ of a solver implemented on the
-GPU cluster compared to the same solver implemented on the CPU cluster. The relative gains, presented in the fourth column, are computed as a ratio of
-the CPU execution time over the GPU execution time:
+Tables~\ref{ch12:tab:02} and~\ref{ch12:tab:03} shows the performances of the parallel
+CG and GMRES solvers, respectively, for solving linear systems associated to the sparse
+matrices presented in Tables~\ref{ch12:tab:01}. They allow to compare the performances
+obtained on a cluster of $24$ CPU cores and on a cluster of $12$ GPUs. However, Table~\ref{ch12:tab:02}
+shows only the performances of solving symmetric sparse linear systems, due to the inability
+of the CG method to solve the nonsymmetric systems. In both tables, the second and third
+columns give, respectively, the execution times in seconds obtained on $24$ CPU cores
+($Time_{gpu}$) and that obtained on $12$ GPUs ($Time_{gpu}$). Moreover, we take into account
+the relative gains $\tau$ of a solver implemented on the GPU cluster compared to the same
+solver implemented on the CPU cluster. The relative gains\index{Relative~gain}, presented
+in the fourth column, are computed as a ratio of the CPU execution time over the GPU
+execution time:
\begin{equation}
\tau = \frac{Time_{cpu}}{Time_{gpu}}.
-\label{eq:20}
+\label{ch12:eq:20}
\end{equation}
-In addition, Tables~\ref{tab:02} and~\ref{tab:03} give the number of iterations ($iter$), the precision $prec$ of the solution computed on the GPU cluster
-and the difference $\Delta$ between the solution computed on the CPU cluster and that computed on the GPU cluster. Both parameters $prec$ and $\Delta$
-allow to validate and verify the accuracy of the solution computed on the GPU cluster. We have computed them as follows:
+In addition, Tables~\ref{ch12:tab:02} and~\ref{ch12:tab:03} give the number of iterations
+($iter$), the precision $prec$ of the solution computed on the GPU cluster and the difference
+$\Delta$ between the solution computed on the CPU cluster and that computed on the GPU cluster.
+Both parameters $prec$ and $\Delta$ allow to validate and verify the accuracy of the solution
+computed on the GPU cluster. We have computed them as follows:
\begin{eqnarray}
\Delta = max|x^{cpu}-x^{gpu}|,\\
prec = max|M^{-1}r^{gpu}|,
\end{eqnarray}
-where $\Delta$ is the maximum vector element, in absolute value, of the difference between the two solutions $x^{cpu}$ and $x^{gpu}$ computed, respectively,
-on CPU and GPU clusters and $prec$ is the maximum element, in absolute value, of the residual vector $r^{gpu}\in\mathbb{R}^{n}$ of the solution $x^{gpu}$.
-Thus, we can see that the solutions obtained on the GPU cluster were computed with a sufficient accuracy (about $10^{-10}$) and they are, more or less,
-equivalent to those computed on the CPU cluster with a small difference ranging from $10^{-10}$ to $10^{-26}$. However, we can notice from the relative
-gains $\tau$ that is not interesting to use multiple GPUs for solving small sparse linear systems. in fact, a small sparse matrix does not allow to maximize
-utilization of GPU cores. In addition, the communications required to synchronize the computations over the cluster increase the idle times of GPUs and
-slow down further the parallel computations.
-
-Consequently, in order to test the performances of the parallel solvers, we developed in C programming language a generator of large sparse matrices.
-This generator takes a matrix from the Davis's collection~\cite{ref10} as an initial matrix to construct large sparse matrices exceeding ten million
-of rows. It must be executed in parallel by the MPI processes of the computing nodes, so that each process could construct its sparse sub-matrix. In
-first experimental tests, we are focused on sparse matrices having a banded structure, because they are those arise in the most of numerical problems.
-So to generate the global sparse matrix, each MPI process constructs its sub-matrix by performing several copies of an initial sparse matrix chosen
-from the Davis's collection. Then, it puts all these copies on the main diagonal of the global matrix (see Figure~\ref{fig:06}). Moreover, the empty
-spaces between two successive copies in the main diagonal are filled with sub-copies (left-copy and right-copy in Figure~\ref{fig:06}) of the same
+where $\Delta$ is the maximum vector element, in absolute value, of the difference between
+the two solutions $x^{cpu}$ and $x^{gpu}$ computed, respectively, on CPU and GPU clusters and
+$prec$ is the maximum element, in absolute value, of the residual vector $r^{gpu}\in\mathbb{R}^{n}$
+of the solution $x^{gpu}$. Thus, we can see that the solutions obtained on the GPU cluster
+were computed with a sufficient accuracy (about $10^{-10}$) and they are, more or less, equivalent
+to those computed on the CPU cluster with a small difference ranging from $10^{-10}$ to $10^{-26}$.
+However, we can notice from the relative gains $\tau$ that is not interesting to use multiple
+GPUs for solving small sparse linear systems. in fact, a small sparse matrix does not allow to
+maximize utilization of GPU cores. In addition, the communications required to synchronize the
+computations over the cluster increase the idle times of GPUs and slow down further the parallel
+computations.
+
+Consequently, in order to test the performances of the parallel solvers, we developed in C programming
+language a generator of large sparse matrices. This generator takes a matrix from the Davis's collection~\cite{ch12:ref10}
+as an initial matrix to construct large sparse matrices exceeding ten million of rows. It must be executed
+in parallel by the MPI processes of the computing nodes, so that each process could construct its sparse
+sub-matrix. In first experimental tests, we are focused on sparse matrices having a banded structure,
+because they are those arise in the most of numerical problems. So to generate the global sparse matrix,
+each MPI process constructs its sub-matrix by performing several copies of an initial sparse matrix chosen
+from the Davis's collection. Then, it puts all these copies on the main diagonal of the global matrix
+(see Figure~\ref{ch12:fig:06}). Moreover, the empty spaces between two successive copies in the main
+diagonal are filled with sub-copies (left-copy and right-copy in Figure~\ref{ch12:fig:06}) of the same
initial matrix.
\begin{figure}
\centerline{\includegraphics[scale=0.30]{Chapters/chapter12/figures/generation}}
\caption{Parallel generation of a large sparse matrix by four computing nodes.}
-\label{fig:06}
+\label{ch12:fig:06}
\end{figure}
\begin{table}[!h]
\end{tabular}
\vspace{0.5cm}
\caption{Main characteristics of sparse banded matrices generated from those of the Davis's collection.}
-\label{tab:04}
+\label{ch12:tab:04}
\end{table}
-We have used the parallel CG and GMRES algorithms for solving sparse linear systems of $25$ million unknown values. The sparse matrices associated
-to these linear systems are generated from those presented in Table~\ref{tab:01}. Their main characteristics are given in Table~\ref{tab:04}. Tables~\ref{tab:05}
-and~\ref{tab:06} shows the performances of the parallel CG and GMRES solvers, respectively, obtained on a cluster of $24$ CPU cores and on a cluster
-of $12$ GPUs. Obviously, we can notice from these tables that solving large sparse linear systems on a GPU cluster is more efficient than on a CPU
-cluster (see relative gains $\tau$). We can also notice that the execution times of the CG method, whether in a CPU cluster or on a GPU cluster, are
-better than those the GMRES method for solving large symmetric linear systems. In fact, the CG method is characterized by a better convergence rate
-and a shorter execution time of an iteration than those of the GMRES method. Moreover, an iteration of the parallel GMRES method requires more data
-exchanges between computing nodes compared to the parallel CG method.
+We have used the parallel CG and GMRES algorithms for solving sparse linear systems of $25$
+million unknown values. The sparse matrices associated to these linear systems are generated
+from those presented in Table~\ref{ch12:tab:01}. Their main characteristics are given in Table~\ref{ch12:tab:04}.
+Tables~\ref{ch12:tab:05} and~\ref{ch12:tab:06} shows the performances of the parallel CG and
+GMRES solvers, respectively, obtained on a cluster of $24$ CPU cores and on a cluster of $12$
+GPUs. Obviously, we can notice from these tables that solving large sparse linear systems on
+a GPU cluster is more efficient than on a CPU cluster (see relative gains $\tau$). We can also
+notice that the execution times of the CG method, whether in a CPU cluster or on a GPU cluster,
+are better than those the GMRES method for solving large symmetric linear systems. In fact, the
+CG method is characterized by a better convergence\index{Convergence} rate and a shorter execution
+time of an iteration than those of the GMRES method. Moreover, an iteration of the parallel GMRES
+method requires more data exchanges between computing nodes compared to the parallel CG method.
\begin{table}[!h]
\begin{center}
\end{tabular}
\caption{Performances of the parallel CG method for solving linear systems associated to sparse banded matrices on a cluster of 24 CPU cores vs.
on a cluster of 12 GPUs.}
-\label{tab:05}
+\label{ch12:tab:05}
\end{center}
\end{table}
\end{tabular}
\caption{Performances of the parallel GMRES method for solving linear systems associated to sparse banded matrices on a cluster of 24 CPU cores vs.
on a cluster of 12 GPUs.}
-\label{tab:06}
+\label{ch12:tab:06}
\end{center}
\end{table}
%% SECTION 5 %%
%%--------------------------%%
\section{Hypergraph partitioning}
-\label{sec:05}
-In this section, we present the performances of both parallel CG and GMRES solvers for solving linear systems associated to sparse matrices having
-large bandwidths. Indeed, we are interested on sparse matrices having the nonzero values distributed along their bandwidths.
+\label{ch12:sec:05}
+In this section, we present the performances of both parallel CG and GMRES solvers for solving linear
+systems associated to sparse matrices having large bandwidths. Indeed, we are interested on sparse
+matrices having the nonzero values distributed along their bandwidths.
\begin{figure}
\centerline{\includegraphics[scale=0.22]{Chapters/chapter12/figures/generation_1}}
\caption{Parallel generation of a large sparse five-bands matrix by four computing nodes.}
-\label{fig:07}
+\label{ch12:fig:07}
\end{figure}
\begin{table}[!h]
& torso3 & $872,029,998$ & $25,000,000$\\ \hline
\end{tabular}
\caption{Main characteristics of sparse five-bands matrices generated from those of the Davis's collection.}
-\label{tab:07}
+\label{ch12:tab:07}
\end{center}
\end{table}
-We have developed in C programming language a generator of large sparse matrices having five bands distributed along their bandwidths (see Figure~\ref{fig:07}).
-The principle of this generator is equivalent to that in Section~\ref{sec:04}. However, the copies performed on the initial matrix (chosen from the
-Davis's collection) are placed on the main diagonal and on four off-diagonals, two on the right and two on the left of the main diagonal. Figure~\ref{fig:07}
-shows an example of a generation of a sparse five-bands matrix by four computing nodes. Table~\ref{tab:07} shows the main characteristics of sparse
-five-bands matrices generated from those presented in Table~\ref{tab:01} and associated to linear systems of $25$ million unknown values.
+We have developed in C programming language a generator of large sparse matrices
+having five bands distributed along their bandwidths (see Figure~\ref{ch12:fig:07}).
+The principle of this generator is equivalent to that in Section~\ref{ch12:sec:04}.
+However, the copies performed on the initial matrix (chosen from the Davis's collection)
+are placed on the main diagonal and on four off-diagonals, two on the right and two
+on the left of the main diagonal. Figure~\ref{ch12:fig:07} shows an example of a
+generation of a sparse five-bands matrix by four computing nodes. Table~\ref{ch12:tab:07}
+shows the main characteristics of sparse five-bands matrices generated from those
+presented in Table~\ref{ch12:tab:01} and associated to linear systems of $25$ million
+unknown values.
\begin{table}[!h]
\begin{center}
\end{tabular}
\caption{Performances of parallel CG solver for solving linear systems associated to sparse five-bands matrices
on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs}
-\label{tab:08}
+\label{ch12:tab:08}
\end{center}
\end{table}
\end{tabular}
\caption{Performances of parallel GMRES solver for solving linear systems associated to sparse five-bands matrices
on a cluster of 24 CPU cores vs. on a cluster of 12 GPUs}
-\label{tab:09}
+\label{ch12:tab:09}
\end{center}
\end{table}
-Tables~\ref{tab:08} and~\ref{tab:09} shows the performaces of the parallel CG and GMRES solvers, respectively, obtained on
-a cluster of $24$ CPU cores and on a cluster of $12$ GPUs. The linear systems solved in these tables are associated to the
-sparse five-bands matrices presented on Table~\ref{tab:07}. We can notice from both Tables~\ref{tab:08} and~\ref{tab:09} that
-using a GPU cluster is not efficient for solving these kind of sparse linear systems. We can see that the execution times obtained
-on the GPU cluster are almost equivalent to those obtained on the CPU cluster (see the relative gains presented in column~$4$
-of each table). This is due to the large number of communications necessary to synchronize the computations over the cluster.
-Indeed, the naive partitioning, row-by-row or column-by-column, of sparse matrices having large bandwidths can link a computing
-node to many neighbors and then generate a large number of data dependencies between these computing nodes in the cluster.
-
-Therefore, we have chosen to use a hypergraph partitioning method, which is well-suited to numerous kinds of sparse matrices~\cite{ref11}.
-Indeed, it can well model the communications between the computing nodes, particularly in the case of nonsymmetric and irregular
-matrices, and it gives good reduction of the total communication volume. In contrast, it is an expensive operation in terms of
-execution time and memory space.
-
-The sparse matrix $A$ of the linear system to be solved is modeled as a hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ as
-follows:
-\begin{itemize*}
+Tables~\ref{ch12:tab:08} and~\ref{ch12:tab:09} shows the performances of the parallel
+CG and GMRES solvers, respectively, obtained on a cluster of $24$ CPU cores and on a
+cluster of $12$ GPUs. The linear systems solved in these tables are associated to the
+sparse five-bands matrices presented on Table~\ref{ch12:tab:07}. We can notice from
+both Tables~\ref{ch12:tab:08} and~\ref{ch12:tab:09} that using a GPU cluster is not
+efficient for solving these kind of sparse linear systems\index{Sparse~linear~system}.
+We can see that the execution times obtained on the GPU cluster are almost equivalent
+to those obtained on the CPU cluster (see the relative gains presented in column~$4$
+of each table). This is due to the large number of communications necessary to synchronize
+the computations over the cluster. Indeed, the naive partitioning, row-by-row or column-by-column,
+of sparse matrices having large bandwidths can link a computing node to many neighbors
+and then generate a large number of data dependencies between these computing nodes in
+the cluster.
+
+Therefore, we have chosen to use a hypergraph partitioning method\index{Hypergraph},
+which is well-suited to numerous kinds of sparse matrices~\cite{ch12:ref11}. Indeed,
+it can well model the communications between the computing nodes, particularly in the
+case of nonsymmetric and irregular matrices, and it gives good reduction of the total
+communication volume. In contrast, it is an expensive operation in terms of execution
+time and memory space.
+
+The sparse matrix $A$ of the linear system to be solved is modeled as a hypergraph
+$\mathcal{H}=(\mathcal{V},\mathcal{E})$\index{Hypergraph} as follows:
+\begin{itemize}
\item each matrix row $\{i\}_{0\leq i<n}$ corresponds to a vertex $v_i\in\mathcal{V}$ and,
\item each matrix column $\{j\}_{0\leq j<n}$ corresponds to a hyperedge $e_j\in\mathcal{E}$, where:
\begin{equation}
\end{equation}
\item $w_i$ is the weight of vertex $v_i$ and,
\item $c_j$ is the cost of hyperedge $e_j$.
-\end{itemize*}
-A $K$-way partitioning of a hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ is defined as $\mathcal{P}=\{\mathcal{V}_1,\ldots,\mathcal{V}_K\}$
-a set of pairwise disjoint non-empty subsets (or parts) of the vertex set $\mathcal{V}$, so that each subset is attributed to a computing node.
-Figure~\ref{fig:08} shows an example of the hypergraph model of a $(9\times 9)$ sparse matrix in three parts. The circles and squares correspond,
-respectively, to the vertices and hyperedges of the hypergraph. The solid squares define the cut hyperedges connecting at least two different parts.
-The connectivity $\lambda_j$ of a cut hyperedge $e_j$ denotes the number of different parts spanned by $e_j$.
+\end{itemize}
+A $K$-way partitioning of a hypergraph $\mathcal{H}=(\mathcal{V},\mathcal{E})$ is
+defined as $\mathcal{P}=\{\mathcal{V}_1,\ldots,\mathcal{V}_K\}$ a set of pairwise
+disjoint non-empty subsets (or parts) of the vertex set $\mathcal{V}$, so that each
+subset is attributed to a computing node. Figure~\ref{ch12:fig:08} shows an example
+of the hypergraph model of a $(9\times 9)$ sparse matrix in three parts. The circles
+and squares correspond, respectively, to the vertices and hyperedges of the hypergraph.
+The solid squares define the cut hyperedges connecting at least two different parts.
+The connectivity $\lambda_j$ of a cut hyperedge $e_j$ denotes the number of different
+parts spanned by $e_j$.
\begin{figure}
\centerline{\includegraphics[scale=0.5]{Chapters/chapter12/figures/hypergraph}}
\caption{An example of the hypergraph partitioning of a sparse matrix decomposed between three computing nodes.}
-\label{fig:08}
+\label{ch12:fig:08}
\end{figure}
-The cut hyperedges model the total communication volume between the different computing nodes in the cluster, necessary to perform the parallel SpMV
-multiplication. Indeed, each hyperedge $e_j$ defines a set of atomic computations $b_i\leftarrow b_i+a_{ij}x_j$, $0\leq i,j<n$, of the SpMV multiplication
-$Ax=b$ that need the $j^{th}$ unknown value of solution vector $x$. Therefore, pins of hyperedge $e_j$, $pins[e_j]$, are the set of matrix rows sharing
-and requiring the same unknown value $x_j$. For example in Figure~\ref{fig:08}, hyperedge $e_9$ whose pins are: $pins[e_9]=\{v_2,v_5,v_9\}$ represents the
-dependency of matrix rows $2$, $5$ and $9$ to unknown $x_9$ needed to perform in parallel the atomic operations: $b_2\leftarrow b_2+a_{29}x_9$,
-$b_5\leftarrow b_5+a_{59}x_9$ and $b_9\leftarrow b_9+a_{99}x_9$. However, unknown $x_9$ is the third entry of the sub-solution vector $x$ of part (or node) $3$.
-So the computing node $3$ must exchange this value with nodes $1$ and $2$, which leads to perform two communications.
-
-The hypergraph partitioning allows to reduce the total communication volume required to perform the parallel SpMV multiplication, while maintaining the
-load balancing between the computing nodes. In fact, it allows to minimize at best the following amount:
+The cut hyperedges model the total communication volume between the different computing
+nodes in the cluster, necessary to perform the parallel SpMV multiplication\index{SpMV~multiplication}.
+Indeed, each hyperedge $e_j$ defines a set of atomic computations $b_i\leftarrow b_i+a_{ij}x_j$,
+$0\leq i,j<n$, of the SpMV multiplication $Ax=b$ that need the $j^{th}$ unknown value of
+solution vector $x$. Therefore, pins of hyperedge $e_j$, $pins[e_j]$, are the set of matrix
+rows sharing and requiring the same unknown value $x_j$. For example in Figure~\ref{ch12:fig:08},
+hyperedge $e_9$ whose pins are: $pins[e_9]=\{v_2,v_5,v_9\}$ represents the dependency of matrix
+rows $2$, $5$ and $9$ to unknown $x_9$ needed to perform in parallel the atomic operations:
+$b_2\leftarrow b_2+a_{29}x_9$, $b_5\leftarrow b_5+a_{59}x_9$ and $b_9\leftarrow b_9+a_{99}x_9$.
+However, unknown $x_9$ is the third entry of the sub-solution vector $x$ of part (or node) $3$.
+So the computing node $3$ must exchange this value with nodes $1$ and $2$, which leads to perform
+two communications.
+
+The hypergraph partitioning\index{Hypergraph} allows to reduce the total communication volume
+required to perform the parallel SpMV multiplication, while maintaining the load balancing between
+the computing nodes. In fact, it allows to minimize at best the following amount:
\begin{equation}
\mathcal{X}(\mathcal{P})=\sum_{e_{j}\in\mathcal{E}_{C}}c_{j}(\lambda_{j}-1),
\end{equation}
-where $\mathcal{E}_{C}$ denotes the set of the cut hyperedges coming from the hypergraph partitioning $\mathcal{P}$ and $c_j$ and $\lambda_j$ are, respectively,
-the cost and the connectivity of cut hyperedge $e_j$. Moreover, it also ensures the load balancing between the $K$ parts as follows:
+where $\mathcal{E}_{C}$ denotes the set of the cut hyperedges coming from the hypergraph partitioning
+$\mathcal{P}$ and $c_j$ and $\lambda_j$ are, respectively, the cost and the connectivity of cut hyperedge
+$e_j$. Moreover, it also ensures the load balancing between the $K$ parts as follows:
\begin{equation}
W_{k}\leq (1+\epsilon)W_{avg}, \hspace{0.2cm} (1\leq k\leq K) \hspace{0.2cm} \text{and} \hspace{0.2cm} (0<\epsilon<1),
\end{equation}
-where $W_{k}$ is the sum of all vertex weights ($w_{i}$) in part $\mathcal{V}_{k}$, $W_{avg}$ is the average weight of all $K$ parts and $\epsilon$ is the
-maximum allowed imbalanced ratio.
-
-The hypergraph partitioning is a NP-complete problem but software tools using heuristics are developed, for example: hMETIS~\cite{ref12}, PaToH~\cite{ref13}
-and Zoltan~\cite{ref14}. Since our objective is solving large sparse linear systems, we use the parallel hypergraph partitioning which must be performed by
-at least two MPI processes. It allows to accelerate the data partitioning of large sparse matrices. For this, the hypergraph $\mathcal{H}$ must be partitioned
-in $p$ (number of MPI processes) sub-hypergraphs $\mathcal{H}_k=(\mathcal{V}_k,\mathcal{E}_k)$, $0\leq k<p$, and then we performed the parallel hypergraph
-partitioning method using some functions of the MPI library between the $p$ processes.
-
-Tables~\ref{tab:10} and~\ref{tab:11} shows the performances of the parallel CG and GMRES solvers, respectively, using the hypergraph partitioning for solving
-large linear systems associated to the sparse five-bands matrices presented in Table~\ref{tab:07}. For these experimental tests, we have applied the parallel
-hypergraph partitioning~\cite{ref15} developed in Zoltan tool~\cite{ref14}. We have initialized the parameters of the partitioning operation as follows:
-\begin{itemize*}
+where $W_{k}$ is the sum of all vertex weights ($w_{i}$) in part $\mathcal{V}_{k}$, $W_{avg}$ is the
+average weight of all $K$ parts and $\epsilon$ is the maximum allowed imbalanced ratio.
+
+The hypergraph partitioning is a NP-complete problem but software tools using heuristics are developed,
+for example: hMETIS~\cite{ch12:ref12}, PaToH~\cite{ch12:ref13} and Zoltan~\cite{ch12:ref14}. Since our
+objective is solving large sparse linear systems, we use the parallel hypergraph partitioning which must
+be performed by at least two MPI processes. It allows to accelerate the data partitioning of large sparse
+matrices. For this, the hypergraph $\mathcal{H}$ must be partitioned in $p$ (number of MPI processes)
+sub-hypergraphs $\mathcal{H}_k=(\mathcal{V}_k,\mathcal{E}_k)$, $0\leq k<p$, and then we performed the
+parallel hypergraph partitioning method using some functions of the MPI library between the $p$ processes.
+
+Tables~\ref{ch12:tab:10} and~\ref{ch12:tab:11} shows the performances of the parallel CG and GMRES solvers,
+respectively, using the hypergraph partitioning for solving large linear systems associated to the sparse
+five-bands matrices presented in Table~\ref{ch12:tab:07}. For these experimental tests, we have applied the
+parallel hypergraph partitioning~\cite{ch12:ref15} developed in Zoltan tool~\cite{ch12:ref14}. We have initialized
+the parameters of the partitioning operation as follows:
+\begin{itemize}
\item the weight $w_{i}$ of each vertex $v_{j}\in\mathcal{V}$ is set to the number of nonzero values on matrix row $i$,
\item for the sake of simplicity, the cost $c_{j}$ of each hyperedge $e_{j}\in\mathcal{E}$ is fixed to $1$,
\item the maximum imbalanced load ratio $\epsilon$ is limited to $10\%$.\\
-\end{itemize*}
+\end{itemize}
-\begin{table}[!h]
+\begin{table}
\begin{center}
\begin{tabular}{|c|c|c|c|c|}
\hline
\end{tabular}
\caption{Performances of the parallel CG solver using hypergraph partitioning for solving linear systems associated to
sparse five-bands matrices on a cluster of 24 CPU cores vs. on a cluster of 12 GPU.}
-\label{tab:10}
+\label{ch12:tab:10}
\end{center}
\end{table}
-\begin{table}[!h]
+\begin{table}
\begin{center}
\begin{tabular}{|c|c|c|c|c|}
\hline
\end{tabular}
\caption{Performances of the parallel GMRES solver using hypergraph partitioning for solving linear systems associated to
sparse five-bands matrices on a cluster of 24 CPU cores vs. on a cluster of 12 GPU.}
-\label{tab:11}
+\label{ch12:tab:11}
\end{center}
\end{table}
-We can notice from both Tables~\ref{tab:10} and~\ref{tab:11} that the hypergraph partitioning has improved the performances of both
-parallel CG and GMRES algorithms. The execution times
-on the GPU cluster of both parallel solvers are significantly improved compared to those obtained by using the partitioning row-by-row.
-For these examples of sparse matrices, the execution times of CG and GMRES solvers are reduced about $76\%$ and $65\%$ respectively
-(see column~$5$ of each table) compared to those obtained in Tables~\ref{tab:08} and~\ref{tab:09}.
-
-In fact, the hypergraph partitioning applied to sparse matrices having large bandwidths allows to reduce the total communication volume
-necessary to synchronize the computations between the computing nodes in the GPU cluster. Table~\ref{tab:12} presents, for each sparse
-matrix, the total communication volume between $12$ GPU computing nodes obtained by using the partitioning row-by-row (column~$2$), the
-total communication volume obtained by using the hypergraph partitioning (column~$3$) and the execution times in minutes of the hypergraph
-partitioning operation performed by $12$ MPI processes (column~$4$). The total communication volume defines the total number of the vector
-elements exchanged by the computing nodes. Then, Table~\ref{tab:12} shows that the hypergraph partitioning method can split the sparse
-matrix so as to minimize the data dependencies between the computing nodes and thus to reduce the total communication volume.
+We can notice from both Tables~\ref{ch12:tab:10} and~\ref{ch12:tab:11} that the
+hypergraph partitioning has improved the performances of both parallel CG and GMRES
+algorithms. The execution times on the GPU cluster of both parallel solvers are
+significantly improved compared to those obtained by using the partitioning row-by-row.
+For these examples of sparse matrices, the execution times of CG and GMRES solvers
+are reduced about $76\%$ and $65\%$ respectively (see column~$5$ of each table)
+compared to those obtained in Tables~\ref{ch12:tab:08} and~\ref{ch12:tab:09}.
+
+In fact, the hypergraph partitioning\index{Hypergraph} applied to sparse matrices
+having large bandwidths allows to reduce the total communication volume necessary
+to synchronize the computations between the computing nodes in the GPU cluster.
+Table~\ref{ch12:tab:12} presents, for each sparse matrix, the total communication
+volume between $12$ GPU computing nodes obtained by using the partitioning row-by-row
+(column~$2$), the total communication volume obtained by using the hypergraph partitioning
+(column~$3$) and the execution times in minutes of the hypergraph partitioning
+operation performed by $12$ MPI processes (column~$4$). The total communication
+volume defines the total number of the vector elements exchanged by the computing
+nodes. Then, Table~\ref{ch12:tab:12} shows that the hypergraph partitioning method
+can split the sparse matrix so as to minimize the data dependencies between the
+computing nodes and thus to reduce the total communication volume.
\begin{table}
\begin{center}
torso3 & $25,682,514$ & $613,250$ & $61.51$ \\ \hline
\end{tabular}
\caption{The total communication volume between 12 GPU computing nodes without and with the hypergraph partitioning method.}
-\label{tab:12}
+\label{ch12:tab:12}
\end{center}
\end{table}
-Nevertheless, as we can see from the fourth column of Table~\ref{tab:12}, the hypergraph partitioning takes longer compared
-to the execution times of the resolutions. As previously mentioned, the hypergraph partitioning method is less efficient in
-terms of memory consumption and partitioning time than its graph counterpart, but the hypergraph well models the nonsymmetric
-and irregular problems. So for the applications which often use the same sparse matrices, we can perform the hypergraph partitioning
-on these matrices only once for each and then, we save the traces of these partitionings in files to be reused several times.
-Therefore, this allows to avoid the partitioning of the sparse matrices at each resolution of the linear systems.
-
-\begin{figure}[!t]
+Nevertheless, as we can see from the fourth column of Table~\ref{ch12:tab:12},
+the hypergraph partitioning takes longer compared to the execution times of the
+resolutions. As previously mentioned, the hypergraph partitioning method is less
+efficient in terms of memory consumption and partitioning time than its graph
+counterpart, but the hypergraph well models the nonsymmetric and irregular problems.
+So for the applications which often use the same sparse matrices, we can perform
+the hypergraph partitioning on these matrices only once for each and then, we save
+the traces of these partitionings in files to be reused several times. Therefore,
+this allows to avoid the partitioning of the sparse matrices at each resolution
+of the linear systems.
+
+\begin{figure}[!h]
\centering
-\begin{tabular}{c}
-\includegraphics[scale=0.7]{Chapters/chapter12/figures/scale_band} \\
-\small{(a) Sparse band matrices} \\
-\\
-\includegraphics[scale=0.7]{Chapters/chapter12/figures/scale_5band} \\
-\small{(b) Sparse five-bands matrices}
-\end{tabular}
+ \mbox{\subfigure[Sparse band matrices]{\includegraphics[scale=0.7]{Chapters/chapter12/figures/scale_band}\label{ch12:fig:09.01}}}
+\vfill
+ \mbox{\subfigure[Sparse five-bands matrices]{\includegraphics[scale=0.7]{Chapters/chapter12/figures/scale_5band}\label{ch12:fig:09.02}}}
\caption{Weak-scaling of the parallel CG and GMRES solvers on a GPU cluster for solving large sparse linear systems.}
-\label{fig:09}
+\label{ch12:fig:09}
\end{figure}
-However, the most important performance parameter is the scalability of the parallel CG and GMRES solvers on a GPU cluster.
-Particularly, we have taken into account the weak-scaling of both parallel algorithms on a cluster of one to 12 GPU computing
-nodes. We have performed a set of experiments on both matrix structures: band matrices and five-bands matrices. The
-sparse matrices of tests are generated from the symmetric sparse matrix {\it thermal2} chosen from the Davis's collection.
-Figures~\ref{fig:09}-$(a)$ and~\ref{fig:09}-$(b)$ show the execution times of both parallel methods for solving large linear
-systems associated to band matrices and those associated to five-bands matrices, respectively. The size of a sparse sub-matrix
-per computing node, for each matrix structure, is fixed as follows:
-\begin{itemize*}
+However, the most important performance parameter is the scalability of the parallel
+CG\index{Iterative~method!CG} and GMRES\index{Iterative~method!GMRES} solvers on a GPU
+cluster. Particularly, we have taken into account the weak-scaling of both parallel
+algorithms on a cluster of one to 12 GPU computing nodes. We have performed a set of
+experiments on both matrix structures: band matrices and five-bands matrices. The sparse
+matrices of tests are generated from the symmetric sparse matrix {\it thermal2} chosen
+from the Davis's collection. Figures~\ref{ch12:fig:09.01} and~\ref{ch12:fig:09.02}
+show the execution times of both parallel methods for solving large linear systems
+associated to band matrices and those associated to five-bands matrices, respectively.
+The size of a sparse sub-matrix per computing node, for each matrix structure, is fixed
+as follows:
+\begin{itemize}
\item band matrix: $15$ million of rows and $105,166,557$ of nonzero values,
\item five-bands matrix: $5$ million of rows and $78,714,492$ of nonzero values.
-\end{itemize*}
-We can see from these figures that both parallel solvers are quite scalable on a GPU cluster. Indeed, the execution times remains
-almost constant while the size of the sparse linear systems to be solved increases proportionally with the number of
-the GPU computing nodes. This means that the communication cost is relatively constant regardless of the number the computing nodes
-in the GPU cluster.
+\end{itemize}
+We can see from these figures that both parallel solvers are quite scalable on a GPU
+cluster. Indeed, the execution times remains almost constant while the size of the
+sparse linear systems to be solved increases proportionally with the number of the
+GPU computing nodes. This means that the communication cost is relatively constant
+regardless of the number the computing nodes in the GPU cluster.
%% SECTION 6 %%
%%--------------------------%%
\section{Conclusion}
-\label{sec:06}
-In this chapter, we have aimed at harnessing the computing power of a cluster of GPUs for solving large sparse linear systems.
-For this, we have used two Krylov sub-space iterative methods: the CG and GMRES methods. The first method is well-known to its
-efficiency for solving symmetric linear systems and the second one is used, particularly, for solving nonsymmetric linear systems.
-
-We have presentend the parallel implementation of both iterative methods on a GPU cluster. Particularly, the operations dealing with
-the vectors and/or matrices, of these methods, are parallelized between the different GPU computing nodes of the cluster. Indeed,
-the data-parallel vector operations are accelerated by GPUs and the communications required to synchronize the parallel computations
-are carried out by CPU cores. For this, we have used a heterogeneous CUDA/MPI programming to implement the parallel iterative
+\label{ch12:sec:06}
+In this chapter, we have aimed at harnessing the computing power of a
+cluster of GPUs for solving large sparse linear systems. For this, we
+have used two Krylov subspace iterative methods: the CG and GMRES methods.
+The first method is well-known to its efficiency for solving symmetric
+linear systems and the second one is used, particularly, for solving
+nonsymmetric linear systems.
+
+We have presented the parallel implementation of both iterative methods
+on a GPU cluster. Particularly, the operations dealing with the vectors
+and/or matrices, of these methods, are parallelized between the different
+GPU computing nodes of the cluster. Indeed, the data-parallel vector operations
+are accelerated by GPUs and the communications required to synchronize the
+parallel computations are carried out by CPU cores. For this, we have used
+a heterogeneous CUDA/MPI programming to implement the parallel iterative
algorithms.
-In the experimental tests, we have shown that using a GPU cluster is efficient for solving linear systems associated to very large
-sparse matrices. The experimental results, obtained in the present chapter, showed that a cluster of $12$ GPUs is about $7$
-times faster than a cluster of $24$ CPU cores for solving large sparse linear systems of $25$ million unknown values. This is due
-to the GPU ability to compute the data-parallel operations faster than the CPUs. However, we have shown that solving linear systems
-associated to matrices having large bandwidths uses many communications to synchronize the computations of GPUs, which slow down
-even more the resolution. Moreover, there are two kinds of communications: between a CPU and its GPU and between CPUs of the computing
-nodes, such that the first ones are the slowest communications on a GPU cluster. So, we have proposed to use the hypergraph partitioning
-instead of the row-by-row partitioning. This allows to minimize the data dependencies between the GPU computing nodes and thus to
-reduce the total communication volume. The experimental results showed that using the hypergraph partitioning technique improve the
-execution times on average of $76\%$ to the CG method and of $65\%$ to the GMRES method on a cluster of $12$ GPUs.
-
-In the recent GPU hardware and software architectures, the GPU-Direct system with CUDA version 5.0 is used so that two GPUs located on
-the same node or on distant nodes can communicate between them directly without CPUs. This allows to improve the data transfers between
-GPUs.
+In the experimental tests, we have shown that using a GPU cluster is efficient
+for solving linear systems associated to very large sparse matrices. The experimental
+results, obtained in the present chapter, showed that a cluster of $12$ GPUs is
+about $7$ times faster than a cluster of $24$ CPU cores for solving large sparse
+linear systems of $25$ million unknown values. This is due to the GPU ability to
+compute the data-parallel operations faster than the CPUs. However, we have shown
+that solving linear systems associated to matrices having large bandwidths uses
+many communications to synchronize the computations of GPUs, which slow down even
+more the resolution. Moreover, there are two kinds of communications: between a
+CPU and its GPU and between CPUs of the computing nodes, such that the first ones
+are the slowest communications on a GPU cluster. So, we have proposed to use the
+hypergraph partitioning instead of the row-by-row partitioning. This allows to
+minimize the data dependencies between the GPU computing nodes and thus to reduce
+the total communication volume. The experimental results showed that using the
+hypergraph partitioning technique improve the execution times on average of $76\%$
+to the CG method and of $65\%$ to the GMRES method on a cluster of $12$ GPUs.
+
+In the recent GPU hardware and software architectures, the GPU-Direct system with
+CUDA version 5.0 is used so that two GPUs located on the same node or on distant
+nodes can communicate between them directly without CPUs. This allows to improve
+the data transfers between GPUs.
\putbib[Chapters/chapter12/biblio12]