proceed by successive iterations of a same block of elementary operations, during which an
infinite number of approximate solutions $\{x_k\}_{k\geq 0}$ is computed. Indeed, from an
initial guess $x_0$, an iterative method determines at each iteration $k>0$ an approximate
proceed by successive iterations of a same block of elementary operations, during which an
infinite number of approximate solutions $\{x_k\}_{k\geq 0}$ is computed. Indeed, from an
initial guess $x_0$, an iterative method determines at each iteration $k>0$ an approximate
\end{equation}
The number of iterations necessary to reach the exact solution $x^{*}$ is not known beforehand
and can be infinite. In practice, an iterative method often finds an approximate solution $\tilde{x}$
\end{equation}
The number of iterations necessary to reach the exact solution $x^{*}$ is not known beforehand
and can be infinite. In practice, an iterative method often finds an approximate solution $\tilde{x}$
In the present chapter, we describe two Krylov methods which are widely used: the CG method (conjugate
gradient method) and the GMRES method (generalized minimal residual method). In practice, the
Krylov subspace methods are usually used with preconditioners that allow the improvement of their
In the present chapter, we describe two Krylov methods which are widely used: the CG method (conjugate
gradient method) and the GMRES method (generalized minimal residual method). In practice, the
Krylov subspace methods are usually used with preconditioners that allow the improvement of their
-convergence. So, in what follows, the CG and GMRES methods are used to solve the left-preconditioned\index{Sparse~linear~system!Preconditioned}
+convergence. So, in what follows, the CG and GMRES methods are used to solve the left-preconditioned\index{sparse linear system!preconditioned}
can be adapted to solve nonlinear equations and optimization problems. However, it can only be applied
to problems with positive definite symmetric matrices.
can be adapted to solve nonlinear equations and optimization problems. However, it can only be applied
to problems with positive definite symmetric matrices.
-The main idea of the CG method\index{Iterative~method!CG} is the computation of a sequence of approximate
-solutions $\{x_k\}_{k\geq 0}$ in a Krylov subspace\index{Iterative~method!Krylov~subspace} of order $k$ as
+The main idea of the CG method\index{iterative method!CG} is the computation of a sequence of approximate
+solutions $\{x_k\}_{k\geq 0}$ in a Krylov subspace\index{iterative method!Krylov~subspace} of order $k$ as
In this algorithm, $\varepsilon$ is the convergence tolerance threshold, $maxiter$ is the maximum
number of iterations, and $(\cdot,\cdot)$ defines the dot product between two vectors in $\mathbb{R}^{n}$.
At every iteration, a direction vector $p_k$ is determined, so that it is orthogonal to the preconditioned
residual $z_k$ and to the direction vectors $\{p_i\}_{i<k}$ previously determined (from line~$8$ to
line~$13$). Then, at lines~$16$ and~$17$, the iterate $x_k$ and the residual $r_k$ are computed using
formulas~(\ref{ch12:eq:07}) and~(\ref{ch12:eq:08}), respectively. The CG method converges after, at
In this algorithm, $\varepsilon$ is the convergence tolerance threshold, $maxiter$ is the maximum
number of iterations, and $(\cdot,\cdot)$ defines the dot product between two vectors in $\mathbb{R}^{n}$.
At every iteration, a direction vector $p_k$ is determined, so that it is orthogonal to the preconditioned
residual $z_k$ and to the direction vectors $\{p_i\}_{i<k}$ previously determined (from line~$8$ to
line~$13$). Then, at lines~$16$ and~$17$, the iterate $x_k$ and the residual $r_k$ are computed using
formulas~(\ref{ch12:eq:07}) and~(\ref{ch12:eq:08}), respectively. The CG method converges after, at
-most, $n$ iterations. In practice, the CG algorithm stops when the tolerance threshold\index{Convergence!Tolerance~threshold}
-$\varepsilon$ and/or the maximum number of iterations\index{Convergence!Maximum~number~of~iterations}
+most, $n$ iterations. In practice, the CG algorithm stops when the tolerance threshold\index{convergence!tolerance threshold}
+$\varepsilon$ and/or the maximum number of iterations\index{convergence!maximum number of iterations}
\subsection{GMRES method}
\label{ch12:sec:02.02}
The iterative GMRES method was developed by Saad and Schultz in 1986~\cite{ch12:ref3} as a generalization
\subsection{GMRES method}
\label{ch12:sec:02.02}
The iterative GMRES method was developed by Saad and Schultz in 1986~\cite{ch12:ref3} as a generalization
\begin{equation}
\begin{array}{ll}
x_k \in x_0 + \mathcal{K}_k(A, v_1),& v_1=\frac{r_0}{\|r_0\|_2},
\end{array}
\label{ch12:eq:12}
\end{equation}
\begin{equation}
\begin{array}{ll}
x_k \in x_0 + \mathcal{K}_k(A, v_1),& v_1=\frac{r_0}{\|r_0\|_2},
\end{array}
\label{ch12:eq:12}
\end{equation}
-GMRES uses the Arnoldi process~\cite{ch12:ref5}\index{Iterative~method!Arnoldi~process} to construct an
-orthonormal basis $V_k$ for the Krylov subspace $\mathcal{K}_k$ and an upper Hessenberg matrix\index{Hessenberg~matrix}
+GMRES uses the Arnoldi iterations~\cite{ch12:ref5}\index{iterative method!Arnoldi iterations} to construct an
+orthonormal basis $V_k$ for the Krylov subspace $\mathcal{K}_k$ and an upper Hessenberg matrix\index{Hessenberg matrix}
solves the linear least-squares problem of size $m$ to find the vector $y\in\mathbb{R}^{m}$
which minimizes at best the residual norm (line~$18$). Finally, it computes an approximate
solution $x_m$ in the Krylov subspace spanned by $V_m$ (line~$19$). The GMRES algorithm is
stopped when the residual norm is sufficiently small ($\|r_m\|_2<\varepsilon$) and/or the
solves the linear least-squares problem of size $m$ to find the vector $y\in\mathbb{R}^{m}$
which minimizes at best the residual norm (line~$18$). Finally, it computes an approximate
solution $x_m$ in the Krylov subspace spanned by $V_m$ (line~$19$). The GMRES algorithm is
stopped when the residual norm is sufficiently small ($\|r_m\|_2<\varepsilon$) and/or the
-In this section, we present the parallel algorithms of both iterative CG\index{Iterative~method!CG}
-and GMRES\index{Iterative~method!GMRES} methods for GPU clusters. The implementation is performed on
+In this section, we present the parallel algorithms of both iterative CG\index{iterative method!CG}
+and GMRES\index{iterative method!GMRES} methods for GPU clusters. The implementation is performed on
a GPU cluster composed of different computing nodes, such that each node is a CPU core managed by one
MPI (message passing interface) process and equipped with a GPU card. The parallelization of these algorithms is carried out by
a GPU cluster composed of different computing nodes, such that each node is a CPU core managed by one
MPI (message passing interface) process and equipped with a GPU card. The parallelization of these algorithms is carried out by
CUDA (compute unified device architecture) programming environment inside each node. In what follows, the algorithms of the iterative methods
are called iterative solvers.
CUDA (compute unified device architecture) programming environment inside each node. In what follows, the algorithms of the iterative methods
are called iterative solvers.
then, solving the triangular system by backward substitutions to compute $y$. Consequently,
solving the least-squares problem on the GPU is not efficient. Indeed, the triangular
solves are not easy to parallelize and inefficient on GPUs. However, the least-squares
problem to solve in the GMRES method with restarts has, generally, a very small size $m$.
Therefore, we develop an inexpensive kernel which must be executed by a single CUDA thread.
then, solving the triangular system by backward substitutions to compute $y$. Consequently,
solving the least-squares problem on the GPU is not efficient. Indeed, the triangular
solves are not easy to parallelize and inefficient on GPUs. However, the least-squares
problem to solve in the GMRES method with restarts has, generally, a very small size $m$.
Therefore, we develop an inexpensive kernel which must be executed by a single CUDA thread.
-The most important operation in CG\index{Iterative~method!CG} and GMRES\index{Iterative~method!GMRES}
-methods is the SpMV multiplication (sparse matrix-vector multiplication)\index{SpMV~multiplication},
+The most important operation in CG\index{iterative method!CG} and GMRES\index{iterative method!GMRES}
+methods is the SpMV multiplication (sparse matrix-vector multiplication)\index{SpMV multiplication},
because it is often an expensive operation in terms of execution time and memory space.
Moreover, it requires taking care of the storage format of the sparse matrix in the
memory. Indeed, the naive storage, row-by-row or column-by-column, of a sparse matrix
because it is often an expensive operation in terms of execution time and memory space.
Moreover, it requires taking care of the storage format of the sparse matrix in the
memory. Indeed, the naive storage, row-by-row or column-by-column, of a sparse matrix
nature of the matrix often leads to irregular memory accesses to read the matrix nonzero
values. So, the computation of the SpMV multiplication on GPUs can involve noncoalesced
accesses to the global memory, which slows down its performances even more. One of the
nature of the matrix often leads to irregular memory accesses to read the matrix nonzero
values. So, the computation of the SpMV multiplication on GPUs can involve noncoalesced
accesses to the global memory, which slows down its performances even more. One of the
-most efficient compressed storage formats\index{Compressed~storage~format} of sparse
-matrices on GPUs is the HYB (hybrid)\index{Compressed~storage~format!HYB} format~\cite{ch12:ref7}.
+most efficient compressed storage formats\index{compressed storage format} of sparse
+matrices on GPUs is the HYB (hybrid)\index{compressed storage format!HYB} format~\cite{ch12:ref7}.
which is insensitive to the matrix structure. Consequently, we use the HYB kernel~\cite{ch12:ref8}
developed by NVIDIA to implement the SpMV multiplication of CG and GMRES methods on GPUs.
Moreover, to avoid the noncoalesced accesses to the high-latency global memory, we fill
which is insensitive to the matrix structure. Consequently, we use the HYB kernel~\cite{ch12:ref8}
developed by NVIDIA to implement the SpMV multiplication of CG and GMRES methods on GPUs.
Moreover, to avoid the noncoalesced accesses to the high-latency global memory, we fill
\label{ch12:sec:03.03}
All the computing nodes of the GPU cluster execute in parallel the same iterative solver
(Algorithm~\ref{ch12:alg:01} or Algorithm~\ref{ch12:alg:02}) adapted to GPUs, but on their
\label{ch12:sec:03.03}
All the computing nodes of the GPU cluster execute in parallel the same iterative solver
(Algorithm~\ref{ch12:alg:01} or Algorithm~\ref{ch12:alg:02}) adapted to GPUs, but on their
$0\leq i<p$. However, in order to solve the complete sparse linear system~(\ref{ch12:eq:11}),
synchronizations must be performed between the local computations of the computing nodes over
$0\leq i<p$. However, in order to solve the complete sparse linear system~(\ref{ch12:eq:11}),
synchronizations must be performed between the local computations of the computing nodes over
As already mentioned, the most important operation of CG and GMRES methods is the SpMV multiplication.
In the parallel implementation of the iterative methods, each computing node $i$ performs the
As already mentioned, the most important operation of CG and GMRES methods is the SpMV multiplication.
In the parallel implementation of the iterative methods, each computing node $i$ performs the
-Therefore, before computing the SpMV multiplication\index{SpMV~multiplication}, the neighboring
-nodes\index{Neighboring~node} over the GPU cluster must exchange between them the shared vector
+Therefore, before computing the SpMV multiplication\index{SpMV multiplication}, the neighboring
+nodes\index{neighboring node} over the GPU cluster must exchange between them the shared vector
elements necessary to compute this multiplication. First, each computing node determines, in its
local subvector, the vector elements needed by other nodes. Then, the neighboring nodes exchange
between them these shared vector elements. The data exchanges are implemented by using the MPI
elements necessary to compute this multiplication. First, each computing node determines, in its
local subvector, the vector elements needed by other nodes. Then, the neighboring nodes exchange
between them these shared vector elements. The data exchanges are implemented by using the MPI
-point-to-point communication routines: blocking\index{MPI~subroutines!Blocking} sends with \verb+MPI_Send()+
-and nonblocking\index{MPI~subroutines!Nonblocking} receives with \verb+MPI_Irecv()+. Figure~\ref{ch12:fig:02}
+point-to-point communication routines: blocking\index{MPI!blocking} sends with \verb+MPI_Send()+
+and nonblocking\index{MPI!nonblocking} receives with \verb+MPI_Irecv()+. Figure~\ref{ch12:fig:02}
shows an example of data exchanges between \textit{Node 1} and its neighbors \textit{Node 0}, \textit{Node 2},
and \textit{Node 3}. In this example, the iterate matrix $A$ split between these four computing
nodes is that presented in Figure~\ref{ch12:fig:01}.
shows an example of data exchanges between \textit{Node 1} and its neighbors \textit{Node 0}, \textit{Node 2},
and \textit{Node 3}. In this example, the iterate matrix $A$ split between these four computing
nodes is that presented in Figure~\ref{ch12:fig:01}.
and communication data between GPU nodes are carried out by passing messages. However, a GPU cannot exchange data
with other GPUs in a direct way. Then, CPUs via MPI processes are in charge of the synchronizations within the GPU
cluster. Consequently, the vector elements to be exchanged must be copied from the GPU memory to the CPU memory
and vice versa before and after the synchronization operation between CPUs. We have used the CUBLAS\index{CUBLAS}
communication subroutines to perform the data transfers between a CPU core and its GPU: \verb+cublasGetVector()+
and \verb+cublasSetVector()+. Finally, in addition to the data exchanges, GPU nodes perform reduction operations
and communication data between GPU nodes are carried out by passing messages. However, a GPU cannot exchange data
with other GPUs in a direct way. Then, CPUs via MPI processes are in charge of the synchronizations within the GPU
cluster. Consequently, the vector elements to be exchanged must be copied from the GPU memory to the CPU memory
and vice versa before and after the synchronization operation between CPUs. We have used the CUBLAS\index{CUBLAS}
communication subroutines to perform the data transfers between a CPU core and its GPU: \verb+cublasGetVector()+
and \verb+cublasSetVector()+. Finally, in addition to the data exchanges, GPU nodes perform reduction operations
providing $12$GB of RAM with a memory bandwidth of $25.6$GB/s. In addition, two Tesla C1060 GPUs are
connected to each machine via a PCI-Express 16x Gen 2.0 interface with a throughput of $8$GB/s. A
Tesla C1060 GPU contains $240$ cores running at $1.3$GHz and providing a global memory of $4$GB with
providing $12$GB of RAM with a memory bandwidth of $25.6$GB/s. In addition, two Tesla C1060 GPUs are
connected to each machine via a PCI-Express 16x Gen 2.0 interface with a throughput of $8$GB/s. A
Tesla C1060 GPU contains $240$ cores running at $1.3$GHz and providing a global memory of $4$GB with
that we used in the experimental tests.
Linux cluster version 2.6.39 OS is installed on CPUs. C programming language is used to code
that we used in the experimental tests.
Linux cluster version 2.6.39 OS is installed on CPUs. C programming language is used to code
All tests are made on double-precision floating point operations. The parameters of both linear
solvers are initialized as follows: the residual tolerance threshold $\varepsilon=10^{-12}$, the
maximum number of iterations $maxiter=500$, the right-hand side $b$ is filled with $1.0$, and the
All tests are made on double-precision floating point operations. The parameters of both linear
solvers are initialized as follows: the residual tolerance threshold $\varepsilon=10^{-12}$, the
maximum number of iterations $maxiter=500$, the right-hand side $b$ is filled with $1.0$, and the
used in the GMRES method to $16$ iterations ($m=16$). For the sake of simplicity, we have chosen
the preconditioner $M$ as the main diagonal of the sparse matrix $A$. Indeed, it allows us to easily
compute the required inverse matrix $M^{-1}$, and it provides a relatively good preconditioning for
used in the GMRES method to $16$ iterations ($m=16$). For the sake of simplicity, we have chosen
the preconditioner $M$ as the main diagonal of the sparse matrix $A$. Indeed, it allows us to easily
compute the required inverse matrix $M^{-1}$, and it provides a relatively good preconditioning for
the present chapter, the bandwidth of a sparse matrix is defined as the number of matrix columns separating
the first and the last nonzero value on a matrix row.
the present chapter, the bandwidth of a sparse matrix is defined as the number of matrix columns separating
the first and the last nonzero value on a matrix row.
\begin{tabular}{|c|c|c|c|c|}
\hline
{\bf Matrix Type} & {\bf Matrix Name} & {\bf \# Rows} & {\bf \# Nonzeros} & {\bf Bandwidth} \\ \hline \hline
\begin{tabular}{|c|c|c|c|c|}
\hline
{\bf Matrix Type} & {\bf Matrix Name} & {\bf \# Rows} & {\bf \# Nonzeros} & {\bf Bandwidth} \\ \hline \hline
& torso3 & $259,156$ & $4,429,042$ & $216,854$ \\ \hline
\end{tabular}
& torso3 & $259,156$ & $4,429,042$ & $216,854$ \\ \hline
\end{tabular}
\begin{tabular}{|c|c|c|c|c|c|c|}
\hline
{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\#~Iter.}$ & $\mathbf{Prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline
\begin{tabular}{|c|c|c|c|c|c|c|}
\hline
{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\#~Iter.}$ & $\mathbf{Prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline
columns give, respectively, the execution times in seconds obtained on $24$ CPU cores
($Time_{cpu}$) and that obtained on $12$ GPUs ($Time_{gpu}$). Moreover, we take into account
the relative gains $\tau$ of a solver implemented on the GPU cluster compared to the same
columns give, respectively, the execution times in seconds obtained on $24$ CPU cores
($Time_{cpu}$) and that obtained on $12$ GPUs ($Time_{gpu}$). Moreover, we take into account
the relative gains $\tau$ of a solver implemented on the GPU cluster compared to the same
a GPU cluster is more efficient than on a CPU cluster (see relative gains $\tau$). We can also
notice that the execution times of the CG method, whether in a CPU cluster or in a GPU cluster,
are better than those of the GMRES method for solving large symmetric linear systems. In fact, the
a GPU cluster is more efficient than on a CPU cluster (see relative gains $\tau$). We can also
notice that the execution times of the CG method, whether in a CPU cluster or in a GPU cluster,
are better than those of the GMRES method for solving large symmetric linear systems. In fact, the
time of an iteration than those of the GMRES method. Moreover, an iteration of the parallel GMRES
method requires more data exchanges between computing nodes compared to the parallel CG method.
time of an iteration than those of the GMRES method. Moreover, an iteration of the parallel GMRES
method requires more data exchanges between computing nodes compared to the parallel CG method.
\begin{tabular}{|c|c|c|c|c|c|c|}
\hline
{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\#~Iter.}$ & $\mathbf{Prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline
\begin{tabular}{|c|c|c|c|c|c|c|}
\hline
{\bf Matrix} & $\mathbf{Time_{cpu}}$ & $\mathbf{Time_{gpu}}$ & $\mathbf{\tau}$ & $\mathbf{\#~Iter.}$ & $\mathbf{Prec.}$ & $\mathbf{\Delta}$ \\ \hline \hline
\caption{Performances of the parallel GMRES method for solving linear systems associated to sparse banded matrices on a cluster of 24 CPU cores vs.
on a cluster of 12 GPUs.}
\label{ch12:tab:06}
\caption{Performances of the parallel GMRES method for solving linear systems associated to sparse banded matrices on a cluster of 24 CPU cores vs.
on a cluster of 12 GPUs.}
\label{ch12:tab:06}