%% %%
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\chapterauthor{}{}
-\newcommand{\scalprod}[2]%
-{\ensuremath{\langle #1 \, , #2 \rangle}}
+%\chapterauthor{}{}
+\chapterauthor{Lilia Ziane Khodja, Raphaël Couturier, and Jacques Bahi}{Femto-ST Institute, University of Franche-Comte, France}
+\chapterauthor{Ming Chau}{Advanced Solutions Accelerator, Castelnau Le Lez, France}
+%\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte, France}
+\chapterauthor{Pierre Spitéri}{ENSEEIHT-IRIT, Toulouse, France}
+%\chapterauthor{Jacques Bahi}{Femto-ST Institute, University of Franche-Comte, France}
+
+
\chapter{Solving sparse nonlinear systems of obstacle problems on GPU clusters}
+\label{ch13}
+
+
+
%%--------------------------%%
%% SECTION 1 %%
%%--------------------------%%
\section{Introduction}
-\label{sec:01}
-The obstacle problem is one kind of free boundary problems. It allows to model, for example, an elastic membrane covering a solid obstacle.
-In this case, the objective is to find an equilibrium position of this membrane constrained to be above the obstacle and which tends to minimize
-its surface and/or its energy. The study of such problems occurs in many applications, for example: fluid mechanics, bio-mathematics (tumour growth
-process) or financial mathematics (American or European option pricing).
-
-In this chapter, we focus on solutions of large obstacle problems defined in a three-dimensional domain. Particularly, the present study consists
-in solving large nonlinear systems derived from the spatial discretization of these problems. Owing to the great size of such systems, in order to
-reduce computation times, we proceed by solving them by parallel synchronous or asynchronous iterative algorithms. Moreover, we aim at harnessing
-the computing power of GPUs to accelerate computations of these parallel algorithms. For this, we use an iterative method involving a projection
-on a convex set, which is: the projected Richardson method. We choose this method among other iterative methods because it is easy to implement on
-parallel computers and easy to adapt to GPU architectures.
-
-In Section~\ref{sec:02}, we present the mathematical model of obstacle problems then, in Section~\ref{sec:03}, we describe the general principle of
-the parallel projected Richardson method. Next, in Section~\ref{sec:04}, we give the main key points of the parallel implementation of both synchronous
-and asynchronous algorithms of the projected Richardson method on a GPU cluster. In Section~\ref{sec:05}, we present the performances of both parallel
-algorithms obtained from simulations carried out on a CPU and GPU clusters. Finally, in Section~\ref{sec:06}, we use the read-black ordering technique
-to improve the convergence and, thus, the execution times of the parallel projected Richardson algorithms on the GPU cluster.
+\label{ch13:sec:01}
+The obstacle problem is one kind of free boundary problem. It allows us to model,
+for example, an elastic membrane covering a solid obstacle. In this case, the
+objective is to find an equilibrium position of this membrane constrained to be
+above the obstacle and which tends to minimize its surface and/or its energy.
+The study of such problems occurs in many applications, for example, fluid mechanics,
+biomathematics (tumor growth process), or financial mathematics (American or
+European option pricing).
+
+In this chapter, we focus on solutions of large obstacle problems defined in a
+three-dimensional domain. Particularly, the present study consists of solving
+large nonlinear systems derived from the spatial discretization of these problems.
+Owing to the great size of such systems, in order to reduce computation times,
+we proceed by solving them by parallel synchronous or asynchronous iterative
+algorithms. Moreover, we aim at harnessing the computing power of GPUs to accelerate
+computations of these parallel algorithms. For this, we use an iterative method
+involving a projection on a convex set, which is the projected Richardson method.
+We chose this method among other iterative methods because it is easy to implement
+on parallel computers and easy to adapt to GPU architectures.
+
+In Section~\ref{ch13:sec:02}, we present the mathematical model of obstacle problems
+then, in Section~\ref{ch13:sec:03}, we describe the general principle of the parallel
+projected Richardson method. Next, in Section~\ref{ch13:sec:04}, we give the main key
+points of the parallel implementation of both synchronous and asynchronous algorithms
+of the projected Richardson method on a GPU cluster. In Section~\ref{ch13:sec:05}, we
+present the performances of both parallel algorithms obtained from simulations carried
+out on GPU clusters. Finally, in Section~\ref{ch13:sec:06}, we use the read-black
+ordering technique to improve the convergence and, thus, the execution times of the parallel
+projected Richardson algorithms on the GPU cluster.
+
%%--------------------------%%
%% SECTION 2 %%
%%--------------------------%%
\section{Obstacle problems}
-\label{sec:02}
-In this section, we present the mathematical model of obstacle problems defined in a three-dimensional domain.
-This model is based on that presented in~\cite{ref1}.
+\label{ch13:sec:02}
+In this section, we present the mathematical model of obstacle problems defined in a
+three-dimensional domain. This model is based on that presented in~\cite{ch13:ref1}.
+
%%*******************
%%*******************
\subsection{Mathematical model}
-\label{sec:02.01}
-An obstacle problem, arising for example in mechanics or financial derivatives, consists in solving a time dependent
-nonlinear equation:
+\label{ch13:sec:02.01}
+An obstacle problem\index{obstacle problem}, arising for example in mechanics or financial
+derivatives, consists of solving a time-dependent nonlinear equation\index{nonlinear}:
\begin{equation}
\left\{
\begin{array}{l}
\mbox{B.C. on~}u(t,x,y,z)\mbox{~defined on~}\partial\Omega,
\end{array}
\right.
-\label{eq:01}
+\label{ch13:eq:01}
\end{equation}
-where $u_0$ is the initial condition, $c\geq 0$, $b$ and $\eta$ are physical parameters, $T$ is the final time, $u=u(t,x,y,z)$
-is an element of the solution vector $U$ to compute, $f$ is the right-hand right that could represent, for example, the external
-forces, B.C. describes the boundary conditions on the boundary $\partial\Omega$ of the domain $\Omega$, $\phi$ models a constraint
-imposed to $u$, $\Delta$ is the Laplacian operator, $\nabla$ is the gradient operator, a.e.w. means almost every where and ``.''
-defines the products between two scalars, a scalar and a vector or a matrix and a vector. In practice the boundary condition,
-generally considered, is the Dirichlet condition (where $u$ is fixed on $\partial\Omega$) or the Neumann condition (where the
-normal derivative of $u$ is fixed on $\partial\Omega$).
-
-The time dependent equation~(\ref{eq:01}) is numerically solved by considering an implicit or a semi-implicit time marching,
-where at each time step $k$ a stationary nonlinear problem is solved:
+where $u_0$ is the initial condition; $c\geq 0$, $b$, and $\eta$ are physical parameters;
+$T$ is the final time; $u=u(t,x,y,z)$ is an element of the solution vector $U$ to compute;
+$f$ is the right-hand side that could represent, for example, the external forces; B.C.
+describes the boundary conditions on the boundary $\partial\Omega$ of the domain $\Omega$;
+$\phi$ models a constraint imposed to $u$; $\Delta$ is the Laplacian operator; $\nabla$
+is the gradient operator; a.e.w. means almost everywhere, and ``.'' defines the products
+between two scalars, a scalar and a vector, or a matrix and a vector. In practice the boundary
+condition, generally considered, is the Dirichlet condition (where $u$ is fixed on $\partial\Omega$)
+or the Neumann condition (where the normal derivative of $u$ is fixed on $\partial\Omega$).
+
+The time-dependent equation~(\ref{ch13:eq:01}) is numerically solved by considering an
+implicit or a semi-implicit time marching, where at each time step $k$ a stationary nonlinear
+problem\index{nonlinear} is solved:
\begin{equation}
\left\{
\begin{array}{l}
\mbox{B.C. on~}u(t,x,y,z)\mbox{~defined on~}\partial\Omega,
\end{array}
\right.
-\label{eq:02}
+\label{ch13:eq:02}
\end{equation}
-where $\delta=\frac{1}{k}$ is the inverse of the time step $k$, $g=f+\delta u^{prev}$ and $u^{prev}$ is the solution computed at the
-previous time step.
+where $\delta=\frac{1}{k}$ is the inverse of the time step $k$, $g=f+\delta u^{prev}$ and
+$u^{prev}$ is the solution computed at the previous time step.
%%*******************
%%*******************
\subsection{Discretization}
-\label{sec:02.02}
-First, we note that the spatial discretization of the previous stationary problem~(\ref{eq:02}) does not provide a symmetric matrix,
-because the convection-diffusion operator is not self-adjoint. Moreover, the fact that the operator is self-adjoint or not plays an
-important role in the choice of the appropriate algorithm for solving nonlinear systems derived from the discretization of the obstacle
-problem. Nevertheless, since the convection coefficients arising in the operator~(\ref{eq:02}) are constant, we can formulate the same
-problem by an self-adjoint operator by performing a classical change of variables. Then, we can replace the stationary convection-diffusion
-problem:
+\label{ch13:sec:02.02}
+First, we note that the spatial discretization of the previous stationary
+problem~(\ref{ch13:eq:02}) does not provide a symmetric matrix, because the
+convection-diffusion operator is not self-adjoint. Moreover, the fact that
+the operator is self-adjoint or not plays an important role in the choice
+of the appropriate algorithm for solving nonlinear systems derived from the
+discretization of the obstacle problem\index{obstacle problem}. Nevertheless,
+since the convection coefficients arising in the operator~(\ref{ch13:eq:02})
+are constant, we can formulate the same problem by self-adjoint operator
+by performing a classical change of variables. Then, we can replace the stationary
+convection-diffusion problem:
\begin{equation}
b^{t}.\nabla v-\eta.\Delta v+(c+\delta).v=g\mbox{,~a.e.w. in~}\lbrack 0,T\rbrack\times\Omega\mbox{,~}c\geq 0\mbox{,~}\delta\geq~0,
-\label{eq:03}
+\label{ch13:eq:03}
\end{equation}
by the following stationary diffusion operator:
\begin{equation}
-\eta.\Delta u+(\frac{\|b\|^{2}_{2}}{4\eta}+c+\delta).u=e^{-a}g=f,
-\label{eq:04}
+\label{ch13:eq:04}
\end{equation}
-where $b=\{b_{1},b_{2},b_{3}\}$, $\|b\|_{2}$ denotes the Euclidean norm of $b$ and $v=e^{-a}.u$ represents the general change of variables
-such that $a=\frac{b^{t}(x,y,z)}{2\eta}$. Consequently, the numerical resolution of the diffusion problem (the self-adjoint operator~(\ref{eq:04}))
-is done by optimization algorithms, in contrast, that of the convection-diffusion problem (non self-adjoint operator~(\ref{eq:03})) is
-done by relaxation algorithms. In the case of our studied algorithm, the convergence is ensured by M-matrix property then, the performance
-is linked to the magnitude of the spectral radius of the iteration matrix, which is independent of the condition number.
-
-Next, the three-dimensional domain $\Omega\subset\mathbb{R}^{3}$ is set to $\Omega=\lbrack 0,1\rbrack^{3}$ and discretized with an uniform
-Cartesian mesh constituted by $M=m^3$ discretization points, where $m$ related to the spatial discretization step by $h=\frac{1}{m+1}$. This
-is carried out by using a classical order 2 finite difference approximation of the Laplacian. So, the complete discretization of both stationary
-boundary value problems~(\ref{eq:03}) and~(\ref{eq:04}) leads to the solution of a large discrete complementary problem of the following
-form, when both Dirichlet or Neumann boundary conditions are used:
+where $b=\{b_{1},b_{2},b_{3}\}$, $\|b\|_{2}$ denotes the Euclidean norm of $b$, and
+$v=e^{-a}.u$ represents the general change of variables such that $a=\frac{b^{t}(x,y,z)}{2\eta}$.
+Consequently, the numerical resolution of the diffusion problem (the self-adjoint
+operator~(\ref{ch13:eq:04})) is done by optimization algorithms, in contrast to that
+of the convection-diffusion problem (nonself-adjoint operator~(\ref{ch13:eq:03}))
+which is done by relaxation algorithms. In the case of our studied algorithm, the convergence\index{convergence}
+is ensured by M-matrix property; then, the performance is linked to the magnitude of
+the spectral radius of the iteration matrix, which is independent of the condition
+number.
+
+Next, the three-dimensional domain $\Omega\subset\mathbb{R}^{3}$ is set to $\Omega=\lbrack 0,1\rbrack^{3}$
+and discretized with a uniform Cartesian mesh constituted by $M=m^3$ discretization
+points, where $m$ is related to the spatial discretization step by $h=\frac{1}{m+1}$. This
+is carried out by using a classical order 2 finite difference approximation of the Laplacian.
+So, the complete discretization of both stationary boundary value problems~(\ref{ch13:eq:03})
+and~(\ref{ch13:eq:04}) leads to the solution of a large discrete complementary problem
+of the following form, when both Dirichlet or Neumann boundary conditions are used:
\begin{equation}
\left\{
\begin{array}{l}
((A+\delta I)U^{*}-G)^{T}(U^{*}-\bar{\Phi})=0,\\
\end{array}
\right.
-\label{eq:05}
+\label{ch13:eq:05}
\end{equation}
-where $A$ is a matrix obtained after the spatial discretization by a finite difference method, $G$ is derived from the Euler first order implicit time
-marching scheme and from the discretized right-hand side of the obstacle problem, $\delta$ is the inverse of the time step $k$ and $I$ is the identity
-matrix. The matrix $A$ is symmetric when the self-adjoint operator is considered and nonsymmetric otherwise.
+where $A$ is a matrix obtained after the spatial discretization by a finite difference
+method, $G$ is derived from the Euler first order implicit time marching scheme and from
+the discretized right-hand side of the obstacle problem, $\delta$ is the inverse of the
+time step $k$, and $I$ is the identity matrix. The matrix $A$ is symmetric when the self-adjoint
+operator is considered and nonsymmetric otherwise.
-According to the chosen discretization scheme of the Laplacian, $A$ is an M-matrix (irreducibly diagonal dominant, see~\cite{ref2}) and, consequently,
-the matrix $(A+\delta I)$ is also an M-matrix. This property is important to the convergence of iterative methods.
+According to the chosen discretization scheme of the Laplacian, $A$ is an M-matrix (irreducibly
+diagonal dominant, see~\cite{ch13:ref2}) and, consequently, the matrix $(A+\delta I)$ is also
+an M-matrix. This property is important to the convergence of iterative methods.
%%--------------------------%%
%% SECTION 3 %%
%%--------------------------%%
\section{Parallel iterative method}
-\label{sec:03}
-Owing to the large size of the previous discrete complementary problem~(\ref{eq:05}), we will solve it by parallel synchronous or asynchronous iterative
-algorithms (see~\cite{ref3,ref4,ref5}). In this chapter, we aim at harnessing the computing power of GPU clusters for solving these large nonlinear systems.
-Then, we choose to use the projected Richardson iterative method for solving the diffusion problem~(\ref{eq:04}). Indeed, this method is based on the iterations
-of the Jacobi method which are easy to parallelize on parallel computers and easy to adapt to GPU architectures. Then, according to the boundary value problem
-formulation with a self-adjoint operator~(\ref{eq:04}), we can consider here the equivalent optimization problem and the fixed point mapping associated to
-its solution.
-
-Assume that $E=\mathbb{R}^{M}$ is a Hilbert space, in which $\scalprod{.}{.}$ is the scalar product and $\|.\|$ its associated norm. So, the general fixed
-point problem to be solved is defined as follows:
+\label{ch13:sec:03}
+Owing to the large size of the previous discrete complementary problem~(\ref{ch13:eq:05}),
+we will solve it by parallel synchronous or asynchronous iterative algorithms (see~\cite{ch13:ref3,ch13:ref4,ch13:ref5}).
+In this chapter, we aim at harnessing the computing power of GPU clusters for solving these
+large nonlinear systems\index{nonlinear}. Then, we choose to use the projected Richardson
+iterative method\index{iterative method!projected Richardson} for solving the diffusion
+problem~(\ref{ch13:eq:04}). Indeed, this method is based on the iterations of the Jacobi
+method\index{iterative method!Jacobi}, which are easy to parallelize on parallel computers
+and easy to adapt to GPU architectures. Then, according to the boundary value problem
+formulation with a self-adjoint operator~(\ref{ch13:eq:04}), we can consider here the
+equivalent optimization problem and the fixed point mapping associated to its solution.
+
+Assume that $E=\mathbb{R}^{M}$ is a Hilbert space\index{Hilbert space}, in which $\scalprod{.}{.}$
+is the scalar product and $\|.\|$ its associated norm. So, the general fixed point problem
+to be solved is defined as follows:
\begin{equation}
\left\{
\begin{array}{l}
U^{*} = F(U^{*}), \\
\end{array}
\right.
-\label{eq:06}
+\label{ch13:eq:06}
\end{equation}
where $U\mapsto F(U)$ is an application from $E$ to $E$.
-Let $K$ be a closed convex set defined by:
+Let $K$ be a closed convex set defined by
\begin{equation}
K = \{U | U \geq \Phi \mbox{~everywhere in~} E\},
-\label{eq:07}
+\label{ch13:eq:07}
\end{equation}
-where $\Phi$ is the discrete obstacle function. In fact, the obstacle problem~(\ref{eq:05}) is formulated as the following constrained optimization problem:
+where $\Phi$ is the discrete obstacle function. In fact, the obstacle problem~(\ref{ch13:eq:05})\index{obstacle problem}
+is formulated as the following constrained optimization problem:
\begin{equation}
\left\{
\begin{array}{l}
\forall V \in K, J(U^{*}) \leq J(V), \\
\end{array}
\right.
-\label{eq:08}
+\label{ch13:eq:08}
\end{equation}
-where the cost function is given by:
+where the cost function is given by
\begin{equation}
J(U) = \frac{1}{2}\scalprod{\mathcal{A}.U}{U} - \scalprod{G}{U},
-\label{eq:09}
+\label{ch13:eq:09}
\end{equation}
-in which $\scalprod{.}{.}$ denotes the scalar product in $E$, $\mathcal{A}=A+\delta I$ is a symmetric positive definite, $A$ is the discretization matrix
-associated with the self-adjoint operator~(\ref{eq:04}) after change of variables.
+in which $\scalprod{.}{.}$ denotes the scalar product in $E$, $\mathcal{A}=A+\delta I$
+is a symmetric positive definite, and $A$ is the discretization matrix associated with the
+self-adjoint operator~(\ref{ch13:eq:04}) after change of variables.
-For any $U\in E$, let $P_K(U)$ be the projection of $U$ on $K$. For any $\gamma\in\mathbb{R}$, $\gamma>0$, the fixed point mapping $F_{\gamma}$ of the projected
-Richardson method is defined as follows:
+For any $U\in E$; let $P_K(U)$ be the projection of $U$ on $K$. For any $\gamma\in\mathbb{R}$,
+$\gamma>0$, the fixed point mapping $F_{\gamma}$ of the projected Richardson method\index{iterative method!projected Richardson}
+is defined as follows:
\begin{equation}
U^{*} = F_{\gamma}(U^{*}) = P_K(U^{*} - \gamma(\mathcal{A}.U^{*} - G)).
-\label{eq:10}
+\label{ch13:eq:10}
\end{equation}
-In order to reduce the computation time, the large optimization problem is solved in a numerical way by using a parallel asynchronous algorithm of the projected
-Richardson method on the convex set $K$. Particularly, we will consider an asynchronous parallel adaptation of the projected Richardson method~\cite{ref6}.
-
-Let $\alpha\in\mathbb{N}$ be a positive integer. We consider that the space $E=\displaystyle\prod_{i=1}^{\alpha} E_i$ is a product of $\alpha$ subspaces $E_i$
-where $i\in\{1,\ldots,\alpha\}$. Note that $E_i=\mathbb{R}^{m_i}$, where $\displaystyle\sum_{i=1}^{\alpha} m_{i}=M$, is also a Hilbert space in which $\scalprod{.}{.}_i$
-denotes the scalar product and $|.|_i$ the associated norm, for all $i\in\{1,\ldots,\alpha\}$. Then, for all $u,v\in E$, $\scalprod{u}{v}=\displaystyle\sum_{i=1}^{\alpha}\scalprod{u_i}{v_i}_i$
+In order to reduce the computation time, the large optimization problem is solved in a
+numerical way by using a parallel asynchronous algorithm of the projected Richardson method
+on the convex set $K$. Particularly, we will consider an asynchronous parallel adaptation
+of the projected Richardson method~\cite{ch13:ref6}.
+
+Let $\alpha\in\mathbb{N}$ be a positive integer. We consider that the space $E=\displaystyle\prod_{i=1}^{\alpha} E_i$
+is a product of $\alpha$ subspaces $E_i$ where $i\in\{1,\ldots,\alpha\}$. Note that $E_i=\mathbb{R}^{m_i}$,
+where $\displaystyle\sum_{i=1}^{\alpha} m_{i}=M$, is also a Hilbert space\index{Hilbert space}
+in which $\scalprod{.}{.}_i$ denotes the scalar product and $|.|_i$ the associated norm, for
+all $i\in\{1,\ldots,\alpha\}$. Then, for all $u,v\in E$, $\scalprod{u}{v}=\displaystyle\sum_{i=1}^{\alpha}\scalprod{u_i}{v_i}_i$
is the scalar product on $E$.
-Let $U\in E$, we consider the following decomposition of $U$ and the corresponding decomposition of $F_\gamma$ into $\alpha$ blocks:
+Let $U\in E$, we consider the following decomposition of $U$ and the corresponding decomposition
+of $F_\gamma$ into $\alpha$ blocks:
\begin{equation}
\begin{array}{rcl}
U & = & (U_1,\ldots,U_{\alpha}), \\
F_{\gamma}(U) & = & (F_{1,\gamma}(U),\ldots,F_{\alpha,\gamma}(U)). \\
\end{array}
-\label{eq:11}
+\label{ch13:eq:11}
\end{equation}
-Assume that the convex set $K=\displaystyle\prod_{i=1}^{\alpha}K_{i}$, such that $\forall i\in\{1,\ldots,\alpha\},K_i\subset E_i$ and $K_i$ is a closed convex set.
-Let also $G=(G_1,\ldots,G_{\alpha})\in E$ and, for any $U\in E$, $P_K(U)=(P_{K_1}(U_1),\ldots,P_{K_{\alpha}}(U_{\alpha}))$ is the projection of $U$ on $K$ where $\forall i\in\{1,\ldots,\alpha\},P_{K_i}$
-is the projector from $E_i$ onto $K_i$. So, the fixed point mapping of the projected Richardson method~(\ref{eq:10}) can be written in the following way:
+Assume that the convex set $K=\displaystyle\prod_{i=1}^{\alpha}K_{i}$, such that $\forall i\in\{1,\ldots,\alpha\},K_i\subset E_i$,
+and $K_i$ is a closed convex set. Let also $G=(G_1,\ldots,G_{\alpha})\in E$; for any
+$U\in E$, $P_K(U)=(P_{K_1}(U_1),\ldots,P_{K_{\alpha}}(U_{\alpha}))$ is the projection of $U$
+on $K$ where $\forall i\in\{1,\ldots,\alpha\},P_{K_i}$ is the projector from $E_i$ onto
+$K_i$. So, the fixed point mapping of the projected Richardson method~(\ref{ch13:eq:10})\index{iterative method!projected Richardson}
+can be written in the following way:
\begin{equation}
\forall U\in E\mbox{,~}\forall i\in\{1,\ldots,\alpha\}\mbox{,~}F_{i,\gamma}(U) = P_{K_i}(U_i - \gamma(\mathcal{A}_i.U - G_i)).
-\label{eq:12}
+\label{ch13:eq:12}
\end{equation}
-Note that $\displaystyle\mathcal{A}_i.U= \sum_{j=1}^{\alpha}\mathcal{A}_{i,j}.U_j$, where $\mathcal{A}_{i,j}$ denote block matrices of $\mathcal{A}$.
+Note that $\displaystyle\mathcal{A}_i.U= \sum_{j=1}^{\alpha}\mathcal{A}_{i,j}.U_j$, where
+$\mathcal{A}_{i,j}$ denote block matrices of $\mathcal{A}$.
-The parallel asynchronous iterations of the projected Richardson method for solving the obstacle problem~(\ref{eq:08}) are defined as follows: let $U^0\in E,U^0=(U^0_1,\ldots,U^0_\alpha)$ be
-the initial solution, then for all $p\in\mathbb{N}$, the iterate $U^{p+1}=(U^{p+1}_1,\ldots,U^{p+1}_{\alpha})$ is recursively defined by:
+The parallel asynchronous iterations of the projected Richardson method for solving the
+obstacle problem~(\ref{ch13:eq:08}) are defined as follows: let $U^0\in E,U^0=(U^0_1,\ldots,U^0_\alpha)$
+be the initial solution, then for all $p\in\mathbb{N}$, the iterate $U^{p+1}=(U^{p+1}_1,\ldots,U^{p+1}_{\alpha})$
+is recursively defined by
\begin{equation}
U_i^{p+1} =
\left\{
U_i^p \mbox{~otherwise}, \\
\end{array}
\right.
-\label{eq:13}
+\label{ch13:eq:13}
\end{equation}
where
\begin{equation}
\left\{
\begin{array}{l}
\forall p\in\mathbb{N}, s(p)\subset\{1,\ldots,\alpha\}\mbox{~and~} s(p)\ne\emptyset, \\
-\forall i\in\{1,\ldots,\alpha\},\{p \ | \ i \in s(p)\}\mbox{~is denombrable},
+\forall i\in\{1,\ldots,\alpha\},\{p \ | \ i \in s(p)\}\mbox{~is enumerable},
\end{array}
\right.
-\label{eq:14}
+\label{ch13:eq:14}
\end{equation}
and $\forall j\in\{1,\ldots,\alpha\}$,
\begin{equation}
\displaystyle\lim_{p\to\infty}\rho_j(p) = +\infty.\\
\end{array}
\right.
-\label{eq:15}
+\label{ch13:eq:15}
\end{equation}
-The previous asynchronous scheme of the projected Richardson method models computations that are carried out in parallel
-without order nor synchronization (according to the behavior of the parallel iterative method) and describes a subdomain
-method without overlapping. It is a general model that takes into account all possible situations of parallel computations
-and non-blocking message passing. So, the synchronous iterative scheme is defined by:
+The previous asynchronous scheme\index{asynchronous iterations} of the projected Richardson
+method models computations that are carried out in parallel without order or
+synchronization (according to the behavior of the parallel iterative method) and
+describes a subdomain method without overlapping. It is a general model that takes
+into account all possible situations of parallel computations and nonblocking message
+passing. So, the synchronous iterative scheme\index{synchronous iterations} is defined by
\begin{equation}
\forall j\in\{1,\ldots,\alpha\} \mbox{,~} \forall p\in\mathbb{N} \mbox{,~} \rho_j(p)=p.
-\label{eq:16}
+\label{ch13:eq:16}
\end{equation}
-The values of $s(p)$ and $\rho_j(p)$ are defined dynamically and not explicitly by the parallel asynchronous or synchronous
-execution of the algorithm. Particularly, it enables one to consider distributed computations whereby processors compute at
-their own pace according to their intrinsic characteristics and computational load. The parallelism between the processors is
-well described by the set $s(p)$ which contains at each step $p$ the index of the components relaxed by each processor on a
-parallel way while the use of delayed components in~(\ref{eq:13}) permits one to model nondeterministic behavior and does not
-imply inefficiency of the considered distributed scheme of computation. Note that, according to~\cite{ref7}, theoretically,
-each component of the vector must be relaxed an infinity of time. The choice of the relaxed components to be used in the
-computational process may be guided by any criterion and, in particular, a natural criterion is to pick-up the most recently
-available values of the components computed by the processors. Furthermore, the asynchronous iterations are implemented by
-means of non-blocking MPI communication subroutines (asynchronous communications).
-
-The important property ensuring the convergence of the parallel projected Richardson method, both synchronous and asynchronous
-algorithms, is the fact that $\mathcal{A}$ is an M-matrix. Moreover, the convergence proceeds from a result of~\cite{ref6}.
-Indeed, there exists a value $\gamma_0>0$, such that $\forall\gamma\in ]0,\gamma_0[$, the parallel iterations~(\ref{eq:13}),
-(\ref{eq:14}) and~(\ref{eq:15}), associated to the fixed point mapping $F_\gamma$~(\ref{eq:12}), converge to the unique solution
-$U^{*}$ of the discretized problem.
+The values of $s(p)$ and $\rho_j(p)$ are defined dynamically and not explicitly by
+the parallel asynchronous or synchronous execution of the algorithm. Particularly,
+They allow us to consider distributed computations whereby processors compute at
+their own pace according to their intrinsic characteristics and computational load.
+The parallelism between the processors is well described by the set $s(p)$ which
+contains at each step $p$ the index of the components relaxed by each processor on
+a parallel way while the use of delayed components in~(\ref{ch13:eq:13}) permits one
+to model nondeterministic behavior and does not imply inefficiency of the considered
+distributed scheme of computation. Note that, according to~\cite{ch13:ref7}, theoretically,
+each component of the vector must be relaxed an infinite number of times. The choice of the
+relaxed components to be used in the computational process may be guided by any criterion,
+and in particular, a natural criterion is to pickup the most recently available
+values of the components computed by the processors. Furthermore, the asynchronous
+iterations are implemented by means of nonblocking MPI communication subroutines\index{MPI!nonblocking}
+(asynchronous communications).
+
+The important property ensuring the convergence of the parallel projected Richardson
+method, both synchronous and asynchronous algorithms, is the fact that $\mathcal{A}$
+is an M-matrix. Moreover, the convergence\index{convergence} proceeds from a result
+of~\cite{ch13:ref6}. Indeed, there exists a value $\gamma_0>0$, such that $\forall\gamma\in ]0,\gamma_0[$,
+the parallel iterations~(\ref{ch13:eq:13}), (\ref{ch13:eq:14}), and~(\ref{ch13:eq:15}),
+associated to the fixed point mapping $F_\gamma$~(\ref{ch13:eq:12}), converge to the
+unique solution $U^{*}$ of the discretized problem.
%%--------------------------%%
%% SECTION 4 %%
%%--------------------------%%
\section{Parallel implementation on a GPU cluster}
-\label{sec:04}
-In this section, we give the main key points of the parallel implementation of the projected Richardson method, both synchronous
-and asynchronous versions, on a GPU cluster, for solving the nonlinear systems derived from the discretization of large obstacle
-problems. More precisely, each nonlinear system is solved iteratively using the whole cluster. We use a heteregeneous CUDA/MPI
-programming. Indeed, the communication of data, at each iteration between the GPU computing nodes, can be either synchronous
-or asynchronous using the MPI communication subroutines, whereas inside each GPU node, a CUDA parallelization is performed.
-
-\begin{figure}[!h]
+\label{ch13:sec:04}
+In this section, we give the main key points of the parallel implementation of the
+projected Richardson method, both synchronous and asynchronous versions, on a GPU
+cluster, for solving the nonlinear systems derived from the discretization of large
+obstacle problems. More precisely, each nonlinear system is solved iteratively using
+the whole cluster. We use a heterogeneous CUDA and MPI programming. Indeed, the communication
+of data, at each iteration between the GPU computing nodes, can be either synchronous
+or asynchronous using the MPI communication subroutines, whereas inside each GPU node,
+a CUDA parallelization is performed.
+
+Let $S$ denote the number of computing nodes\index{computing node} on the GPU cluster,
+where a computing node is composed of CPU core holding one MPI process and a GPU card.
+So, before starting computations, the obstacle problem of size $(NX\times NY\times NZ)$
+is split into $S$ parallelepipedic subproblems, each for a node (MPI process, GPU), as
+is shown in Figure~\ref{ch13:fig:01}. Indeed, the $NY$ and $NZ$ dimensions (according
+to the $y$ and $z$ axises) of the three-dimensional problem are split, respectively,
+into $Sy$ and $Sz$ parts, such that $S=Sy\times Sz$. In this case, each computing node
+has at most four neighboring nodes. This kind of data partitioning reduces the data
+exchanges at subdomain boundaries compared to a naive $z$-axis-wise partitioning.
+
+\begin{figure}
\centerline{\includegraphics[scale=0.30]{Chapters/chapter13/figures/splitCPU}}
\caption{Data partitioning of a problem to be solved among $S=3\times 4$ computing nodes.}
-\label{fig:01}
+\label{ch13:fig:01}
\end{figure}
-Let $S$ denote the number of computing nodes on the GPU cluster, where a computing node is composed of CPU core holding one MPI
-process and a GPU card. So, before starting computations, the obstacle problem of size $(NX\times NY\times NZ)$ is split into $S$
-parallelepipedic sub-problems, each for a node (MPI process, GPU), as is shown in Figure~\ref{fig:01}. Indeed, the $NY$ and $NZ$
-dimensions (according to the $y$ and $z$ axises) of the three-dimensional problem are, respectively, split into $Sy$ and $Sz$ parts,
-such that $S=Sy\times Sz$. In this case, each computing node has at most four neighboring nodes. This kind of the data partitioning
-reduces the data exchanges at subdomain boundaries compared to a naive $z$-axis-wise partitioning.
+All the computing nodes of the GPU cluster execute in parallel the Algorithm~\ref{ch13:alg:01}
+on a three-dimensional subproblems of size $(NX\times ny\times nz)$.
+This algorithm gives the main key points for solving an obstacle problem\index{obstacle problem}
+defined in a three-dimensional domain, where $A$ is the discretization matrix, $G$
+is the right-hand side, and $U$ is the solution vector. After the initialization step,
+all the data generated from the partitioning operation are copied from the CPU memories
+to the GPU global memories to be processed on the GPUs. Next, the algorithm uses $NbSteps$
+time steps to solve the global obstacle problem. In fact, it uses a parallel algorithm
+adapted to GPUs from the projected Richardson iterative method for solving the nonlinear
+systems\index{nonlinear} of the obstacle problem. This function is defined by {\it Solve()}
+in Algorithm~\ref{ch13:alg:01}. At every time step, the initial guess $U^0$ for the iterative
+algorithm is set to the solution found at the previous time step. Moreover, the right-hand
+side $G$ is computed as follows: \[G = \frac{1}{k}.U^{prev} + F\] where $k$ is the time step,
+$U^{prev}$ is the solution computed in the previous time step, and each element $f(x, y, z)$
+of the vector $F$ is computed as follows:
+\begin{equation}
+f(x,y,z)=\cos(2\pi x)\cdot\cos(4\pi y)\cdot\cos(6\pi z).
+\label{ch13:eq:18}
+\end{equation}
+Finally, the solution $U$ of the obstacle problem is copied back from the GPU global
+memories to the CPU memories. We use the communication subroutines of the CUBLAS
+(CUDA Basic Linear Algebra Subroutines) library~\cite{ch13:ref8}\index{CUBLAS} for the memory allocations in the GPU (\verb+cublasAlloc()+)
+and the data transfers between the CPU and its GPU: \verb+cublasSetVector()+ and \verb+cublasGetVector()+.
-\begin{algorithm}[!t]
-%\SetLine
-%\linesnumbered
-Initialization of the parameters of the sub-problem\;
+\begin{algorithm}[t]
+Initialization of the parameters of the subproblem\;
Allocate and fill the data in the global memory GPU\;
\For{$i=1$ {\bf to} $NbSteps$}{
$G = \frac{1}{k}.U^0 + F$\;
$U^0 = U$\;
}
Copy the solution $U$ back from GPU memory\;
-\caption{Parallel solving of the obstacle problem on a GPU cluster}
-\label{alg:01}
+\caption{parallel solving of the obstacle problem on a GPU cluster}
+\label{ch13:alg:01}
\end{algorithm}
-All the computing nodes of the GPU cluster execute in parallel the same Algorithm~\ref{alg:01} but on different three-dimensional
-sub-problems of size $(NX\times ny\times nz)$. This algorithm gives the main key points for solving an obstacle problem defined in
-a three-dimensional domain, where $A$ is the discretization matrix, $G$ is the right-hand side and $U$ is the solution vector. After
-the initialization step, all the data generated from the partitioning operation are copied from the CPU memories to the GPU global
-memories, to be processed on the GPUs. Next, the algorithm uses $NbSteps$ time steps to solve the global obstacle problem. In fact,
-it uses a parallel algorithm adapted to GPUs of the projected Richardson iterative method for solving the nonlinear systems of the
-obstacle problem. This function is defined by {\it Solve()} in Algorithm~\ref{alg:01}. At every time step, the initial guess $U^0$
-for the iterative algorithm is set to the solution found at the previous time step. Moreover, the right-hand side $G$ is computed
-as follows: \[G = \frac{1}{k}.U^{prev} + F\] where $k$ is the time step, $U^{prev}$ is the solution computed in the previous time
-step and each element $f(x, y, z)$ of the vector $F$ is computed as follows:
-\begin{equation}
-f(x,y,z)=\cos(2\pi x)\cdot\cos(4\pi y)\cdot\cos(6\pi z).
-\label{eq:18}
-\end{equation}
-Finally, the solution $U$ of the obstacle problem is copied back from the GPU global memories to the CPU memories. We use the
-communication subroutines of the CUBLAS library~\cite{ref8} (CUDA Basic Linear Algebra Subroutines) for the memory allocations in
-the GPU (\verb+cublasAlloc()+) and the data transfers between the CPU and its GPU: \verb+cublasSetVector()+ and \verb+cublasGetVector()+.
-
\begin{algorithm}[!t]
-% \SetLine
-% \linesnumbered
$p = 0$\;
$conv = false$\;
$U = U^{0}$\;
$p = p + 1$\;
$conv$ = Convergence($error$, $p$, $\varepsilon$, $MaxRelax$)\;
}
-\caption{Parallel iterative solving of the nonlinear systems on a GPU cluster ($Solve()$ function)}
-\label{alg:02}
+\caption{parallel iterative solving of the nonlinear systems on a GPU cluster ($Solve()$ function)}
+\label{ch13:alg:02}
\end{algorithm}
-As many other iterative methods, the algorithm of the projected Richardson method is based on algebraic functions operating on vectors
-and/or matrices, which are more efficient on parallel computers when they work on large vectors. Its parallel implementation on the GPU
-cluster is carried out so that the GPUs execute the vector operations as kernels and the CPUs execute the serial codes, supervise the
-kernel executions and the data exchanges with the neighboring nodes and supply the GPUs with data. Algorithm~\ref{alg:02} shows the
-main key points of the parallel iterative algorithm (function $Solve()$ in Algorithm~\ref{alg:01}). All the vector operations inside
-the main loop ({\bf repeat} ... {\bf until}) are executed by the GPU. We use the following functions of the CUBLAS library:
+As are many other iterative methods, the algorithm of the projected Richardson
+method\index{iterative method!projected Richardson} is based on algebraic
+functions operating on vectors and/or matrices, which are more efficient on
+parallel computers when they work on large vectors. Its parallel implementation
+on the GPU cluster is carried out so that the GPUs execute the vector operations
+as kernels and the CPUs execute the serial codes, supervise the kernel executions
+and the data exchanges with the neighboring nodes\index{neighboring node}, and
+supply the GPUs with data. Algorithm~\ref{ch13:alg:02} shows the main key points
+of the parallel iterative algorithm (function $Solve()$ in Algorithm~\ref{ch13:alg:01}).
+All the vector operations inside the main loop ({\bf repeat} ... {\bf until})
+are executed by the GPU. We use the following functions of the CUBLAS library\index{CUBLAS}:
\begin{itemize}
\item \verb+cublasDaxpy()+ to compute the difference between the solution vectors $U^{p}$ and $U^{p+1}$ computed in two successive relaxations
-$p$ and $p+1$ (line~$7$ in Algorithm~\ref{alg:02}),
-\item \verb+cublasDnrm2()+ to perform the Euclidean norm (line~$8$) and,
+$p$ and $p+1$ (line~$7$ in Algorithm~\ref{ch13:alg:02}),
+\item \verb+cublasDnrm2()+ to perform the Euclidean norm (line~$8$), and
\item \verb+cublasDcpy()+ for the data copy of a vector to another one in the GPU memory (lines~$3$ and~$9$).
\end{itemize}
-The dimensions of the grid and blocks of threads that execute a given kernel depend on the resources of the GPU multiprocessor and the
-resource requirements of the kernel. So, if $block$ defines the size of a thread block, which must not exceed the maximum size of a thread
-block, then the number of thread blocks in the grid, denoted by $grid$, can be computed according to the size of the local sub-problem
-as follows: \[grid = \frac{(NX\times ny\times nz)+block-1}{block}.\] However, when solving very large problems, the size of the thread
-grid can exceed the maximum number of thread blocks that can be executed on the GPUs (up-to $65.535$ thread blocks) and, thus, the kernel
-will fail to launch. Therefore, for each kernel, we decompose the three-dimensional sub-problem into $nz$ two-dimensional slices of size
-($NX\times ny$), as is shown in Figure~\ref{fig:02}. All slices of the same kernel are executed using {\bf for} loop by $NX\times ny$ parallel
-threads organized in a two-dimensional grid of two-dimensional thread blocks, as is shown in Listing~\ref{list:01}. Each thread is in charge
-of $nz$ discretization points (one from each slice), accessed in the GPU memory with a constant stride $(NX\times ny)$.
+The dimensions of the grid and blocks of threads that execute a given kernel
+depend on the resources of the GPU multiprocessor and the resource requirements
+of the kernel. So, if $block$ defines the size of a thread block, which must
+not exceed the maximum size of a thread block, then the number of thread blocks
+in the grid, denoted by $grid$, can be computed according to the size of the
+local subproblem as follows: \[grid = \frac{(NX\times ny\times nz)+block-1}{block}.\]
+However, when solving very large problems, the size of the thread grid can exceed
+the maximum number of thread blocks that can be executed on the GPUs (upto $65.535$
+thread blocks), and thus, the kernel will fail to launch. Therefore, for each kernel,
+we decompose the three-dimensional subproblem into $nz$ two-dimensional slices of size
+($NX\times ny$), as is shown in Figure~\ref{ch13:fig:02}. All slices of the same kernel
+are executed using a {\bf for} loop by $NX\times ny$ parallel threads organized in a
+two-dimensional grid of two-dimensional thread blocks, as is shown in Listing~\ref{ch13:list:01}.
+Each thread is in charge of $nz$ discretization points (one from each slice), accessed
+in the GPU memory with a constant stride $(NX\times ny)$.
\begin{figure}
\centerline{\includegraphics[scale=0.30]{Chapters/chapter13/figures/splitGPU}}
-\caption{Decomposition of a sub-problem in a GPU into $nz$ slices.}
-\label{fig:02}
+\caption{Decomposition of a subproblem in a GPU into $nz$ slices.}
+\label{ch13:fig:02}
\end{figure}
\begin{center}
-\lstinputlisting[label=list:01,caption=Skeleton codes of a GPU kernel and a CPU function]{Chapters/chapter13/ex1.cu}
+\lstinputlisting[label=ch13:list:01,caption=skeleton codes of a GPU kernel and a CPU function]{Chapters/chapter13/ex1.cu}
\end{center}
-The function $Determine\_Bordering\_Vector\_Elements()$ (line~$5$ in Algorithm~\ref{alg:02}) determines the values of the vector
-elements shared at the boundaries with neighboring computing nodes. Its main operations are defined as follows:
+The function $Determine\_Bordering\_Vector\_Elements()$ (line~$5$ in Algorithm~\ref{ch13:alg:02})
+determines the values of the vector elements shared at the boundaries with neighboring computing
+nodes. Its main operations are as follows:
\begin{enumerate}
\item define the values associated to the bordering points needed by the neighbors,
\item copy the values associated to the bordering points from the GPU to the CPU,
-\item exchange the values associated to the bordering points between the neighboring CPUs,
-\item copy the received values associated to the bordering points from the CPU to the GPU,
+\item exchange the values associated to the bordering points between the neighboring CPUs, and
+\item copy the received values associated to the bordering points from the CPU to the GPU.
\end{enumerate}
The first operation of this function is implemented as kernels to be performed by the GPU:
\begin{itemize}
-\item a kernel executed by $NX\times nz$ threads to define the values associated to the bordering vector elements along $y$-axis and,
-\item a kernel executed by $NX\times ny$ threads to define the values associated to the bordering vector elements along $z$-axis.
+\item a kernel executed by $NX\times nz$ threads to define the values associated to the bordering vector elements along the $y$-axis, and
+\item a kernel executed by $NX\times ny$ threads to define the values associated to the bordering vector elements along the $z$-axis.
\end{itemize}
-As mentioned before, we develop the \emph{synchronous} and \emph{asynchronous} algorithms of the projected Richardson method. Obviously,
-in this scope, the synchronous or asynchronous communications refer to the communications between the CPU cores (MPI processes) on the
-GPU cluster, in order to exchange the vector elements associated to subdomain boundaries. For the memory copies between a CPU core and
-its GPU, we use the synchronous communication routines of the CUBLAS library: \verb+cublasSetVector()+ and \verb+cublasGetVector()+
-in the synchronous algorithm and the asynchronous ones: \verb+cublasSetVectorAsync()+ and \verb+cublasGetVectorAsync()+ in the
-asynchronous algorithm. Moreover, we use the communication routines of the MPI library to carry out the data exchanges between the neighboring
-nodes. We use the following communication routines: \verb+MPI_Isend()+ and \verb+MPI_Irecv()+ to perform non-blocking sends and receptions,
-respectively. For the synchronous algorithm, we use the MPI routine \verb+MPI_Waitall()+ which puts the MPI process of a computing node
-in blocking status until all data exchanges with neighboring nodes (sends and receptions) are completed. In contrast, for the asynchronous
-algorithms, we use the MPI routine \verb+MPI_Test()+ which tests the completion of a data exchange (send or reception) without putting the
-MPI process in blocking status.
-
-The function $Compute\_New\_Vector\_Elements()$ (line~$6$ in Algorithm~\ref{alg:02}) computes, at each iteration, the new elements
-of the iterate vector $U$. Its general code is presented in Listing~\ref{list:01} (CPU function). The iterations of the projected
-Richardson method, based on those of the Jacobi method, are defined as follows:
+As mentioned previously, we develop the \emph{synchronous} and \emph{asynchronous}
+algorithms of the projected Richardson method. Obviously, in this scope, the
+synchronous\index{synchronous iterations} or asynchronous\index{asynchronous iterations} communications
+refer to the communications between the CPU cores (MPI processes) on the GPU cluster,
+in order to exchange the vector elements associated to subdomain boundaries. For
+the memory copies between a CPU core and its GPU, we use the synchronous communication
+routines of the CUBLAS library\index{CUBLAS}: \verb+cublasSetVector()+ and \verb+cublasGetVector()+
+in the synchronous algorithm and the asynchronous ones: \verb+cublasSetVectorAsync()+
+and \verb+cublasGetVectorAsync()+ in the asynchronous algorithm. Moreover, we
+use the communication routines of the MPI library to carry out the data exchanges
+between the neighboring nodes. We use the following communication routines: \verb+MPI_Isend()+
+and \verb+MPI_Irecv()+ to perform nonblocking\index{MPI!nonblocking}
+sends and receives, respectively. For the synchronous algorithm, we use the MPI
+routine \verb+MPI_Waitall()+ which puts the MPI process of a computing node in
+blocking status until all data exchanges with neighboring nodes (sends and receives)
+are completed. In contrast, for the asynchronous algorithms, we use the MPI routine
+\verb+MPI_Test()+ which tests the completion of a data exchange (send or receives)
+without putting the MPI process in blocking status\index{MPI!blocking}.
+
+The function $Compute\_New\_Vector\_Elements()$ (line~$6$ in Algorithm~\ref{ch13:alg:02})
+computes, at each iteration, the new elements of the iterate vector $U$. Its general code
+is presented in Listing~\ref{ch13:list:01} (CPU function). The iterations of the projected
+Richardson method\index{iterative method!projected Richardson}, based on those of the Jacobi
+method\index{iterative method!Jacobi}, are defined as follows:
\begin{equation}
\begin{array}{ll}
u^{p+1}(x,y,z) =& \frac{1}{Center}(g(x,y,z) - (Center\cdot u^{p}(x,y,z) + \\
& South\cdot u^{p}(x,y-h,z) + North\cdot u^{p}(x,y+h,z) + \\
& Rear\cdot u^{p}(x,y,z-h) + Front\cdot u^{p}(x,y,z+h))),
\end{array}
-\label{eq:17}
+\label{ch13:eq:17}
\end{equation}
-where $u^{p}(x,y,z)$ is an element of the iterate vector $U$ computed at the iteration $p$ and $g(x,y,z)$ is a vector element of the
-right-hand side $G$. The scalars $Center$, $West$, $East$, $South$, $North$, $Rear$ and $Front$ define constant coefficients of the
-block matrix $A$. Figure~\ref{fig:03} shows the positions of these coefficients in a three-dimensional domain.
+where $u^{p}(x,y,z)$ is an element of the iterate vector $U$ computed at the
+iteration $p$ and $g(x,y,z)$ is a vector element of the right-hand side $G$.
+The scalars $Center$, $West$, $East$, $South$, $North$, $Rear$, and $Front$
+define constant coefficients of the block matrix $A$. Figure~\ref{ch13:fig:03}
+shows the positions of these coefficients in a three-dimensional domain.
\begin{figure}
\centerline{\includegraphics[scale=0.35]{Chapters/chapter13/figures/matrix}}
\caption{Matrix constant coefficients in a three-dimensional domain.}
-\label{fig:03}
+\label{ch13:fig:03}
\end{figure}
-The kernel implementations of the projected Richardson method on GPUs uses a perfect fine-grain multithreading parallelism. Since the
-projected Richardson algorithm is implemented as a fixed point method, each kernel is executed by a large number of GPU threads such
-that each thread is in charge of the computation of one element of the iterate vector $U$. Moreover, this method uses the vector elements
-updates of the Jacobi method, which means that each thread $i$ computes the new value of its element $u_{i}^{p+1}$ independently of the
-new values $u_{j}^{p+1}$, where $j\neq i$, of those computed in parallel by other threads at the same iteration $p+1$. Listing~\ref{list:02}
-shows the GPU implementations of the main kernels of the projected Richardson method, which are: the matrix-vector multiplication
-(\verb+MV_Multiplication()+) and the vector elements updates (\verb+Vector_Updates()+). The codes of these kernels are based on
-that presented in Listing~\ref{list:01}.
-
-\lstinputlisting[label=list:02,caption=GPU kernels of the projected Richardson method]{Chapters/chapter13/ex2.cu}
+The kernel implementations of the projected Richardson method on GPUs uses a
+perfect fine-grain multithreading parallelism. Since the projected Richardson
+algorithm is implemented as a fixed point method, each kernel is executed by
+a large number of GPU threads such that each thread is in charge of the computation
+of one element of the iterate vector $U$. Moreover, this method uses the vector
+elements updates of the Jacobi method, which means that each thread $i$ computes
+the new value of its element $u_{i}^{p+1}$ independently of the new values $u_{j}^{p+1}$,
+where $j\neq i$, of those computed in parallel by other threads at the same iteration
+$p+1$. Listing~\ref{ch13:list:02} shows the GPU implementations of the main kernels
+of the projected Richardson method, which are: the matrix-vector multiplication
+(\verb+MV_Multiplication()+) and the vector elements updates (\verb+Vector_Updates()+).
+The codes of these kernels are based on that presented in Listing~\ref{ch13:list:01}.
+
+\pagebreak
+\lstinputlisting[label=ch13:list:02,caption=GPU kernels of the projected Richardson method]{Chapters/chapter13/ex2.cu}
\begin{figure}
\centerline{\includegraphics[scale=0.3]{Chapters/chapter13/figures/points3D}}
\caption{Computation of a vector element with the projected Richardson method.}
-\label{fig:04}
+\label{ch13:fig:04}
\end{figure}
-Each kernel is executed by $NX\times ny$ GPU threads so that $nz$ slices of $(NX\times ny)$ vector elements are computed in
-a {\bf for} loop. In this case, each thread is in charge of one vector element from each slice (in total $nz$ vector elements
-along $z$-axis). We can notice from the formula~(\ref{eq:17}) that the computation of a vector element $u^{p+1}(x,y,z)$, by
-a thread at iteration $p+1$, requires seven vector elements computed at the previous iteration $p$: two vector elements in
-each dimension plus the vector element at the intersection of the three axises $x$, $y$ and $z$ (see Figure~\ref{fig:04}).
-So, to reduce the memory accesses to the high-latency global memory, the vector elements of the current slice can be stored
-in the low-latency shared memories of thread blocks, as is described in~\cite{ref9}. Nevertheless, the fact that the computation
-of a vector element requires only two elements in each dimension does not allow to maximize the data reuse from the shared memories.
-The computation of a slice involves in total $(bx+2)\times(by+2)$ accesses to the global memory per thread block, to fill the
-required vector elements in the shared memory where $bx$ and $by$ are the dimensions of a thread block. Then, in order to optimize
-the memory accesses on GPUs, the elements of the iterate vector $U$ are filled in the cache texture memory (see~\cite{ref10}).
-In new GPU generations as Fermi or Kepler, the global memory accesses are always cached in L1 and L2 caches. For example, for
-a given kernel, we can favour the use of the L1 cache to that of the shared memory by using the function \verb+cudaFuncSetCacheConfig(Kernel,cudaFuncCachePreferL1)+.
-So, the initial access to the global memory loads the vector elements required by the threads of a block into the cache memory
-(texture or L1/L2 caches). Then, all the following memory accesses read from this cache memory. In Listing~\ref{list:02}, the
-function \verb+fetch_double(v,i)+ is used to read from the texture memory the $i^{th}$ element of the double-precision vector
-\verb+v+ (see Listing~\ref{list:03}). Moreover, the seven constant coefficients of matrix $A$ can be stored in the constant memory
-but, since they are reused $nz$ times by each thread, it is more interesting to fill them on the low-latency registers of each thread.
-
-\lstinputlisting[label=list:03,caption=Memory access to the cache texture memory]{Chapters/chapter13/ex3.cu}
-
-The function $Convergence()$ (line~$11$ in Algorithm~\ref{alg:02}) allows to detect the convergence of the parallel iterative algorithm
-and is based on the tolerance threshold $\varepsilon$ and the maximum number of relaxations $MaxRelax$. We take into account the number
-of relaxations since that of iterations cannot be computed in the asynchronous case. Indeed, a relaxation is the update~(\ref{eq:13}) of
-a local iterate vector $U_i$ according to $F_i$. Then, counting the number of relaxations is possible in both synchronous and asynchronous
-cases. On the other hand, an iteration is the update of at least all vector components with $F_i$.
-
-In the synchronous algorithm, the global convergence is detected when the maximal value of the absolute error, $error$, is sufficiently small
-and/or the maximum number of relaxations, $MaxRelax$, is reached, as follows:
+Each kernel is executed by $NX\times ny$ GPU threads so that $nz$ slices
+of $(NX\times ny)$ vector elements are computed in a {\bf for} loop. In
+this case, each thread is in charge of one vector element from each slice
+(in total $nz$ vector elements along the $z$-axis). We can notice from the
+formula~(\ref{ch13:eq:17}) that the computation of a vector element $u^{p+1}(x,y,z)$,
+by a thread at iteration $p+1$, requires seven vector elements computed
+at the previous iteration $p$: two vector elements in each dimension plus
+the vector element at the intersection of the three axes $x$, $y$, and $z$
+(see Figure~\ref{ch13:fig:04}). So, to reduce the memory accesses to the
+high-latency global memory, the vector elements of the current slice can
+be stored in the low-latency shared memories of thread blocks, as is described
+in~\cite{ch13:ref9}. Nevertheless, the fact that the computation of a vector
+element requires only two elements in each dimension does not allow us to maximize
+the data reuse from the shared memories. The computation of a slice involves
+in total $(bx+2)\times(by+2)$ accesses to the global memory per thread block,
+to fill the required vector elements in the shared memory where $bx$ and $by$
+are the dimensions of a thread block. Then, in order to optimize the memory
+accesses on GPUs, the elements of the iterate vector $U$ are filled in the
+cache texture memory (see~\cite{ch13:ref10}). In new GPU hardware and software as Fermi
+or Kepler, the global memory accesses are always cached in L1 and L2 caches.
+For example, for a given kernel, we can favor the use of the L1 cache to that
+of the shared memory by using the function \verb+cudaFuncSetCacheConfig(Kernel,cudaFuncCachePreferL1)+.
+So, the initial access to the global memory loads the vector elements required
+by the threads of a block into the cache memory (texture or L1/L2 caches). Then,
+all the following memory accesses read from this cache memory. In Listing~\ref{ch13:list:02},
+the function \verb+fetch_double(v,i)+ is used to read from the texture memory
+the $ith$ element of the double-precision vector \verb+v+ (see Listing~\ref{ch13:list:03}).
+Moreover, the seven constant coefficients of matrix $A$ can be stored in the
+constant memory but, since they are reused $nz$ times by each thread, it is more
+efficient to fill them on the low-latency registers of each thread.
+
+\pagebreak
+\lstinputlisting[label=ch13:list:03,caption=memory access to the cache texture memory]{Chapters/chapter13/ex3.cu}
+
+The function $Convergence()$ (line~$11$ in Algorithm~\ref{ch13:alg:02}) allows us
+to detect the convergence of the parallel iterative algorithm and is based on
+the tolerance threshold\index{convergence!tolerance threshold} $\varepsilon$
+and the maximum number of relaxations\index{convergence!maximum number of relaxations}
+$MaxRelax$. We take into account the number of relaxations since that of iterations
+cannot be computed in the asynchronous case. Indeed, a relaxation is the update~(\ref{ch13:eq:13})
+of a local iterate vector $U_i$ according to $F_i$. Then, counting the number
+of relaxations is possible in both synchronous and asynchronous cases. On the
+other hand, an iteration is the update of at least all vector components with
+$F_i$.
+
+In the synchronous\index{synchronous iterations} algorithm, the global convergence is detected
+when the maximal value of the absolute error, $error$, is sufficiently small and/or
+the maximum number of relaxations, $MaxRelax$, is reached, as follows:
$$
\begin{array}{l}
error=\|U^{p}-U^{p+1}\|_{2}; \\
conv \leftarrow true;
\end{array}
$$
-where the function $AllReduce()$ uses the MPI reduction subroutine \verb+MPI_Allreduce()+ to compute the maximal value, $maxerror$, among the
-local absolute errors, $error$, of all computing nodes and $p$ (in Algorithm~\ref{alg:02}) is used as a counter of the local relaxations carried
-out by a computing node. In the asynchronous algorithms, the global convergence is detected when all computing nodes locally converge. For this,
-we use a token ring architecture around which a boolean token travels, in one direction, from a computing node to another. Starting from node $0$,
-the boolean token is set to $true$ by node $i$ if the local convergence is reached or to $false$ otherwise and, then, it is sent to node $i+1$.
-Finally, the global convergence is detected when node $0$ receives from its neighbor node $S-1$, in the ring architecture, a token set to $true$.
-In this case, node $0$ sends a stop message (end of parallel solving) to all computing nodes in the cluster.
+where the function $AllReduce()$ uses the MPI global reduction subroutine\index{MPI!global}
+\verb+MPI_Allreduce()+ to compute the maximal value, $maxerror$, among the local
+absolute errors, $error$, of all computing nodes, and $p$ (in Algorithm~\ref{ch13:alg:02})
+is used as a counter of the local relaxations carried out by a computing node. In
+the asynchronous\index{asynchronous iterations} algorithms, the global convergence is detected
+when all computing nodes locally converge. For this, we use a token ring architecture
+around which a boolean token travels, in one direction, from a computing node to another.
+Starting from node $0$, the boolean token is set to $true$ by node $i$ if the local
+convergence is reached or to $false$ otherwise, and then, it is sent to node $i+1$.
+Finally, the global convergence is detected when node $0$ receives from its neighbor
+node $S-1$, in the ring architecture, a token set to $true$. In this case, node $0$
+sends a stop message (end of parallel solving) to all computing nodes in the cluster.
%%--------------------------%%
%% SECTION 5 %%
%%--------------------------%%
\section{Experimental tests on a GPU cluster}
-\label{sec:05}
-The GPU cluster of tests, that we used in this chapter, is an $20Gbps$ Infiniband network of six machines. Each machine is a Quad-Core Xeon
-E5530 CPU running at $2.4$GHz. It provides a RAM memory of $12$GB with a memory bandwidth of $25.6$GB/s and it is equipped with two Nvidia
-Tesla C1060 GPUs. A Tesla GPU contains in total $240$ cores running at $1.3$GHz. It provides $4$GB of global memory with a memory bandwidth
-of $102$GB/s, accessible by all its cores and also by the CPU through the PCI-Express 16x Gen 2.0 interface with a throughput of $8$GB/s.
-Hence, the memory copy operations between the GPU and the CPU are about $12$ times slower than those of the Tesla GPU memory. We have performed
-our simulations on a cluster of $24$ CPU cores and on a cluster of $12$ GPUs. Figure~\ref{fig:05} describes the components of the GPU cluster
-of tests.
-
-\begin{figure}
-\centerline{\includegraphics[scale=0.25]{Chapters/chapter13/figures/cluster}}
-\caption{GPU cluster of tests composed of 12 computing nodes (six machines, each with two GPUs.}
-\label{fig:05}
-\end{figure}
-
-Linux cluster version 2.6.39 OS is installed on CPUs. C programming language is used for coding the parallel algorithms of the methods on both
-GPU cluster and CPU cluster. CUDA version 4.0~\cite{ref12} is used for programming GPUs, using CUBLAS library~\cite{ref8} to deal with vector
-operations in GPUs and, finally, MPI functions of OpenMPI 1.3.3 are used to carry out the synchronous and asynchronous communications between
-CPU cores. Indeed, in our experiments, a computing node is managed by a MPI process and it is composed of one CPU core and one GPU card.
+\label{ch13:sec:05}
+The GPU cluster\index{GPU!cluster} of tests that we used in this chapter is an $20GB/s$
+Infiniband network of six machines. Each machine is a Quad-Core Xeon E5530 CPU running at
+$2.4$GHz. It provides a RAM memory of $12$GB with a memory bandwidth of $25.6$GB/s and it
+is equipped with two NVIDIA Tesla C1060 GPUs. A Tesla GPU contains in total $240$ cores
+running at $1.3$GHz. It provides $4$GB of global memory with a memory bandwidth of $102$GB/s,
+accessible by all its cores and also by the CPU through the PCI-Express 16x Gen 2.0 interface
+with a throughput of $8$GB/s. Hence, the memory copy operations between the GPU and the CPU
+are about $12$ times slower than those of the Tesla GPU memory. We have performed our simulations
+on a cluster of $24$ CPU cores and on a cluster of $12$ GPUs. Figure~\ref{ch12:fig:04} describes
+the components of the GPU cluster of tests.
+
+Linux cluster version 2.6.39 OS is installed on CPUs. C programming language is used for
+coding the parallel algorithms of the methods on both GPU cluster and CPU cluster. CUDA
+version 4.0~\cite{ch13:ref12} is used for programming GPUs, using CUBLAS library~\cite{ch13:ref8}
+to deal with vector operations in GPUs, and finally, MPI functions of OpenMPI 1.3.3 are
+used to carry out the synchronous and asynchronous communications between CPU cores. Indeed,
+in our experiments, a computing node is managed by one MPI process and it is composed of
+one CPU core and one GPU card.
-All experimental results of the parallel projected Richardson algorithms are obtained from simulations made in double precision data. The obstacle
-problems to be solved are defined in constant three-dimensional domain $\Omega\subset\mathbb{R}^{3}$. The numerical values of the parameters of the
-obstacle problems are: $\eta=0.2$, $c=1.1$, $f$ is computed by formula~(\ref{eq:18}) and final time $T=0.02$. Moreover, three time steps ($NbSteps=3$)
-are computed with $k=0.0066$. As the discretization matrix is constant along the time steps, the convergence properties of the iterative algorithms
-do not change. Thus, the performance characteristics obtained with three time steps will still be valid for more time steps. The initial function
-$u(0,x,y,z)$ of the obstacle problem~(\ref{eq:01}) is set to $0$, with a constraint $u\geq\phi=0$. The relaxation parameter $\gamma$ used by the
-projected Richardson method is computed automatically thanks to the diagonal entries of the discretization matrix. The formula and its proof can be
-found in~\cite{ref11}, Section~2.3. The convergence tolerance threshold $\varepsilon$ is set to $1e$-$04$ and the maximum number of relaxations is
-limited to $10^{6}$ relaxations. Finally, the number of threads per block is set to $256$ threads, which gives, in general, good performances for
-most GPU applications. We have performed some tests for the execution configurations and we have noticed that the best configuration of the $256$
-threads per block is an organization into two dimensions of sizes $(64,4)$.
-
-\begin{table}[!h]
+All experimental results of the parallel projected Richardson algorithms are obtained
+from simulations made in double precision data. The obstacle problems to be solved are
+defined in constant three-dimensional domain $\Omega\subset\mathbb{R}^{3}$. The numerical
+values of the parameters of the obstacle problems are $\eta=0.2$, $c=1.1$, $f$ is computed
+by formula~(\ref{ch13:eq:18}), and final time $T=0.02$. Moreover, three time steps ($NbSteps=3$)
+are computed with $k=0.0066$. As the discretization matrix is constant along the time
+steps, the convergence properties of the iterative algorithms do not change. Thus, the
+performance characteristics obtained with three time steps will still be valid for more
+time steps. The initial function $u(0,x,y,z)$ of the obstacle problem~(\ref{ch13:eq:01})
+is set to $0$, with a constraint $u\geq\phi=0$. The relaxation parameter $\gamma$ used
+by the projected Richardson method is computed automatically thanks to the diagonal entries
+of the discretization matrix. The formula and its proof can be found in~\cite{ch13:ref11}.
+The convergence tolerance threshold $\varepsilon$ is set to $1e$-$04$ and the
+maximum number of relaxations is limited to $10^{6}$ relaxations. Finally, the number of
+threads per block is set to $256$ threads, which gives, in general, good performances for
+most GPU applications. We have performed some tests for the execution configurations and
+have noticed that the best configuration of the $256$ threads per block is an organization
+into two dimensions of sizes $(64,4)$.
+
+The performance measures that we took into account are the execution times and the number
+of relaxations performed by the parallel iterative algorithms, both synchronous and asynchronous
+versions, on the GPU and CPU clusters. These algorithms are used for solving nonlinear systems
+derived from the discretization of obstacle problems of sizes $256^{3}$, $512^{3}$, $768^{3}$,
+and $800^{3}$. In Table~\ref{ch13:tab:01} and Table~\ref{ch13:tab:02}, we show the performances
+of the parallel synchronous and asynchronous algorithms of the projected Richardson method
+implemented, respectively, on a cluster of $24$ CPU cores and on a cluster of $12$ GPUs. In
+these tables, the execution time defines the time spent by the slowest computing node and the
+number of relaxations is computed as the summation of those carried out by all computing nodes.
+
+In the sixth column of Table~\ref{ch13:tab:01} and in the eighth column of Table~\ref{ch13:tab:02},
+we give the gains in percentage obtained by using an asynchronous algorithm compared to a synchronous
+one. We can notice that the asynchronous version on CPU and GPU clusters is slightly faster than
+the synchronous one for both methods. Indeed, the cluster of tests is composed of local and homogeneous
+nodes communicating via low-latency connections. So, in the case of distant and/or heterogeneous
+nodes (or even with geographically distant clusters), the asynchronous version would be faster than
+the synchronous one. However, the gains obtained on the GPU cluster are better than those obtained
+on the CPU cluster. In fact, the computation times are reduced by accelerating the computations on
+GPUs while the communication times remain unchanged.
+
+\begin{table}
\centering
\begin{tabular}{|c|c|c|c|c|c|}
\hline
\multirow{2}{*}{\bf Pb. size} & \multicolumn{2}{c|}{\bf Synchronous} & \multicolumn{2}{c|}{\bf Asynchronous} & \multirow{2}{*}{\bf Gain\%} \\ \cline{2-5}
- & $\mathbf{T_{cpu}}$ & {\bf \#relax.} & $\mathbf{T_{cpu}}$ & {\bf \#relax.} & \\ \hline \hline
+ & $\mathbf{T_{cpu}}$ & {\bf \# Relax.} & $\mathbf{T_{cpu}}$ & {\bf \# Relax.} & \\ \hline \hline
$256^{3}$ & $575.22$ & $198,288$ & $539.25$ & $198,613$ & $6.25$ \\ \hline \hline
\end{tabular}
\vspace{0.5cm}
\caption{Execution times in seconds of the parallel projected Richardson method implemented on a cluster of 24 CPU cores.}
-\label{tab:01}
+\label{ch13:tab:01}
\end{table}
-\begin{table}[!h]
+\begin{table}
\centering
+\begin{scriptsize}
\begin{tabular}{|c|c|c|c|c|c|c|c|}
\hline
-\multirow{2}{*}{\bf Pb. size} & \multicolumn{3}{c|}{\bf Synchronous} & \multicolumn{3}{c|}{\bf Asynchronous} & \multirow{2}{*}{\bf Gain\%} \\ \cline{2-7}
+\multirow{2}{*}{\bf Pb. size} & \multicolumn{3}{c|}{\bf Synchronous} & \multicolumn{3}{c|}{\bf Asynchronous} & \multirow{2}{*}{\bf Gain\%} \\ \cline{2-7}
- & $\mathbf{T_{gpu}}$ & {\bf \#relax.} & $\mathbf{\tau}$ & $\mathbf{T_{gpu}}$ & {\bf \#relax.} & $\mathbf{\tau}$ & \\ \hline \hline
+ & $\mathbf{T_{gpu}}$ & {\bf \# Relax.} & $\mathbf{\tau}$ & $\mathbf{T_{gpu}}$ & {\bf \# Relax.} & $\mathbf{\tau}$ & \\ \hline \hline
-$256^{3}$ & $29.67$ & $100,692$ & $19.39$ & $18.00$ & $94,215$ & $29.96$ & $39.33$ \\ \hline \hline
+$256^{3}$ & $29.67$ & $100,692$ & $19.39$ & $18.00$ & $94,215$ & $29.96$ & $39.33$ \\\hline \hline
-$512^{3}$ & $521.83$ & $381,300$ & $36.89$ & $425.15$ & $347,279$ & $42.89$ & $18.53$ \\ \hline \hline
+$512^{3}$ & $521.83$ & $381,300$ & $36.89$ & $425.15$ & $347,279$ & $42.89$ & $18.53$ \\\hline \hline
-$768^{3}$ & $4,112.68$ & $831,144$ & $50.13$ & $3,313.87$ & $750,232$ & $55.40$ & $19.42$ \\ \hline \hline
+$768^{3}$ & $4,112.68$ & $831,144$ & $50.13$ & $3,313.87$ & $750,232$ & $55.40$ & $19.42$ \\ \hline \hline
-$800^{3}$ & $3,950.87$ & $899,088$ & $56.22$ & $3,636.57$ & $834,900$ & $51.91$ & $7.95$ \\ \hline
+$800^{3}$ & $3,950.87$ & $899,088$ & $56.22$ & $3,636.57$ & $834,900$ & $51.91$ & $7.95$ \\ \hline
\end{tabular}
+\end{scriptsize}
\vspace{0.5cm}
\caption{Execution times in seconds of the parallel projected Richardson method implemented on a cluster of 12 GPUs.}
-\label{tab:02}
+\label{ch13:tab:02}
\end{table}
-The performance measures that we took into account are the execution times and the number of relaxations performed by the parallel iterative algorithms,
-both synchronous and asynchronous versions, on the GPU and CPU clusters. These algorithms are used for solving nonlinear systems derived from the discretization
-of obstacle problems of sizes $256^{3}$, $512^{3}$, $768^{3}$ and $800^{3}$. In Table~\ref{tab:01} and Table~\ref{tab:02}, we show the performances
-of the parallel synchronous and asynchronous algorithms of the projected Richardson method implemented, respectively, on a cluster of $24$ CPU cores
-and on a cluster of $12$ GPUs. In these tables, the execution time defines the time spent by the slowest computing node and the number of relaxations
-is computed as the summation of those carried out by all computing nodes.
-
-In the sixth column of Table~\ref{tab:01} and in the eighth column of Table~\ref{tab:02}, we give the gains in $\%$ obtained by using an
-asynchronous algorithm compared to a synchronous one. We can notice that the asynchronous version on CPU and GPU clusters is slightly faster
-than the synchronous one for both methods. Indeed, the cluster of tests is composed of local and homogeneous nodes communicating via low-latency
-connections. So, in the case of distant and/or heterogeneous nodes (or even with geographically distant clusters) the asynchronous version
-would be faster than the synchronous one. However, the gains obtained on the GPU cluster are better than those obtained on the CPU cluster.
-In fact, the computation times are reduced by accelerating the computations on GPUs while the communication times still unchanged.
-
-The fourth and seventh columns of Table~\ref{tab:02} show the relative gains obtained by executing the parallel algorithms on the cluster
-of $12$ GPUs instead on the cluster of $24$ CPU cores. We compute the relative gain $\tau$ as a ratio between the execution time $T_{cpu}$
-spent on the CPU cluster over that $T_{gpu}$ spent on the GPU cluster: \[\tau=\frac{T_{cpu}}{T_{gpu}}.\] We can see from these ratios that
-solving large obstacle problems is faster on the GPU cluster than on the CPU cluster. Indeed, the GPUs are more efficient than their
-counterpart CPUs to execute large data-parallel operations. In addition, the projected Richardson method is implemented as a fixed point-based
-iteration and uses the Jacobi vector updates that allow a well thread-parallelization on GPUs, such that each GPU thread is in charge
-of one vector component at a time without being dependent on other vector components computed by other threads. Then, this allow to exploit
-at best the high performance computing of the GPUs by using all the GPU resources and avoiding the idle cores.
-
-Finally, the number of relaxations performed by the parallel synchronous algorithm is different in the CPU and GPU versions, because the number
-of computing nodes involved in the GPU cluster and in the CPU cluster is different. In the CPU case, $24$ computing nodes ($24$ CPU cores) are
-considered, whereas in the GPU case, $12$ computing nodes ($12$ GPUs) are considered. As the number of relaxations depends on the domain decomposition,
+The fourth and seventh columns of Table~\ref{ch13:tab:02} show the relative gains
+obtained by executing the parallel algorithms on the cluster of $12$ GPUs instead
+on the cluster of $24$ CPU cores. We compute the relative gain\index{relative gain}
+$\tau$ as a ratio between the execution time $T_{cpu}$ spent on the CPU cluster over
+that $T_{gpu}$ spent on the GPU cluster: \[\tau=\frac{T_{cpu}}{T_{gpu}}.\] We can see
+from these ratios that solving large obstacle problems is faster on the GPU cluster
+than on the CPU cluster. Indeed, the GPUs are more efficient than their counterpart
+CPUs at executing large data-parallel operations. In addition, the projected Richardson
+method is implemented as a fixed point based iteration and uses the Jacobi vector updates
+that allow a well-suited thread-parallelization on GPUs, such that each GPU thread is in charge
+of one vector component at a time without being dependent on other vector components
+computed by other threads. Then, this allows us to exploit at best the high performance
+computing of the GPUs by using all the GPU resources and avoiding the idle cores.
+
+Finally, the number of relaxations performed by the parallel synchronous algorithm
+is different in the CPU and GPU versions, because the number of computing nodes involved
+in the GPU cluster and in the CPU cluster is different. In the CPU case, $24$ computing
+nodes ($24$ CPU cores) are considered, whereas in the GPU case, $12$ computing nodes
+($12$ GPUs) are considered. As the number of relaxations depends on the domain decomposition,
consequently it also depends on the number of computing nodes.
+
%%--------------------------%%
%% SECTION 6 %%
%%--------------------------%%
-\section{Red-Black ordering technique}
-\label{sec:06}
-As is well-known, the Jacobi method is characterized by a slow convergence rate compared to some iterative methods (for example Gauss-Seidel method).
-So, in this section, we present some solutions to reduce the execution time and the number of relaxations and, more specifically, to speed up the
-convergence of the parallel projected Richardson method on the GPU cluster. We propose to use the point red-black ordering technique to accelerate
-the convergence. This technique is often used to increase the parallelism of iterative methods for solving linear systems~\cite{ref13,ref14,ref15}.
-We apply it to the projected Richardson method as a compromise between the Jacobi and Gauss-Seidel iterative methods.
-
-The general principle of the red-black technique is as follows. Let $t$ be the summation of the integer $x$-, $y$- and $z$-coordinates of a vector
-element $u(x,y,z)$ on a three-dimensional domain: $t=x+y+z$. As is shown in Figure~\ref{fig:06.01}, the red-black ordering technique consists in the
-parallel computing of the red vector elements having even value $t$ by using the values of the black ones then, the parallel computing of the black
-vector elements having odd values $t$ by using the new values of the red ones.
+\section{Red-black ordering technique}
+\label{ch13:sec:06}
+As is well known, the Jacobi method\index{iterative method!Jacobi} is characterized
+by a slow convergence\index{convergence} rate compared to some iterative methods\index{iterative method}
+(for example, Gauss-Seidel method\index{iterative method!Gauss-Seidel}). So, in this
+section, we present some solutions to reduce the execution time and the number of
+relaxations and, more specifically, to speed up the convergence of the parallel
+projected Richardson method on the GPU cluster. We propose to use the point red-black
+ordering technique\index{iterative method!red-black ordering} to accelerate the
+convergence. This technique is often used to increase the parallelism of iterative
+methods for solving linear systems~\cite{ch13:ref13,ch13:ref14,ch13:ref15}. We
+apply it to the projected Richardson method as a compromise between the Jacobi
+and Gauss-Seidel iterative methods.
+
+The general principle of the red-black technique is as follows. Let $t$ be the
+summation of the integer $x$-, $y$-, and $z$-coordinates of a vector element $u(x,y,z)$
+on a three-dimensional domain: $t=x+y+z$. As is shown in Figure~\ref{ch13:fig:06.01},
+the red-black ordering technique consists of the parallel computing of the red
+vector elements having even value $t$ by using the values of the black ones, then
+the parallel computing of the black vector elements having odd values $t$ by using
+the new values of the red ones.
+
+This technique can be implemented on the GPU in two different manners:
+\begin{itemize}
+\item among all launched threads ($NX\times ny$ threads), only one thread out of two computes its red or black vector element at a time or
+\item all launched threads (on average half of $NX\times ny$ threads) compute the red vector elements first, and then the black ones.
+\end{itemize}
+However, in both solutions, for each memory transaction, only half of the memory
+segment addressed by a half-warp is used. So, the computation of the red and black
+vector elements leads to using twice the initial number of memory transactions. Then,
+we apply the point red-black ordering\index{iterative method!red-black ordering}
+accordingly to the $y$-coordinate, as is shown in Figure~\ref{ch13:fig:06.02}. In
+this case, the vector elements having even $y$-coordinate are computed in parallel
+using the values of those having odd $y$-coordinate and then vice-versa. Moreover,
+in the GPU implementation of the parallel projected Richardson method (Section~\ref{ch13:sec:04}),
+we have shown that a subproblem of size $(NX\times ny\times nz)$ is decomposed into
+$nz$ grids of size $(NX\times ny)$. Then, each kernel is executed in parallel by
+$NX\times ny$ GPU threads, so that each thread is in charge of $nz$ vector elements
+along the $z$-axis (one vector element in each grid of the subproblem). So, we propose
+to use the new values of the vector elements computed in grid $i$ to compute those
+of the vector elements in grid $i+1$. Listing~\ref{ch13:list:04} describes the kernel
+of the matrix-vector multiplication and the kernel of the vector elements updates of
+the parallel projected Richardson method using the red-black ordering technique.
\begin{figure}
\centering
- \mbox{\subfigure[Red-black ordering on x, y and z axises]{\includegraphics[width=2.3in]{Chapters/chapter13/figures/rouge-noir}\label{fig:06.01}}\quad
- \subfigure[Red-black ordering on y axis]{\includegraphics[width=2.3in]{Chapters/chapter13/figures/rouge-noir-y}\label{fig:06.02}}}
+ \mbox{\subfigure[Red-black ordering on x, y, and z axises]{\includegraphics[width=2.3in]{Chapters/chapter13/figures/rouge-noir_clair}\label{ch13:fig:06.01}}\quad
+ \subfigure[Red-black ordering on y axis]{\includegraphics[width=2.3in]{Chapters/chapter13/figures/rouge-noir-y_clair}\label{ch13:fig:06.02}}}
\caption{Red-black ordering for computing the iterate vector elements in a three-dimensional space.}
\end{figure}
-This technique can be implemented on the GPU in two different manners:
-\begin{itemize}
-\item among all launched threads ($NX\times ny$ threads), only one thread out of two computes its red or black vector element at a time or,
-\item all launched threads (on average half of $NX\times ny$ threads) compute the red vector elements first and, then, the black ones.
-\end{itemize}
-However, in both solutions, for each memory transaction, only half of the memory segment addressed by a half-warp is used. So, the computation of the
-red and black vector elements leads to use twice the initial number of memory transactions. Then, we apply the point red-black ordering accordingly to
-the $y$-coordinate, as is shown in Figure~\ref{fig:06.02}. In this case, the vector elements having even $y$-coordinate are computed in parallel using
-the values of those having odd $y$-coordinate and then vice-versa. Moreover, in the GPU implementation of the parallel projected Richardson method (Section~\ref{sec:04}),
-we have shown that a sub-problem of size $(NX\times ny\times nz)$ is decomposed into $nz$ grids of size $(NX\times ny)$. Then, each kernel is executed
-in parallel by $NX\times ny$ GPU threads, so that each thread is in charge of $nz$ vector elements along $z$-axis (one vector element in each grid of
-the sub-problem). So, we propose to use the new values of the vector elements computed in grid $i$ to compute those of the vector elements in grid $i+1$.
-Listing~\ref{list:04} describes the kernel of the matrix-vector multiplication and the kernel of the vector elements updates of the parallel projected
-Richardson method using the red-black ordering technique.
-
-\lstinputlisting[label=list:04,caption=GPU kernels of the projected Richardson method using the red-black technique]{Chapters/chapter13/ex4.cu}
-
-Finally, we exploit the concurrent executions between the host functions and the GPU kernels provided by the GPU hardware and software. In fact, the kernel
-launches are asynchronous (when this environment variable is not disabled on the GPUs), such that the control is returned to the host (MPI process) before
-the GPU has completed the requested task (kernel)~\cite{ref12}. Therefore, all the kernels necessary to update the local vector elements, $u(x,y,z)$ where
-$0<y<(ny-1)$ and $0<z<(nz-1)$, are executed first. Then, the values associated to the bordering vector elements are exchanged between the neighbors. Finally,
-the values of the vector elements associated to the bordering vector elements are updated. In this case, the computation of the local vector elements is
-performed concurrently with the data exchanges between neighboring CPUs and this in both synchronous and asynchronous cases.
+\pagebreak
+\lstinputlisting[label=ch13:list:04,caption=GPU kernels of the projected Richardson method using the red-black technique]{Chapters/chapter13/ex4.cu}
+
+Finally, we exploit the concurrent executions between the host functions and the GPU
+kernels provided by the GPU hardware and software. In fact, the kernel launches are
+asynchronous (when this environment variable is not disabled on the GPUs), such that
+the control is returned to the host (MPI process) before the GPU has completed the
+requested task (kernel)~\cite{ch13:ref12}. Therefore, all the kernels necessary to
+update the local vector elements, $u(x,y,z)$ where $0<y<(ny-1)$ and $0<z<(nz-1)$,
+are executed first. Then, the values associated to the bordering vector elements
+are exchanged between the neighbors. Finally, the values of the vector elements
+associated to the bordering vector elements are updated. In this case, the computation
+of the local vector elements is performed concurrently with the data exchanges between
+neighboring CPUs and this in both synchronous and asynchronous cases.
+
+In Table~\ref{ch13:tab:03}, we report the execution times and the number of relaxations
+performed on a cluster of $12$ GPUs by the parallel projected Richardson algorithms; it
+can be noted that the performances of the projected Richardson algorithm are improved by using the
+point red-black ordering. We compare the performances of the parallel projected Richardson
+method with and without this later ordering (Tables~\ref{ch13:tab:02} and~\ref{ch13:tab:03}).
+We can notice that both parallel synchronous and asynchronous algorithms are faster when
+they use the red-black ordering. Indeed, we can see in Table~\ref{ch13:tab:03} that the
+execution times of these algorithms are reduced, on average, by $32\%$ compared to those
+shown in Table~\ref{ch13:tab:02}.
\begin{table}[!h]
\centering
\hline
\multirow{2}{*}{\bf Pb. size} & \multicolumn{2}{c|}{\bf Synchronous} & \multicolumn{2}{c|}{\bf Asynchronous} & \multirow{2}{*}{\bf Gain\%} \\ \cline{2-5}
- & $\mathbf{T_{gpu}}$ & {\bf \#relax.} & $\mathbf{T_{gpu}}$ & {\bf \#relax.} & \\ \hline \hline
+ & $\mathbf{T_{gpu}}$ & {\bf \# Relax.} & $\mathbf{T_{gpu}}$ & {\bf \# Relax.} & \\ \hline \hline
$256^{3}$ & $18.37$ & $71,988$ & $12.58$ & $67,638$ & $31.52$ \\ \hline \hline
$800^{3}$ & $2,748.23$ & $638,916$ & $2,502.61$ & $592,525$ & $8.92$ \\ \hline
\end{tabular}
\vspace{0.5cm}
-\caption{Execution times in seconds of the parallel projected Richardson method using read-black ordering technique implemented on a cluster of 12 GPUs.}
-\label{tab:03}
+\caption{Execution times in seconds of the parallel projected Richardson method using red-black ordering technique implemented on a cluster of 12 GPUs.}
+\label{ch13:tab:03}
\end{table}
-In Table~\ref{tab:03}, we report the execution times and the number of relaxations performed on a cluster of $12$ GPUs by the parallel projected Richardson
-algorithms; it can be noted that the performances of the projected Richardson are improved by using the point read-black ordering. We compare the performances
-of the parallel projected Richardson method with and without this later ordering (Tables~\ref{tab:02} and~\ref{tab:03}). We can notice that both parallel synchronous
-and asynchronous algorithms are faster when they use the red-black ordering. Indeed, we can see in Table~\ref{tab:03} that the execution times of these algorithms
-are reduced, on average, by $32\%$ compared to those shown in Table~\ref{tab:02}.
-
\begin{figure}
-\centerline{\includegraphics[scale=0.9]{Chapters/chapter13/figures/scale}}
+\centerline{\includegraphics[scale=0.7]{Chapters/chapter13/figures/scale}}
\caption{Weak scaling of both synchronous and asynchronous algorithms of the projected Richardson method using red-black ordering technique.}
-\label{fig:07}
+\label{ch13:fig:07}
\end{figure}
-In Figure~\ref{fig:07}, we study the ratio between the computation time and the communication time of the parallel projected Richardson algorithms on a GPU cluster.
-The experimental tests are carried out on a cluster composed of one to ten Tesla GPUs. We have focused on the weak scaling of both parallel, synchronous and asynchronous,
-algorithms using the red-black ordering technique. For this, we have fixed the size of a sub-problem to $256^{3}$ per computing node (a CPU core and a GPU). Then,
-Figure~\ref{fig:07} shows the number of relaxations performed, on average, per second by a computing node. We can see from this figure that the efficiency of the
-asynchronous algorithm is almost stable, while that of the synchronous algorithm decreases (down to $81\%$ in this example) with the increasing of the number of
-computing nodes on the cluster. This is due to the fact that the ratio between the time of the computation over that of the communication is reduced when the computations
-are performed on GPUs. Indeed, GPUs compute faster than CPUs and communications are more time consuming. In this context, asynchronous algorithms are more scalable
-than synchronous ones. So, with large scale GPU clusters, synchronous algorithms might be more penalized by communications, as can be deduced from Figure~\ref{fig:07}.
-That is why we think that asynchronous iterative algorithms are all the more interesting in this case.
+In Figure~\ref{ch13:fig:07}, we study the ratio between the computation time and
+the communication time of the parallel projected Richardson algorithms on a GPU
+cluster. The experimental tests are carried out on a cluster composed of one to
+ten Tesla GPUs. We have focused on the weak scaling of both parallel, synchronous
+and asynchronous, algorithms using the red-black ordering technique. For this, we
+have fixed the size of a subproblem to $256^{3}$ per computing node (a CPU core
+and a GPU). Then, Figure~\ref{ch13:fig:07} shows the number of relaxations performed,
+on average, per second by a computing node. We can see from this figure that the
+efficiency of the asynchronous algorithm is almost stable, while that of the synchronous
+algorithm decreases (down to $81\%$ in this example) with the increase in the
+number of computing nodes on the cluster. This is due to the fact that the ratio
+between the time of the computation over that of the communication is reduced when
+the computations are performed on GPUs. Indeed, GPUs compute faster than CPUs and
+communications are more time-consuming. In this context, asynchronous algorithms
+are more scalable than synchronous ones. So, with large scale GPU clusters, synchronous\index{synchronous iterations}
+algorithms might be more penalized by communications, as can be deduced from Figure~\ref{ch13:fig:07}.
+That is why we think that asynchronous\index{asynchronous iterations} iterative algorithms
+are all the more interesting in this case.
%%--------------------------%%
%% SECTION 7 %%
%%--------------------------%%
\section{Conclusion}
-\label{sec:07}
-Our main contribution, in this chapter, is the parallel implementation of an asynchronous iterative method on GPU clusters for solving large scale nonlinear
-systems derived from the spatial discretization of three-dimensional obstacle problems. For this, we have implemented both synchronous and asynchronous algorithms of the
-Richardson iterative method using a projection on a convex set. Indeed, this method uses point-based iterations of the Jacobi method that are very easy to parallelize on
-parallel computers. We have shown that its adapted parallel algorithms to GPU architectures allows to exploit at best the computing power of the GPUs and to accelerate the
-resolution of large nonlinear systems. Consequently, the experimental results have shown that solving nonlinear systems of large obstacle problems with this method is about
-fifty times faster on a cluster of $12$ GPUs than on a cluster of $24$ CPU cores. Moreover, we have applied to this projected Richardson method the red-black ordering technique
-which allows it to improve its convergence rate. Thus, the execution times of both parallel algorithms performed on the cluster of $12$ GPUs are reduced on average of $32\%$.
-
-Afterwards, the experiments have shown that the asynchronous version is slightly more efficient than the synchronous one. In fact, the computations are accelerated by using GPUs
-while the communication times still unchanged. In addition, we have studied the weak-scaling in the synchronous and asynchronous cases, which has confirmed that the ratio between
-the computations and the communications are reduced when using a cluster of GPUs. We highlight that asynchronous iterative algorithms are more scalable than synchronous ones.
-Therefore, we can conclude that asynchronous iterations are well suited to tackle scalability issues on GPU clusters.
-
-In future works, we plan to perform experiments on large scale GPU clusters and on geographically distant GPU clusters, because we expect that asynchronous versions would
-be faster and more scalable on such architectures. Furthermore, we want to study the performance behavior and the scalability of other numerical algorithms which support,
-if possible, the model of asynchronous iterations.
+\label{ch13:sec:07}
+Our main contribution, in this chapter, is the parallel implementation of an asynchronous
+iterative method on GPU clusters for solving large scale nonlinear systems derived from the
+spatial discretization of three-dimensional obstacle problems. For this, we have implemented
+both synchronous and asynchronous algorithms of the Richardson iterative method using a projection
+on a convex set. Indeed, this method uses point-based iterations of the Jacobi method that
+are very easy to parallelize on parallel computers. We have shown that its adapted parallel
+algorithms to GPU architectures allow us to exploit at best the computing power of the GPUs and
+to accelerate the resolution of large nonlinear systems. Consequently, the experimental results
+have shown that solving nonlinear systems of large obstacle problems with this method is about
+fifty times faster on a cluster of $12$ GPUs than on a cluster of $24$ CPU cores. Moreover,
+we have applied to this projected Richardson method the red-black ordering technique which
+allows it to improve its convergence rate. Thus, the execution times of both parallel algorithms
+performed on the cluster of $12$ GPUs are reduced on average of $32\%$.
+
+Afterwards, the experiments have shown that the asynchronous version is slightly more efficient
+than the synchronous one. In fact, the computations are accelerated by using GPUs while the communication
+times are still unchanged. In addition, we have studied the weak-scaling in the synchronous and asynchronous
+cases, which has confirmed that the ratio between the computations and the communications are reduced
+when using a cluster of GPUs. We highlight that asynchronous iterative algorithms are more scalable
+than synchronous ones. Therefore, we can conclude that asynchronous iterations are well suited to
+tackle scalability issues on GPU clusters.
+
+In future works, we plan to perform experiments on large scale GPU clusters and on geographically
+distant GPU clusters, because we expect that asynchronous versions would be faster and more scalable
+on such architectures. Furthermore, we want to study the performance behavior and the scalability of
+other numerical algorithms which support, if possible, the model of asynchronous iterations.
\putbib[Chapters/chapter13/biblio13]