BookGPU/Chapters/chapter13/ch13.tex~

   1 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   2 %%                          %%
   3 %%       CHAPTER 13         %%
   4 %%                          %%
   5 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   6
   7 \chapterauthor{}{}
   8 \newcommand{\scalprod}[2]%
   9 {\ensuremath{\langle #1 \, , #2 \rangle}}
  10 \chapter{Solving sparse nonlinear systems of obstacle problems on GPU clusters}
  11
  12 %%--------------------------%%
  13 %%       SECTION 1          %%
  14 %%--------------------------%%
  15 \section{Introduction}
  16 \label{sec:01}
  17 The obstacle problem is one kind of free boundary problems. It allows to model, for example, an elastic membrane covering a solid obstacle.
  18 In this case, the objective is to find an equilibrium position of this membrane constrained to be above the obstacle and which tends to minimize
  19 its surface and/or its energy. The study of such problems occurs in many applications, for example: fluid mechanics, bio-mathematics (tumour growth
  20 process) or financial mathematics (American or European option pricing).
  21
  22 In this chapter, we focus on solutions of large obstacle problems defined in a three-dimensional domain. Particularly, the present study consists
  23 in solving large nonlinear systems derived from the spatial discretization of these problems. Owing to the great size of such systems, in order to
  24 reduce computation times, we proceed by solving them by parallel synchronous or asynchronous iterative algorithms. Moreover, we aim at harnessing
  25 the computing power of GPUs to accelerate computations of these parallel algorithms. For this, we use an iterative method involving a projection
  26 on a convex set, which is: the projected Richardson method. We choose this method among other iterative methods because it is easy to implement on
  27 parallel computers and easy to adapt to GPU architectures.
  28
  29 In Section~\ref{sec:02}, we present the mathematical model of obstacle problems then, in Section~\ref{sec:03}, we describe the general principle of
  30 the parallel projected Richardson method. Next, in Section~\ref{sec:04}, we give the main key points of the parallel implementation of both synchronous
  31 and asynchronous algorithms of the projected Richardson method on a GPU cluster. In Section~\ref{sec:05}, we present the performances of both parallel
  32 algorithms obtained from simulations carried out on a CPU and GPU clusters. Finally, in Section~\ref{sec:06}, we use the read-black ordering technique
  33 to improve the convergence and, thus, the execution times of the parallel projected Richardson algorithms on the GPU cluster.
  34
  35
  36 %%--------------------------%%
  37 %%       SECTION 2          %%
  38 %%--------------------------%%
  39 \section{Obstacle problems}
  40 \label{sec:02}
  41 In this section, we present the mathematical model of obstacle problems defined in a three-dimensional domain.
  42 This model is based on that presented in~\cite{ref1}.
  43
  44 %%*******************
  45 %%*******************
  46 \subsection{Mathematical model}
  47 \label{sec:02.01}
  48 An obstacle problem, arising for example in mechanics or financial derivatives, consists in solving a time dependent
  49 nonlinear equation:
  50 \begin{equation}
  51 \left\{
  52 \begin{array}{l}
  53 \frac{\partial u}{\partial t}+b^t.\nabla u-\eta.\Delta u+c.u-f\geq 0\mbox{,~}u\geq\phi\mbox{,~a.e.w. in~}\lbrack 0,T\rbrack\times\Omega\mbox{,~}\eta>0,\\
  54 (\frac{\partial u}{\partial t}+b^t.\nabla u-\eta.\Delta u+c.u-f)(u-\phi)=0\mbox{,~a.e.w. in~}\lbrack 0,T\rbrack\times\Omega,\\
  55 u(0,x,y,z)=u_0(x,y,z),\\
  56 \mbox{B.C. on~}u(t,x,y,z)\mbox{~defined on~}\partial\Omega,
  57 \end{array}
  58 \right.
  59 \label{eq:01}
  60 \end{equation}
  61 where $u_0$ is the initial condition, $c\geq 0$, $b$ and $\eta$ are physical parameters, $T$ is the final time, $u=u(t,x,y,z)$
  62 is an element of the solution vector $U$ to compute, $f$ is the right-hand right that could represent, for example, the external
  63 forces, B.C. describes the boundary conditions on the boundary $\partial\Omega$ of the domain $\Omega$, $\phi$ models a constraint
  64 imposed to $u$, $\Delta$ is the Laplacian operator, $\nabla$ is the gradient operator, a.e.w. means almost every where and ``.''
  65 defines the products between two scalars, a scalar and a vector or a matrix and a vector. In practice the boundary condition,
  66 generally considered, is the Dirichlet condition (where $u$ is fixed on $\partial\Omega$) or the Neumann condition (where the
  67 normal derivative of $u$ is fixed on $\partial\Omega$).
  68
  69 The time dependent equation~(\ref{eq:01}) is numerically solved by considering an implicit or a semi-implicit time marching,
  70 where at each time step $k$ a stationary nonlinear problem is solved:
  71 \begin{equation}
  72 \left\{
  73 \begin{array}{l}
  74 b^t.\nabla u-\eta.\Delta u+(c+\delta).u-g\geq 0\mbox{,~}u\geq\phi\mbox{,~a.e.w. in~}\lbrack 0,T\rbrack\times\Omega\mbox{,~}\eta>0, \\
  75 (b^t.\nabla u-\eta.\Delta u+(c+\delta).u- g)(u-\phi)=0\mbox{,~a.e.w. in~}\lbrack 0,T\rbrack\times\Omega, \\
  76 \mbox{B.C. on~}u(t,x,y,z)\mbox{~defined on~}\partial\Omega,
  77 \end{array}
  78 \right.
  79 \label{eq:02}
  80 \end{equation}
  81 where $\delta=\frac{1}{k}$ is the inverse of the time step $k$, $g=f+\delta u^{prev}$ and $u^{prev}$ is the solution computed at the
  82 previous time step.
  83
  84
  85 %%*******************
  86 %%*******************
  87 \subsection{Discretization}
  88 \label{sec:02.02}
  89 First, we note that the spatial discretization of the previous stationary problem~(\ref{eq:02}) does not provide a symmetric matrix,
  90 because the convection-diffusion operator is not self-adjoint. Moreover, the fact that the operator is self-adjoint or not plays an
  91 important role in the choice of the appropriate algorithm for solving nonlinear systems derived from the discretization of the obstacle
  92 problem. Nevertheless, since the convection coefficients arising in the operator~(\ref{eq:02}) are constant, we can formulate the same
  93 problem by an self-adjoint operator by performing a classical change of variables. Then, we can replace the stationary convection-diffusion
  94 problem:
  95 \begin{equation}
  96 b^{t}.\nabla v-\eta.\Delta v+(c+\delta).v=g\mbox{,~a.e.w. in~}\lbrack 0,T\rbrack\times\Omega\mbox{,~}c\geq 0\mbox{,~}\delta\geq~0,
  97 \label{eq:03}
  98 \end{equation}
  99 by the following stationary diffusion operator:
 100 \begin{equation}
 101 -\eta.\Delta u+(\frac{\|b\|^{2}_{2}}{4\eta}+c+\delta).u=e^{-a}g=f,
 102 \label{eq:04}
 103 \end{equation}
 104 where $b=\{b_{1},b_{2},b_{3}\}$, $\|b\|_{2}$ denotes the Euclidean norm of $b$ and $v=e^{-a}.u$ represents the general change of variables
 105 such that $a=\frac{b^{t}(x,y,z)}{2\eta}$. Consequently, the numerical resolution of the diffusion problem (the self-adjoint operator~(\ref{eq:04}))
 106 is done by optimization algorithms, in contrast, that of the convection-diffusion problem (non self-adjoint operator~(\ref{eq:03})) is
 107 done by relaxation algorithms. In the case of our studied algorithm, the convergence is ensured by M-matrix property then, the performance
 108 is linked to the magnitude of the spectral radius of the iteration matrix, which is independent of the condition number.
 109
 110 Next, the three-dimensional domain $\Omega\subset\mathbb{R}^{3}$ is set to $\Omega=\lbrack 0,1\rbrack^{3}$ and discretized with an uniform
 111 Cartesian mesh constituted by $M=m^3$ discretization points, where $m$ related to the spatial discretization step by $h=\frac{1}{m+1}$. This
 112 is carried out by using a classical order 2 finite difference approximation of the Laplacian. So, the complete discretization of both stationary
 113 boundary value problems~(\ref{eq:03}) and~(\ref{eq:04}) leads to the solution of a large discrete complementary problem of the following
 114 form, when both Dirichlet or Neumann boundary conditions are used:
 115 \begin{equation}
 116 \left\{
 117 \begin{array}{l}
 118 \mbox{Find~}U^{*}\in\mathbb{R}^{M}\mbox{~such~that} \\
 119 (A+\delta I)U^{*}-G\geq 0\mbox{,~}U^{*}\geq\bar{\Phi},\\
 120 ((A+\delta I)U^{*}-G)^{T}(U^{*}-\bar{\Phi})=0,\\
 121 \end{array}
 122 \right.
 123 \label{eq:05}
 124 \end{equation}
 125 where $A$ is a matrix obtained after the spatial discretization by a finite difference method, $G$ is derived from the Euler first order implicit time
 126 marching scheme and from the discretized right-hand side of the obstacle problem, $\delta$ is the inverse of the time step $k$ and $I$ is the identity
 127 matrix. The matrix $A$ is symmetric when the self-adjoint operator is considered and nonsymmetric otherwise.
 128
 129 According to the chosen discretization scheme of the Laplacian, $A$ is an M-matrix (irreducibly diagonal dominant, see~\cite{ref2}) and, consequently,
 130 the matrix $(A+\delta I)$ is also an M-matrix. This property is important to the convergence of iterative methods.
 131
 132
 133 %%--------------------------%%
 134 %%       SECTION 3          %%
 135 %%--------------------------%%
 136 \section{Parallel iterative method}
 137 \label{sec:03}
 138 Owing to the large size of the previous discrete complementary problem~(\ref{eq:05}), we will solve it by parallel synchronous or asynchronous iterative
 139 algorithms (see~\cite{ref3,ref4,ref5}). In this chapter, we aim at harnessing the computing power of GPU clusters for solving these large nonlinear systems.
 140 Then, we choose to use the projected Richardson iterative method for solving the diffusion problem~(\ref{eq:04}). Indeed, this method is based on the iterations
 141 of the Jacobi method which are easy to parallelize on parallel computers and easy to adapt to GPU architectures. Then, according to the boundary value problem
 142 formulation with a self-adjoint operator~(\ref{eq:04}), we can consider here the equivalent optimization problem and the fixed point mapping associated to
 143 its solution.
 144
 145 Assume that $E=\mathbb{R}^{M}$ is a Hilbert space, in which $\scalprod{.}{.}$ is the scalar product and $\|.\|$ its associated norm. So, the general fixed
 146 point problem to be solved is defined as follows:
 147 \begin{equation}
 148 \left\{
 149 \begin{array}{l}
 150 \mbox{Find~} U^{*} \in E \mbox{~such that} \\
 151 U^{*} = F(U^{*}), \\
 152 \end{array}
 153 \right.
 154 \label{eq:06}
 155 \end{equation}
 156 where $U\mapsto F(U)$ is an application from $E$ to $E$.
 157
 158 Let $K$ be a closed convex set defined by:
 159 \begin{equation}
 160 K = \{U | U \geq \Phi \mbox{~everywhere in~} E\},
 161 \label{eq:07}
 162 \end{equation}
 163 where $\Phi$ is the discrete obstacle function. In fact, the obstacle problem~(\ref{eq:05}) is formulated as the following constrained optimization problem:
 164 \begin{equation}
 165 \left\{
 166 \begin{array}{l}
 167 \mbox{Find~} U^{*} \in K \mbox{~such that} \\
 168 \forall V \in K, J(U^{*}) \leq J(V), \\
 169 \end{array}
 170 \right.
 171 \label{eq:08}
 172 \end{equation}
 173 where the cost function is given by:
 174 \begin{equation}
 175 J(U) = \frac{1}{2}\scalprod{\mathcal{A}.U}{U} - \scalprod{G}{U},
 176 \label{eq:09}
 177 \end{equation}
 178 in which $\scalprod{.}{.}$ denotes the scalar product in $E$, $\mathcal{A}=A+\delta I$ is a symmetric positive definite, $A$ is the discretization matrix
 179 associated with the self-adjoint operator~(\ref{eq:04}) after change of variables.
 180
 181 For any $U\in E$, let $P_K(U)$ be the projection of $U$ on $K$. For any $\gamma\in\mathbb{R}$, $\gamma>0$, the fixed point mapping $F_{\gamma}$ of the projected
 182 Richardson method is defined as follows:
 183 \begin{equation}
 184 U^{*} = F_{\gamma}(U^{*}) = P_K(U^{*} - \gamma(\mathcal{A}.U^{*} - G)).
 185 \label{eq:10}
 186 \end{equation}
 187 In order to reduce the computation time, the large optimization problem is solved in a numerical way by using a parallel asynchronous algorithm of the projected
 188 Richardson method on the convex set $K$. Particularly, we will consider an asynchronous parallel adaptation of the projected Richardson method~\cite{ref6}.
 189
 190 Let $\alpha\in\mathbb{N}$ be a positive integer. We consider that the space $E=\displaystyle\prod_{i=1}^{\alpha} E_i$ is a product of $\alpha$ subspaces $E_i$
 191 where $i\in\{1,\ldots,\alpha\}$. Note that $E_i=\mathbb{R}^{m_i}$, where $\displaystyle\sum_{i=1}^{\alpha} m_{i}=M$, is also a Hilbert space in which $\scalprod{.}{.}_i$
 192 denotes the scalar product and $|.|_i$ the associated norm, for all $i\in\{1,\ldots,\alpha\}$. Then, for all $u,v\in E$, $\scalprod{u}{v}=\displaystyle\sum_{i=1}^{\alpha}\scalprod{u_i}{v_i}_i$
 193 is the scalar product on $E$.
 194
 195 Let $U\in E$, we consider the following decomposition of $U$ and the corresponding decomposition of $F_\gamma$ into $\alpha$ blocks:
 196 \begin{equation}
 197 \begin{array}{rcl}
 198 U    & = & (U_1,\ldots,U_{\alpha}), \\
 199 F_{\gamma}(U) & = & (F_{1,\gamma}(U),\ldots,F_{\alpha,\gamma}(U)). \\
 200 \end{array}
 201 \label{eq:11}
 202 \end{equation}
 203 Assume that the convex set $K=\displaystyle\prod_{i=1}^{\alpha}K_{i}$, such that $\forall i\in\{1,\ldots,\alpha\},K_i\subset E_i$ and $K_i$ is a closed convex set.
 204 Let also $G=(G_1,\ldots,G_{\alpha})\in E$ and, for any $U\in E$, $P_K(U)=(P_{K_1}(U_1),\ldots,P_{K_{\alpha}}(U_{\alpha}))$ is the projection of $U$ on $K$ where $\forall i\in\{1,\ldots,\alpha\},P_{K_i}$
 205 is the projector from $E_i$ onto $K_i$. So, the fixed point mapping of the projected Richardson method~(\ref{eq:10}) can be written in the following way:
 206 \begin{equation}
 207 \forall U\in E\mbox{,~}\forall i\in\{1,\ldots,\alpha\}\mbox{,~}F_{i,\gamma}(U) = P_{K_i}(U_i - \gamma(\mathcal{A}_i.U - G_i)).
 208 \label{eq:12}
 209 \end{equation}
 210 Note that $\displaystyle\mathcal{A}_i.U= \sum_{j=1}^{\alpha}\mathcal{A}_{i,j}.U_j$, where $\mathcal{A}_{i,j}$ denote block matrices of $\mathcal{A}$.
 211
 212 The parallel asynchronous iterations of the projected Richardson method for solving the obstacle problem~(\ref{eq:08}) are defined as follows: let $U^0\in E,U^0=(U^0_1,\ldots,U^0_\alpha)$ be
 213 the initial solution, then for all $p\in\mathbb{N}$, the iterate $U^{p+1}=(U^{p+1}_1,\ldots,U^{p+1}_{\alpha})$ is recursively defined by:
 214 \begin{equation}
 215 U_i^{p+1} =
 216 \left\{
 217 \begin{array}{l}
 218 F_{i,\gamma}(U_1^{\rho_1(p)}, \ldots, U_{\alpha}^{\rho_{\alpha}(p)}) \mbox{~if~} i\in s(p), \\
 219 U_i^p \mbox{~otherwise}, \\
 220 \end{array}
 221 \right.
 222 \label{eq:13}
 223 \end{equation}
 224 where
 225 \begin{equation}
 226 \left\{
 227 \begin{array}{l}
 228 \forall p\in\mathbb{N}, s(p)\subset\{1,\ldots,\alpha\}\mbox{~and~} s(p)\ne\emptyset, \\
 229 \forall i\in\{1,\ldots,\alpha\},\{p \ | \ i \in s(p)\}\mbox{~is denombrable},
 230 \end{array}
 231 \right.
 232 \label{eq:14}
 233 \end{equation}
 234 and $\forall j\in\{1,\ldots,\alpha\}$,
 235 \begin{equation}
 236 \left\{
 237 \begin{array}{l}
 238 \forall p\in\mathbb{N}, \rho_j(p)\in\mathbb{N}, 0\leq\rho_j(p)\leq p\mbox{~and~}\rho_j(p)=p\mbox{~if~} j\in s(p),\\
 239 \displaystyle\lim_{p\to\infty}\rho_j(p) = +\infty.\\
 240 \end{array}
 241 \right.
 242 \label{eq:15}
 243 \end{equation}
 244
 245 The previous asynchronous scheme of the projected Richardson method models computations that are carried out in parallel
 246 without order nor synchronization (according to the behavior of the parallel iterative method) and describes a subdomain
 247 method without overlapping. It is a general model that takes into account all possible situations of parallel computations
 248 and non-blocking message passing. So, the synchronous iterative scheme is defined by:
 249 \begin{equation}
 250 \forall j\in\{1,\ldots,\alpha\} \mbox{,~} \forall p\in\mathbb{N} \mbox{,~} \rho_j(p)=p.
 251 \label{eq:16}
 252 \end{equation}
 253 The values of $s(p)$ and $\rho_j(p)$ are defined dynamically and not explicitly by the parallel asynchronous or synchronous
 254 execution of the algorithm. Particularly, it enables one to consider distributed computations whereby processors compute at
 255 their own pace according to their intrinsic characteristics and computational load. The parallelism between the processors is
 256 well described by the set $s(p)$ which contains at each step $p$ the index of the components relaxed by each processor on a
 257 parallel way while the use of delayed components in~(\ref{eq:13}) permits one to model nondeterministic behavior and does not
 258 imply inefficiency of the considered distributed scheme of computation. Note that, according to~\cite{ref7}, theoretically,
 259 each component of the vector must be relaxed an infinity of time. The choice of the relaxed components to be used in the
 260 computational process may be guided by any criterion and, in particular, a natural criterion is to pick-up the most recently
 261 available values of the components computed by the processors. Furthermore, the asynchronous iterations are implemented by
 262 means of non-blocking MPI communication subroutines (asynchronous communications).
 263
 264 The important property ensuring the convergence of the parallel projected Richardson method, both  synchronous and asynchronous
 265 algorithms, is the fact that $\mathcal{A}$ is an M-matrix. Moreover, the convergence proceeds from a result of~\cite{ref6}.
 266 Indeed, there exists a value $\gamma_0>0$, such that $\forall\gamma\in ]0,\gamma_0[$, the parallel iterations~(\ref{eq:13}),
 267 (\ref{eq:14}) and~(\ref{eq:15}), associated to the fixed point mapping $F_\gamma$~(\ref{eq:12}), converge to the unique solution
 268 $U^{*}$ of the discretized problem.
 269
 270
 271 %%--------------------------%%
 272 %%       SECTION 4          %%
 273 %%--------------------------%%
 274 \section{Parallel implementation on a GPU cluster}
 275 \label{sec:04}
 276 In this section, we give the main key points of the parallel implementation of the projected Richardson method, both synchronous
 277 and asynchronous versions, on a GPU cluster, for solving the nonlinear systems derived from the discretization of large obstacle
 278 problems. More precisely, each nonlinear system is solved iteratively using the whole cluster. We use a heteregeneous CUDA/MPI
 279 programming. Indeed, the communication of data, at each iteration between the GPU computing nodes, can be either synchronous
 280 or asynchronous using the MPI communication subroutines, whereas inside each GPU node, a CUDA parallelization is performed.
 281
 282 \begin{figure}[!h]
 283 \centerline{\includegraphics[scale=0.30]{Chapters/chapter13/figures/splitCPU}}
 284 \caption{Data partitioning of a problem to be solved among $S=3\times 4$ computing nodes.}
 285 \label{fig:01}
 286 \end{figure}
 287
 288 Let $S$ denote the number of computing nodes on the GPU cluster, where a computing node is composed of CPU core holding one MPI
 289 process and a GPU card. So, before starting computations, the obstacle problem of size $(NX\times NY\times NZ)$ is split into $S$
 290 parallelepipedic sub-problems, each for a node (MPI process, GPU), as is shown in Figure~\ref{fig:01}. Indeed, the $NY$ and $NZ$
 291 dimensions (according to the $y$ and $z$ axises) of the three-dimensional problem are, respectively, split into $Sy$ and $Sz$ parts,
 292 such that $S=Sy\times Sz$. In this case, each computing node has at most four neighboring nodes. This kind of the data partitioning
 293 reduces the data exchanges at subdomain boundaries compared to a naive $z$-axis-wise partitioning.
 294
 295 \begin{algorithm}[!t]
 296 \SetLine
 297 \linesnumbered
 298 Initialization of the parameters of the sub-problem\;
 299 Allocate and fill the data in the global memory GPU\;
 300 \For{$i=1$ {\bf to} $NbSteps$}{
 301    $G = \frac{1}{k}.U^0 + F$\;
 302    Solve($A$, $U^0$, $G$, $U$, $\varepsilon$, $MaxRelax$)\;
 303    $U^0 = U$\;
 304 }
 305 Copy the solution $U$ back from GPU memory\;
 306 \caption{Parallel solving of the obstacle problem on a GPU cluster}
 307 \label{alg:01}
 308 \end{algorithm}
 309
 310 All the computing nodes of the GPU cluster execute in parallel the same Algorithm~\ref{alg:01} but on different three-dimensional
 311 sub-problems of size $(NX\times ny\times nz)$. This algorithm gives the main key points for solving an obstacle problem defined in
 312 a three-dimensional domain, where $A$ is the discretization matrix, $G$ is the right-hand side and $U$ is the solution vector. After
 313 the initialization step, all the data generated from the partitioning operation are copied from the CPU memories to the GPU global
 314 memories, to be processed on the GPUs. Next, the algorithm uses $NbSteps$ time steps to solve the global obstacle problem. In fact,
 315 it uses a parallel algorithm adapted to GPUs of the projected Richardson iterative method for solving the nonlinear systems of the
 316 obstacle problem. This function is defined by {\it Solve()} in Algorithm~\ref{alg:01}. At every time step, the initial guess $U^0$
 317 for the iterative algorithm is set to the solution found at the previous time step. Moreover, the right-hand side $G$ is computed
 318 as follows: \[G = \frac{1}{k}.U^{prev} + F\] where $k$ is the time step, $U^{prev}$ is the solution computed in the previous time
 319 step and each element $f(x, y, z)$ of the vector $F$ is computed as follows:
 320 \begin{equation}
 321 f(x,y,z)=\cos(2\pi x)\cdot\cos(4\pi y)\cdot\cos(6\pi z).
 322 \label{eq:18}
 323 \end{equation}
 324 Finally, the solution $U$ of the obstacle problem is copied back from the GPU global memories to the CPU memories. We use the
 325 communication subroutines of the CUBLAS library~\cite{ref8} (CUDA Basic Linear Algebra Subroutines) for the memory allocations in
 326 the GPU (\verb+cublasAlloc()+) and the data transfers between the CPU and its GPU: \verb+cublasSetVector()+ and \verb+cublasGetVector()+.
 327
 328 \begin{algorithm}[!t]
 329   \SetLine
 330   \linesnumbered
 331   $p = 0$\;
 332   $conv = false$\;
 333   $U = U^{0}$\;
 334   \Repeat{$(conv=true)$}{
 335     Determine\_Bordering\_Vector\_Elements($U$)\;
 336     Compute\_New\_Vector\_Elements($A$, $G$, $U$)\;
 337     $tmp = U^{0} - U$\;
 338     $error = \|tmp\|_{2}$\;
 339     $U^{0} = U$\;
 340     $p = p + 1$\;
 341     $conv$ = Convergence($error$, $p$, $\varepsilon$, $MaxRelax$)\;
 342   }
 343 \caption{Parallel iterative solving of the nonlinear systems on a GPU cluster ($Solve()$ function)}
 344 \label{alg:02}
 345 \end{algorithm}
 346
 347 As many other iterative methods, the algorithm of the projected Richardson method is based on algebraic functions operating on vectors
 348 and/or matrices, which are more efficient on parallel computers when they work on large vectors. Its parallel implementation on the GPU
 349 cluster is carried out so that the GPUs execute the vector operations as kernels and the CPUs execute the serial codes, supervise the
 350 kernel executions and the data exchanges with the neighboring nodes and supply the GPUs with data. Algorithm~\ref{alg:02} shows the
 351 main key points of the parallel iterative algorithm (function $Solve()$ in Algorithm~\ref{alg:01}). All the vector operations inside
 352 the main loop ({\bf repeat} ... {\bf until}) are executed by the GPU. We use the following functions of the CUBLAS library:
 353 \begin{itemize*}
 354 \item \verb+cublasDaxpy()+ to compute the difference between the solution vectors $U^{p}$ and $U^{p+1}$ computed in two successive relaxations
 355 $p$ and $p+1$ (line~$7$ in Algorithm~\ref{alg:02}),
 356 \item \verb+cublasDnrm2()+ to perform the Euclidean norm (line~$8$) and,
 357 \item \verb+cublasDcpy()+ for the data copy of a vector to another one in the GPU memory (lines~$3$ and~$9$).
 358 \end{itemize*}
 359
 360 The dimensions of the grid and blocks of threads that execute a given kernel depend on the resources of the GPU multiprocessor and the
 361 resource requirements of the kernel. So, if $block$ defines the size of a thread block, which must not exceed the maximum size of a thread
 362 block, then the number of thread blocks in the grid, denoted by $grid$, can be computed according to the size of the local sub-problem
 363 as follows: \[grid = \frac{(NX\times ny\times nz)+block-1}{block}.\] However, when solving very large problems, the size of the thread
 364 grid can exceed the maximum number of thread blocks that can be executed on the GPUs (up-to $65.535$ thread blocks) and, thus, the kernel
 365 will fail to launch. Therefore, for each kernel, we decompose the three-dimensional sub-problem into $nz$ two-dimensional slices of size
 366 ($NX\times ny$), as is shown in Figure~\ref{fig:02}. All slices of the same kernel are executed using {\bf for} loop by $NX\times ny$ parallel
 367 threads organized in a two-dimensional grid of two-dimensional thread blocks, as is shown in Listing~\ref{list:01}. Each thread is in charge
 368 of $nz$ discretization points (one from each slice), accessed in the GPU memory with a constant stride $(NX\times ny)$.
 369
 370 \begin{figure}
 371 \centerline{\includegraphics[scale=0.30]{Chapters/chapter13/figures/splitGPU}}
 372 \caption{Decomposition of a sub-problem in a GPU into $nz$ slices.}
 373 \label{fig:02}
 374 \end{figure}
 375
 376 \begin{center}
 377 \lstinputlisting[label=list:01,caption=Skeleton codes of a GPU kernel and a CPU function]{Chapters/chapter13/ex1.cu}
 378 \end{center}
 379 The function $Determine\_Bordering\_Vector\_Elements()$ (line~$5$ in Algorithm~\ref{alg:02}) determines the values of the vector
 380 elements shared at the boundaries with neighboring computing nodes. Its main operations are defined as follows:
 381 \begin{enumerate*}
 382 \item define the values associated to the bordering points needed by the neighbors,
 383 \item copy the values associated to the bordering points from the GPU to the CPU,
 384 \item exchange the values associated to the bordering points between the neighboring CPUs,
 385 \item copy the received values associated to the bordering points from the CPU to the GPU,
 386 \end{enumerate*}
 387 The first operation of this function is implemented as kernels to be performed by the GPU:
 388 \begin{itemize*}
 389 \item a kernel executed by $NX\times nz$ threads to define the values associated to the bordering vector elements along $y$-axis and,
 390 \item a kernel executed by $NX\times ny$ threads to define the values associated to the bordering vector elements along $z$-axis.
 391 \end{itemize*}
 392 As mentioned before, we develop the \emph{synchronous} and \emph{asynchronous} algorithms of the projected Richardson method. Obviously,
 393 in this scope, the synchronous or asynchronous communications refer to the communications between the CPU cores (MPI processes) on the
 394 GPU cluster, in order to exchange the vector elements associated to subdomain boundaries. For the memory copies between a CPU core and
 395 its GPU, we use the synchronous communication routines of the CUBLAS library: \verb+cublasSetVector()+ and \verb+cublasGetVector()+
 396 in the synchronous algorithm and the asynchronous ones: \verb+cublasSetVectorAsync()+ and \verb+cublasGetVectorAsync()+ in the
 397 asynchronous algorithm. Moreover, we use the communication routines of the MPI library to carry out the data exchanges between the neighboring
 398 nodes. We use the following communication routines: \verb+MPI_Isend()+ and \verb+MPI_Irecv()+ to perform non-blocking sends and receptions,
 399 respectively. For the synchronous algorithm, we use the MPI routine \verb+MPI_Waitall()+ which puts the MPI process of a computing node
 400 in blocking status until all data exchanges with neighboring nodes (sends and receptions) are completed. In contrast, for the asynchronous
 401 algorithms, we use the MPI routine \verb+MPI_Test()+ which tests the completion of a data exchange (send or reception) without putting the
 402 MPI process in blocking status.
 403
 404 The function $Compute\_New\_Vector\_Elements()$ (line~$6$ in Algorithm~\ref{alg:02}) computes, at each iteration, the new elements
 405 of the iterate vector $U$. Its general code is presented in Listing~\ref{list:01} (CPU function). The iterations of the projected
 406 Richardson method, based on those of the Jacobi method, are defined as follows:
 407 \begin{equation}
 408 \begin{array}{ll}
 409 u^{p+1}(x,y,z) =& \frac{1}{Center}(g(x,y,z) - (Center\cdot u^{p}(x,y,z) + \\
 410 & West\cdot u^{p}(x-h,y,z) + East\cdot u^{p}(x+h,y,z) + \\
 411 & South\cdot u^{p}(x,y-h,z) + North\cdot u^{p}(x,y+h,z) + \\
 412 & Rear\cdot u^{p}(x,y,z-h) + Front\cdot u^{p}(x,y,z+h))),
 413 \end{array}
 414 \label{eq:17}
 415 \end{equation}
 416 where $u^{p}(x,y,z)$ is an element of the iterate vector $U$ computed at the iteration $p$ and $g(x,y,z)$ is a vector element of the
 417 right-hand side $G$. The scalars $Center$, $West$, $East$, $South$, $North$, $Rear$ and $Front$ define constant coefficients of the
 418 block matrix $A$. Figure~\ref{fig:03} shows the positions of these coefficients in a three-dimensional domain.
 419
 420 \begin{figure}
 421 \centerline{\includegraphics[scale=0.35]{Chapters/chapter13/figures/matrix}}
 422 \caption{Matrix constant coefficients in a three-dimensional domain.}
 423 \label{fig:03}
 424 \end{figure}
 425
 426 The kernel implementations of the projected Richardson method on GPUs uses a perfect fine-grain multithreading parallelism. Since the
 427 projected Richardson algorithm is implemented as a fixed point method, each kernel is executed by a large number of GPU threads such
 428 that each thread is in charge of the computation of one element of the iterate vector $U$. Moreover, this method uses the vector elements
 429 updates of the Jacobi method, which means that each thread $i$ computes the new value of its element $u_{i}^{p+1}$ independently of the
 430 new values $u_{j}^{p+1}$, where $j\neq i$, of those computed in parallel by other threads at the same iteration $p+1$. Listing~\ref{list:02}
 431 shows the GPU implementations of the main kernels of the projected Richardson method, which are: the matrix-vector multiplication
 432 (\verb+MV_Multiplication()+) and the vector elements updates (\verb+Vector_Updates()+). The codes of these kernels are based on
 433 that presented in Listing~\ref{list:01}.
 434
 435 \lstinputlisting[label=list:02,caption=GPU kernels of the projected Richardson method]{Chapters/chapter13/ex2.cu}
 436
 437 \begin{figure}
 438 \centerline{\includegraphics[scale=0.3]{Chapters/chapter13/figures/points3D}}
 439 \caption{Computation of a vector element with the projected Richardson method.}
 440 \label{fig:04}
 441 \end{figure}
 442
 443 Each kernel is executed by $NX\times ny$ GPU threads so that $nz$ slices of $(NX\times ny)$ vector elements are computed in
 444 a {\bf for} loop. In this case, each thread is in charge of one vector element from each slice (in total $nz$ vector elements
 445 along $z$-axis). We can notice from the formula~(\ref{eq:17}) that the computation of a vector element $u^{p+1}(x,y,z)$, by
 446 a thread at iteration $p+1$, requires seven vector elements computed at the previous iteration $p$: two vector elements in
 447 each dimension plus the vector element at the intersection of the three axises $x$, $y$ and $z$ (see Figure~\ref{fig:04}).
 448 So, to reduce the memory accesses to the high-latency global memory, the vector elements of the current slice can be stored
 449 in the low-latency shared memories of thread blocks, as is described in~\cite{ref9}. Nevertheless, the fact that the computation
 450 of a vector element requires only two elements in each dimension does not allow to maximize the data reuse from the shared memories.
 451 The computation of a slice involves in total $(bx+2)\times(by+2)$ accesses to the global memory per thread block, to fill the
 452 required vector elements in the shared memory where $bx$ and $by$ are the dimensions of a thread block. Then, in order to optimize
 453 the memory accesses on GPUs, the elements of the iterate vector $U$ are filled in the cache texture memory (see~\cite{ref10}).
 454 In new GPU generations as Fermi or Kepler, the global memory accesses are always cached in L1 and L2 caches. For example, for
 455 a given kernel, we can favour the use of the L1 cache to that of the shared memory by using the function \verb+cudaFuncSetCacheConfig(Kernel,cudaFuncCachePreferL1)+.
 456 So, the initial access to the global memory loads the vector elements required by the threads of a block into the cache memory
 457 (texture or L1/L2 caches). Then, all the following memory accesses read from this cache memory. In Listing~\ref{list:02}, the
 458 function \verb+fetch_double(v,i)+ is used to read from the texture memory the $i^{th}$ element of the double-precision vector
 459 \verb+v+ (see Listing~\ref{list:03}). Moreover, the seven constant coefficients of matrix $A$ can be stored in the constant memory
 460 but, since they are reused $nz$ times by each thread, it is more interesting to fill them on the low-latency registers of each thread.
 461
 462 \lstinputlisting[label=list:03,caption=Memory access to the cache texture memory]{Chapters/chapter13/ex3.cu}
 463
 464 The function $Convergence()$ (line~$11$ in Algorithm~\ref{alg:02}) allows to detect the convergence of the parallel iterative algorithm
 465 and is based on the tolerance threshold $\varepsilon$ and the maximum number of relaxations $MaxRelax$. We take into account the number
 466 of relaxations since that of iterations cannot be computed in the asynchronous case. Indeed, a relaxation is the update~(\ref{eq:13}) of
 467 a local iterate vector $U_i$ according to $F_i$. Then, counting the number of relaxations is possible in both synchronous and asynchronous
 468 cases. On the other hand, an iteration is the update of at least all vector components with $F_i$.
 469
 470 In the synchronous algorithm, the global convergence is detected when the maximal value of the absolute error, $error$, is sufficiently small
 471 and/or the maximum number of relaxations, $MaxRelax$, is reached, as follows:
 472 $$
 473 \begin{array}{l}
 474 error=\|U^{p}-U^{p+1}\|_{2}; \\
 475 AllReduce(error,\hspace{0.1cm}maxerror,\hspace{0.1cm}MAX); \\
 476 \text{if}((maxerror<\varepsilon)\hspace{0.2cm}\text{or}\hspace{0.2cm}(p\geq MaxRelax)) \\
 477 conv \leftarrow true;
 478 \end{array}
 479 $$
 480 where the function $AllReduce()$ uses the MPI reduction subroutine \verb+MPI_Allreduce()+ to compute the maximal value, $maxerror$, among the
 481 local absolute errors, $error$, of all computing nodes and $p$ (in Algorithm~\ref{alg:02}) is used as a counter of the local relaxations carried
 482 out by a computing node. In the asynchronous algorithms, the global convergence is detected when all computing nodes locally converge. For this,
 483 we use a token ring architecture around which a boolean token travels, in one direction, from a computing node to another. Starting from node $0$,
 484 the boolean token is set to $true$ by node $i$ if the local convergence is reached or to $false$ otherwise and, then, it is sent to node $i+1$.
 485 Finally, the global convergence is detected when node $0$ receives from its neighbor node $S-1$, in the ring architecture, a token set to $true$.
 486 In this case, node $0$ sends a stop message (end of parallel solving) to all computing nodes in the cluster.
 487
 488
 489 %%--------------------------%%
 490 %%       SECTION 5          %%
 491 %%--------------------------%%
 492 \section{Experimental tests on a GPU cluster}
 493 \label{sec:05}
 494 The GPU cluster of tests, that we used in this chapter, is an $20Gbps$ Infiniband network of six machines. Each machine is a Quad-Core Xeon
 495 E5530 CPU running at $2.4$GHz. It provides a RAM memory of $12$GB with a memory bandwidth of $25.6$GB/s and it is equipped with two Nvidia
 496 Tesla C1060 GPUs. A Tesla GPU contains in total $240$ cores running at $1.3$GHz. It provides $4$GB of global memory with a memory bandwidth
 497 of $102$GB/s, accessible by all its cores and also by the CPU through the PCI-Express 16x Gen 2.0 interface with a throughput of $8$GB/s.
 498 Hence, the memory copy operations between the GPU and the CPU are about $12$ times slower than those of the Tesla GPU memory. We have performed
 499 our simulations on a cluster of $24$ CPU cores and on a cluster of $12$ GPUs. Figure~\ref{fig:05} describes the components of the GPU cluster
 500 of tests.
 501
 502 \begin{figure}
 503 \centerline{\includegraphics[scale=0.25]{Chapters/chapter13/figures/cluster}}
 504 \caption{GPU cluster of tests composed of 12 computing nodes (six machines, each with two GPUs.}
 505 \label{fig:05}
 506 \end{figure}
 507
 508 Linux cluster version 2.6.39 OS is installed on CPUs. C programming language is used for coding the parallel algorithms of the methods on both
 509 GPU cluster and CPU cluster. CUDA version 4.0~\cite{ref12} is used for programming GPUs, using CUBLAS library~\cite{ref8} to deal with vector
 510 operations in GPUs and, finally, MPI functions of OpenMPI 1.3.3 are used to carry out the synchronous and asynchronous communications between
 511 CPU cores. Indeed, in our experiments, a computing node is managed by a MPI process and it is composed of one CPU core and one GPU card.
 512
 513 All experimental results of the parallel projected Richardson algorithms are obtained from simulations made in double precision data. The obstacle
 514 problems to be solved are defined in constant three-dimensional domain $\Omega\subset\mathbb{R}^{3}$. The numerical values of the parameters of the
 515 obstacle problems are: $\eta=0.2$, $c=1.1$, $f$ is computed by formula~(\ref{eq:18}) and final time $T=0.02$. Moreover, three time steps ($NbSteps=3$)
 516 are computed with $k=0.0066$. As the discretization matrix is constant along the time steps, the convergence properties of the iterative algorithms
 517 do not change. Thus, the performance characteristics obtained with three time steps will still be valid for more time steps. The initial function
 518 $u(0,x,y,z)$ of the obstacle problem~(\ref{eq:01}) is set to $0$, with a constraint $u\geq\phi=0$. The relaxation parameter $\gamma$ used by the
 519 projected Richardson method is computed automatically thanks to the diagonal entries of the discretization matrix. The formula and its proof can be
 520 found in~\cite{ref11}, Section~2.3. The convergence tolerance threshold $\varepsilon$ is set to $1e$-$04$ and the maximum number of relaxations is
 521 limited to $10^{6}$ relaxations. Finally, the number of threads per block is set to $256$ threads, which gives, in general, good performances for
 522 most GPU applications. We have performed some tests for the execution configurations and we have noticed that the best configuration of the $256$
 523 threads per block is an organization into two dimensions of sizes $(64,4)$.
 524
 525 \begin{table}[!h]
 526 \centering
 527 \begin{tabular}{|c|c|c|c|c|c|}
 528 \hline
 529 \multirow{2}{*}{\bf Pb. size} & \multicolumn{2}{c|}{\bf Synchronous} & \multicolumn{2}{c|}{\bf Asynchronous} & \multirow{2}{*}{\bf Gain\%} \\ \cline{2-5}
 530
 531                               & $\mathbf{T_{cpu}}$ & {\bf \#relax.}  & $\mathbf{T_{cpu}}$ & {\bf \#relax.}  &  \\ \hline \hline
 532
 533 $256^{3}$                     & $575.22$           & $198,288$        & $539.25$          & $198,613$        & $6.25$ \\ \hline \hline
 534
 535 $512^{3}$                     & $19,250.25$        & $750,912$        & $18,237.14$       & $769,611$        & $5.26$ \\ \hline \hline
 536
 537 $768^{3}$                     & $206,159.44$       & $1,635,264$      & $183,582.60$      & $1,577,004$      & $10.95$ \\ \hline \hline
 538
 539 $800^{3}$                     & $222,108.09$       & $1,769,232$      & $188,790.04$      & $1,701,735$      & $15.00$ \\ \hline
 540 \end{tabular}
 541 \vspace{0.5cm}
 542 \caption{Execution times in seconds of the parallel projected Richardson method implemented on a cluster of 24 CPU cores.}
 543 \label{tab:01}
 544 \end{table}
 545
 546 \begin{table}[!h]
 547 \centering
 548 \begin{tabular}{|c|c|c|c|c|c|c|c|}
 549 \hline
 550 \multirow{2}{*}{\bf Pb. size} & \multicolumn{3}{c|}{\bf Synchronous}                   & \multicolumn{3}{c|}{\bf Asynchronous}                   & \multirow{2}{*}{\bf Gain\%}  \\ \cline{2-7}
 551
 552                               & $\mathbf{T_{gpu}}$ & {\bf \#relax.}  & $\mathbf{\tau}$ & $\mathbf{T_{gpu}}$ & {\bf \#relax.}  & $\mathbf{\tau}$ &           \\  \hline \hline
 553
 554 $256^{3}$                     & $29.67$            & $100,692$        &  $19.39$        & $18.00$           & $94,215$         & $29.96$         & $39.33$  \\ \hline \hline
 555
 556 $512^{3}$                     & $521.83$           & $381,300$        & $36.89$         & $425.15$          & $347,279$        & $42.89$         & $18.53$ \\ \hline \hline
 557
 558 $768^{3}$                     & $4,112.68$         & $831,144$        & $50.13$         & $3,313.87$        & $750,232$        & $55.40$         & $19.42$ \\ \hline \hline
 559
 560 $800^{3}$                     & $3,950.87$         & $899,088$        & $56.22$         & $3,636.57$        & $834,900$        & $51.91$         & $7.95$ \\ \hline
 561 \end{tabular}
 562 \vspace{0.5cm}
 563 \caption{Execution times in seconds of the parallel projected Richardson method implemented on a cluster of 12 GPUs.}
 564 \label{tab:02}
 565 \end{table}
 566
 567 The performance measures that we took into account are the execution times and the number of relaxations performed by the parallel iterative algorithms,
 568 both synchronous and asynchronous versions, on the GPU and CPU clusters. These algorithms are used for solving nonlinear systems derived from the discretization
 569 of obstacle problems of sizes $256^{3}$, $512^{3}$, $768^{3}$ and $800^{3}$. In Table~\ref{tab:01} and Table~\ref{tab:02}, we show the performances
 570 of the parallel synchronous and asynchronous algorithms of the projected Richardson method implemented, respectively, on a cluster of $24$ CPU cores
 571 and on a cluster of $12$ GPUs. In these tables, the execution time defines the time spent by the slowest computing node and the number of relaxations
 572 is computed as the summation of those carried out by all computing nodes.
 573
 574 In the sixth column of Table~\ref{tab:01} and in the eighth column of Table~\ref{tab:02}, we give the gains in $\%$ obtained by using an
 575 asynchronous algorithm compared to a synchronous one. We can notice that the asynchronous version on CPU and GPU clusters is slightly faster
 576 than the synchronous one for both methods. Indeed, the cluster of tests is composed of local and homogeneous nodes communicating via low-latency
 577 connections. So, in the case of distant and/or heterogeneous nodes (or even with geographically distant clusters) the asynchronous version
 578 would be faster than the synchronous one. However, the gains obtained on the GPU cluster are better than those obtained on the CPU cluster.
 579 In fact, the computation times are reduced by accelerating the computations on GPUs while the communication times still unchanged.
 580
 581 The fourth and seventh columns of Table~\ref{tab:02} show the relative gains obtained by executing the parallel algorithms on the cluster
 582 of $12$ GPUs instead on the cluster of $24$ CPU cores. We compute the relative gain $\tau$ as a ratio between the execution time $T_{cpu}$
 583 spent on the CPU cluster over that $T_{gpu}$ spent on the GPU cluster: \[\tau=\frac{T_{cpu}}{T_{gpu}}.\] We can see from these ratios that
 584 solving large obstacle problems is faster on the GPU cluster than on the CPU cluster. Indeed, the GPUs are more efficient than their
 585 counterpart CPUs to execute large data-parallel operations. In addition, the projected Richardson method is implemented as a fixed point-based
 586 iteration and uses the Jacobi vector updates that allow a well thread-parallelization on GPUs, such that each GPU thread is in charge
 587 of one vector component at a time without being dependent on other vector components computed by other threads. Then, this allow to exploit
 588 at best the high performance computing of the GPUs by using all the GPU resources and avoiding the idle cores.
 589
 590 Finally, the number of relaxations performed by the parallel synchronous algorithm is different in the CPU and GPU versions, because the number
 591 of computing nodes involved in the GPU cluster and in the CPU cluster is different. In the CPU case, $24$ computing nodes ($24$ CPU cores) are
 592 considered, whereas in the GPU case, $12$ computing nodes ($12$ GPUs) are considered. As the number of relaxations depends on the domain decomposition,
 593 consequently it also depends on the number of computing nodes.
 594
 595
 596 %%--------------------------%%
 597 %%       SECTION 6          %%
 598 %%--------------------------%%
 599 \section{Red-Black ordering technique}
 600 \label{sec:06}
 601 As is well-known, the Jacobi method is characterized by a slow convergence rate compared to some iterative methods (for example Gauss-Seidel method).
 602 So, in this section, we present some solutions to reduce the execution time and the number of relaxations and, more specifically, to speed up the
 603 convergence of the parallel projected Richardson method on the GPU cluster. We propose to use the point red-black ordering technique to accelerate
 604 the convergence. This technique is often used to increase the parallelism of iterative methods for solving linear systems~\cite{ref13,ref14,ref15}.
 605 We apply it to the projected Richardson method as a compromise between the Jacobi and Gauss-Seidel iterative methods.
 606
 607 The general principle of the red-black technique is as follows. Let $t$ be the summation of the integer $x$-, $y$- and $z$-coordinates of a vector
 608 element $u(x,y,z)$ on a three-dimensional domain: $t=x+y+z$. As is shown in Figure~\ref{fig:06.01}, the red-black ordering technique consists in the
 609 parallel computing of the red vector elements having even value $t$ by using the values of the black ones then, the parallel computing of the black
 610 vector elements having odd values $t$ by using the new values of the red ones.
 611
 612 \begin{figure}
 613 \centering
 614   \mbox{\subfigure[Red-black ordering on x, y and z axises]{\includegraphics[width=2.3in]{Chapters/chapter13/figures/rouge-noir}\label{fig:06.01}}\quad
 615         \subfigure[Red-black ordering on y axis]{\includegraphics[width=2.3in]{Chapters/chapter13/figures/rouge-noir-y}\label{fig:06.02}}}
 616 \caption{Red-black ordering for computing the iterate vector elements in a three-dimensional space.}
 617 \end{figure}
 618
 619 This technique can be implemented on the GPU in two different manners:
 620 \begin{itemize*}
 621 \item among all launched threads ($NX\times ny$ threads), only one thread out of two computes its red or black vector element at a time or,
 622 \item all launched threads (on average half of $NX\times ny$ threads) compute the red vector elements first and, then, the black ones.
 623 \end{itemize*}
 624 However, in both solutions, for each memory transaction, only half of the memory segment addressed by a half-warp is used. So, the computation of the
 625 red and black vector elements leads to use twice the initial number of memory transactions. Then, we apply the point red-black ordering accordingly to
 626 the $y$-coordinate, as is shown in Figure~\ref{fig:06.02}. In this case, the vector elements having even $y$-coordinate are computed in parallel using
 627 the values of those having odd $y$-coordinate and then vice-versa. Moreover, in the GPU implementation of the parallel projected Richardson method (Section~\ref{sec:04}),
 628 we have shown that a sub-problem of size $(NX\times ny\times nz)$ is decomposed into $nz$ grids of size $(NX\times ny)$. Then, each kernel is executed
 629 in parallel by $NX\times ny$ GPU threads, so that each thread is in charge of $nz$ vector elements along $z$-axis (one vector element in each grid of
 630 the sub-problem). So, we propose to use the new values of the vector elements computed in grid $i$ to compute those of the vector elements in grid $i+1$.
 631 Listing~\ref{list:04} describes the kernel of the matrix-vector multiplication and the kernel of the vector elements updates of the parallel projected
 632 Richardson method using the red-black ordering technique.
 633
 634 \lstinputlisting[label=list:04,caption=GPU kernels of the projected Richardson method using the red-black technique]{Chapters/chapter13/ex4.cu}
 635
 636 Finally, we exploit the concurrent executions between the host functions and the GPU kernels provided by the GPU hardware and software. In fact, the kernel
 637 launches are asynchronous (when this environment variable is not disabled on the GPUs), such that the control is returned to the host (MPI process) before
 638 the GPU has completed the requested task (kernel)~\cite{ref12}. Therefore, all the kernels necessary to update the local vector elements, $u(x,y,z)$ where
 639 $0<y<(ny-1)$ and $0<z<(nz-1)$, are executed first. Then, the values associated to the bordering vector elements are exchanged between the neighbors. Finally,
 640 the values of the vector elements associated to the bordering vector elements are updated. In this case, the computation of the local vector elements is
 641 performed concurrently with the data exchanges between neighboring CPUs and this in both synchronous and asynchronous cases.
 642
 643 \begin{table}[!h]
 644 \centering
 645 \begin{tabular}{|c|c|c|c|c|c|}
 646 \hline
 647 \multirow{2}{*}{\bf Pb. size} & \multicolumn{2}{c|}{\bf Synchronous} & \multicolumn{2}{c|}{\bf Asynchronous} & \multirow{2}{*}{\bf Gain\%}  \\ \cline{2-5}
 648
 649                               & $\mathbf{T_{gpu}}$ & {\bf \#relax.}   & $\mathbf{T_{gpu}}$ & {\bf \#relax.}   &           \\  \hline \hline
 650
 651 $256^{3}$                     & $18.37$            & $71,988$         & $12.58$           & $67,638$         & $31.52$  \\ \hline \hline
 652
 653 $512^{3}$                     & $349.23$           & $271,188$        & $289.41$          & $246,036$        & $17.13$ \\ \hline \hline
 654
 655 $768^{3}$                     & $2,773.65$         & $590,652$        & $2,222.22$        & $532,806$        & $19.88$ \\ \hline \hline
 656
 657 $800^{3}$                     & $2,748.23$         & $638,916$        & $2,502.61$        & $592,525$        & $8.92$ \\ \hline
 658 \end{tabular}
 659 \vspace{0.5cm}
 660 \caption{Execution times in seconds of the parallel projected Richardson method using read-black ordering technique implemented on a cluster of 12 GPUs.}
 661 \label{tab:03}
 662 \end{table}
 663
 664 In Table~\ref{tab:03}, we report the execution times and the number of relaxations performed on a cluster of $12$ GPUs by the parallel projected Richardson
 665 algorithms; it can be noted that the performances of the projected Richardson are improved by using the point read-black ordering. We compare the performances
 666 of the parallel projected Richardson method with and without this later ordering (Tables~\ref{tab:02} and~\ref{tab:03}). We can notice that both parallel synchronous
 667 and asynchronous algorithms are faster when they use the red-black ordering. Indeed, we can see in Table~\ref{tab:03} that the execution times of these algorithms
 668 are reduced, on average, by $32\%$ compared to those shown in Table~\ref{tab:02}.
 669
 670 \begin{figure}
 671 \centerline{\includegraphics[scale=0.9]{Chapters/chapter13/figures/scale}}
 672 \caption{Weak scaling of both synchronous and asynchronous algorithms of the projected Richardson method using red-black ordering technique.}
 673 \label{fig:07}
 674 \end{figure}
 675
 676 In Figure~\ref{fig:07}, we study the ratio between the computation time and the communication time of the parallel projected Richardson algorithms on a GPU cluster.
 677 The experimental tests are carried out on a cluster composed of one to ten Tesla GPUs. We have focused on the weak scaling of both parallel, synchronous and asynchronous,
 678 algorithms using the red-black ordering technique. For this, we have fixed the size of a sub-problem to $256^{3}$ per computing node (a CPU core and a GPU). Then,
 679 Figure~\ref{fig:07} shows the number of relaxations performed, on average, per second by a computing node. We can see from this figure that the efficiency of the
 680 asynchronous algorithm is almost stable, while that of the synchronous algorithm decreases (down to $81\%$ in this example) with the increasing of the number of
 681 computing nodes on the cluster. This is due to the fact that the ratio between the time of the computation over that of the communication is reduced when the computations
 682 are performed on GPUs. Indeed, GPUs compute faster than CPUs and communications are more time consuming. In this context, asynchronous algorithms are more scalable
 683 than synchronous ones. So, with large scale GPU clusters, synchronous algorithms might be more penalized by communications, as can be deduced from Figure~\ref{fig:07}.
 684 That is why we think that asynchronous iterative algorithms are all the more interesting in this case.
 685
 686
 687 %%--------------------------%%
 688 %%       SECTION 7          %%
 689 %%--------------------------%%
 690 \section{Conclusion}
 691 \label{sec:07}
 692 Our main contribution, in this chapter, is the parallel implementation of an asynchronous iterative method on GPU clusters for solving large scale nonlinear
 693 systems derived from the spatial discretization of three-dimensional obstacle problems. For this, we have implemented both synchronous and asynchronous algorithms of the
 694 Richardson iterative method using a projection on a convex set. Indeed, this method uses point-based iterations of the Jacobi method that are very easy to parallelize on
 695 parallel computers. We have shown that its adapted parallel algorithms to GPU architectures allows to exploit at best the computing power of the GPUs and to accelerate the
 696 resolution of large nonlinear systems. Consequently, the experimental results have shown that solving nonlinear systems of large obstacle problems with this method is about
 697 fifty times faster on a cluster of $12$ GPUs than on a cluster of $24$ CPU cores. Moreover, we have applied to this projected Richardson method the red-black ordering technique
 698 which allows it to improve its convergence rate. Thus, the execution times of both parallel algorithms performed on the cluster of $12$ GPUs are reduced on average of $32\%$.
 699
 700 Afterwards, the experiments have shown that the asynchronous version is slightly more efficient than the synchronous one. In fact, the computations are accelerated by using GPUs
 701 while the communication times still unchanged. In addition, we have studied the weak-scaling in the synchronous and asynchronous cases, which has confirmed that the ratio between
 702 the computations and the communications are reduced when using a cluster of GPUs. We highlight that asynchronous iterative algorithms are more scalable than synchronous ones.
 703 Therefore, we can conclude that asynchronous iterations are well suited to tackle scalability issues on GPU clusters.
 704
 705 In future works, we plan to perform experiments on large scale GPU clusters and on geographically distant GPU clusters, because we expect that asynchronous versions would
 706 be faster and more scalable on such architectures. Furthermore, we want to study the performance behavior and the scalability of other numerical algorithms which support,
 707 if possible, the model of asynchronous iterations.
 708
 709 \putbib[Chapters/chapter13/biblio13]
 710