BookGPU/Chapters/chapter13/ch13.tex

   1 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   2 %%                          %%
   3 %%       CHAPTER 13         %%
   4 %%                          %%
   5 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
   6
   7 %\chapterauthor{}{}
   8 \chapterauthor{Lilia Ziane Khodja}{Femto-ST Institute, University of Franche-Comte, France}
   9 \chapterauthor{Ming Chau}{Advanced Solutions Accelerator, Castelnau Le Lez, France}
  10 \chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte, France}
  11 \chapterauthor{Pierre Spitéri}{ENSEEIHT-IRIT, Toulouse, France}
  12 \chapterauthor{Jacques Bahi}{Femto-ST Institute, University of Franche-Comte, France}
  13
  14
  15 \chapter{Solving sparse nonlinear systems of obstacle problems on GPU clusters}
  16 \label{ch13}
  17
  18
  19
  20
  21 %%--------------------------%%
  22 %%       SECTION 1          %%
  23 %%--------------------------%%
  24 \section{Introduction}
  25 \label{ch13:sec:01}
  26 The obstacle problem is one kind of free boundary problems. It allows to model,
  27 for example, an elastic membrane covering a solid obstacle. In this case, the
  28 objective is to find an equilibrium position of this membrane constrained to be
  29 above the obstacle and which tends to minimize its surface and/or its energy.
  30 The study of such problems occurs in many applications, for example: fluid mechanics,
  31 bio-mathematics (tumor growth process) or financial mathematics (American or
  32 European option pricing).
  33
  34 In this chapter, we focus on solutions of large obstacle problems defined in a
  35 three-dimensional domain. Particularly, the present study consists in solving
  36 large nonlinear systems derived from the spatial discretization of these problems.
  37 Owing to the great size of such systems, in order to reduce computation times,
  38 we proceed by solving them by parallel synchronous or asynchronous iterative
  39 algorithms. Moreover, we aim at harnessing the computing power of GPUs to accelerate
  40 computations of these parallel algorithms. For this, we use an iterative method
  41 involving a projection on a convex set, which is: the projected Richardson method.
  42 We choose this method among other iterative methods because it is easy to implement
  43 on parallel computers and easy to adapt to GPU architectures.
  44
  45 In Section~\ref{ch13:sec:02}, we present the mathematical model of obstacle problems
  46 then, in Section~\ref{ch13:sec:03}, we describe the general principle of the parallel
  47 projected Richardson method. Next, in Section~\ref{ch13:sec:04}, we give the main key
  48 points of the parallel implementation of both synchronous and asynchronous algorithms
  49 of the projected Richardson method on a GPU cluster. In Section~\ref{ch13:sec:05}, we
  50 present the performances of both parallel algorithms obtained from simulations carried
  51 out on a CPU and GPU clusters. Finally, in Section~\ref{ch13:sec:06}, we use the read-black
  52 ordering technique to improve the convergence and, thus, the execution times of the parallel
  53 projected Richardson algorithms on the GPU cluster.
  54
  55
  56
  57 %%--------------------------%%
  58 %%       SECTION 2          %%
  59 %%--------------------------%%
  60 \section{Obstacle problems}
  61 \label{ch13:sec:02}
  62 In this section, we present the mathematical model of obstacle problems defined in a
  63 three-dimensional domain. This model is based on that presented in~\cite{ch13:ref1}.
  64
  65
  66 %%*******************
  67 %%*******************
  68 \subsection{Mathematical model}
  69 \label{ch13:sec:02.01}
  70 An obstacle problem\index{Obstacle~problem}, arising for example in mechanics or financial
  71 derivatives, consists in solving a time dependent nonlinear equation\index{Nonlinear}:
  72 \begin{equation}
  73 \left\{
  74 \begin{array}{l}
  75 \frac{\partial u}{\partial t}+b^t.\nabla u-\eta.\Delta u+c.u-f\geq 0\mbox{,~}u\geq\phi\mbox{,~a.e.w. in~}\lbrack 0,T\rbrack\times\Omega\mbox{,~}\eta>0,\\
  76 (\frac{\partial u}{\partial t}+b^t.\nabla u-\eta.\Delta u+c.u-f)(u-\phi)=0\mbox{,~a.e.w. in~}\lbrack 0,T\rbrack\times\Omega,\\
  77 u(0,x,y,z)=u_0(x,y,z),\\
  78 \mbox{B.C. on~}u(t,x,y,z)\mbox{~defined on~}\partial\Omega,
  79 \end{array}
  80 \right.
  81 \label{ch13:eq:01}
  82 \end{equation}
  83 where $u_0$ is the initial condition, $c\geq 0$, $b$ and $\eta$ are physical parameters,
  84 $T$ is the final time, $u=u(t,x,y,z)$ is an element of the solution vector $U$ to compute,
  85 $f$ is the right-hand side that could represent, for example, the external forces, B.C.
  86 describes the boundary conditions on the boundary $\partial\Omega$ of the domain $\Omega$,
  87 $\phi$ models a constraint imposed to $u$, $\Delta$ is the Laplacian operator, $\nabla$
  88 is the gradient operator, a.e.w. means almost every where and ``.'' defines the products
  89 between two scalars, a scalar and a vector or a matrix and a vector. In practice the boundary
  90 condition, generally considered, is the Dirichlet condition (where $u$ is fixed on $\partial\Omega$)
  91 or the Neumann condition (where the normal derivative of $u$ is fixed on $\partial\Omega$).
  92
  93 The time dependent equation~(\ref{ch13:eq:01}) is numerically solved by considering an
  94 implicit or a semi-implicit time marching, where at each time step $k$ a stationary nonlinear
  95 problem\index{Nonlinear} is solved:
  96 \begin{equation}
  97 \left\{
  98 \begin{array}{l}
  99 b^t.\nabla u-\eta.\Delta u+(c+\delta).u-g\geq 0\mbox{,~}u\geq\phi\mbox{,~a.e.w. in~}\lbrack 0,T\rbrack\times\Omega\mbox{,~}\eta>0, \\
 100 (b^t.\nabla u-\eta.\Delta u+(c+\delta).u- g)(u-\phi)=0\mbox{,~a.e.w. in~}\lbrack 0,T\rbrack\times\Omega, \\
 101 \mbox{B.C. on~}u(t,x,y,z)\mbox{~defined on~}\partial\Omega,
 102 \end{array}
 103 \right.
 104 \label{ch13:eq:02}
 105 \end{equation}
 106 where $\delta=\frac{1}{k}$ is the inverse of the time step $k$, $g=f+\delta u^{prev}$ and
 107 $u^{prev}$ is the solution computed at the previous time step.
 108
 109
 110 %%*******************
 111 %%*******************
 112 \subsection{Discretization}
 113 \label{ch13:sec:02.02}
 114 First, we note that the spatial discretization of the previous stationary
 115 problem~(\ref{ch13:eq:02}) does not provide a symmetric matrix, because the
 116 convection-diffusion operator is not self-adjoint. Moreover, the fact that
 117 the operator is self-adjoint or not plays an important role in the choice
 118 of the appropriate algorithm for solving nonlinear systems derived from the
 119 discretization of the obstacle problem\index{Obstacle~problem}. Nevertheless,
 120 since the convection coefficients arising in the operator~(\ref{ch13:eq:02})
 121 are constant, we can formulate the same problem by an self-adjoint operator
 122 by performing a classical change of variables. Then, we can replace the stationary
 123 convection-diffusion problem:
 124 \begin{equation}
 125 b^{t}.\nabla v-\eta.\Delta v+(c+\delta).v=g\mbox{,~a.e.w. in~}\lbrack 0,T\rbrack\times\Omega\mbox{,~}c\geq 0\mbox{,~}\delta\geq~0,
 126 \label{ch13:eq:03}
 127 \end{equation}
 128 by the following stationary diffusion operator:
 129 \begin{equation}
 130 -\eta.\Delta u+(\frac{\|b\|^{2}_{2}}{4\eta}+c+\delta).u=e^{-a}g=f,
 131 \label{ch13:eq:04}
 132 \end{equation}
 133 where $b=\{b_{1},b_{2},b_{3}\}$, $\|b\|_{2}$ denotes the Euclidean norm of $b$ and
 134 $v=e^{-a}.u$ represents the general change of variables such that $a=\frac{b^{t}(x,y,z)}{2\eta}$.
 135 Consequently, the numerical resolution of the diffusion problem (the self-adjoint
 136 operator~(\ref{ch13:eq:04})) is done by optimization algorithms, in contrast, that
 137 of the convection-diffusion problem (non self-adjoint operator~(\ref{ch13:eq:03}))
 138 is done by relaxation algorithms. In the case of our studied algorithm, the convergence\index{Convergence}
 139 is ensured by M-matrix property then, the performance is linked to the magnitude of
 140 the spectral radius of the iteration matrix, which is independent of the condition
 141 number.
 142
 143 Next, the three-dimensional domain $\Omega\subset\mathbb{R}^{3}$ is set to $\Omega=\lbrack 0,1\rbrack^{3}$
 144 and discretized with an uniform Cartesian mesh constituted by $M=m^3$ discretization
 145 points, where $m$ related to the spatial discretization step by $h=\frac{1}{m+1}$. This
 146 is carried out by using a classical order 2 finite difference approximation of the Laplacian.
 147 So, the complete discretization of both stationary boundary value problems~(\ref{ch13:eq:03})
 148 and~(\ref{ch13:eq:04}) leads to the solution of a large discrete complementary problem
 149 of the following form, when both Dirichlet or Neumann boundary conditions are used:
 150 \begin{equation}
 151 \left\{
 152 \begin{array}{l}
 153 \mbox{Find~}U^{*}\in\mathbb{R}^{M}\mbox{~such~that} \\
 154 (A+\delta I)U^{*}-G\geq 0\mbox{,~}U^{*}\geq\bar{\Phi},\\
 155 ((A+\delta I)U^{*}-G)^{T}(U^{*}-\bar{\Phi})=0,\\
 156 \end{array}
 157 \right.
 158 \label{ch13:eq:05}
 159 \end{equation}
 160 where $A$ is a matrix obtained after the spatial discretization by a finite difference
 161 method, $G$ is derived from the Euler first order implicit time marching scheme and from
 162 the discretized right-hand side of the obstacle problem, $\delta$ is the inverse of the
 163 time step $k$ and $I$ is the identity matrix. The matrix $A$ is symmetric when the self-adjoint
 164 operator is considered and nonsymmetric otherwise.
 165
 166 According to the chosen discretization scheme of the Laplacian, $A$ is an M-matrix (irreducibly
 167 diagonal dominant, see~\cite{ch13:ref2}) and, consequently, the matrix $(A+\delta I)$ is also
 168 an M-matrix. This property is important to the convergence of iterative methods.
 169
 170
 171 %%--------------------------%%
 172 %%       SECTION 3          %%
 173 %%--------------------------%%
 174 \section{Parallel iterative method}
 175 \label{ch13:sec:03}
 176 Owing to the large size of the previous discrete complementary problem~(\ref{ch13:eq:05}),
 177 we will solve it by parallel synchronous or asynchronous iterative algorithms (see~\cite{ch13:ref3,ch13:ref4,ch13:ref5}).
 178 In this chapter, we aim at harnessing the computing power of GPU clusters for solving these
 179 large nonlinear systems\index{Nonlinear}. Then, we choose to use the projected Richardson
 180 iterative method\index{Iterative~method!Projected~Richardson} for solving the diffusion
 181 problem~(\ref{ch13:eq:04}). Indeed, this method is based on the iterations of the Jacobi
 182 method\index{Iterative~method!Jacobi} which are easy to parallelize on parallel computers
 183 and easy to adapt to GPU architectures. Then, according to the boundary value problem
 184 formulation with a self-adjoint operator~(\ref{ch13:eq:04}), we can consider here the
 185 equivalent optimization problem and the fixed point mapping associated to its solution.
 186
 187 Assume that $E=\mathbb{R}^{M}$ is a Hilbert space\index{Hilbert~space}, in which $\scalprod{.}{.}$
 188 is the scalar product and $\|.\|$ its associated norm. So, the general fixed point problem
 189 to be solved is defined as follows:
 190 \begin{equation}
 191 \left\{
 192 \begin{array}{l}
 193 \mbox{Find~} U^{*} \in E \mbox{~such that} \\
 194 U^{*} = F(U^{*}), \\
 195 \end{array}
 196 \right.
 197 \label{ch13:eq:06}
 198 \end{equation}
 199 where $U\mapsto F(U)$ is an application from $E$ to $E$.
 200
 201 Let $K$ be a closed convex set defined by:
 202 \begin{equation}
 203 K = \{U | U \geq \Phi \mbox{~everywhere in~} E\},
 204 \label{ch13:eq:07}
 205 \end{equation}
 206 where $\Phi$ is the discrete obstacle function. In fact, the obstacle problem~(\ref{ch13:eq:05})\index{Obstacle~problem}
 207 is formulated as the following constrained optimization problem:
 208 \begin{equation}
 209 \left\{
 210 \begin{array}{l}
 211 \mbox{Find~} U^{*} \in K \mbox{~such that} \\
 212 \forall V \in K, J(U^{*}) \leq J(V), \\
 213 \end{array}
 214 \right.
 215 \label{ch13:eq:08}
 216 \end{equation}
 217 where the cost function is given by:
 218 \begin{equation}
 219 J(U) = \frac{1}{2}\scalprod{\mathcal{A}.U}{U} - \scalprod{G}{U},
 220 \label{ch13:eq:09}
 221 \end{equation}
 222 in which $\scalprod{.}{.}$ denotes the scalar product in $E$, $\mathcal{A}=A+\delta I$
 223 is a symmetric positive definite, $A$ is the discretization matrix associated with the
 224 self-adjoint operator~(\ref{ch13:eq:04}) after change of variables.
 225
 226 For any $U\in E$, let $P_K(U)$ be the projection of $U$ on $K$. For any $\gamma\in\mathbb{R}$,
 227 $\gamma>0$, the fixed point mapping $F_{\gamma}$ of the projected Richardson method\index{Iterative~method!Projected~Richardson}
 228 is defined as follows:
 229 \begin{equation}
 230 U^{*} = F_{\gamma}(U^{*}) = P_K(U^{*} - \gamma(\mathcal{A}.U^{*} - G)).
 231 \label{ch13:eq:10}
 232 \end{equation}
 233 In order to reduce the computation time, the large optimization problem is solved in a
 234 numerical way by using a parallel asynchronous algorithm of the projected Richardson method
 235 on the convex set $K$. Particularly, we will consider an asynchronous parallel adaptation
 236 of the projected Richardson method~\cite{ch13:ref6}.
 237
 238 Let $\alpha\in\mathbb{N}$ be a positive integer. We consider that the space $E=\displaystyle\prod_{i=1}^{\alpha} E_i$
 239 is a product of $\alpha$ subspaces $E_i$ where $i\in\{1,\ldots,\alpha\}$. Note that $E_i=\mathbb{R}^{m_i}$,
 240 where $\displaystyle\sum_{i=1}^{\alpha} m_{i}=M$, is also a Hilbert space\index{Hilbert~space}
 241 in which $\scalprod{.}{.}_i$ denotes the scalar product and $|.|_i$ the associated norm, for
 242 all $i\in\{1,\ldots,\alpha\}$. Then, for all $u,v\in E$, $\scalprod{u}{v}=\displaystyle\sum_{i=1}^{\alpha}\scalprod{u_i}{v_i}_i$
 243 is the scalar product on $E$.
 244
 245 Let $U\in E$, we consider the following decomposition of $U$ and the corresponding decomposition
 246 of $F_\gamma$ into $\alpha$ blocks:
 247 \begin{equation}
 248 \begin{array}{rcl}
 249 U    & = & (U_1,\ldots,U_{\alpha}), \\
 250 F_{\gamma}(U) & = & (F_{1,\gamma}(U),\ldots,F_{\alpha,\gamma}(U)). \\
 251 \end{array}
 252 \label{ch13:eq:11}
 253 \end{equation}
 254 Assume that the convex set $K=\displaystyle\prod_{i=1}^{\alpha}K_{i}$, such that $\forall i\in\{1,\ldots,\alpha\},K_i\subset E_i$
 255 and $K_i$ is a closed convex set. Let also $G=(G_1,\ldots,G_{\alpha})\in E$ and, for any
 256 $U\in E$, $P_K(U)=(P_{K_1}(U_1),\ldots,P_{K_{\alpha}}(U_{\alpha}))$ is the projection of $U$
 257 on $K$ where $\forall i\in\{1,\ldots,\alpha\},P_{K_i}$ is the projector from $E_i$ onto
 258 $K_i$. So, the fixed point mapping of the projected Richardson method~(\ref{ch13:eq:10})\index{Iterative~method!Projected~Richardson}
 259 can be written in the following way:
 260 \begin{equation}
 261 \forall U\in E\mbox{,~}\forall i\in\{1,\ldots,\alpha\}\mbox{,~}F_{i,\gamma}(U) = P_{K_i}(U_i - \gamma(\mathcal{A}_i.U - G_i)).
 262 \label{ch13:eq:12}
 263 \end{equation}
 264 Note that $\displaystyle\mathcal{A}_i.U= \sum_{j=1}^{\alpha}\mathcal{A}_{i,j}.U_j$, where
 265 $\mathcal{A}_{i,j}$ denote block matrices of $\mathcal{A}$.
 266
 267 The parallel asynchronous iterations of the projected Richardson method for solving the
 268 obstacle problem~(\ref{ch13:eq:08}) are defined as follows: let $U^0\in E,U^0=(U^0_1,\ldots,U^0_\alpha)$
 269 be the initial solution, then for all $p\in\mathbb{N}$, the iterate $U^{p+1}=(U^{p+1}_1,\ldots,U^{p+1}_{\alpha})$
 270 is recursively defined by:
 271 \begin{equation}
 272 U_i^{p+1} =
 273 \left\{
 274 \begin{array}{l}
 275 F_{i,\gamma}(U_1^{\rho_1(p)}, \ldots, U_{\alpha}^{\rho_{\alpha}(p)}) \mbox{~if~} i\in s(p), \\
 276 U_i^p \mbox{~otherwise}, \\
 277 \end{array}
 278 \right.
 279 \label{ch13:eq:13}
 280 \end{equation}
 281 where
 282 \begin{equation}
 283 \left\{
 284 \begin{array}{l}
 285 \forall p\in\mathbb{N}, s(p)\subset\{1,\ldots,\alpha\}\mbox{~and~} s(p)\ne\emptyset, \\
 286 \forall i\in\{1,\ldots,\alpha\},\{p \ | \ i \in s(p)\}\mbox{~is denombrable},
 287 \end{array}
 288 \right.
 289 \label{ch13:eq:14}
 290 \end{equation}
 291 and $\forall j\in\{1,\ldots,\alpha\}$,
 292 \begin{equation}
 293 \left\{
 294 \begin{array}{l}
 295 \forall p\in\mathbb{N}, \rho_j(p)\in\mathbb{N}, 0\leq\rho_j(p)\leq p\mbox{~and~}\rho_j(p)=p\mbox{~if~} j\in s(p),\\
 296 \displaystyle\lim_{p\to\infty}\rho_j(p) = +\infty.\\
 297 \end{array}
 298 \right.
 299 \label{ch13:eq:15}
 300 \end{equation}
 301
 302 The previous asynchronous scheme\index{Asynchronous} of the projected Richardson
 303 method models computations that are carried out in parallel without order nor
 304 synchronization (according to the behavior of the parallel iterative method) and
 305 describes a subdomain method without overlapping. It is a general model that takes
 306 into account all possible situations of parallel computations and nonblocking message
 307 passing. So, the synchronous iterative scheme\index{Synchronous} is defined by:
 308 \begin{equation}
 309 \forall j\in\{1,\ldots,\alpha\} \mbox{,~} \forall p\in\mathbb{N} \mbox{,~} \rho_j(p)=p.
 310 \label{ch13:eq:16}
 311 \end{equation}
 312 The values of $s(p)$ and $\rho_j(p)$ are defined dynamically and not explicitly by
 313 the parallel asynchronous or synchronous execution of the algorithm. Particularly,
 314 it enables one to consider distributed computations whereby processors compute at
 315 their own pace according to their intrinsic characteristics and computational load.
 316 The parallelism between the processors is well described by the set $s(p)$ which
 317 contains at each step $p$ the index of the components relaxed by each processor on
 318 a parallel way while the use of delayed components in~(\ref{ch13:eq:13}) permits one
 319 to model nondeterministic behavior and does not imply inefficiency of the considered
 320 distributed scheme of computation. Note that, according to~\cite{ch13:ref7}, theoretically,
 321 each component of the vector must be relaxed an infinity of time. The choice of the
 322 relaxed components to be used in the computational process may be guided by any criterion
 323 and, in particular, a natural criterion is to pick-up the most recently available
 324 values of the components computed by the processors. Furthermore, the asynchronous
 325 iterations are implemented by means of nonblocking MPI communication subroutines\index{MPI~subroutines!Nonblocking}
 326 (asynchronous communications).
 327
 328 The important property ensuring the convergence of the parallel projected Richardson
 329 method, both synchronous and asynchronous algorithms, is the fact that $\mathcal{A}$
 330 is an M-matrix. Moreover, the convergence\index{Convergence} proceeds from a result
 331 of~\cite{ch13:ref6}. Indeed, there exists a value $\gamma_0>0$, such that $\forall\gamma\in ]0,\gamma_0[$,
 332 the parallel iterations~(\ref{ch13:eq:13}), (\ref{ch13:eq:14}) and~(\ref{ch13:eq:15}),
 333 associated to the fixed point mapping $F_\gamma$~(\ref{ch13:eq:12}), converge to the
 334 unique solution $U^{*}$ of the discretized problem.
 335
 336
 337 %%--------------------------%%
 338 %%       SECTION 4          %%
 339 %%--------------------------%%
 340 \section{Parallel implementation on a GPU cluster}
 341 \label{ch13:sec:04}
 342 In this section, we give the main key points of the parallel implementation of the
 343 projected Richardson method, both synchronous and asynchronous versions, on a GPU
 344 cluster, for solving the nonlinear systems derived from the discretization of large
 345 obstacle problems. More precisely, each nonlinear system is solved iteratively using
 346 the whole cluster. We use a heterogeneous CUDA/MPI programming. Indeed, the communication
 347 of data, at each iteration between the GPU computing nodes, can be either synchronous
 348 or asynchronous using the MPI communication subroutines, whereas inside each GPU node,
 349 a CUDA parallelization is performed.
 350
 351 \begin{figure}[!h]
 352 \centerline{\includegraphics[scale=0.30]{Chapters/chapter13/figures/splitCPU}}
 353 \caption{Data partitioning of a problem to be solved among $S=3\times 4$ computing nodes.}
 354 \label{ch13:fig:01}
 355 \end{figure}
 356
 357 Let $S$ denote the number of computing nodes\index{Computing~node} on the GPU cluster,
 358 where a computing node is composed of CPU core holding one MPI process and a GPU card.
 359 So, before starting computations, the obstacle problem of size $(NX\times NY\times NZ)$
 360 is split into $S$ parallelepipedic sub-problems, each for a node (MPI process, GPU), as
 361 is shown in Figure~\ref{ch13:fig:01}. Indeed, the $NY$ and $NZ$ dimensions (according
 362 to the $y$ and $z$ axises) of the three-dimensional problem are, respectively, split
 363 into $Sy$ and $Sz$ parts, such that $S=Sy\times Sz$. In this case, each computing node
 364 has at most four neighboring nodes. This kind of the data partitioning reduces the data
 365 exchanges at subdomain boundaries compared to a naive $z$-axis-wise partitioning.
 366
 367 \begin{algorithm}[!t]
 368 Initialization of the parameters of the sub-problem\;
 369 Allocate and fill the data in the global memory GPU\;
 370 \For{$i=1$ {\bf to} $NbSteps$}{
 371    $G = \frac{1}{k}.U^0 + F$\;
 372    Solve($A$, $U^0$, $G$, $U$, $\varepsilon$, $MaxRelax$)\;
 373    $U^0 = U$\;
 374 }
 375 Copy the solution $U$ back from GPU memory\;
 376 \caption{Parallel solving of the obstacle problem on a GPU cluster}
 377 \label{ch13:alg:01}
 378 \end{algorithm}
 379
 380 All the computing nodes of the GPU cluster execute in parallel the same Algorithm~\ref{ch13:alg:01}
 381 but on different three-dimensional sub-problems of size $(NX\times ny\times nz)$.
 382 This algorithm gives the main key points for solving an obstacle problem\index{Obstacle~problem}
 383 defined in a three-dimensional domain, where $A$ is the discretization matrix, $G$
 384 is the right-hand side and $U$ is the solution vector. After the initialization step,
 385 all the data generated from the partitioning operation are copied from the CPU memories
 386 to the GPU global memories, to be processed on the GPUs. Next, the algorithm uses $NbSteps$
 387 time steps to solve the global obstacle problem. In fact, it uses a parallel algorithm
 388 adapted to GPUs of the projected Richardson iterative method for solving the nonlinear
 389 systems\index{Nonlinear} of the obstacle problem. This function is defined by {\it Solve()}
 390 in Algorithm~\ref{ch13:alg:01}. At every time step, the initial guess $U^0$ for the iterative
 391 algorithm is set to the solution found at the previous time step. Moreover, the right-hand
 392 side $G$ is computed as follows: \[G = \frac{1}{k}.U^{prev} + F\] where $k$ is the time step,
 393 $U^{prev}$ is the solution computed in the previous time step and each element $f(x, y, z)$
 394 of the vector $F$ is computed as follows:
 395 \begin{equation}
 396 f(x,y,z)=\cos(2\pi x)\cdot\cos(4\pi y)\cdot\cos(6\pi z).
 397 \label{ch13:eq:18}
 398 \end{equation}
 399 Finally, the solution $U$ of the obstacle problem is copied back from the GPU global
 400 memories to the CPU memories. We use the communication subroutines of the CUBLAS library~\cite{ch13:ref8}\index{CUBLAS}
 401 (CUDA Basic Linear Algebra Subroutines) for the memory allocations in the GPU (\verb+cublasAlloc()+)
 402 and the data transfers between the CPU and its GPU: \verb+cublasSetVector()+ and \verb+cublasGetVector()+.
 403
 404 \begin{algorithm}[!t]
 405   $p = 0$\;
 406   $conv = false$\;
 407   $U = U^{0}$\;
 408   \Repeat{$(conv=true)$}{
 409     Determine\_Bordering\_Vector\_Elements($U$)\;
 410     Compute\_New\_Vector\_Elements($A$, $G$, $U$)\;
 411     $tmp = U^{0} - U$\;
 412     $error = \|tmp\|_{2}$\;
 413     $U^{0} = U$\;
 414     $p = p + 1$\;
 415     $conv$ = Convergence($error$, $p$, $\varepsilon$, $MaxRelax$)\;
 416   }
 417 \caption{Parallel iterative solving of the nonlinear systems on a GPU cluster ($Solve()$ function)}
 418 \label{ch13:alg:02}
 419 \end{algorithm}
 420
 421 As many other iterative methods, the algorithm of the projected Richardson
 422 method\index{Iterative~method!Projected~Richardson} is based on algebraic
 423 functions operating on vectors and/or matrices, which are more efficient on
 424 parallel computers when they work on large vectors. Its parallel implementation
 425 on the GPU cluster is carried out so that the GPUs execute the vector operations
 426 as kernels and the CPUs execute the serial codes, supervise the kernel executions
 427 and the data exchanges with the neighboring nodes\index{Neighboring~node} and
 428 supply the GPUs with data. Algorithm~\ref{ch13:alg:02} shows the main key points
 429 of the parallel iterative algorithm (function $Solve()$ in Algorithm~\ref{ch13:alg:01}).
 430 All the vector operations inside the main loop ({\bf repeat} ... {\bf until})
 431 are executed by the GPU. We use the following functions of the CUBLAS library\index{CUBLAS}:
 432 \begin{itemize}
 433 \item \verb+cublasDaxpy()+ to compute the difference between the solution vectors $U^{p}$ and $U^{p+1}$ computed in two successive relaxations
 434 $p$ and $p+1$ (line~$7$ in Algorithm~\ref{ch13:alg:02}),
 435 \item \verb+cublasDnrm2()+ to perform the Euclidean norm (line~$8$) and,
 436 \item \verb+cublasDcpy()+ for the data copy of a vector to another one in the GPU memory (lines~$3$ and~$9$).
 437 \end{itemize}
 438
 439 The dimensions of the grid and blocks of threads that execute a given kernel
 440 depend on the resources of the GPU multiprocessor and the resource requirements
 441 of the kernel. So, if $block$ defines the size of a thread block, which must
 442 not exceed the maximum size of a thread block, then the number of thread blocks
 443 in the grid, denoted by $grid$, can be computed according to the size of the
 444 local sub-problem as follows: \[grid = \frac{(NX\times ny\times nz)+block-1}{block}.\]
 445 However, when solving very large problems, the size of the thread grid can exceed
 446 the maximum number of thread blocks that can be executed on the GPUs (up-to $65.535$
 447 thread blocks) and, thus, the kernel will fail to launch. Therefore, for each kernel,
 448 we decompose the three-dimensional sub-problem into $nz$ two-dimensional slices of size
 449 ($NX\times ny$), as is shown in Figure~\ref{ch13:fig:02}. All slices of the same kernel
 450 are executed using {\bf for} loop by $NX\times ny$ parallel threads organized in a
 451 two-dimensional grid of two-dimensional thread blocks, as is shown in Listing~\ref{ch13:list:01}.
 452 Each thread is in charge of $nz$ discretization points (one from each slice), accessed
 453 in the GPU memory with a constant stride $(NX\times ny)$.
 454
 455 \begin{figure}
 456 \centerline{\includegraphics[scale=0.30]{Chapters/chapter13/figures/splitGPU}}
 457 \caption{Decomposition of a sub-problem in a GPU into $nz$ slices.}
 458 \label{ch13:fig:02}
 459 \end{figure}
 460
 461 \begin{center}
 462 \lstinputlisting[label=ch13:list:01,caption=Skeleton codes of a GPU kernel and a CPU function]{Chapters/chapter13/ex1.cu}
 463 \end{center}
 464 The function $Determine\_Bordering\_Vector\_Elements()$ (line~$5$ in Algorithm~\ref{ch13:alg:02})
 465 determines the values of the vector elements shared at the boundaries with neighboring computing
 466 nodes. Its main operations are defined as follows:
 467 \begin{enumerate}
 468 \item define the values associated to the bordering points needed by the neighbors,
 469 \item copy the values associated to the bordering points from the GPU to the CPU,
 470 \item exchange the values associated to the bordering points between the neighboring CPUs,
 471 \item copy the received values associated to the bordering points from the CPU to the GPU,
 472 \end{enumerate}
 473 The first operation of this function is implemented as kernels to be performed by the GPU:
 474 \begin{itemize}
 475 \item a kernel executed by $NX\times nz$ threads to define the values associated to the bordering vector elements along $y$-axis and,
 476 \item a kernel executed by $NX\times ny$ threads to define the values associated to the bordering vector elements along $z$-axis.
 477 \end{itemize}
 478 As mentioned before, we develop the \emph{synchronous} and \emph{asynchronous}
 479 algorithms of the projected Richardson method. Obviously, in this scope, the
 480 synchronous\index{Synchronous} or asynchronous\index{Asynchronous} communications
 481 refer to the communications between the CPU cores (MPI processes) on the GPU cluster,
 482 in order to exchange the vector elements associated to subdomain boundaries. For
 483 the memory copies between a CPU core and its GPU, we use the synchronous communication
 484 routines of the CUBLAS library\index{CUBLAS}: \verb+cublasSetVector()+ and \verb+cublasGetVector()+
 485 in the synchronous algorithm and the asynchronous ones: \verb+cublasSetVectorAsync()+
 486 and \verb+cublasGetVectorAsync()+ in the asynchronous algorithm. Moreover, we
 487 use the communication routines of the MPI library to carry out the data exchanges
 488 between the neighboring nodes. We use the following communication routines: \verb+MPI_Isend()+
 489 and \verb+MPI_Irecv()+ to perform nonblocking\index{MPI~subroutines!Nonblocking}
 490 sends and receptions, respectively. For the synchronous algorithm, we use the MPI
 491 routine \verb+MPI_Waitall()+ which puts the MPI process of a computing node in
 492 blocking status until all data exchanges with neighboring nodes (sends and receptions)
 493 are completed. In contrast, for the asynchronous algorithms, we use the MPI routine
 494 \verb+MPI_Test()+ which tests the completion of a data exchange (send or reception)
 495 without putting the MPI process in blocking status\index{MPI~subroutines!Blocking}.
 496
 497 The function $Compute\_New\_Vector\_Elements()$ (line~$6$ in Algorithm~\ref{ch13:alg:02})
 498 computes, at each iteration, the new elements of the iterate vector $U$. Its general code
 499 is presented in Listing~\ref{ch13:list:01} (CPU function). The iterations of the projected
 500 Richardson method\index{Iterative~method!Projected~Richardson}, based on those of the Jacobi
 501 method\index{Iterative~method!Jacobi}, are defined as follows:
 502 \begin{equation}
 503 \begin{array}{ll}
 504 u^{p+1}(x,y,z) =& \frac{1}{Center}(g(x,y,z) - (Center\cdot u^{p}(x,y,z) + \\
 505 & West\cdot u^{p}(x-h,y,z) + East\cdot u^{p}(x+h,y,z) + \\
 506 & South\cdot u^{p}(x,y-h,z) + North\cdot u^{p}(x,y+h,z) + \\
 507 & Rear\cdot u^{p}(x,y,z-h) + Front\cdot u^{p}(x,y,z+h))),
 508 \end{array}
 509 \label{ch13:eq:17}
 510 \end{equation}
 511 where $u^{p}(x,y,z)$ is an element of the iterate vector $U$ computed at the
 512 iteration $p$ and $g(x,y,z)$ is a vector element of the right-hand side $G$.
 513 The scalars $Center$, $West$, $East$, $South$, $North$, $Rear$ and $Front$
 514 define constant coefficients of the block matrix $A$. Figure~\ref{ch13:fig:03}
 515 shows the positions of these coefficients in a three-dimensional domain.
 516
 517 \begin{figure}
 518 \centerline{\includegraphics[scale=0.35]{Chapters/chapter13/figures/matrix}}
 519 \caption{Matrix constant coefficients in a three-dimensional domain.}
 520 \label{ch13:fig:03}
 521 \end{figure}
 522
 523 The kernel implementations of the projected Richardson method on GPUs uses a
 524 perfect fine-grain multithreading parallelism. Since the projected Richardson
 525 algorithm is implemented as a fixed point method, each kernel is executed by
 526 a large number of GPU threads such that each thread is in charge of the computation
 527 of one element of the iterate vector $U$. Moreover, this method uses the vector
 528 elements updates of the Jacobi method, which means that each thread $i$ computes
 529 the new value of its element $u_{i}^{p+1}$ independently of the new values $u_{j}^{p+1}$,
 530 where $j\neq i$, of those computed in parallel by other threads at the same iteration
 531 $p+1$. Listing~\ref{ch13:list:02} shows the GPU implementations of the main kernels
 532 of the projected Richardson method, which are: the matrix-vector multiplication
 533 (\verb+MV_Multiplication()+) and the vector elements updates (\verb+Vector_Updates()+).
 534 The codes of these kernels are based on that presented in Listing~\ref{ch13:list:01}.
 535
 536 \lstinputlisting[label=ch13:list:02,caption=GPU kernels of the projected Richardson method]{Chapters/chapter13/ex2.cu}
 537
 538 \begin{figure}
 539 \centerline{\includegraphics[scale=0.3]{Chapters/chapter13/figures/points3D}}
 540 \caption{Computation of a vector element with the projected Richardson method.}
 541 \label{ch13:fig:04}
 542 \end{figure}
 543
 544 Each kernel is executed by $NX\times ny$ GPU threads so that $nz$ slices
 545 of $(NX\times ny)$ vector elements are computed in a {\bf for} loop. In
 546 this case, each thread is in charge of one vector element from each slice
 547 (in total $nz$ vector elements along $z$-axis). We can notice from the
 548 formula~(\ref{ch13:eq:17}) that the computation of a vector element $u^{p+1}(x,y,z)$,
 549 by a thread at iteration $p+1$, requires seven vector elements computed
 550 at the previous iteration $p$: two vector elements in each dimension plus
 551 the vector element at the intersection of the three axises $x$, $y$ and $z$
 552 (see Figure~\ref{ch13:fig:04}). So, to reduce the memory accesses to the
 553 high-latency global memory, the vector elements of the current slice can
 554 be stored in the low-latency shared memories of thread blocks, as is described
 555 in~\cite{ch13:ref9}. Nevertheless, the fact that the computation of a vector
 556 element requires only two elements in each dimension does not allow to maximize
 557 the data reuse from the shared memories. The computation of a slice involves
 558 in total $(bx+2)\times(by+2)$ accesses to the global memory per thread block,
 559 to fill the required vector elements in the shared memory where $bx$ and $by$
 560 are the dimensions of a thread block. Then, in order to optimize the memory
 561 accesses on GPUs, the elements of the iterate vector $U$ are filled in the
 562 cache texture memory (see~\cite{ch13:ref10}). In new GPU generations as Fermi
 563 or Kepler, the global memory accesses are always cached in L1 and L2 caches.
 564 For example, for a given kernel, we can favour the use of the L1 cache to that
 565 of the shared memory by using the function \verb+cudaFuncSetCacheConfig(Kernel,cudaFuncCachePreferL1)+.
 566 So, the initial access to the global memory loads the vector elements required
 567 by the threads of a block into the cache memory (texture or L1/L2 caches). Then,
 568 all the following memory accesses read from this cache memory. In Listing~\ref{ch13:list:02},
 569 the function \verb+fetch_double(v,i)+ is used to read from the texture memory
 570 the $i^{th}$ element of the double-precision vector \verb+v+ (see Listing~\ref{ch13:list:03}).
 571 Moreover, the seven constant coefficients of matrix $A$ can be stored in the
 572 constant memory but, since they are reused $nz$ times by each thread, it is more
 573 interesting to fill them on the low-latency registers of each thread.
 574
 575 \lstinputlisting[label=ch13:list:03,caption=Memory access to the cache texture memory]{Chapters/chapter13/ex3.cu}
 576
 577 The function $Convergence()$ (line~$11$ in Algorithm~\ref{ch13:alg:02}) allows
 578 to detect the convergence of the parallel iterative algorithm and is based on
 579 the tolerance threshold\index{Convergence!Tolerance~threshold} $\varepsilon$
 580 and the maximum number of relaxations\index{Convergence!Maximum~number~of~relaxations}
 581 $MaxRelax$. We take into account the number of relaxations since that of iterations
 582 cannot be computed in the asynchronous case. Indeed, a relaxation is the update~(\ref{ch13:eq:13})
 583 of a local iterate vector $U_i$ according to $F_i$. Then, counting the number
 584 of relaxations is possible in both synchronous and asynchronous cases. On the
 585 other hand, an iteration is the update of at least all vector components with
 586 $F_i$.
 587
 588 In the synchronous\index{Synchronous} algorithm, the global convergence is detected
 589 when the maximal value of the absolute error, $error$, is sufficiently small and/or
 590 the maximum number of relaxations, $MaxRelax$, is reached, as follows:
 591 $$
 592 \begin{array}{l}
 593 error=\|U^{p}-U^{p+1}\|_{2}; \\
 594 AllReduce(error,\hspace{0.1cm}maxerror,\hspace{0.1cm}MAX); \\
 595 \text{if}((maxerror<\varepsilon)\hspace{0.2cm}\text{or}\hspace{0.2cm}(p\geq MaxRelax)) \\
 596 conv \leftarrow true;
 597 \end{array}
 598 $$
 599 where the function $AllReduce()$ uses the MPI global reduction subroutine\index{MPI~subroutines!Global}
 600 \verb+MPI_Allreduce()+ to compute the maximal value, $maxerror$, among the local
 601 absolute errors, $error$, of all computing nodes and $p$ (in Algorithm~\ref{ch13:alg:02})
 602 is used as a counter of the local relaxations carried out by a computing node. In
 603 the asynchronous\index{Asynchronous} algorithms, the global convergence is detected
 604 when all computing nodes locally converge. For this, we use a token ring architecture
 605 around which a boolean token travels, in one direction, from a computing node to another.
 606 Starting from node $0$, the boolean token is set to $true$ by node $i$ if the local
 607 convergence is reached or to $false$ otherwise and, then, it is sent to node $i+1$.
 608 Finally, the global convergence is detected when node $0$ receives from its neighbor
 609 node $S-1$, in the ring architecture, a token set to $true$. In this case, node $0$
 610 sends a stop message (end of parallel solving) to all computing nodes in the cluster.
 611
 612
 613 %%--------------------------%%
 614 %%       SECTION 5          %%
 615 %%--------------------------%%
 616 \section{Experimental tests on a GPU cluster}
 617 \label{ch13:sec:05}
 618 The GPU cluster\index{GPU~cluster} of tests, that we used in this chapter, is an $20Gbps$
 619 Infiniband network of six machines. Each machine is a Quad-Core Xeon E5530 CPU running at
 620 $2.4$GHz. It provides a RAM memory of $12$GB with a memory bandwidth of $25.6$GB/s and it
 621 is equipped with two Nvidia Tesla C1060 GPUs. A Tesla GPU contains in total $240$ cores
 622 running at $1.3$GHz. It provides $4$GB of global memory with a memory bandwidth of $102$GB/s,
 623 accessible by all its cores and also by the CPU through the PCI-Express 16x Gen 2.0 interface
 624 with a throughput of $8$GB/s. Hence, the memory copy operations between the GPU and the CPU
 625 are about $12$ times slower than those of the Tesla GPU memory. We have performed our simulations
 626 on a cluster of $24$ CPU cores and on a cluster of $12$ GPUs. Figure~\ref{ch13:fig:05} describes
 627 the components of the GPU cluster of tests.
 628
 629 Linux cluster version 2.6.39 OS is installed on CPUs. C programming language is used for
 630 coding the parallel algorithms of the methods on both GPU cluster and CPU cluster. CUDA
 631 version 4.0~\cite{ch13:ref12} is used for programming GPUs, using CUBLAS library~\cite{ch13:ref8}
 632 to deal with vector operations in GPUs and, finally, MPI functions of OpenMPI 1.3.3 are
 633 used to carry out the synchronous and asynchronous communications between CPU cores. Indeed,
 634 in our experiments, a computing node is managed by a MPI process and it is composed of
 635 one CPU core and one GPU card.
 636
 637 All experimental results of the parallel projected Richardson algorithms are obtained
 638 from simulations made in double precision data. The obstacle problems to be solved are
 639 defined in constant three-dimensional domain $\Omega\subset\mathbb{R}^{3}$. The numerical
 640 values of the parameters of the obstacle problems are: $\eta=0.2$, $c=1.1$, $f$ is computed
 641 by formula~(\ref{ch13:eq:18}) and final time $T=0.02$. Moreover, three time steps ($NbSteps=3$)
 642 are computed with $k=0.0066$. As the discretization matrix is constant along the time
 643 steps, the convergence properties of the iterative algorithms do not change. Thus, the
 644 performance characteristics obtained with three time steps will still be valid for more
 645 time steps. The initial function $u(0,x,y,z)$ of the obstacle problem~(\ref{ch13:eq:01})
 646 is set to $0$, with a constraint $u\geq\phi=0$. The relaxation parameter $\gamma$ used
 647 by the projected Richardson method is computed automatically thanks to the diagonal entries
 648 of the discretization matrix. The formula and its proof can be found in~\cite{ch13:ref11},
 649 Section~2.3. The convergence tolerance threshold $\varepsilon$ is set to $1e$-$04$ and the
 650 maximum number of relaxations is limited to $10^{6}$ relaxations. Finally, the number of
 651 threads per block is set to $256$ threads, which gives, in general, good performances for
 652 most GPU applications. We have performed some tests for the execution configurations and
 653 we have noticed that the best configuration of the $256$ threads per block is an organization
 654 into two dimensions of sizes $(64,4)$.
 655
 656 \begin{figure}
 657 \centerline{\includegraphics[scale=0.25]{Chapters/chapter13/figures/cluster}}
 658 \caption{GPU cluster of tests composed of 12 computing nodes (six machines, each with two GPUs.}
 659 \label{ch13:fig:05}
 660 \end{figure}
 661
 662 The performance measures that we took into account are the execution times and the number
 663 of relaxations performed by the parallel iterative algorithms, both synchronous and asynchronous
 664 versions, on the GPU and CPU clusters. These algorithms are used for solving nonlinear systems
 665 derived from the discretization of obstacle problems of sizes $256^{3}$, $512^{3}$, $768^{3}$
 666 and $800^{3}$. In Table~\ref{ch13:tab:01} and Table~\ref{ch13:tab:02}, we show the performances
 667 of the parallel synchronous and asynchronous algorithms of the projected Richardson method
 668 implemented, respectively, on a cluster of $24$ CPU cores and on a cluster of $12$ GPUs. In
 669 these tables, the execution time defines the time spent by the slowest computing node and the
 670 number of relaxations is computed as the summation of those carried out by all computing nodes.
 671
 672 In the sixth column of Table~\ref{ch13:tab:01} and in the eighth column of Table~\ref{ch13:tab:02},
 673 we give the gains in $\%$ obtained by using an asynchronous algorithm compared to a synchronous
 674 one. We can notice that the asynchronous version on CPU and GPU clusters is slightly faster than
 675 the synchronous one for both methods. Indeed, the cluster of tests is composed of local and homogeneous
 676 nodes communicating via low-latency connections. So, in the case of distant and/or heterogeneous
 677 nodes (or even with geographically distant clusters) the asynchronous version would be faster than
 678 the synchronous one. However, the gains obtained on the GPU cluster are better than those obtained
 679 on the CPU cluster. In fact, the computation times are reduced by accelerating the computations on
 680 GPUs while the communication times still unchanged.
 681
 682 \begin{table}
 683 \centering
 684 \begin{tabular}{|c|c|c|c|c|c|}
 685 \hline
 686 \multirow{2}{*}{\bf Pb. size} & \multicolumn{2}{c|}{\bf Synchronous} & \multicolumn{2}{c|}{\bf Asynchronous} & \multirow{2}{*}{\bf Gain\%} \\ \cline{2-5}
 687
 688                               & $\mathbf{T_{cpu}}$ & {\bf \#relax.}  & $\mathbf{T_{cpu}}$ & {\bf \#relax.}  &  \\ \hline \hline
 689
 690 $256^{3}$                     & $575.22$           & $198,288$        & $539.25$          & $198,613$        & $6.25$ \\ \hline \hline
 691
 692 $512^{3}$                     & $19,250.25$        & $750,912$        & $18,237.14$       & $769,611$        & $5.26$ \\ \hline \hline
 693
 694 $768^{3}$                     & $206,159.44$       & $1,635,264$      & $183,582.60$      & $1,577,004$      & $10.95$ \\ \hline \hline
 695
 696 $800^{3}$                     & $222,108.09$       & $1,769,232$      & $188,790.04$      & $1,701,735$      & $15.00$ \\ \hline
 697 \end{tabular}
 698 \vspace{0.5cm}
 699 \caption{Execution times in seconds of the parallel projected Richardson method implemented on a cluster of 24 CPU cores.}
 700 \label{ch13:tab:01}
 701 \end{table}
 702
 703 \begin{table}
 704 \centering
 705 \begin{tabular}{|c|c|c|c|c|c|c|c|}
 706 \hline
 707 \multirow{2}{*}{\bf Pb. size} & \multicolumn{3}{c|}{\bf Synchronous}                 & \multicolumn{3}{c|}{\bf Asynchronous}                & \multirow{2}{*}{\bf Gain\%}  \\ \cline{2-7}
 708
 709                              & $\mathbf{T_{gpu}}$ & {\bf \#relax.}  & $\mathbf{\tau}$ & $\mathbf{T_{gpu}}$ & {\bf \#relax.}  & $\mathbf{\tau}$ & \\  \hline \hline
 710
 711 $256^{3}$                    & $29.67$            & $100,692$       &  $19.39$        & $18.00$           & $94,215$         & $29.96$    & $39.33$ \\\hline \hline
 712
 713 $512^{3}$                    & $521.83$           & $381,300$       & $36.89$         & $425.15$          & $347,279$        & $42.89$     & $18.53$ \\\hline \hline
 714
 715 $768^{3}$                    & $4,112.68$         & $831,144$        & $50.13$         & $3,313.87$        & $750,232$        & $55.40$    & $19.42$ \\ \hline \hline
 716
 717 $800^{3}$                    & $3,950.87$         & $899,088$        & $56.22$         & $3,636.57$        & $834,900$        & $51.91$     & $7.95$ \\ \hline
 718 \end{tabular}
 719 \vspace{0.5cm}
 720 \caption{Execution times in seconds of the parallel projected Richardson method implemented on a cluster of 12 GPUs.}
 721 \label{ch13:tab:02}
 722 \end{table}
 723
 724 The fourth and seventh columns of Table~\ref{ch13:tab:02} show the relative gains
 725 obtained by executing the parallel algorithms on the cluster of $12$ GPUs instead
 726 on the cluster of $24$ CPU cores. We compute the relative gain\index{Relative~gain}
 727 $\tau$ as a ratio between the execution time $T_{cpu}$ spent on the CPU cluster over
 728 that $T_{gpu}$ spent on the GPU cluster: \[\tau=\frac{T_{cpu}}{T_{gpu}}.\] We can see
 729 from these ratios that solving large obstacle problems is faster on the GPU cluster
 730 than on the CPU cluster. Indeed, the GPUs are more efficient than their counterpart
 731 CPUs to execute large data-parallel operations. In addition, the projected Richardson
 732 method is implemented as a fixed point-based iteration and uses the Jacobi vector updates
 733 that allow a well thread-parallelization on GPUs, such that each GPU thread is in charge
 734 of one vector component at a time without being dependent on other vector components
 735 computed by other threads. Then, this allow to exploit at best the high performance
 736 computing of the GPUs by using all the GPU resources and avoiding the idle cores.
 737
 738 Finally, the number of relaxations performed by the parallel synchronous algorithm
 739 is different in the CPU and GPU versions, because the number of computing nodes involved
 740 in the GPU cluster and in the CPU cluster is different. In the CPU case, $24$ computing
 741 nodes ($24$ CPU cores) are considered, whereas in the GPU case, $12$ computing nodes
 742 ($12$ GPUs) are considered. As the number of relaxations depends on the domain decomposition,
 743 consequently it also depends on the number of computing nodes.
 744
 745
 746
 747 %%--------------------------%%
 748 %%       SECTION 6          %%
 749 %%--------------------------%%
 750 \section{Red-Black ordering technique}
 751 \label{ch13:sec:06}
 752 As is well-known, the Jacobi method\index{Iterative~method!Jacobi} is characterized
 753 by a slow convergence\index{Convergence} rate compared to some iterative methods\index{Iterative~method}
 754 (for example Gauss-Seidel method\index{Iterative~method!Gauss-Seidel}). So, in this
 755 section, we present some solutions to reduce the execution time and the number of
 756 relaxations and, more specifically, to speed up the convergence of the parallel
 757 projected Richardson method on the GPU cluster. We propose to use the point red-black
 758 ordering technique\index{Iterative~method!Red-Black~ordering} to accelerate the
 759 convergence. This technique is often used to increase the parallelism of iterative
 760 methods for solving linear systems~\cite{ch13:ref13,ch13:ref14,ch13:ref15}. We
 761 apply it to the projected Richardson method as a compromise between the Jacobi
 762 and Gauss-Seidel iterative methods.
 763
 764 The general principle of the red-black technique is as follows. Let $t$ be the
 765 summation of the integer $x$-, $y$- and $z$-coordinates of a vector element $u(x,y,z)$
 766 on a three-dimensional domain: $t=x+y+z$. As is shown in Figure~\ref{ch13:fig:06.01},
 767 the red-black ordering technique consists in the parallel computing of the red
 768 vector elements having even value $t$ by using the values of the black ones then,
 769 the parallel computing of the black vector elements having odd values $t$ by using
 770 the new values of the red ones.
 771
 772 This technique can be implemented on the GPU in two different manners:
 773 \begin{itemize}
 774 \item among all launched threads ($NX\times ny$ threads), only one thread out of two computes its red or black vector element at a time or,
 775 \item all launched threads (on average half of $NX\times ny$ threads) compute the red vector elements first and, then, the black ones.
 776 \end{itemize}
 777 However, in both solutions, for each memory transaction, only half of the memory
 778 segment addressed by a half-warp is used. So, the computation of the red and black
 779 vector elements leads to use twice the initial number of memory transactions. Then,
 780 we apply the point red-black ordering\index{Iterative~method!Red-Black~ordering}
 781 accordingly to the $y$-coordinate, as is shown in Figure~\ref{ch13:fig:06.02}. In
 782 this case, the vector elements having even $y$-coordinate are computed in parallel
 783 using the values of those having odd $y$-coordinate and then vice-versa. Moreover,
 784 in the GPU implementation of the parallel projected Richardson method (Section~\ref{ch13:sec:04}),
 785 we have shown that a sub-problem of size $(NX\times ny\times nz)$ is decomposed into
 786 $nz$ grids of size $(NX\times ny)$. Then, each kernel is executed in parallel by
 787 $NX\times ny$ GPU threads, so that each thread is in charge of $nz$ vector elements
 788 along $z$-axis (one vector element in each grid of the sub-problem). So, we propose
 789 to use the new values of the vector elements computed in grid $i$ to compute those
 790 of the vector elements in grid $i+1$. Listing~\ref{ch13:list:04} describes the kernel
 791 of the matrix-vector multiplication and the kernel of the vector elements updates of
 792 the parallel projected Richardson method using the red-black ordering technique.
 793
 794 \begin{figure}
 795 \centering
 796   \mbox{\subfigure[Red-Black ordering on x, y and z axises]{\includegraphics[width=2.3in]{Chapters/chapter13/figures/rouge-noir}\label{ch13:fig:06.01}}\quad
 797         \subfigure[Red-Black ordering on y axis]{\includegraphics[width=2.3in]{Chapters/chapter13/figures/rouge-noir-y}\label{ch13:fig:06.02}}}
 798 \caption{Red-Black ordering for computing the iterate vector elements in a three-dimensional space.}
 799 \end{figure}
 800
 801 \lstinputlisting[label=ch13:list:04,caption=GPU kernels of the projected Richardson method using the red-black technique]{Chapters/chapter13/ex4.cu}
 802
 803 Finally, we exploit the concurrent executions between the host functions and the GPU
 804 kernels provided by the GPU hardware and software. In fact, the kernel launches are
 805 asynchronous (when this environment variable is not disabled on the GPUs), such that
 806 the control is returned to the host (MPI process) before the GPU has completed the
 807 requested task (kernel)~\cite{ch13:ref12}. Therefore, all the kernels necessary to
 808 update the local vector elements, $u(x,y,z)$ where $0<y<(ny-1)$ and $0<z<(nz-1)$,
 809 are executed first. Then, the values associated to the bordering vector elements
 810 are exchanged between the neighbors. Finally, the values of the vector elements
 811 associated to the bordering vector elements are updated. In this case, the computation
 812 of the local vector elements is performed concurrently with the data exchanges between
 813 neighboring CPUs and this in both synchronous and asynchronous cases.
 814
 815 In Table~\ref{ch13:tab:03}, we report the execution times and the number of relaxations
 816 performed on a cluster of $12$ GPUs by the parallel projected Richardson algorithms; it
 817 can be noted that the performances of the projected Richardson are improved by using the
 818 point read-black ordering. We compare the performances of the parallel projected Richardson
 819 method with and without this later ordering (Tables~\ref{ch13:tab:02} and~\ref{ch13:tab:03}).
 820 We can notice that both parallel synchronous and asynchronous algorithms are faster when
 821 they use the red-black ordering. Indeed, we can see in Table~\ref{ch13:tab:03} that the
 822 execution times of these algorithms are reduced, on average, by $32\%$ compared to those
 823 shown in Table~\ref{ch13:tab:02}.
 824
 825 \begin{table}[!h]
 826 \centering
 827 \begin{tabular}{|c|c|c|c|c|c|}
 828 \hline
 829 \multirow{2}{*}{\bf Pb. size} & \multicolumn{2}{c|}{\bf Synchronous} & \multicolumn{2}{c|}{\bf Asynchronous} & \multirow{2}{*}{\bf Gain\%}  \\ \cline{2-5}
 830
 831                               & $\mathbf{T_{gpu}}$ & {\bf \#relax.}   & $\mathbf{T_{gpu}}$ & {\bf \#relax.}   &           \\  \hline \hline
 832
 833 $256^{3}$                     & $18.37$            & $71,988$         & $12.58$           & $67,638$         & $31.52$  \\ \hline \hline
 834
 835 $512^{3}$                     & $349.23$           & $271,188$        & $289.41$          & $246,036$        & $17.13$ \\ \hline \hline
 836
 837 $768^{3}$                     & $2,773.65$         & $590,652$        & $2,222.22$        & $532,806$        & $19.88$ \\ \hline \hline
 838
 839 $800^{3}$                     & $2,748.23$         & $638,916$        & $2,502.61$        & $592,525$        & $8.92$ \\ \hline
 840 \end{tabular}
 841 \vspace{0.5cm}
 842 \caption{Execution times in seconds of the parallel projected Richardson method using read-black ordering technique implemented on a cluster of 12 GPUs.}
 843 \label{ch13:tab:03}
 844 \end{table}
 845
 846 \begin{figure}
 847 \centerline{\includegraphics[scale=0.7]{Chapters/chapter13/figures/scale}}
 848 \caption{Weak scaling of both synchronous and asynchronous algorithms of the projected Richardson method using red-black ordering technique.}
 849 \label{ch13:fig:07}
 850 \end{figure}
 851
 852 In Figure~\ref{ch13:fig:07}, we study the ratio between the computation time and
 853 the communication time of the parallel projected Richardson algorithms on a GPU
 854 cluster. The experimental tests are carried out on a cluster composed of one to
 855 ten Tesla GPUs. We have focused on the weak scaling of both parallel, synchronous
 856 and asynchronous, algorithms using the red-black ordering technique. For this, we
 857 have fixed the size of a sub-problem to $256^{3}$ per computing node (a CPU core
 858 and a GPU). Then, Figure~\ref{ch13:fig:07} shows the number of relaxations performed,
 859 on average, per second by a computing node. We can see from this figure that the
 860 efficiency of the asynchronous algorithm is almost stable, while that of the synchronous
 861 algorithm decreases (down to $81\%$ in this example) with the increasing of the
 862 number of computing nodes on the cluster. This is due to the fact that the ratio
 863 between the time of the computation over that of the communication is reduced when
 864 the computations are performed on GPUs. Indeed, GPUs compute faster than CPUs and
 865 communications are more time consuming. In this context, asynchronous algorithms
 866 are more scalable than synchronous ones. So, with large scale GPU clusters, synchronous\index{Synchronous}
 867 algorithms might be more penalized by communications, as can be deduced from Figure~\ref{ch13:fig:07}.
 868 That is why we think that asynchronous\index{Asynchronous} iterative algorithms
 869 are all the more interesting in this case.
 870
 871
 872 %%--------------------------%%
 873 %%       SECTION 7          %%
 874 %%--------------------------%%
 875 \section{Conclusion}
 876 \label{ch13:sec:07}
 877 Our main contribution, in this chapter, is the parallel implementation of an asynchronous
 878 iterative method on GPU clusters for solving large scale nonlinear systems derived from the
 879 spatial discretization of three-dimensional obstacle problems. For this, we have implemented
 880 both synchronous and asynchronous algorithms of the Richardson iterative method using a projection
 881 on a convex set. Indeed, this method uses point-based iterations of the Jacobi method that
 882 are very easy to parallelize on parallel computers. We have shown that its adapted parallel
 883 algorithms to GPU architectures allows to exploit at best the computing power of the GPUs and
 884 to accelerate the resolution of large nonlinear systems. Consequently, the experimental results
 885 have shown that solving nonlinear systems of large obstacle problems with this method is about
 886 fifty times faster on a cluster of $12$ GPUs than on a cluster of $24$ CPU cores. Moreover,
 887 we have applied to this projected Richardson method the red-black ordering technique which
 888 allows it to improve its convergence rate. Thus, the execution times of both parallel algorithms
 889 performed on the cluster of $12$ GPUs are reduced on average of $32\%$.
 890
 891 Afterwards, the experiments have shown that the asynchronous version is slightly more efficient
 892 than the synchronous one. In fact, the computations are accelerated by using GPUs while the communication
 893 times still unchanged. In addition, we have studied the weak-scaling in the synchronous and asynchronous
 894 cases, which has confirmed that the ratio between the computations and the communications are reduced
 895 when using a cluster of GPUs. We highlight that asynchronous iterative algorithms are more scalable
 896 than synchronous ones. Therefore, we can conclude that asynchronous iterations are well suited to
 897 tackle scalability issues on GPU clusters.
 898
 899 In future works, we plan to perform experiments on large scale GPU clusters and on geographically
 900 distant GPU clusters, because we expect that asynchronous versions would be faster and more scalable
 901 on such architectures. Furthermore, we want to study the performance behavior and the scalability of
 902 other numerical algorithms which support, if possible, the model of asynchronous iterations.
 903
 904 \putbib[Chapters/chapter13/biblio13]
 905