X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/18cb85fd3967a4c46c9582e8515cef9af3448269..df81c0a751f143232818e778c671084a19ce8063:/BookGPU/Chapters/chapter13/ch13.tex diff --git a/BookGPU/Chapters/chapter13/ch13.tex b/BookGPU/Chapters/chapter13/ch13.tex index 635181d..7a1aa42 100755 --- a/BookGPU/Chapters/chapter13/ch13.tex +++ b/BookGPU/Chapters/chapter13/ch13.tex @@ -4,49 +4,71 @@ %% %% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -\chapterauthor{}{} -\newcommand{\scalprod}[2]% -{\ensuremath{\langle #1 \, , #2 \rangle}} +%\chapterauthor{}{} +\chapterauthor{Lilia Ziane Khodja}{Femto-ST Institute, University of Franche-Comte, France} +\chapterauthor{Ming Chau}{Advanced Solutions Accelerator, Castelnau Le Lez, France} +\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte, France} +\chapterauthor{Pierre Spitéri}{ENSEEIHT-IRIT, Toulouse, France} +\chapterauthor{Jacques Bahi}{Femto-ST Institute, University of Franche-Comte, France} + + \chapter{Solving sparse nonlinear systems of obstacle problems on GPU clusters} +\label{ch13} + + + %%--------------------------%% %% SECTION 1 %% %%--------------------------%% \section{Introduction} -\label{sec:01} -The obstacle problem is one kind of free boundary problems. It allows to model, for example, an elastic membrane covering a solid obstacle. -In this case, the objective is to find an equilibrium position of this membrane constrained to be above the obstacle and which tends to minimize -its surface and/or its energy. The study of such problems occurs in many applications, for example: fluid mechanics, bio-mathematics (tumour growth -process) or financial mathematics (American or European option pricing). - -In this chapter, we focus on solutions of large obstacle problems defined in a three-dimensional domain. Particularly, the present study consists -in solving large nonlinear systems derived from the spatial discretization of these problems. Owing to the great size of such systems, in order to -reduce computation times, we proceed by solving them by parallel synchronous or asynchronous iterative algorithms. Moreover, we aim at harnessing -the computing power of GPUs to accelerate computations of these parallel algorithms. For this, we use an iterative method involving a projection -on a convex set, which is: the projected Richardson method. We choose this method among other iterative methods because it is easy to implement on -parallel computers and easy to adapt to GPU architectures. - -In Section~\ref{sec:02}, we present the mathematical model of obstacle problems then, in Section~\ref{sec:03}, we describe the general principle of -the parallel projected Richardson method. Next, in Section~\ref{sec:04}, we give the main key points of the parallel implementation of both synchronous -and asynchronous algorithms of the projected Richardson method on a GPU cluster. In Section~\ref{sec:05}, we present the performances of both parallel -algorithms obtained from simulations carried out on a CPU and GPU clusters. Finally, in Section~\ref{sec:06}, we use the read-black ordering technique -to improve the convergence and, thus, the execution times of the parallel projected Richardson algorithms on the GPU cluster. +\label{ch13:sec:01} +The obstacle problem is one kind of free boundary problems. It allows to model, +for example, an elastic membrane covering a solid obstacle. In this case, the +objective is to find an equilibrium position of this membrane constrained to be +above the obstacle and which tends to minimize its surface and/or its energy. +The study of such problems occurs in many applications, for example: fluid mechanics, +bio-mathematics (tumor growth process) or financial mathematics (American or +European option pricing). + +In this chapter, we focus on solutions of large obstacle problems defined in a +three-dimensional domain. Particularly, the present study consists in solving +large nonlinear systems derived from the spatial discretization of these problems. +Owing to the great size of such systems, in order to reduce computation times, +we proceed by solving them by parallel synchronous or asynchronous iterative +algorithms. Moreover, we aim at harnessing the computing power of GPUs to accelerate +computations of these parallel algorithms. For this, we use an iterative method +involving a projection on a convex set, which is: the projected Richardson method. +We choose this method among other iterative methods because it is easy to implement +on parallel computers and easy to adapt to GPU architectures. + +In Section~\ref{ch13:sec:02}, we present the mathematical model of obstacle problems +then, in Section~\ref{ch13:sec:03}, we describe the general principle of the parallel +projected Richardson method. Next, in Section~\ref{ch13:sec:04}, we give the main key +points of the parallel implementation of both synchronous and asynchronous algorithms +of the projected Richardson method on a GPU cluster. In Section~\ref{ch13:sec:05}, we +present the performances of both parallel algorithms obtained from simulations carried +out on a CPU and GPU clusters. Finally, in Section~\ref{ch13:sec:06}, we use the read-black +ordering technique to improve the convergence and, thus, the execution times of the parallel +projected Richardson algorithms on the GPU cluster. + %%--------------------------%% %% SECTION 2 %% %%--------------------------%% \section{Obstacle problems} -\label{sec:02} -In this section, we present the mathematical model of obstacle problems defined in a three-dimensional domain. -This model is based on that presented in~\cite{ref1}. +\label{ch13:sec:02} +In this section, we present the mathematical model of obstacle problems defined in a +three-dimensional domain. This model is based on that presented in~\cite{ch13:ref1}. + %%******************* %%******************* \subsection{Mathematical model} -\label{sec:02.01} -An obstacle problem, arising for example in mechanics or financial derivatives, consists in solving a time dependent -nonlinear equation: +\label{ch13:sec:02.01} +An obstacle problem\index{Obstacle~problem}, arising for example in mechanics or financial +derivatives, consists in solving a time dependent nonlinear equation\index{Nonlinear}: \begin{equation} \left\{ \begin{array}{l} @@ -56,18 +78,21 @@ u(0,x,y,z)=u_0(x,y,z),\\ \mbox{B.C. on~}u(t,x,y,z)\mbox{~defined on~}\partial\Omega, \end{array} \right. -\label{eq:01} +\label{ch13:eq:01} \end{equation} -where $u_0$ is the initial condition, $c\geq 0$, $b$ and $\eta$ are physical parameters, $T$ is the final time, $u=u(t,x,y,z)$ -is an element of the solution vector $U$ to compute, $f$ is the right-hand right that could represent, for example, the external -forces, B.C. describes the boundary conditions on the boundary $\partial\Omega$ of the domain $\Omega$, $\phi$ models a constraint -imposed to $u$, $\Delta$ is the Laplacian operator, $\nabla$ is the gradient operator, a.e.w. means almost every where and ``.'' -defines the products between two scalars, a scalar and a vector or a matrix and a vector. In practice the boundary condition, -generally considered, is the Dirichlet condition (where $u$ is fixed on $\partial\Omega$) or the Neumann condition (where the -normal derivative of $u$ is fixed on $\partial\Omega$). - -The time dependent equation~(\ref{eq:01}) is numerically solved by considering an implicit or a semi-implicit time marching, -where at each time step $k$ a stationary nonlinear problem is solved: +where $u_0$ is the initial condition, $c\geq 0$, $b$ and $\eta$ are physical parameters, +$T$ is the final time, $u=u(t,x,y,z)$ is an element of the solution vector $U$ to compute, +$f$ is the right-hand side that could represent, for example, the external forces, B.C. +describes the boundary conditions on the boundary $\partial\Omega$ of the domain $\Omega$, +$\phi$ models a constraint imposed to $u$, $\Delta$ is the Laplacian operator, $\nabla$ +is the gradient operator, a.e.w. means almost every where and ``.'' defines the products +between two scalars, a scalar and a vector or a matrix and a vector. In practice the boundary +condition, generally considered, is the Dirichlet condition (where $u$ is fixed on $\partial\Omega$) +or the Neumann condition (where the normal derivative of $u$ is fixed on $\partial\Omega$). + +The time dependent equation~(\ref{ch13:eq:01}) is numerically solved by considering an +implicit or a semi-implicit time marching, where at each time step $k$ a stationary nonlinear +problem\index{Nonlinear} is solved: \begin{equation} \left\{ \begin{array}{l} @@ -76,42 +101,52 @@ b^t.\nabla u-\eta.\Delta u+(c+\delta).u-g\geq 0\mbox{,~}u\geq\phi\mbox{,~a.e.w. \mbox{B.C. on~}u(t,x,y,z)\mbox{~defined on~}\partial\Omega, \end{array} \right. -\label{eq:02} +\label{ch13:eq:02} \end{equation} -where $\delta=\frac{1}{k}$ is the inverse of the time step $k$, $g=f+\delta u^{prev}$ and $u^{prev}$ is the solution computed at the -previous time step. +where $\delta=\frac{1}{k}$ is the inverse of the time step $k$, $g=f+\delta u^{prev}$ and +$u^{prev}$ is the solution computed at the previous time step. %%******************* %%******************* \subsection{Discretization} -\label{sec:02.02} -First, we note that the spatial discretization of the previous stationary problem~(\ref{eq:02}) does not provide a symmetric matrix, -because the convection-diffusion operator is not self-adjoint. Moreover, the fact that the operator is self-adjoint or not plays an -important role in the choice of the appropriate algorithm for solving nonlinear systems derived from the discretization of the obstacle -problem. Nevertheless, since the convection coefficients arising in the operator~(\ref{eq:02}) are constant, we can formulate the same -problem by an self-adjoint operator by performing a classical change of variables. Then, we can replace the stationary convection-diffusion -problem: +\label{ch13:sec:02.02} +First, we note that the spatial discretization of the previous stationary +problem~(\ref{ch13:eq:02}) does not provide a symmetric matrix, because the +convection-diffusion operator is not self-adjoint. Moreover, the fact that +the operator is self-adjoint or not plays an important role in the choice +of the appropriate algorithm for solving nonlinear systems derived from the +discretization of the obstacle problem\index{Obstacle~problem}. Nevertheless, +since the convection coefficients arising in the operator~(\ref{ch13:eq:02}) +are constant, we can formulate the same problem by an self-adjoint operator +by performing a classical change of variables. Then, we can replace the stationary +convection-diffusion problem: \begin{equation} b^{t}.\nabla v-\eta.\Delta v+(c+\delta).v=g\mbox{,~a.e.w. in~}\lbrack 0,T\rbrack\times\Omega\mbox{,~}c\geq 0\mbox{,~}\delta\geq~0, -\label{eq:03} +\label{ch13:eq:03} \end{equation} by the following stationary diffusion operator: \begin{equation} -\eta.\Delta u+(\frac{\|b\|^{2}_{2}}{4\eta}+c+\delta).u=e^{-a}g=f, -\label{eq:04} +\label{ch13:eq:04} \end{equation} -where $b=\{b_{1},b_{2},b_{3}\}$, $\|b\|_{2}$ denotes the Euclidean norm of $b$ and $v=e^{-a}.u$ represents the general change of variables -such that $a=\frac{b^{t}(x,y,z)}{2\eta}$. Consequently, the numerical resolution of the diffusion problem (the self-adjoint operator~(\ref{eq:04})) -is done by optimization algorithms, in contrast, that of the convection-diffusion problem (non self-adjoint operator~(\ref{eq:03})) is -done by relaxation algorithms. In the case of our studied algorithm, the convergence is ensured by M-matrix property then, the performance -is linked to the magnitude of the spectral radius of the iteration matrix, which is independent of the condition number. - -Next, the three-dimensional domain $\Omega\subset\mathbb{R}^{3}$ is set to $\Omega=\lbrack 0,1\rbrack^{3}$ and discretized with an uniform -Cartesian mesh constituted by $M=m^3$ discretization points, where $m$ related to the spatial discretization step by $h=\frac{1}{m+1}$. This -is carried out by using a classical order 2 finite difference approximation of the Laplacian. So, the complete discretization of both stationary -boundary value problems~(\ref{eq:03}) and~(\ref{eq:04}) leads to the solution of a large discrete complementary problem of the following -form, when both Dirichlet or Neumann boundary conditions are used: +where $b=\{b_{1},b_{2},b_{3}\}$, $\|b\|_{2}$ denotes the Euclidean norm of $b$ and +$v=e^{-a}.u$ represents the general change of variables such that $a=\frac{b^{t}(x,y,z)}{2\eta}$. +Consequently, the numerical resolution of the diffusion problem (the self-adjoint +operator~(\ref{ch13:eq:04})) is done by optimization algorithms, in contrast, that +of the convection-diffusion problem (non self-adjoint operator~(\ref{ch13:eq:03})) +is done by relaxation algorithms. In the case of our studied algorithm, the convergence\index{Convergence} +is ensured by M-matrix property then, the performance is linked to the magnitude of +the spectral radius of the iteration matrix, which is independent of the condition +number. + +Next, the three-dimensional domain $\Omega\subset\mathbb{R}^{3}$ is set to $\Omega=\lbrack 0,1\rbrack^{3}$ +and discretized with an uniform Cartesian mesh constituted by $M=m^3$ discretization +points, where $m$ related to the spatial discretization step by $h=\frac{1}{m+1}$. This +is carried out by using a classical order 2 finite difference approximation of the Laplacian. +So, the complete discretization of both stationary boundary value problems~(\ref{ch13:eq:03}) +and~(\ref{ch13:eq:04}) leads to the solution of a large discrete complementary problem +of the following form, when both Dirichlet or Neumann boundary conditions are used: \begin{equation} \left\{ \begin{array}{l} @@ -120,30 +155,38 @@ form, when both Dirichlet or Neumann boundary conditions are used: ((A+\delta I)U^{*}-G)^{T}(U^{*}-\bar{\Phi})=0,\\ \end{array} \right. -\label{eq:05} +\label{ch13:eq:05} \end{equation} -where $A$ is a matrix obtained after the spatial discretization by a finite difference method, $G$ is derived from the Euler first order implicit time -marching scheme and from the discretized right-hand side of the obstacle problem, $\delta$ is the inverse of the time step $k$ and $I$ is the identity -matrix. The matrix $A$ is symmetric when the self-adjoint operator is considered and nonsymmetric otherwise. +where $A$ is a matrix obtained after the spatial discretization by a finite difference +method, $G$ is derived from the Euler first order implicit time marching scheme and from +the discretized right-hand side of the obstacle problem, $\delta$ is the inverse of the +time step $k$ and $I$ is the identity matrix. The matrix $A$ is symmetric when the self-adjoint +operator is considered and nonsymmetric otherwise. -According to the chosen discretization scheme of the Laplacian, $A$ is an M-matrix (irreducibly diagonal dominant, see~\cite{ref2}) and, consequently, -the matrix $(A+\delta I)$ is also an M-matrix. This property is important to the convergence of iterative methods. +According to the chosen discretization scheme of the Laplacian, $A$ is an M-matrix (irreducibly +diagonal dominant, see~\cite{ch13:ref2}) and, consequently, the matrix $(A+\delta I)$ is also +an M-matrix. This property is important to the convergence of iterative methods. %%--------------------------%% %% SECTION 3 %% %%--------------------------%% \section{Parallel iterative method} -\label{sec:03} -Owing to the large size of the previous discrete complementary problem~(\ref{eq:05}), we will solve it by parallel synchronous or asynchronous iterative -algorithms (see~\cite{ref3,ref4,ref5}). In this chapter, we aim at harnessing the computing power of GPU clusters for solving these large nonlinear systems. -Then, we choose to use the projected Richardson iterative method for solving the diffusion problem~(\ref{eq:04}). Indeed, this method is based on the iterations -of the Jacobi method which are easy to parallelize on parallel computers and easy to adapt to GPU architectures. Then, according to the boundary value problem -formulation with a self-adjoint operator~(\ref{eq:04}), we can consider here the equivalent optimization problem and the fixed point mapping associated to -its solution. - -Assume that $E=\mathbb{R}^{M}$ is a Hilbert space, in which $\scalprod{.}{.}$ is the scalar product and $\|.\|$ its associated norm. So, the general fixed -point problem to be solved is defined as follows: +\label{ch13:sec:03} +Owing to the large size of the previous discrete complementary problem~(\ref{ch13:eq:05}), +we will solve it by parallel synchronous or asynchronous iterative algorithms (see~\cite{ch13:ref3,ch13:ref4,ch13:ref5}). +In this chapter, we aim at harnessing the computing power of GPU clusters for solving these +large nonlinear systems\index{Nonlinear}. Then, we choose to use the projected Richardson +iterative method\index{Iterative~method!Projected~Richardson} for solving the diffusion +problem~(\ref{ch13:eq:04}). Indeed, this method is based on the iterations of the Jacobi +method\index{Iterative~method!Jacobi} which are easy to parallelize on parallel computers +and easy to adapt to GPU architectures. Then, according to the boundary value problem +formulation with a self-adjoint operator~(\ref{ch13:eq:04}), we can consider here the +equivalent optimization problem and the fixed point mapping associated to its solution. + +Assume that $E=\mathbb{R}^{M}$ is a Hilbert space\index{Hilbert~space}, in which $\scalprod{.}{.}$ +is the scalar product and $\|.\|$ its associated norm. So, the general fixed point problem +to be solved is defined as follows: \begin{equation} \left\{ \begin{array}{l} @@ -151,16 +194,17 @@ point problem to be solved is defined as follows: U^{*} = F(U^{*}), \\ \end{array} \right. -\label{eq:06} +\label{ch13:eq:06} \end{equation} where $U\mapsto F(U)$ is an application from $E$ to $E$. Let $K$ be a closed convex set defined by: \begin{equation} K = \{U | U \geq \Phi \mbox{~everywhere in~} E\}, -\label{eq:07} +\label{ch13:eq:07} \end{equation} -where $\Phi$ is the discrete obstacle function. In fact, the obstacle problem~(\ref{eq:05}) is formulated as the following constrained optimization problem: +where $\Phi$ is the discrete obstacle function. In fact, the obstacle problem~(\ref{ch13:eq:05})\index{Obstacle~problem} +is formulated as the following constrained optimization problem: \begin{equation} \left\{ \begin{array}{l} @@ -168,49 +212,62 @@ where $\Phi$ is the discrete obstacle function. In fact, the obstacle problem~(\ \forall V \in K, J(U^{*}) \leq J(V), \\ \end{array} \right. -\label{eq:08} +\label{ch13:eq:08} \end{equation} where the cost function is given by: \begin{equation} J(U) = \frac{1}{2}\scalprod{\mathcal{A}.U}{U} - \scalprod{G}{U}, -\label{eq:09} +\label{ch13:eq:09} \end{equation} -in which $\scalprod{.}{.}$ denotes the scalar product in $E$, $\mathcal{A}=A+\delta I$ is a symmetric positive definite, $A$ is the discretization matrix -associated with the self-adjoint operator~(\ref{eq:04}) after change of variables. +in which $\scalprod{.}{.}$ denotes the scalar product in $E$, $\mathcal{A}=A+\delta I$ +is a symmetric positive definite, $A$ is the discretization matrix associated with the +self-adjoint operator~(\ref{ch13:eq:04}) after change of variables. -For any $U\in E$, let $P_K(U)$ be the projection of $U$ on $K$. For any $\gamma\in\mathbb{R}$, $\gamma>0$, the fixed point mapping $F_{\gamma}$ of the projected -Richardson method is defined as follows: +For any $U\in E$, let $P_K(U)$ be the projection of $U$ on $K$. For any $\gamma\in\mathbb{R}$, +$\gamma>0$, the fixed point mapping $F_{\gamma}$ of the projected Richardson method\index{Iterative~method!Projected~Richardson} +is defined as follows: \begin{equation} U^{*} = F_{\gamma}(U^{*}) = P_K(U^{*} - \gamma(\mathcal{A}.U^{*} - G)). -\label{eq:10} +\label{ch13:eq:10} \end{equation} -In order to reduce the computation time, the large optimization problem is solved in a numerical way by using a parallel asynchronous algorithm of the projected -Richardson method on the convex set $K$. Particularly, we will consider an asynchronous parallel adaptation of the projected Richardson method~\cite{ref6}. - -Let $\alpha\in\mathbb{N}$ be a positive integer. We consider that the space $E=\displaystyle\prod_{i=1}^{\alpha} E_i$ is a product of $\alpha$ subspaces $E_i$ -where $i\in\{1,\ldots,\alpha\}$. Note that $E_i=\mathbb{R}^{m_i}$, where $\displaystyle\sum_{i=1}^{\alpha} m_{i}=M$, is also a Hilbert space in which $\scalprod{.}{.}_i$ -denotes the scalar product and $|.|_i$ the associated norm, for all $i\in\{1,\ldots,\alpha\}$. Then, for all $u,v\in E$, $\scalprod{u}{v}=\displaystyle\sum_{i=1}^{\alpha}\scalprod{u_i}{v_i}_i$ +In order to reduce the computation time, the large optimization problem is solved in a +numerical way by using a parallel asynchronous algorithm of the projected Richardson method +on the convex set $K$. Particularly, we will consider an asynchronous parallel adaptation +of the projected Richardson method~\cite{ch13:ref6}. + +Let $\alpha\in\mathbb{N}$ be a positive integer. We consider that the space $E=\displaystyle\prod_{i=1}^{\alpha} E_i$ +is a product of $\alpha$ subspaces $E_i$ where $i\in\{1,\ldots,\alpha\}$. Note that $E_i=\mathbb{R}^{m_i}$, +where $\displaystyle\sum_{i=1}^{\alpha} m_{i}=M$, is also a Hilbert space\index{Hilbert~space} +in which $\scalprod{.}{.}_i$ denotes the scalar product and $|.|_i$ the associated norm, for +all $i\in\{1,\ldots,\alpha\}$. Then, for all $u,v\in E$, $\scalprod{u}{v}=\displaystyle\sum_{i=1}^{\alpha}\scalprod{u_i}{v_i}_i$ is the scalar product on $E$. -Let $U\in E$, we consider the following decomposition of $U$ and the corresponding decomposition of $F_\gamma$ into $\alpha$ blocks: +Let $U\in E$, we consider the following decomposition of $U$ and the corresponding decomposition +of $F_\gamma$ into $\alpha$ blocks: \begin{equation} \begin{array}{rcl} U & = & (U_1,\ldots,U_{\alpha}), \\ F_{\gamma}(U) & = & (F_{1,\gamma}(U),\ldots,F_{\alpha,\gamma}(U)). \\ \end{array} -\label{eq:11} +\label{ch13:eq:11} \end{equation} -Assume that the convex set $K=\displaystyle\prod_{i=1}^{\alpha}K_{i}$, such that $\forall i\in\{1,\ldots,\alpha\},K_i\subset E_i$ and $K_i$ is a closed convex set. -Let also $G=(G_1,\ldots,G_{\alpha})\in E$ and, for any $U\in E$, $P_K(U)=(P_{K_1}(U_1),\ldots,P_{K_{\alpha}}(U_{\alpha}))$ is the projection of $U$ on $K$ where $\forall i\in\{1,\ldots,\alpha\},P_{K_i}$ -is the projector from $E_i$ onto $K_i$. So, the fixed point mapping of the projected Richardson method~(\ref{eq:10}) can be written in the following way: +Assume that the convex set $K=\displaystyle\prod_{i=1}^{\alpha}K_{i}$, such that $\forall i\in\{1,\ldots,\alpha\},K_i\subset E_i$ +and $K_i$ is a closed convex set. Let also $G=(G_1,\ldots,G_{\alpha})\in E$ and, for any +$U\in E$, $P_K(U)=(P_{K_1}(U_1),\ldots,P_{K_{\alpha}}(U_{\alpha}))$ is the projection of $U$ +on $K$ where $\forall i\in\{1,\ldots,\alpha\},P_{K_i}$ is the projector from $E_i$ onto +$K_i$. So, the fixed point mapping of the projected Richardson method~(\ref{ch13:eq:10})\index{Iterative~method!Projected~Richardson} +can be written in the following way: \begin{equation} \forall U\in E\mbox{,~}\forall i\in\{1,\ldots,\alpha\}\mbox{,~}F_{i,\gamma}(U) = P_{K_i}(U_i - \gamma(\mathcal{A}_i.U - G_i)). -\label{eq:12} +\label{ch13:eq:12} \end{equation} -Note that $\displaystyle\mathcal{A}_i.U= \sum_{j=1}^{\alpha}\mathcal{A}_{i,j}.U_j$, where $\mathcal{A}_{i,j}$ denote block matrices of $\mathcal{A}$. +Note that $\displaystyle\mathcal{A}_i.U= \sum_{j=1}^{\alpha}\mathcal{A}_{i,j}.U_j$, where +$\mathcal{A}_{i,j}$ denote block matrices of $\mathcal{A}$. -The parallel asynchronous iterations of the projected Richardson method for solving the obstacle problem~(\ref{eq:08}) are defined as follows: let $U^0\in E,U^0=(U^0_1,\ldots,U^0_\alpha)$ be -the initial solution, then for all $p\in\mathbb{N}$, the iterate $U^{p+1}=(U^{p+1}_1,\ldots,U^{p+1}_{\alpha})$ is recursively defined by: +The parallel asynchronous iterations of the projected Richardson method for solving the +obstacle problem~(\ref{ch13:eq:08}) are defined as follows: let $U^0\in E,U^0=(U^0_1,\ldots,U^0_\alpha)$ +be the initial solution, then for all $p\in\mathbb{N}$, the iterate $U^{p+1}=(U^{p+1}_1,\ldots,U^{p+1}_{\alpha})$ +is recursively defined by: \begin{equation} U_i^{p+1} = \left\{ @@ -219,7 +276,7 @@ F_{i,\gamma}(U_1^{\rho_1(p)}, \ldots, U_{\alpha}^{\rho_{\alpha}(p)}) \mbox{~if~} U_i^p \mbox{~otherwise}, \\ \end{array} \right. -\label{eq:13} +\label{ch13:eq:13} \end{equation} where \begin{equation} @@ -229,7 +286,7 @@ where \forall i\in\{1,\ldots,\alpha\},\{p \ | \ i \in s(p)\}\mbox{~is denombrable}, \end{array} \right. -\label{eq:14} +\label{ch13:eq:14} \end{equation} and $\forall j\in\{1,\ldots,\alpha\}$, \begin{equation} @@ -239,62 +296,75 @@ and $\forall j\in\{1,\ldots,\alpha\}$, \displaystyle\lim_{p\to\infty}\rho_j(p) = +\infty.\\ \end{array} \right. -\label{eq:15} +\label{ch13:eq:15} \end{equation} -The previous asynchronous scheme of the projected Richardson method models computations that are carried out in parallel -without order nor synchronization (according to the behavior of the parallel iterative method) and describes a subdomain -method without overlapping. It is a general model that takes into account all possible situations of parallel computations -and non-blocking message passing. So, the synchronous iterative scheme is defined by: +The previous asynchronous scheme\index{Asynchronous} of the projected Richardson +method models computations that are carried out in parallel without order nor +synchronization (according to the behavior of the parallel iterative method) and +describes a subdomain method without overlapping. It is a general model that takes +into account all possible situations of parallel computations and nonblocking message +passing. So, the synchronous iterative scheme\index{Synchronous} is defined by: \begin{equation} \forall j\in\{1,\ldots,\alpha\} \mbox{,~} \forall p\in\mathbb{N} \mbox{,~} \rho_j(p)=p. -\label{eq:16} +\label{ch13:eq:16} \end{equation} -The values of $s(p)$ and $\rho_j(p)$ are defined dynamically and not explicitly by the parallel asynchronous or synchronous -execution of the algorithm. Particularly, it enables one to consider distributed computations whereby processors compute at -their own pace according to their intrinsic characteristics and computational load. The parallelism between the processors is -well described by the set $s(p)$ which contains at each step $p$ the index of the components relaxed by each processor on a -parallel way while the use of delayed components in~(\ref{eq:13}) permits one to model nondeterministic behavior and does not -imply inefficiency of the considered distributed scheme of computation. Note that, according to~\cite{ref7}, theoretically, -each component of the vector must be relaxed an infinity of time. The choice of the relaxed components to be used in the -computational process may be guided by any criterion and, in particular, a natural criterion is to pick-up the most recently -available values of the components computed by the processors. Furthermore, the asynchronous iterations are implemented by -means of non-blocking MPI communication subroutines (asynchronous communications). - -The important property ensuring the convergence of the parallel projected Richardson method, both synchronous and asynchronous -algorithms, is the fact that $\mathcal{A}$ is an M-matrix. Moreover, the convergence proceeds from a result of~\cite{ref6}. -Indeed, there exists a value $\gamma_0>0$, such that $\forall\gamma\in ]0,\gamma_0[$, the parallel iterations~(\ref{eq:13}), -(\ref{eq:14}) and~(\ref{eq:15}), associated to the fixed point mapping $F_\gamma$~(\ref{eq:12}), converge to the unique solution -$U^{*}$ of the discretized problem. +The values of $s(p)$ and $\rho_j(p)$ are defined dynamically and not explicitly by +the parallel asynchronous or synchronous execution of the algorithm. Particularly, +it enables one to consider distributed computations whereby processors compute at +their own pace according to their intrinsic characteristics and computational load. +The parallelism between the processors is well described by the set $s(p)$ which +contains at each step $p$ the index of the components relaxed by each processor on +a parallel way while the use of delayed components in~(\ref{ch13:eq:13}) permits one +to model nondeterministic behavior and does not imply inefficiency of the considered +distributed scheme of computation. Note that, according to~\cite{ch13:ref7}, theoretically, +each component of the vector must be relaxed an infinity of time. The choice of the +relaxed components to be used in the computational process may be guided by any criterion +and, in particular, a natural criterion is to pick-up the most recently available +values of the components computed by the processors. Furthermore, the asynchronous +iterations are implemented by means of nonblocking MPI communication subroutines\index{MPI~subroutines!Nonblocking} +(asynchronous communications). + +The important property ensuring the convergence of the parallel projected Richardson +method, both synchronous and asynchronous algorithms, is the fact that $\mathcal{A}$ +is an M-matrix. Moreover, the convergence\index{Convergence} proceeds from a result +of~\cite{ch13:ref6}. Indeed, there exists a value $\gamma_0>0$, such that $\forall\gamma\in ]0,\gamma_0[$, +the parallel iterations~(\ref{ch13:eq:13}), (\ref{ch13:eq:14}) and~(\ref{ch13:eq:15}), +associated to the fixed point mapping $F_\gamma$~(\ref{ch13:eq:12}), converge to the +unique solution $U^{*}$ of the discretized problem. %%--------------------------%% %% SECTION 4 %% %%--------------------------%% \section{Parallel implementation on a GPU cluster} -\label{sec:04} -In this section, we give the main key points of the parallel implementation of the projected Richardson method, both synchronous -and asynchronous versions, on a GPU cluster, for solving the nonlinear systems derived from the discretization of large obstacle -problems. More precisely, each nonlinear system is solved iteratively using the whole cluster. We use a heteregeneous CUDA/MPI -programming. Indeed, the communication of data, at each iteration between the GPU computing nodes, can be either synchronous -or asynchronous using the MPI communication subroutines, whereas inside each GPU node, a CUDA parallelization is performed. +\label{ch13:sec:04} +In this section, we give the main key points of the parallel implementation of the +projected Richardson method, both synchronous and asynchronous versions, on a GPU +cluster, for solving the nonlinear systems derived from the discretization of large +obstacle problems. More precisely, each nonlinear system is solved iteratively using +the whole cluster. We use a heterogeneous CUDA/MPI programming. Indeed, the communication +of data, at each iteration between the GPU computing nodes, can be either synchronous +or asynchronous using the MPI communication subroutines, whereas inside each GPU node, +a CUDA parallelization is performed. \begin{figure}[!h] \centerline{\includegraphics[scale=0.30]{Chapters/chapter13/figures/splitCPU}} \caption{Data partitioning of a problem to be solved among $S=3\times 4$ computing nodes.} -\label{fig:01} +\label{ch13:fig:01} \end{figure} -Let $S$ denote the number of computing nodes on the GPU cluster, where a computing node is composed of CPU core holding one MPI -process and a GPU card. So, before starting computations, the obstacle problem of size $(NX\times NY\times NZ)$ is split into $S$ -parallelepipedic sub-problems, each for a node (MPI process, GPU), as is shown in Figure~\ref{fig:01}. Indeed, the $NY$ and $NZ$ -dimensions (according to the $y$ and $z$ axises) of the three-dimensional problem are, respectively, split into $Sy$ and $Sz$ parts, -such that $S=Sy\times Sz$. In this case, each computing node has at most four neighboring nodes. This kind of the data partitioning -reduces the data exchanges at subdomain boundaries compared to a naive $z$-axis-wise partitioning. +Let $S$ denote the number of computing nodes\index{Computing~node} on the GPU cluster, +where a computing node is composed of CPU core holding one MPI process and a GPU card. +So, before starting computations, the obstacle problem of size $(NX\times NY\times NZ)$ +is split into $S$ parallelepipedic sub-problems, each for a node (MPI process, GPU), as +is shown in Figure~\ref{ch13:fig:01}. Indeed, the $NY$ and $NZ$ dimensions (according +to the $y$ and $z$ axises) of the three-dimensional problem are, respectively, split +into $Sy$ and $Sz$ parts, such that $S=Sy\times Sz$. In this case, each computing node +has at most four neighboring nodes. This kind of the data partitioning reduces the data +exchanges at subdomain boundaries compared to a naive $z$-axis-wise partitioning. \begin{algorithm}[!t] -%\SetLine -%\linesnumbered Initialization of the parameters of the sub-problem\; Allocate and fill the data in the global memory GPU\; \For{$i=1$ {\bf to} $NbSteps$}{ @@ -304,30 +374,34 @@ Allocate and fill the data in the global memory GPU\; } Copy the solution $U$ back from GPU memory\; \caption{Parallel solving of the obstacle problem on a GPU cluster} -\label{alg:01} +\label{ch13:alg:01} \end{algorithm} -All the computing nodes of the GPU cluster execute in parallel the same Algorithm~\ref{alg:01} but on different three-dimensional -sub-problems of size $(NX\times ny\times nz)$. This algorithm gives the main key points for solving an obstacle problem defined in -a three-dimensional domain, where $A$ is the discretization matrix, $G$ is the right-hand side and $U$ is the solution vector. After -the initialization step, all the data generated from the partitioning operation are copied from the CPU memories to the GPU global -memories, to be processed on the GPUs. Next, the algorithm uses $NbSteps$ time steps to solve the global obstacle problem. In fact, -it uses a parallel algorithm adapted to GPUs of the projected Richardson iterative method for solving the nonlinear systems of the -obstacle problem. This function is defined by {\it Solve()} in Algorithm~\ref{alg:01}. At every time step, the initial guess $U^0$ -for the iterative algorithm is set to the solution found at the previous time step. Moreover, the right-hand side $G$ is computed -as follows: \[G = \frac{1}{k}.U^{prev} + F\] where $k$ is the time step, $U^{prev}$ is the solution computed in the previous time -step and each element $f(x, y, z)$ of the vector $F$ is computed as follows: +All the computing nodes of the GPU cluster execute in parallel the same Algorithm~\ref{ch13:alg:01} +but on different three-dimensional sub-problems of size $(NX\times ny\times nz)$. +This algorithm gives the main key points for solving an obstacle problem\index{Obstacle~problem} +defined in a three-dimensional domain, where $A$ is the discretization matrix, $G$ +is the right-hand side and $U$ is the solution vector. After the initialization step, +all the data generated from the partitioning operation are copied from the CPU memories +to the GPU global memories, to be processed on the GPUs. Next, the algorithm uses $NbSteps$ +time steps to solve the global obstacle problem. In fact, it uses a parallel algorithm +adapted to GPUs of the projected Richardson iterative method for solving the nonlinear +systems\index{Nonlinear} of the obstacle problem. This function is defined by {\it Solve()} +in Algorithm~\ref{ch13:alg:01}. At every time step, the initial guess $U^0$ for the iterative +algorithm is set to the solution found at the previous time step. Moreover, the right-hand +side $G$ is computed as follows: \[G = \frac{1}{k}.U^{prev} + F\] where $k$ is the time step, +$U^{prev}$ is the solution computed in the previous time step and each element $f(x, y, z)$ +of the vector $F$ is computed as follows: \begin{equation} f(x,y,z)=\cos(2\pi x)\cdot\cos(4\pi y)\cdot\cos(6\pi z). -\label{eq:18} +\label{ch13:eq:18} \end{equation} -Finally, the solution $U$ of the obstacle problem is copied back from the GPU global memories to the CPU memories. We use the -communication subroutines of the CUBLAS library~\cite{ref8} (CUDA Basic Linear Algebra Subroutines) for the memory allocations in -the GPU (\verb+cublasAlloc()+) and the data transfers between the CPU and its GPU: \verb+cublasSetVector()+ and \verb+cublasGetVector()+. +Finally, the solution $U$ of the obstacle problem is copied back from the GPU global +memories to the CPU memories. We use the communication subroutines of the CUBLAS library~\cite{ch13:ref8}\index{CUBLAS} +(CUDA Basic Linear Algebra Subroutines) for the memory allocations in the GPU (\verb+cublasAlloc()+) +and the data transfers between the CPU and its GPU: \verb+cublasSetVector()+ and \verb+cublasGetVector()+. \begin{algorithm}[!t] -% \SetLine -% \linesnumbered $p = 0$\; $conv = false$\; $U = U^{0}$\; @@ -341,43 +415,55 @@ the GPU (\verb+cublasAlloc()+) and the data transfers between the CPU and its GP $conv$ = Convergence($error$, $p$, $\varepsilon$, $MaxRelax$)\; } \caption{Parallel iterative solving of the nonlinear systems on a GPU cluster ($Solve()$ function)} -\label{alg:02} +\label{ch13:alg:02} \end{algorithm} -As many other iterative methods, the algorithm of the projected Richardson method is based on algebraic functions operating on vectors -and/or matrices, which are more efficient on parallel computers when they work on large vectors. Its parallel implementation on the GPU -cluster is carried out so that the GPUs execute the vector operations as kernels and the CPUs execute the serial codes, supervise the -kernel executions and the data exchanges with the neighboring nodes and supply the GPUs with data. Algorithm~\ref{alg:02} shows the -main key points of the parallel iterative algorithm (function $Solve()$ in Algorithm~\ref{alg:01}). All the vector operations inside -the main loop ({\bf repeat} ... {\bf until}) are executed by the GPU. We use the following functions of the CUBLAS library: +As many other iterative methods, the algorithm of the projected Richardson +method\index{Iterative~method!Projected~Richardson} is based on algebraic +functions operating on vectors and/or matrices, which are more efficient on +parallel computers when they work on large vectors. Its parallel implementation +on the GPU cluster is carried out so that the GPUs execute the vector operations +as kernels and the CPUs execute the serial codes, supervise the kernel executions +and the data exchanges with the neighboring nodes\index{Neighboring~node} and +supply the GPUs with data. Algorithm~\ref{ch13:alg:02} shows the main key points +of the parallel iterative algorithm (function $Solve()$ in Algorithm~\ref{ch13:alg:01}). +All the vector operations inside the main loop ({\bf repeat} ... {\bf until}) +are executed by the GPU. We use the following functions of the CUBLAS library\index{CUBLAS}: \begin{itemize} \item \verb+cublasDaxpy()+ to compute the difference between the solution vectors $U^{p}$ and $U^{p+1}$ computed in two successive relaxations -$p$ and $p+1$ (line~$7$ in Algorithm~\ref{alg:02}), +$p$ and $p+1$ (line~$7$ in Algorithm~\ref{ch13:alg:02}), \item \verb+cublasDnrm2()+ to perform the Euclidean norm (line~$8$) and, \item \verb+cublasDcpy()+ for the data copy of a vector to another one in the GPU memory (lines~$3$ and~$9$). \end{itemize} -The dimensions of the grid and blocks of threads that execute a given kernel depend on the resources of the GPU multiprocessor and the -resource requirements of the kernel. So, if $block$ defines the size of a thread block, which must not exceed the maximum size of a thread -block, then the number of thread blocks in the grid, denoted by $grid$, can be computed according to the size of the local sub-problem -as follows: \[grid = \frac{(NX\times ny\times nz)+block-1}{block}.\] However, when solving very large problems, the size of the thread -grid can exceed the maximum number of thread blocks that can be executed on the GPUs (up-to $65.535$ thread blocks) and, thus, the kernel -will fail to launch. Therefore, for each kernel, we decompose the three-dimensional sub-problem into $nz$ two-dimensional slices of size -($NX\times ny$), as is shown in Figure~\ref{fig:02}. All slices of the same kernel are executed using {\bf for} loop by $NX\times ny$ parallel -threads organized in a two-dimensional grid of two-dimensional thread blocks, as is shown in Listing~\ref{list:01}. Each thread is in charge -of $nz$ discretization points (one from each slice), accessed in the GPU memory with a constant stride $(NX\times ny)$. +The dimensions of the grid and blocks of threads that execute a given kernel +depend on the resources of the GPU multiprocessor and the resource requirements +of the kernel. So, if $block$ defines the size of a thread block, which must +not exceed the maximum size of a thread block, then the number of thread blocks +in the grid, denoted by $grid$, can be computed according to the size of the +local sub-problem as follows: \[grid = \frac{(NX\times ny\times nz)+block-1}{block}.\] +However, when solving very large problems, the size of the thread grid can exceed +the maximum number of thread blocks that can be executed on the GPUs (up-to $65.535$ +thread blocks) and, thus, the kernel will fail to launch. Therefore, for each kernel, +we decompose the three-dimensional sub-problem into $nz$ two-dimensional slices of size +($NX\times ny$), as is shown in Figure~\ref{ch13:fig:02}. All slices of the same kernel +are executed using {\bf for} loop by $NX\times ny$ parallel threads organized in a +two-dimensional grid of two-dimensional thread blocks, as is shown in Listing~\ref{ch13:list:01}. +Each thread is in charge of $nz$ discretization points (one from each slice), accessed +in the GPU memory with a constant stride $(NX\times ny)$. \begin{figure} \centerline{\includegraphics[scale=0.30]{Chapters/chapter13/figures/splitGPU}} \caption{Decomposition of a sub-problem in a GPU into $nz$ slices.} -\label{fig:02} +\label{ch13:fig:02} \end{figure} \begin{center} -\lstinputlisting[label=list:01,caption=Skeleton codes of a GPU kernel and a CPU function]{Chapters/chapter13/ex1.cu} +\lstinputlisting[label=ch13:list:01,caption=Skeleton codes of a GPU kernel and a CPU function]{Chapters/chapter13/ex1.cu} \end{center} -The function $Determine\_Bordering\_Vector\_Elements()$ (line~$5$ in Algorithm~\ref{alg:02}) determines the values of the vector -elements shared at the boundaries with neighboring computing nodes. Its main operations are defined as follows: +The function $Determine\_Bordering\_Vector\_Elements()$ (line~$5$ in Algorithm~\ref{ch13:alg:02}) +determines the values of the vector elements shared at the boundaries with neighboring computing +nodes. Its main operations are defined as follows: \begin{enumerate} \item define the values associated to the bordering points needed by the neighbors, \item copy the values associated to the bordering points from the GPU to the CPU, @@ -389,21 +475,30 @@ The first operation of this function is implemented as kernels to be performed b \item a kernel executed by $NX\times nz$ threads to define the values associated to the bordering vector elements along $y$-axis and, \item a kernel executed by $NX\times ny$ threads to define the values associated to the bordering vector elements along $z$-axis. \end{itemize} -As mentioned before, we develop the \emph{synchronous} and \emph{asynchronous} algorithms of the projected Richardson method. Obviously, -in this scope, the synchronous or asynchronous communications refer to the communications between the CPU cores (MPI processes) on the -GPU cluster, in order to exchange the vector elements associated to subdomain boundaries. For the memory copies between a CPU core and -its GPU, we use the synchronous communication routines of the CUBLAS library: \verb+cublasSetVector()+ and \verb+cublasGetVector()+ -in the synchronous algorithm and the asynchronous ones: \verb+cublasSetVectorAsync()+ and \verb+cublasGetVectorAsync()+ in the -asynchronous algorithm. Moreover, we use the communication routines of the MPI library to carry out the data exchanges between the neighboring -nodes. We use the following communication routines: \verb+MPI_Isend()+ and \verb+MPI_Irecv()+ to perform non-blocking sends and receptions, -respectively. For the synchronous algorithm, we use the MPI routine \verb+MPI_Waitall()+ which puts the MPI process of a computing node -in blocking status until all data exchanges with neighboring nodes (sends and receptions) are completed. In contrast, for the asynchronous -algorithms, we use the MPI routine \verb+MPI_Test()+ which tests the completion of a data exchange (send or reception) without putting the -MPI process in blocking status. - -The function $Compute\_New\_Vector\_Elements()$ (line~$6$ in Algorithm~\ref{alg:02}) computes, at each iteration, the new elements -of the iterate vector $U$. Its general code is presented in Listing~\ref{list:01} (CPU function). The iterations of the projected -Richardson method, based on those of the Jacobi method, are defined as follows: +As mentioned before, we develop the \emph{synchronous} and \emph{asynchronous} +algorithms of the projected Richardson method. Obviously, in this scope, the +synchronous\index{Synchronous} or asynchronous\index{Asynchronous} communications +refer to the communications between the CPU cores (MPI processes) on the GPU cluster, +in order to exchange the vector elements associated to subdomain boundaries. For +the memory copies between a CPU core and its GPU, we use the synchronous communication +routines of the CUBLAS library\index{CUBLAS}: \verb+cublasSetVector()+ and \verb+cublasGetVector()+ +in the synchronous algorithm and the asynchronous ones: \verb+cublasSetVectorAsync()+ +and \verb+cublasGetVectorAsync()+ in the asynchronous algorithm. Moreover, we +use the communication routines of the MPI library to carry out the data exchanges +between the neighboring nodes. We use the following communication routines: \verb+MPI_Isend()+ +and \verb+MPI_Irecv()+ to perform nonblocking\index{MPI~subroutines!Nonblocking} +sends and receptions, respectively. For the synchronous algorithm, we use the MPI +routine \verb+MPI_Waitall()+ which puts the MPI process of a computing node in +blocking status until all data exchanges with neighboring nodes (sends and receptions) +are completed. In contrast, for the asynchronous algorithms, we use the MPI routine +\verb+MPI_Test()+ which tests the completion of a data exchange (send or reception) +without putting the MPI process in blocking status\index{MPI~subroutines!Blocking}. + +The function $Compute\_New\_Vector\_Elements()$ (line~$6$ in Algorithm~\ref{ch13:alg:02}) +computes, at each iteration, the new elements of the iterate vector $U$. Its general code +is presented in Listing~\ref{ch13:list:01} (CPU function). The iterations of the projected +Richardson method\index{Iterative~method!Projected~Richardson}, based on those of the Jacobi +method\index{Iterative~method!Jacobi}, are defined as follows: \begin{equation} \begin{array}{ll} u^{p+1}(x,y,z) =& \frac{1}{Center}(g(x,y,z) - (Center\cdot u^{p}(x,y,z) + \\ @@ -411,64 +506,88 @@ u^{p+1}(x,y,z) =& \frac{1}{Center}(g(x,y,z) - (Center\cdot u^{p}(x,y,z) + \\ & South\cdot u^{p}(x,y-h,z) + North\cdot u^{p}(x,y+h,z) + \\ & Rear\cdot u^{p}(x,y,z-h) + Front\cdot u^{p}(x,y,z+h))), \end{array} -\label{eq:17} +\label{ch13:eq:17} \end{equation} -where $u^{p}(x,y,z)$ is an element of the iterate vector $U$ computed at the iteration $p$ and $g(x,y,z)$ is a vector element of the -right-hand side $G$. The scalars $Center$, $West$, $East$, $South$, $North$, $Rear$ and $Front$ define constant coefficients of the -block matrix $A$. Figure~\ref{fig:03} shows the positions of these coefficients in a three-dimensional domain. +where $u^{p}(x,y,z)$ is an element of the iterate vector $U$ computed at the +iteration $p$ and $g(x,y,z)$ is a vector element of the right-hand side $G$. +The scalars $Center$, $West$, $East$, $South$, $North$, $Rear$ and $Front$ +define constant coefficients of the block matrix $A$. Figure~\ref{ch13:fig:03} +shows the positions of these coefficients in a three-dimensional domain. \begin{figure} \centerline{\includegraphics[scale=0.35]{Chapters/chapter13/figures/matrix}} \caption{Matrix constant coefficients in a three-dimensional domain.} -\label{fig:03} +\label{ch13:fig:03} \end{figure} -The kernel implementations of the projected Richardson method on GPUs uses a perfect fine-grain multithreading parallelism. Since the -projected Richardson algorithm is implemented as a fixed point method, each kernel is executed by a large number of GPU threads such -that each thread is in charge of the computation of one element of the iterate vector $U$. Moreover, this method uses the vector elements -updates of the Jacobi method, which means that each thread $i$ computes the new value of its element $u_{i}^{p+1}$ independently of the -new values $u_{j}^{p+1}$, where $j\neq i$, of those computed in parallel by other threads at the same iteration $p+1$. Listing~\ref{list:02} -shows the GPU implementations of the main kernels of the projected Richardson method, which are: the matrix-vector multiplication -(\verb+MV_Multiplication()+) and the vector elements updates (\verb+Vector_Updates()+). The codes of these kernels are based on -that presented in Listing~\ref{list:01}. - -\lstinputlisting[label=list:02,caption=GPU kernels of the projected Richardson method]{Chapters/chapter13/ex2.cu} +The kernel implementations of the projected Richardson method on GPUs uses a +perfect fine-grain multithreading parallelism. Since the projected Richardson +algorithm is implemented as a fixed point method, each kernel is executed by +a large number of GPU threads such that each thread is in charge of the computation +of one element of the iterate vector $U$. Moreover, this method uses the vector +elements updates of the Jacobi method, which means that each thread $i$ computes +the new value of its element $u_{i}^{p+1}$ independently of the new values $u_{j}^{p+1}$, +where $j\neq i$, of those computed in parallel by other threads at the same iteration +$p+1$. Listing~\ref{ch13:list:02} shows the GPU implementations of the main kernels +of the projected Richardson method, which are: the matrix-vector multiplication +(\verb+MV_Multiplication()+) and the vector elements updates (\verb+Vector_Updates()+). +The codes of these kernels are based on that presented in Listing~\ref{ch13:list:01}. + +\lstinputlisting[label=ch13:list:02,caption=GPU kernels of the projected Richardson method]{Chapters/chapter13/ex2.cu} \begin{figure} \centerline{\includegraphics[scale=0.3]{Chapters/chapter13/figures/points3D}} \caption{Computation of a vector element with the projected Richardson method.} -\label{fig:04} +\label{ch13:fig:04} \end{figure} -Each kernel is executed by $NX\times ny$ GPU threads so that $nz$ slices of $(NX\times ny)$ vector elements are computed in -a {\bf for} loop. In this case, each thread is in charge of one vector element from each slice (in total $nz$ vector elements -along $z$-axis). We can notice from the formula~(\ref{eq:17}) that the computation of a vector element $u^{p+1}(x,y,z)$, by -a thread at iteration $p+1$, requires seven vector elements computed at the previous iteration $p$: two vector elements in -each dimension plus the vector element at the intersection of the three axises $x$, $y$ and $z$ (see Figure~\ref{fig:04}). -So, to reduce the memory accesses to the high-latency global memory, the vector elements of the current slice can be stored -in the low-latency shared memories of thread blocks, as is described in~\cite{ref9}. Nevertheless, the fact that the computation -of a vector element requires only two elements in each dimension does not allow to maximize the data reuse from the shared memories. -The computation of a slice involves in total $(bx+2)\times(by+2)$ accesses to the global memory per thread block, to fill the -required vector elements in the shared memory where $bx$ and $by$ are the dimensions of a thread block. Then, in order to optimize -the memory accesses on GPUs, the elements of the iterate vector $U$ are filled in the cache texture memory (see~\cite{ref10}). -In new GPU generations as Fermi or Kepler, the global memory accesses are always cached in L1 and L2 caches. For example, for -a given kernel, we can favour the use of the L1 cache to that of the shared memory by using the function \verb+cudaFuncSetCacheConfig(Kernel,cudaFuncCachePreferL1)+. -So, the initial access to the global memory loads the vector elements required by the threads of a block into the cache memory -(texture or L1/L2 caches). Then, all the following memory accesses read from this cache memory. In Listing~\ref{list:02}, the -function \verb+fetch_double(v,i)+ is used to read from the texture memory the $i^{th}$ element of the double-precision vector -\verb+v+ (see Listing~\ref{list:03}). Moreover, the seven constant coefficients of matrix $A$ can be stored in the constant memory -but, since they are reused $nz$ times by each thread, it is more interesting to fill them on the low-latency registers of each thread. - -\lstinputlisting[label=list:03,caption=Memory access to the cache texture memory]{Chapters/chapter13/ex3.cu} - -The function $Convergence()$ (line~$11$ in Algorithm~\ref{alg:02}) allows to detect the convergence of the parallel iterative algorithm -and is based on the tolerance threshold $\varepsilon$ and the maximum number of relaxations $MaxRelax$. We take into account the number -of relaxations since that of iterations cannot be computed in the asynchronous case. Indeed, a relaxation is the update~(\ref{eq:13}) of -a local iterate vector $U_i$ according to $F_i$. Then, counting the number of relaxations is possible in both synchronous and asynchronous -cases. On the other hand, an iteration is the update of at least all vector components with $F_i$. - -In the synchronous algorithm, the global convergence is detected when the maximal value of the absolute error, $error$, is sufficiently small -and/or the maximum number of relaxations, $MaxRelax$, is reached, as follows: +Each kernel is executed by $NX\times ny$ GPU threads so that $nz$ slices +of $(NX\times ny)$ vector elements are computed in a {\bf for} loop. In +this case, each thread is in charge of one vector element from each slice +(in total $nz$ vector elements along $z$-axis). We can notice from the +formula~(\ref{ch13:eq:17}) that the computation of a vector element $u^{p+1}(x,y,z)$, +by a thread at iteration $p+1$, requires seven vector elements computed +at the previous iteration $p$: two vector elements in each dimension plus +the vector element at the intersection of the three axises $x$, $y$ and $z$ +(see Figure~\ref{ch13:fig:04}). So, to reduce the memory accesses to the +high-latency global memory, the vector elements of the current slice can +be stored in the low-latency shared memories of thread blocks, as is described +in~\cite{ch13:ref9}. Nevertheless, the fact that the computation of a vector +element requires only two elements in each dimension does not allow to maximize +the data reuse from the shared memories. The computation of a slice involves +in total $(bx+2)\times(by+2)$ accesses to the global memory per thread block, +to fill the required vector elements in the shared memory where $bx$ and $by$ +are the dimensions of a thread block. Then, in order to optimize the memory +accesses on GPUs, the elements of the iterate vector $U$ are filled in the +cache texture memory (see~\cite{ch13:ref10}). In new GPU generations as Fermi +or Kepler, the global memory accesses are always cached in L1 and L2 caches. +For example, for a given kernel, we can favour the use of the L1 cache to that +of the shared memory by using the function \verb+cudaFuncSetCacheConfig(Kernel,cudaFuncCachePreferL1)+. +So, the initial access to the global memory loads the vector elements required +by the threads of a block into the cache memory (texture or L1/L2 caches). Then, +all the following memory accesses read from this cache memory. In Listing~\ref{ch13:list:02}, +the function \verb+fetch_double(v,i)+ is used to read from the texture memory +the $i^{th}$ element of the double-precision vector \verb+v+ (see Listing~\ref{ch13:list:03}). +Moreover, the seven constant coefficients of matrix $A$ can be stored in the +constant memory but, since they are reused $nz$ times by each thread, it is more +interesting to fill them on the low-latency registers of each thread. + +\lstinputlisting[label=ch13:list:03,caption=Memory access to the cache texture memory]{Chapters/chapter13/ex3.cu} + +The function $Convergence()$ (line~$11$ in Algorithm~\ref{ch13:alg:02}) allows +to detect the convergence of the parallel iterative algorithm and is based on +the tolerance threshold\index{Convergence!Tolerance~threshold} $\varepsilon$ +and the maximum number of relaxations\index{Convergence!Maximum~number~of~relaxations} +$MaxRelax$. We take into account the number of relaxations since that of iterations +cannot be computed in the asynchronous case. Indeed, a relaxation is the update~(\ref{ch13:eq:13}) +of a local iterate vector $U_i$ according to $F_i$. Then, counting the number +of relaxations is possible in both synchronous and asynchronous cases. On the +other hand, an iteration is the update of at least all vector components with +$F_i$. + +In the synchronous\index{Synchronous} algorithm, the global convergence is detected +when the maximal value of the absolute error, $error$, is sufficiently small and/or +the maximum number of relaxations, $MaxRelax$, is reached, as follows: $$ \begin{array}{l} error=\|U^{p}-U^{p+1}\|_{2}; \\ @@ -477,52 +596,90 @@ AllReduce(error,\hspace{0.1cm}maxerror,\hspace{0.1cm}MAX); \\ conv \leftarrow true; \end{array} $$ -where the function $AllReduce()$ uses the MPI reduction subroutine \verb+MPI_Allreduce()+ to compute the maximal value, $maxerror$, among the -local absolute errors, $error$, of all computing nodes and $p$ (in Algorithm~\ref{alg:02}) is used as a counter of the local relaxations carried -out by a computing node. In the asynchronous algorithms, the global convergence is detected when all computing nodes locally converge. For this, -we use a token ring architecture around which a boolean token travels, in one direction, from a computing node to another. Starting from node $0$, -the boolean token is set to $true$ by node $i$ if the local convergence is reached or to $false$ otherwise and, then, it is sent to node $i+1$. -Finally, the global convergence is detected when node $0$ receives from its neighbor node $S-1$, in the ring architecture, a token set to $true$. -In this case, node $0$ sends a stop message (end of parallel solving) to all computing nodes in the cluster. +where the function $AllReduce()$ uses the MPI global reduction subroutine\index{MPI~subroutines!Global} +\verb+MPI_Allreduce()+ to compute the maximal value, $maxerror$, among the local +absolute errors, $error$, of all computing nodes and $p$ (in Algorithm~\ref{ch13:alg:02}) +is used as a counter of the local relaxations carried out by a computing node. In +the asynchronous\index{Asynchronous} algorithms, the global convergence is detected +when all computing nodes locally converge. For this, we use a token ring architecture +around which a boolean token travels, in one direction, from a computing node to another. +Starting from node $0$, the boolean token is set to $true$ by node $i$ if the local +convergence is reached or to $false$ otherwise and, then, it is sent to node $i+1$. +Finally, the global convergence is detected when node $0$ receives from its neighbor +node $S-1$, in the ring architecture, a token set to $true$. In this case, node $0$ +sends a stop message (end of parallel solving) to all computing nodes in the cluster. %%--------------------------%% %% SECTION 5 %% %%--------------------------%% \section{Experimental tests on a GPU cluster} -\label{sec:05} -The GPU cluster of tests, that we used in this chapter, is an $20Gbps$ Infiniband network of six machines. Each machine is a Quad-Core Xeon -E5530 CPU running at $2.4$GHz. It provides a RAM memory of $12$GB with a memory bandwidth of $25.6$GB/s and it is equipped with two Nvidia -Tesla C1060 GPUs. A Tesla GPU contains in total $240$ cores running at $1.3$GHz. It provides $4$GB of global memory with a memory bandwidth -of $102$GB/s, accessible by all its cores and also by the CPU through the PCI-Express 16x Gen 2.0 interface with a throughput of $8$GB/s. -Hence, the memory copy operations between the GPU and the CPU are about $12$ times slower than those of the Tesla GPU memory. We have performed -our simulations on a cluster of $24$ CPU cores and on a cluster of $12$ GPUs. Figure~\ref{fig:05} describes the components of the GPU cluster -of tests. +\label{ch13:sec:05} +The GPU cluster\index{GPU~cluster} of tests, that we used in this chapter, is an $20Gbps$ +Infiniband network of six machines. Each machine is a Quad-Core Xeon E5530 CPU running at +$2.4$GHz. It provides a RAM memory of $12$GB with a memory bandwidth of $25.6$GB/s and it +is equipped with two Nvidia Tesla C1060 GPUs. A Tesla GPU contains in total $240$ cores +running at $1.3$GHz. It provides $4$GB of global memory with a memory bandwidth of $102$GB/s, +accessible by all its cores and also by the CPU through the PCI-Express 16x Gen 2.0 interface +with a throughput of $8$GB/s. Hence, the memory copy operations between the GPU and the CPU +are about $12$ times slower than those of the Tesla GPU memory. We have performed our simulations +on a cluster of $24$ CPU cores and on a cluster of $12$ GPUs. Figure~\ref{ch13:fig:05} describes +the components of the GPU cluster of tests. + +Linux cluster version 2.6.39 OS is installed on CPUs. C programming language is used for +coding the parallel algorithms of the methods on both GPU cluster and CPU cluster. CUDA +version 4.0~\cite{ch13:ref12} is used for programming GPUs, using CUBLAS library~\cite{ch13:ref8} +to deal with vector operations in GPUs and, finally, MPI functions of OpenMPI 1.3.3 are +used to carry out the synchronous and asynchronous communications between CPU cores. Indeed, +in our experiments, a computing node is managed by a MPI process and it is composed of +one CPU core and one GPU card. + +All experimental results of the parallel projected Richardson algorithms are obtained +from simulations made in double precision data. The obstacle problems to be solved are +defined in constant three-dimensional domain $\Omega\subset\mathbb{R}^{3}$. The numerical +values of the parameters of the obstacle problems are: $\eta=0.2$, $c=1.1$, $f$ is computed +by formula~(\ref{ch13:eq:18}) and final time $T=0.02$. Moreover, three time steps ($NbSteps=3$) +are computed with $k=0.0066$. As the discretization matrix is constant along the time +steps, the convergence properties of the iterative algorithms do not change. Thus, the +performance characteristics obtained with three time steps will still be valid for more +time steps. The initial function $u(0,x,y,z)$ of the obstacle problem~(\ref{ch13:eq:01}) +is set to $0$, with a constraint $u\geq\phi=0$. The relaxation parameter $\gamma$ used +by the projected Richardson method is computed automatically thanks to the diagonal entries +of the discretization matrix. The formula and its proof can be found in~\cite{ch13:ref11}, +Section~2.3. The convergence tolerance threshold $\varepsilon$ is set to $1e$-$04$ and the +maximum number of relaxations is limited to $10^{6}$ relaxations. Finally, the number of +threads per block is set to $256$ threads, which gives, in general, good performances for +most GPU applications. We have performed some tests for the execution configurations and +we have noticed that the best configuration of the $256$ threads per block is an organization +into two dimensions of sizes $(64,4)$. \begin{figure} \centerline{\includegraphics[scale=0.25]{Chapters/chapter13/figures/cluster}} \caption{GPU cluster of tests composed of 12 computing nodes (six machines, each with two GPUs.} -\label{fig:05} +\label{ch13:fig:05} \end{figure} -Linux cluster version 2.6.39 OS is installed on CPUs. C programming language is used for coding the parallel algorithms of the methods on both -GPU cluster and CPU cluster. CUDA version 4.0~\cite{ref12} is used for programming GPUs, using CUBLAS library~\cite{ref8} to deal with vector -operations in GPUs and, finally, MPI functions of OpenMPI 1.3.3 are used to carry out the synchronous and asynchronous communications between -CPU cores. Indeed, in our experiments, a computing node is managed by a MPI process and it is composed of one CPU core and one GPU card. - -All experimental results of the parallel projected Richardson algorithms are obtained from simulations made in double precision data. The obstacle -problems to be solved are defined in constant three-dimensional domain $\Omega\subset\mathbb{R}^{3}$. The numerical values of the parameters of the -obstacle problems are: $\eta=0.2$, $c=1.1$, $f$ is computed by formula~(\ref{eq:18}) and final time $T=0.02$. Moreover, three time steps ($NbSteps=3$) -are computed with $k=0.0066$. As the discretization matrix is constant along the time steps, the convergence properties of the iterative algorithms -do not change. Thus, the performance characteristics obtained with three time steps will still be valid for more time steps. The initial function -$u(0,x,y,z)$ of the obstacle problem~(\ref{eq:01}) is set to $0$, with a constraint $u\geq\phi=0$. The relaxation parameter $\gamma$ used by the -projected Richardson method is computed automatically thanks to the diagonal entries of the discretization matrix. The formula and its proof can be -found in~\cite{ref11}, Section~2.3. The convergence tolerance threshold $\varepsilon$ is set to $1e$-$04$ and the maximum number of relaxations is -limited to $10^{6}$ relaxations. Finally, the number of threads per block is set to $256$ threads, which gives, in general, good performances for -most GPU applications. We have performed some tests for the execution configurations and we have noticed that the best configuration of the $256$ -threads per block is an organization into two dimensions of sizes $(64,4)$. - -\begin{table}[!h] +The performance measures that we took into account are the execution times and the number +of relaxations performed by the parallel iterative algorithms, both synchronous and asynchronous +versions, on the GPU and CPU clusters. These algorithms are used for solving nonlinear systems +derived from the discretization of obstacle problems of sizes $256^{3}$, $512^{3}$, $768^{3}$ +and $800^{3}$. In Table~\ref{ch13:tab:01} and Table~\ref{ch13:tab:02}, we show the performances +of the parallel synchronous and asynchronous algorithms of the projected Richardson method +implemented, respectively, on a cluster of $24$ CPU cores and on a cluster of $12$ GPUs. In +these tables, the execution time defines the time spent by the slowest computing node and the +number of relaxations is computed as the summation of those carried out by all computing nodes. + +In the sixth column of Table~\ref{ch13:tab:01} and in the eighth column of Table~\ref{ch13:tab:02}, +we give the gains in $\%$ obtained by using an asynchronous algorithm compared to a synchronous +one. We can notice that the asynchronous version on CPU and GPU clusters is slightly faster than +the synchronous one for both methods. Indeed, the cluster of tests is composed of local and homogeneous +nodes communicating via low-latency connections. So, in the case of distant and/or heterogeneous +nodes (or even with geographically distant clusters) the asynchronous version would be faster than +the synchronous one. However, the gains obtained on the GPU cluster are better than those obtained +on the CPU cluster. In fact, the computation times are reduced by accelerating the computations on +GPUs while the communication times still unchanged. + +\begin{table} \centering \begin{tabular}{|c|c|c|c|c|c|} \hline @@ -540,105 +697,130 @@ $800^{3}$ & $222,108.09$ & $1,769,232$ & $188,790 \end{tabular} \vspace{0.5cm} \caption{Execution times in seconds of the parallel projected Richardson method implemented on a cluster of 24 CPU cores.} -\label{tab:01} +\label{ch13:tab:01} \end{table} -\begin{table}[!h] +\begin{table} \centering \begin{tabular}{|c|c|c|c|c|c|c|c|} \hline -\multirow{2}{*}{\bf Pb. size} & \multicolumn{3}{c|}{\bf Synchronous} & \multicolumn{3}{c|}{\bf Asynchronous} & \multirow{2}{*}{\bf Gain\%} \\ \cline{2-7} +\multirow{2}{*}{\bf Pb. size} & \multicolumn{3}{c|}{\bf Synchronous} & \multicolumn{3}{c|}{\bf Asynchronous} & \multirow{2}{*}{\bf Gain\%} \\ \cline{2-7} - & $\mathbf{T_{gpu}}$ & {\bf \#relax.} & $\mathbf{\tau}$ & $\mathbf{T_{gpu}}$ & {\bf \#relax.} & $\mathbf{\tau}$ & \\ \hline \hline + & $\mathbf{T_{gpu}}$ & {\bf \#relax.} & $\mathbf{\tau}$ & $\mathbf{T_{gpu}}$ & {\bf \#relax.} & $\mathbf{\tau}$ & \\ \hline \hline -$256^{3}$ & $29.67$ & $100,692$ & $19.39$ & $18.00$ & $94,215$ & $29.96$ & $39.33$ \\ \hline \hline +$256^{3}$ & $29.67$ & $100,692$ & $19.39$ & $18.00$ & $94,215$ & $29.96$ & $39.33$ \\\hline \hline -$512^{3}$ & $521.83$ & $381,300$ & $36.89$ & $425.15$ & $347,279$ & $42.89$ & $18.53$ \\ \hline \hline +$512^{3}$ & $521.83$ & $381,300$ & $36.89$ & $425.15$ & $347,279$ & $42.89$ & $18.53$ \\\hline \hline -$768^{3}$ & $4,112.68$ & $831,144$ & $50.13$ & $3,313.87$ & $750,232$ & $55.40$ & $19.42$ \\ \hline \hline +$768^{3}$ & $4,112.68$ & $831,144$ & $50.13$ & $3,313.87$ & $750,232$ & $55.40$ & $19.42$ \\ \hline \hline -$800^{3}$ & $3,950.87$ & $899,088$ & $56.22$ & $3,636.57$ & $834,900$ & $51.91$ & $7.95$ \\ \hline +$800^{3}$ & $3,950.87$ & $899,088$ & $56.22$ & $3,636.57$ & $834,900$ & $51.91$ & $7.95$ \\ \hline \end{tabular} \vspace{0.5cm} \caption{Execution times in seconds of the parallel projected Richardson method implemented on a cluster of 12 GPUs.} -\label{tab:02} +\label{ch13:tab:02} \end{table} -The performance measures that we took into account are the execution times and the number of relaxations performed by the parallel iterative algorithms, -both synchronous and asynchronous versions, on the GPU and CPU clusters. These algorithms are used for solving nonlinear systems derived from the discretization -of obstacle problems of sizes $256^{3}$, $512^{3}$, $768^{3}$ and $800^{3}$. In Table~\ref{tab:01} and Table~\ref{tab:02}, we show the performances -of the parallel synchronous and asynchronous algorithms of the projected Richardson method implemented, respectively, on a cluster of $24$ CPU cores -and on a cluster of $12$ GPUs. In these tables, the execution time defines the time spent by the slowest computing node and the number of relaxations -is computed as the summation of those carried out by all computing nodes. - -In the sixth column of Table~\ref{tab:01} and in the eighth column of Table~\ref{tab:02}, we give the gains in $\%$ obtained by using an -asynchronous algorithm compared to a synchronous one. We can notice that the asynchronous version on CPU and GPU clusters is slightly faster -than the synchronous one for both methods. Indeed, the cluster of tests is composed of local and homogeneous nodes communicating via low-latency -connections. So, in the case of distant and/or heterogeneous nodes (or even with geographically distant clusters) the asynchronous version -would be faster than the synchronous one. However, the gains obtained on the GPU cluster are better than those obtained on the CPU cluster. -In fact, the computation times are reduced by accelerating the computations on GPUs while the communication times still unchanged. - -The fourth and seventh columns of Table~\ref{tab:02} show the relative gains obtained by executing the parallel algorithms on the cluster -of $12$ GPUs instead on the cluster of $24$ CPU cores. We compute the relative gain $\tau$ as a ratio between the execution time $T_{cpu}$ -spent on the CPU cluster over that $T_{gpu}$ spent on the GPU cluster: \[\tau=\frac{T_{cpu}}{T_{gpu}}.\] We can see from these ratios that -solving large obstacle problems is faster on the GPU cluster than on the CPU cluster. Indeed, the GPUs are more efficient than their -counterpart CPUs to execute large data-parallel operations. In addition, the projected Richardson method is implemented as a fixed point-based -iteration and uses the Jacobi vector updates that allow a well thread-parallelization on GPUs, such that each GPU thread is in charge -of one vector component at a time without being dependent on other vector components computed by other threads. Then, this allow to exploit -at best the high performance computing of the GPUs by using all the GPU resources and avoiding the idle cores. - -Finally, the number of relaxations performed by the parallel synchronous algorithm is different in the CPU and GPU versions, because the number -of computing nodes involved in the GPU cluster and in the CPU cluster is different. In the CPU case, $24$ computing nodes ($24$ CPU cores) are -considered, whereas in the GPU case, $12$ computing nodes ($12$ GPUs) are considered. As the number of relaxations depends on the domain decomposition, +The fourth and seventh columns of Table~\ref{ch13:tab:02} show the relative gains +obtained by executing the parallel algorithms on the cluster of $12$ GPUs instead +on the cluster of $24$ CPU cores. We compute the relative gain\index{Relative~gain} +$\tau$ as a ratio between the execution time $T_{cpu}$ spent on the CPU cluster over +that $T_{gpu}$ spent on the GPU cluster: \[\tau=\frac{T_{cpu}}{T_{gpu}}.\] We can see +from these ratios that solving large obstacle problems is faster on the GPU cluster +than on the CPU cluster. Indeed, the GPUs are more efficient than their counterpart +CPUs to execute large data-parallel operations. In addition, the projected Richardson +method is implemented as a fixed point-based iteration and uses the Jacobi vector updates +that allow a well thread-parallelization on GPUs, such that each GPU thread is in charge +of one vector component at a time without being dependent on other vector components +computed by other threads. Then, this allow to exploit at best the high performance +computing of the GPUs by using all the GPU resources and avoiding the idle cores. + +Finally, the number of relaxations performed by the parallel synchronous algorithm +is different in the CPU and GPU versions, because the number of computing nodes involved +in the GPU cluster and in the CPU cluster is different. In the CPU case, $24$ computing +nodes ($24$ CPU cores) are considered, whereas in the GPU case, $12$ computing nodes +($12$ GPUs) are considered. As the number of relaxations depends on the domain decomposition, consequently it also depends on the number of computing nodes. + %%--------------------------%% %% SECTION 6 %% %%--------------------------%% \section{Red-Black ordering technique} -\label{sec:06} -As is well-known, the Jacobi method is characterized by a slow convergence rate compared to some iterative methods (for example Gauss-Seidel method). -So, in this section, we present some solutions to reduce the execution time and the number of relaxations and, more specifically, to speed up the -convergence of the parallel projected Richardson method on the GPU cluster. We propose to use the point red-black ordering technique to accelerate -the convergence. This technique is often used to increase the parallelism of iterative methods for solving linear systems~\cite{ref13,ref14,ref15}. -We apply it to the projected Richardson method as a compromise between the Jacobi and Gauss-Seidel iterative methods. - -The general principle of the red-black technique is as follows. Let $t$ be the summation of the integer $x$-, $y$- and $z$-coordinates of a vector -element $u(x,y,z)$ on a three-dimensional domain: $t=x+y+z$. As is shown in Figure~\ref{fig:06.01}, the red-black ordering technique consists in the -parallel computing of the red vector elements having even value $t$ by using the values of the black ones then, the parallel computing of the black -vector elements having odd values $t$ by using the new values of the red ones. - -\begin{figure} -\centering - \mbox{\subfigure[Red-black ordering on x, y and z axises]{\includegraphics[width=2.3in]{Chapters/chapter13/figures/rouge-noir}\label{fig:06.01}}\quad - \subfigure[Red-black ordering on y axis]{\includegraphics[width=2.3in]{Chapters/chapter13/figures/rouge-noir-y}\label{fig:06.02}}} -\caption{Red-black ordering for computing the iterate vector elements in a three-dimensional space.} -\end{figure} +\label{ch13:sec:06} +As is well-known, the Jacobi method\index{Iterative~method!Jacobi} is characterized +by a slow convergence\index{Convergence} rate compared to some iterative methods\index{Iterative~method} +(for example Gauss-Seidel method\index{Iterative~method!Gauss-Seidel}). So, in this +section, we present some solutions to reduce the execution time and the number of +relaxations and, more specifically, to speed up the convergence of the parallel +projected Richardson method on the GPU cluster. We propose to use the point red-black +ordering technique\index{Iterative~method!Red-Black~ordering} to accelerate the +convergence. This technique is often used to increase the parallelism of iterative +methods for solving linear systems~\cite{ch13:ref13,ch13:ref14,ch13:ref15}. We +apply it to the projected Richardson method as a compromise between the Jacobi +and Gauss-Seidel iterative methods. + +The general principle of the red-black technique is as follows. Let $t$ be the +summation of the integer $x$-, $y$- and $z$-coordinates of a vector element $u(x,y,z)$ +on a three-dimensional domain: $t=x+y+z$. As is shown in Figure~\ref{ch13:fig:06.01}, +the red-black ordering technique consists in the parallel computing of the red +vector elements having even value $t$ by using the values of the black ones then, +the parallel computing of the black vector elements having odd values $t$ by using +the new values of the red ones. This technique can be implemented on the GPU in two different manners: \begin{itemize} \item among all launched threads ($NX\times ny$ threads), only one thread out of two computes its red or black vector element at a time or, \item all launched threads (on average half of $NX\times ny$ threads) compute the red vector elements first and, then, the black ones. \end{itemize} -However, in both solutions, for each memory transaction, only half of the memory segment addressed by a half-warp is used. So, the computation of the -red and black vector elements leads to use twice the initial number of memory transactions. Then, we apply the point red-black ordering accordingly to -the $y$-coordinate, as is shown in Figure~\ref{fig:06.02}. In this case, the vector elements having even $y$-coordinate are computed in parallel using -the values of those having odd $y$-coordinate and then vice-versa. Moreover, in the GPU implementation of the parallel projected Richardson method (Section~\ref{sec:04}), -we have shown that a sub-problem of size $(NX\times ny\times nz)$ is decomposed into $nz$ grids of size $(NX\times ny)$. Then, each kernel is executed -in parallel by $NX\times ny$ GPU threads, so that each thread is in charge of $nz$ vector elements along $z$-axis (one vector element in each grid of -the sub-problem). So, we propose to use the new values of the vector elements computed in grid $i$ to compute those of the vector elements in grid $i+1$. -Listing~\ref{list:04} describes the kernel of the matrix-vector multiplication and the kernel of the vector elements updates of the parallel projected -Richardson method using the red-black ordering technique. - -\lstinputlisting[label=list:04,caption=GPU kernels of the projected Richardson method using the red-black technique]{Chapters/chapter13/ex4.cu} - -Finally, we exploit the concurrent executions between the host functions and the GPU kernels provided by the GPU hardware and software. In fact, the kernel -launches are asynchronous (when this environment variable is not disabled on the GPUs), such that the control is returned to the host (MPI process) before -the GPU has completed the requested task (kernel)~\cite{ref12}. Therefore, all the kernels necessary to update the local vector elements, $u(x,y,z)$ where -$0