X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/715679c7fdeac58424c5052fffe7677f15337801..21ada473e001305cb2a05a34a2da6c2fb1ecd126:/BookGPU/Chapters/chapter13/ch13.tex diff --git a/BookGPU/Chapters/chapter13/ch13.tex b/BookGPU/Chapters/chapter13/ch13.tex index 941f085..ce2c7c2 100755 --- a/BookGPU/Chapters/chapter13/ch13.tex +++ b/BookGPU/Chapters/chapter13/ch13.tex @@ -5,7 +5,7 @@ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%% %\chapterauthor{}{} -\chapterauthor{Lilia Ziane Khodja, Raphaël Couturier and Jacques Bahi}{Femto-ST Institute, University of Franche-Comte, France} +\chapterauthor{Lilia Ziane Khodja, Raphaël Couturier, and Jacques Bahi}{Femto-ST Institute, University of Franche-Comte, France} \chapterauthor{Ming Chau}{Advanced Solutions Accelerator, Castelnau Le Lez, France} %\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte, France} \chapterauthor{Pierre Spitéri}{ENSEEIHT-IRIT, Toulouse, France} @@ -23,23 +23,23 @@ %%--------------------------%% \section{Introduction} \label{ch13:sec:01} -The obstacle problem is one kind of free boundary problems. It allows to model, +The obstacle problem is one kind of free boundary problem. It allows us to model, for example, an elastic membrane covering a solid obstacle. In this case, the objective is to find an equilibrium position of this membrane constrained to be above the obstacle and which tends to minimize its surface and/or its energy. -The study of such problems occurs in many applications, for example: fluid mechanics, -bio-mathematics (tumor growth process) or financial mathematics (American or +The study of such problems occurs in many applications, for example, fluid mechanics, +biomathematics (tumor growth process), or financial mathematics (American or European option pricing). In this chapter, we focus on solutions of large obstacle problems defined in a -three-dimensional domain. Particularly, the present study consists in solving +three-dimensional domain. Particularly, the present study consists of solving large nonlinear systems derived from the spatial discretization of these problems. Owing to the great size of such systems, in order to reduce computation times, we proceed by solving them by parallel synchronous or asynchronous iterative algorithms. Moreover, we aim at harnessing the computing power of GPUs to accelerate computations of these parallel algorithms. For this, we use an iterative method -involving a projection on a convex set, which is: the projected Richardson method. -We choose this method among other iterative methods because it is easy to implement +involving a projection on a convex set, which is the projected Richardson method. +We chose this method among other iterative methods because it is easy to implement on parallel computers and easy to adapt to GPU architectures. In Section~\ref{ch13:sec:02}, we present the mathematical model of obstacle problems @@ -48,7 +48,7 @@ projected Richardson method. Next, in Section~\ref{ch13:sec:04}, we give the mai points of the parallel implementation of both synchronous and asynchronous algorithms of the projected Richardson method on a GPU cluster. In Section~\ref{ch13:sec:05}, we present the performances of both parallel algorithms obtained from simulations carried -out on a CPU and GPU clusters. Finally, in Section~\ref{ch13:sec:06}, we use the read-black +out on GPU clusters. Finally, in Section~\ref{ch13:sec:06}, we use the read-black ordering technique to improve the convergence and, thus, the execution times of the parallel projected Richardson algorithms on the GPU cluster. @@ -67,8 +67,8 @@ three-dimensional domain. This model is based on that presented in~\cite{ch13:re %%******************* \subsection{Mathematical model} \label{ch13:sec:02.01} -An obstacle problem\index{Obstacle~problem}, arising for example in mechanics or financial -derivatives, consists in solving a time dependent nonlinear equation\index{Nonlinear}: +An obstacle problem\index{obstacle problem}, arising for example in mechanics or financial +derivatives, consists of solving a time-dependent nonlinear equation\index{nonlinear}: \begin{equation} \left\{ \begin{array}{l} @@ -80,19 +80,19 @@ u(0,x,y,z)=u_0(x,y,z),\\ \right. \label{ch13:eq:01} \end{equation} -where $u_0$ is the initial condition, $c\geq 0$, $b$ and $\eta$ are physical parameters, -$T$ is the final time, $u=u(t,x,y,z)$ is an element of the solution vector $U$ to compute, -$f$ is the right-hand side that could represent, for example, the external forces, B.C. -describes the boundary conditions on the boundary $\partial\Omega$ of the domain $\Omega$, -$\phi$ models a constraint imposed to $u$, $\Delta$ is the Laplacian operator, $\nabla$ -is the gradient operator, a.e.w. means almost every where and ``.'' defines the products -between two scalars, a scalar and a vector or a matrix and a vector. In practice the boundary +where $u_0$ is the initial condition; $c\geq 0$, $b$, and $\eta$ are physical parameters; +$T$ is the final time; $u=u(t,x,y,z)$ is an element of the solution vector $U$ to compute; +$f$ is the right-hand side that could represent, for example, the external forces; B.C. +describes the boundary conditions on the boundary $\partial\Omega$ of the domain $\Omega$; +$\phi$ models a constraint imposed to $u$; $\Delta$ is the Laplacian operator; $\nabla$ +is the gradient operator; a.e.w. means almost everywhere, and ``.'' defines the products +between two scalars, a scalar and a vector, or a matrix and a vector. In practice the boundary condition, generally considered, is the Dirichlet condition (where $u$ is fixed on $\partial\Omega$) or the Neumann condition (where the normal derivative of $u$ is fixed on $\partial\Omega$). -The time dependent equation~(\ref{ch13:eq:01}) is numerically solved by considering an +The time-dependent equation~(\ref{ch13:eq:01}) is numerically solved by considering an implicit or a semi-implicit time marching, where at each time step $k$ a stationary nonlinear -problem\index{Nonlinear} is solved: +problem\index{nonlinear} is solved: \begin{equation} \left\{ \begin{array}{l} @@ -116,9 +116,9 @@ problem~(\ref{ch13:eq:02}) does not provide a symmetric matrix, because the convection-diffusion operator is not self-adjoint. Moreover, the fact that the operator is self-adjoint or not plays an important role in the choice of the appropriate algorithm for solving nonlinear systems derived from the -discretization of the obstacle problem\index{Obstacle~problem}. Nevertheless, +discretization of the obstacle problem\index{obstacle problem}. Nevertheless, since the convection coefficients arising in the operator~(\ref{ch13:eq:02}) -are constant, we can formulate the same problem by an self-adjoint operator +are constant, we can formulate the same problem by self-adjoint operator by performing a classical change of variables. Then, we can replace the stationary convection-diffusion problem: \begin{equation} @@ -130,19 +130,19 @@ by the following stationary diffusion operator: -\eta.\Delta u+(\frac{\|b\|^{2}_{2}}{4\eta}+c+\delta).u=e^{-a}g=f, \label{ch13:eq:04} \end{equation} -where $b=\{b_{1},b_{2},b_{3}\}$, $\|b\|_{2}$ denotes the Euclidean norm of $b$ and +where $b=\{b_{1},b_{2},b_{3}\}$, $\|b\|_{2}$ denotes the Euclidean norm of $b$, and $v=e^{-a}.u$ represents the general change of variables such that $a=\frac{b^{t}(x,y,z)}{2\eta}$. Consequently, the numerical resolution of the diffusion problem (the self-adjoint -operator~(\ref{ch13:eq:04})) is done by optimization algorithms, in contrast, that +operator~(\ref{ch13:eq:04})) is done by optimization algorithms, in contrast to that of the convection-diffusion problem (non self-adjoint operator~(\ref{ch13:eq:03})) -is done by relaxation algorithms. In the case of our studied algorithm, the convergence\index{Convergence} -is ensured by M-matrix property then, the performance is linked to the magnitude of +which is done by relaxation algorithms. In the case of our studied algorithm, the convergence\index{convergence} +is ensured by M-matrix property; then, the performance is linked to the magnitude of the spectral radius of the iteration matrix, which is independent of the condition number. Next, the three-dimensional domain $\Omega\subset\mathbb{R}^{3}$ is set to $\Omega=\lbrack 0,1\rbrack^{3}$ -and discretized with an uniform Cartesian mesh constituted by $M=m^3$ discretization -points, where $m$ related to the spatial discretization step by $h=\frac{1}{m+1}$. This +and discretized with a uniform Cartesian mesh constituted by $M=m^3$ discretization +points, where $m$ is related to the spatial discretization step by $h=\frac{1}{m+1}$. This is carried out by using a classical order 2 finite difference approximation of the Laplacian. So, the complete discretization of both stationary boundary value problems~(\ref{ch13:eq:03}) and~(\ref{ch13:eq:04}) leads to the solution of a large discrete complementary problem @@ -160,7 +160,7 @@ of the following form, when both Dirichlet or Neumann boundary conditions are us where $A$ is a matrix obtained after the spatial discretization by a finite difference method, $G$ is derived from the Euler first order implicit time marching scheme and from the discretized right-hand side of the obstacle problem, $\delta$ is the inverse of the -time step $k$ and $I$ is the identity matrix. The matrix $A$ is symmetric when the self-adjoint +time step $k$, and $I$ is the identity matrix. The matrix $A$ is symmetric when the self-adjoint operator is considered and nonsymmetric otherwise. According to the chosen discretization scheme of the Laplacian, $A$ is an M-matrix (irreducibly @@ -176,15 +176,15 @@ an M-matrix. This property is important to the convergence of iterative methods. Owing to the large size of the previous discrete complementary problem~(\ref{ch13:eq:05}), we will solve it by parallel synchronous or asynchronous iterative algorithms (see~\cite{ch13:ref3,ch13:ref4,ch13:ref5}). In this chapter, we aim at harnessing the computing power of GPU clusters for solving these -large nonlinear systems\index{Nonlinear}. Then, we choose to use the projected Richardson -iterative method\index{Iterative~method!Projected~Richardson} for solving the diffusion +large nonlinear systems\index{nonlinear}. Then, we choose to use the projected Richardson +iterative method\index{iterative method!projected Richardson} for solving the diffusion problem~(\ref{ch13:eq:04}). Indeed, this method is based on the iterations of the Jacobi -method\index{Iterative~method!Jacobi} which are easy to parallelize on parallel computers +method\index{iterative method!Jacobi}, which are easy to parallelize on parallel computers and easy to adapt to GPU architectures. Then, according to the boundary value problem formulation with a self-adjoint operator~(\ref{ch13:eq:04}), we can consider here the equivalent optimization problem and the fixed point mapping associated to its solution. -Assume that $E=\mathbb{R}^{M}$ is a Hilbert space\index{Hilbert~space}, in which $\scalprod{.}{.}$ +Assume that $E=\mathbb{R}^{M}$ is a Hilbert space\index{Hilbert space}, in which $\scalprod{.}{.}$ is the scalar product and $\|.\|$ its associated norm. So, the general fixed point problem to be solved is defined as follows: \begin{equation} @@ -198,12 +198,12 @@ U^{*} = F(U^{*}), \\ \end{equation} where $U\mapsto F(U)$ is an application from $E$ to $E$. -Let $K$ be a closed convex set defined by: +Let $K$ be a closed convex set defined by \begin{equation} K = \{U | U \geq \Phi \mbox{~everywhere in~} E\}, \label{ch13:eq:07} \end{equation} -where $\Phi$ is the discrete obstacle function. In fact, the obstacle problem~(\ref{ch13:eq:05})\index{Obstacle~problem} +where $\Phi$ is the discrete obstacle function. In fact, the obstacle problem~(\ref{ch13:eq:05})\index{obstacle problem} is formulated as the following constrained optimization problem: \begin{equation} \left\{ @@ -214,17 +214,17 @@ is formulated as the following constrained optimization problem: \right. \label{ch13:eq:08} \end{equation} -where the cost function is given by: +where the cost function is given by \begin{equation} J(U) = \frac{1}{2}\scalprod{\mathcal{A}.U}{U} - \scalprod{G}{U}, \label{ch13:eq:09} \end{equation} in which $\scalprod{.}{.}$ denotes the scalar product in $E$, $\mathcal{A}=A+\delta I$ -is a symmetric positive definite, $A$ is the discretization matrix associated with the +is a symmetric positive definite, and $A$ is the discretization matrix associated with the self-adjoint operator~(\ref{ch13:eq:04}) after change of variables. -For any $U\in E$, let $P_K(U)$ be the projection of $U$ on $K$. For any $\gamma\in\mathbb{R}$, -$\gamma>0$, the fixed point mapping $F_{\gamma}$ of the projected Richardson method\index{Iterative~method!Projected~Richardson} +For any $U\in E$; let $P_K(U)$ be the projection of $U$ on $K$. For any $\gamma\in\mathbb{R}$, +$\gamma>0$, the fixed point mapping $F_{\gamma}$ of the projected Richardson method\index{iterative method!projected Richardson} is defined as follows: \begin{equation} U^{*} = F_{\gamma}(U^{*}) = P_K(U^{*} - \gamma(\mathcal{A}.U^{*} - G)). @@ -237,7 +237,7 @@ of the projected Richardson method~\cite{ch13:ref6}. Let $\alpha\in\mathbb{N}$ be a positive integer. We consider that the space $E=\displaystyle\prod_{i=1}^{\alpha} E_i$ is a product of $\alpha$ subspaces $E_i$ where $i\in\{1,\ldots,\alpha\}$. Note that $E_i=\mathbb{R}^{m_i}$, -where $\displaystyle\sum_{i=1}^{\alpha} m_{i}=M$, is also a Hilbert space\index{Hilbert~space} +where $\displaystyle\sum_{i=1}^{\alpha} m_{i}=M$, is also a Hilbert space\index{Hilbert space} in which $\scalprod{.}{.}_i$ denotes the scalar product and $|.|_i$ the associated norm, for all $i\in\{1,\ldots,\alpha\}$. Then, for all $u,v\in E$, $\scalprod{u}{v}=\displaystyle\sum_{i=1}^{\alpha}\scalprod{u_i}{v_i}_i$ is the scalar product on $E$. @@ -251,11 +251,11 @@ F_{\gamma}(U) & = & (F_{1,\gamma}(U),\ldots,F_{\alpha,\gamma}(U)). \\ \end{array} \label{ch13:eq:11} \end{equation} -Assume that the convex set $K=\displaystyle\prod_{i=1}^{\alpha}K_{i}$, such that $\forall i\in\{1,\ldots,\alpha\},K_i\subset E_i$ -and $K_i$ is a closed convex set. Let also $G=(G_1,\ldots,G_{\alpha})\in E$ and, for any +Assume that the convex set $K=\displaystyle\prod_{i=1}^{\alpha}K_{i}$, such that $\forall i\in\{1,\ldots,\alpha\},K_i\subset E_i$, +and $K_i$ is a closed convex set. Let also $G=(G_1,\ldots,G_{\alpha})\in E$; for any $U\in E$, $P_K(U)=(P_{K_1}(U_1),\ldots,P_{K_{\alpha}}(U_{\alpha}))$ is the projection of $U$ on $K$ where $\forall i\in\{1,\ldots,\alpha\},P_{K_i}$ is the projector from $E_i$ onto -$K_i$. So, the fixed point mapping of the projected Richardson method~(\ref{ch13:eq:10})\index{Iterative~method!Projected~Richardson} +$K_i$. So, the fixed point mapping of the projected Richardson method~(\ref{ch13:eq:10})\index{iterative method!projected Richardson} can be written in the following way: \begin{equation} \forall U\in E\mbox{,~}\forall i\in\{1,\ldots,\alpha\}\mbox{,~}F_{i,\gamma}(U) = P_{K_i}(U_i - \gamma(\mathcal{A}_i.U - G_i)). @@ -267,7 +267,7 @@ $\mathcal{A}_{i,j}$ denote block matrices of $\mathcal{A}$. The parallel asynchronous iterations of the projected Richardson method for solving the obstacle problem~(\ref{ch13:eq:08}) are defined as follows: let $U^0\in E,U^0=(U^0_1,\ldots,U^0_\alpha)$ be the initial solution, then for all $p\in\mathbb{N}$, the iterate $U^{p+1}=(U^{p+1}_1,\ldots,U^{p+1}_{\alpha})$ -is recursively defined by: +is recursively defined by \begin{equation} U_i^{p+1} = \left\{ @@ -283,7 +283,7 @@ where \left\{ \begin{array}{l} \forall p\in\mathbb{N}, s(p)\subset\{1,\ldots,\alpha\}\mbox{~and~} s(p)\ne\emptyset, \\ -\forall i\in\{1,\ldots,\alpha\},\{p \ | \ i \in s(p)\}\mbox{~is denombrable}, +\forall i\in\{1,\ldots,\alpha\},\{p \ | \ i \in s(p)\}\mbox{~is enumerable}, \end{array} \right. \label{ch13:eq:14} @@ -299,37 +299,37 @@ and $\forall j\in\{1,\ldots,\alpha\}$, \label{ch13:eq:15} \end{equation} -The previous asynchronous scheme\index{Asynchronous} of the projected Richardson -method models computations that are carried out in parallel without order nor +The previous asynchronous scheme\index{asynchronous iterations} of the projected Richardson +method models computations that are carried out in parallel without order or synchronization (according to the behavior of the parallel iterative method) and describes a subdomain method without overlapping. It is a general model that takes into account all possible situations of parallel computations and nonblocking message -passing. So, the synchronous iterative scheme\index{Synchronous} is defined by: +passing. So, the synchronous iterative scheme\index{synchronous iterations} is defined by \begin{equation} \forall j\in\{1,\ldots,\alpha\} \mbox{,~} \forall p\in\mathbb{N} \mbox{,~} \rho_j(p)=p. \label{ch13:eq:16} \end{equation} The values of $s(p)$ and $\rho_j(p)$ are defined dynamically and not explicitly by the parallel asynchronous or synchronous execution of the algorithm. Particularly, -it enables one to consider distributed computations whereby processors compute at +They allow us to consider distributed computations whereby processors compute at their own pace according to their intrinsic characteristics and computational load. The parallelism between the processors is well described by the set $s(p)$ which contains at each step $p$ the index of the components relaxed by each processor on a parallel way while the use of delayed components in~(\ref{ch13:eq:13}) permits one to model nondeterministic behavior and does not imply inefficiency of the considered distributed scheme of computation. Note that, according to~\cite{ch13:ref7}, theoretically, -each component of the vector must be relaxed an infinity of time. The choice of the -relaxed components to be used in the computational process may be guided by any criterion -and, in particular, a natural criterion is to pick-up the most recently available +each component of the vector must be relaxed an infinite number of times. The choice of the +relaxed components to be used in the computational process may be guided by any criterion, +and in particular, a natural criterion is to pickup the most recently available values of the components computed by the processors. Furthermore, the asynchronous -iterations are implemented by means of nonblocking MPI communication subroutines\index{MPI~subroutines!Nonblocking} +iterations are implemented by means of nonblocking MPI communication subroutines\index{MPI!nonblocking} (asynchronous communications). The important property ensuring the convergence of the parallel projected Richardson method, both synchronous and asynchronous algorithms, is the fact that $\mathcal{A}$ -is an M-matrix. Moreover, the convergence\index{Convergence} proceeds from a result +is an M-matrix. Moreover, the convergence\index{convergence} proceeds from a result of~\cite{ch13:ref6}. Indeed, there exists a value $\gamma_0>0$, such that $\forall\gamma\in ]0,\gamma_0[$, -the parallel iterations~(\ref{ch13:eq:13}), (\ref{ch13:eq:14}) and~(\ref{ch13:eq:15}), +the parallel iterations~(\ref{ch13:eq:13}), (\ref{ch13:eq:14}), and~(\ref{ch13:eq:15}), associated to the fixed point mapping $F_\gamma$~(\ref{ch13:eq:12}), converge to the unique solution $U^{*}$ of the discretized problem. @@ -343,64 +343,64 @@ In this section, we give the main key points of the parallel implementation of t projected Richardson method, both synchronous and asynchronous versions, on a GPU cluster, for solving the nonlinear systems derived from the discretization of large obstacle problems. More precisely, each nonlinear system is solved iteratively using -the whole cluster. We use a heterogeneous CUDA/MPI programming. Indeed, the communication +the whole cluster. We use a heterogeneous CUDA and MPI programming. Indeed, the communication of data, at each iteration between the GPU computing nodes, can be either synchronous or asynchronous using the MPI communication subroutines, whereas inside each GPU node, a CUDA parallelization is performed. -\begin{figure}[!h] -\centerline{\includegraphics[scale=0.30]{Chapters/chapter13/figures/splitCPU}} -\caption{Data partitioning of a problem to be solved among $S=3\times 4$ computing nodes.} -\label{ch13:fig:01} -\end{figure} - -Let $S$ denote the number of computing nodes\index{Computing~node} on the GPU cluster, +Let $S$ denote the number of computing nodes\index{computing node} on the GPU cluster, where a computing node is composed of CPU core holding one MPI process and a GPU card. So, before starting computations, the obstacle problem of size $(NX\times NY\times NZ)$ -is split into $S$ parallelepipedic sub-problems, each for a node (MPI process, GPU), as +is split into $S$ parallelepipedic subproblems, each for a node (MPI process, GPU), as is shown in Figure~\ref{ch13:fig:01}. Indeed, the $NY$ and $NZ$ dimensions (according -to the $y$ and $z$ axises) of the three-dimensional problem are, respectively, split +to the $y$ and $z$ axises) of the three-dimensional problem are split, respectively, into $Sy$ and $Sz$ parts, such that $S=Sy\times Sz$. In this case, each computing node -has at most four neighboring nodes. This kind of the data partitioning reduces the data +has at most four neighboring nodes. This kind of data partitioning reduces the data exchanges at subdomain boundaries compared to a naive $z$-axis-wise partitioning. -\begin{algorithm}[!t] -Initialization of the parameters of the sub-problem\; -Allocate and fill the data in the global memory GPU\; -\For{$i=1$ {\bf to} $NbSteps$}{ - $G = \frac{1}{k}.U^0 + F$\; - Solve($A$, $U^0$, $G$, $U$, $\varepsilon$, $MaxRelax$)\; - $U^0 = U$\; -} -Copy the solution $U$ back from GPU memory\; -\caption{Parallel solving of the obstacle problem on a GPU cluster} -\label{ch13:alg:01} -\end{algorithm} +\begin{figure} +\centerline{\includegraphics[scale=0.30]{Chapters/chapter13/figures/splitCPU}} +\caption{Data partitioning of a problem to be solved among $S=3\times 4$ computing nodes.} +\label{ch13:fig:01} +\end{figure} -All the computing nodes of the GPU cluster execute in parallel the same Algorithm~\ref{ch13:alg:01} -but on different three-dimensional sub-problems of size $(NX\times ny\times nz)$. -This algorithm gives the main key points for solving an obstacle problem\index{Obstacle~problem} +All the computing nodes of the GPU cluster execute in parallel the Algorithm~\ref{ch13:alg:01} +on a three-dimensional subproblems of size $(NX\times ny\times nz)$. +This algorithm gives the main key points for solving an obstacle problem\index{obstacle problem} defined in a three-dimensional domain, where $A$ is the discretization matrix, $G$ -is the right-hand side and $U$ is the solution vector. After the initialization step, +is the right-hand side, and $U$ is the solution vector. After the initialization step, all the data generated from the partitioning operation are copied from the CPU memories -to the GPU global memories, to be processed on the GPUs. Next, the algorithm uses $NbSteps$ +to the GPU global memories to be processed on the GPUs. Next, the algorithm uses $NbSteps$ time steps to solve the global obstacle problem. In fact, it uses a parallel algorithm -adapted to GPUs of the projected Richardson iterative method for solving the nonlinear -systems\index{Nonlinear} of the obstacle problem. This function is defined by {\it Solve()} +adapted to GPUs from the projected Richardson iterative method for solving the nonlinear +systems\index{nonlinear} of the obstacle problem. This function is defined by {\it Solve()} in Algorithm~\ref{ch13:alg:01}. At every time step, the initial guess $U^0$ for the iterative algorithm is set to the solution found at the previous time step. Moreover, the right-hand side $G$ is computed as follows: \[G = \frac{1}{k}.U^{prev} + F\] where $k$ is the time step, -$U^{prev}$ is the solution computed in the previous time step and each element $f(x, y, z)$ +$U^{prev}$ is the solution computed in the previous time step, and each element $f(x, y, z)$ of the vector $F$ is computed as follows: \begin{equation} f(x,y,z)=\cos(2\pi x)\cdot\cos(4\pi y)\cdot\cos(6\pi z). \label{ch13:eq:18} \end{equation} Finally, the solution $U$ of the obstacle problem is copied back from the GPU global -memories to the CPU memories. We use the communication subroutines of the CUBLAS library~\cite{ch13:ref8}\index{CUBLAS} -(CUDA Basic Linear Algebra Subroutines) for the memory allocations in the GPU (\verb+cublasAlloc()+) +memories to the CPU memories. We use the communication subroutines of the CUBLAS +(CUDA Basic Linear Algebra Subroutines) library~\cite{ch13:ref8}\index{CUBLAS} for the memory allocations in the GPU (\verb+cublasAlloc()+) and the data transfers between the CPU and its GPU: \verb+cublasSetVector()+ and \verb+cublasGetVector()+. +\begin{algorithm}[t] +Initialization of the parameters of the subproblem\; +Allocate and fill the data in the global memory GPU\; +\For{$i=1$ {\bf to} $NbSteps$}{ + $G = \frac{1}{k}.U^0 + F$\; + Solve($A$, $U^0$, $G$, $U$, $\varepsilon$, $MaxRelax$)\; + $U^0 = U$\; +} +Copy the solution $U$ back from GPU memory\; +\caption{parallel solving of the obstacle problem on a GPU cluster} +\label{ch13:alg:01} +\end{algorithm} + \begin{algorithm}[!t] $p = 0$\; $conv = false$\; @@ -414,17 +414,17 @@ and the data transfers between the CPU and its GPU: \verb+cublasSetVector()+ and $p = p + 1$\; $conv$ = Convergence($error$, $p$, $\varepsilon$, $MaxRelax$)\; } -\caption{Parallel iterative solving of the nonlinear systems on a GPU cluster ($Solve()$ function)} +\caption{parallel iterative solving of the nonlinear systems on a GPU cluster ($Solve()$ function)} \label{ch13:alg:02} \end{algorithm} -As many other iterative methods, the algorithm of the projected Richardson -method\index{Iterative~method!Projected~Richardson} is based on algebraic +As are many other iterative methods, the algorithm of the projected Richardson +method\index{iterative method!projected Richardson} is based on algebraic functions operating on vectors and/or matrices, which are more efficient on parallel computers when they work on large vectors. Its parallel implementation on the GPU cluster is carried out so that the GPUs execute the vector operations as kernels and the CPUs execute the serial codes, supervise the kernel executions -and the data exchanges with the neighboring nodes\index{Neighboring~node} and +and the data exchanges with the neighboring nodes\index{neighboring node}, and supply the GPUs with data. Algorithm~\ref{ch13:alg:02} shows the main key points of the parallel iterative algorithm (function $Solve()$ in Algorithm~\ref{ch13:alg:01}). All the vector operations inside the main loop ({\bf repeat} ... {\bf until}) @@ -432,7 +432,7 @@ are executed by the GPU. We use the following functions of the CUBLAS library\in \begin{itemize} \item \verb+cublasDaxpy()+ to compute the difference between the solution vectors $U^{p}$ and $U^{p+1}$ computed in two successive relaxations $p$ and $p+1$ (line~$7$ in Algorithm~\ref{ch13:alg:02}), -\item \verb+cublasDnrm2()+ to perform the Euclidean norm (line~$8$) and, +\item \verb+cublasDnrm2()+ to perform the Euclidean norm (line~$8$), and \item \verb+cublasDcpy()+ for the data copy of a vector to another one in the GPU memory (lines~$3$ and~$9$). \end{itemize} @@ -441,43 +441,43 @@ depend on the resources of the GPU multiprocessor and the resource requirements of the kernel. So, if $block$ defines the size of a thread block, which must not exceed the maximum size of a thread block, then the number of thread blocks in the grid, denoted by $grid$, can be computed according to the size of the -local sub-problem as follows: \[grid = \frac{(NX\times ny\times nz)+block-1}{block}.\] +local subproblem as follows: \[grid = \frac{(NX\times ny\times nz)+block-1}{block}.\] However, when solving very large problems, the size of the thread grid can exceed -the maximum number of thread blocks that can be executed on the GPUs (up-to $65.535$ -thread blocks) and, thus, the kernel will fail to launch. Therefore, for each kernel, -we decompose the three-dimensional sub-problem into $nz$ two-dimensional slices of size +the maximum number of thread blocks that can be executed on the GPUs (upto $65.535$ +thread blocks), and thus, the kernel will fail to launch. Therefore, for each kernel, +we decompose the three-dimensional subproblem into $nz$ two-dimensional slices of size ($NX\times ny$), as is shown in Figure~\ref{ch13:fig:02}. All slices of the same kernel -are executed using {\bf for} loop by $NX\times ny$ parallel threads organized in a +are executed using a {\bf for} loop by $NX\times ny$ parallel threads organized in a two-dimensional grid of two-dimensional thread blocks, as is shown in Listing~\ref{ch13:list:01}. Each thread is in charge of $nz$ discretization points (one from each slice), accessed in the GPU memory with a constant stride $(NX\times ny)$. \begin{figure} \centerline{\includegraphics[scale=0.30]{Chapters/chapter13/figures/splitGPU}} -\caption{Decomposition of a sub-problem in a GPU into $nz$ slices.} +\caption{Decomposition of a subproblem in a GPU into $nz$ slices.} \label{ch13:fig:02} \end{figure} \begin{center} -\lstinputlisting[label=ch13:list:01,caption=Skeleton codes of a GPU kernel and a CPU function]{Chapters/chapter13/ex1.cu} +\lstinputlisting[label=ch13:list:01,caption=skeleton codes of a GPU kernel and a CPU function]{Chapters/chapter13/ex1.cu} \end{center} The function $Determine\_Bordering\_Vector\_Elements()$ (line~$5$ in Algorithm~\ref{ch13:alg:02}) determines the values of the vector elements shared at the boundaries with neighboring computing -nodes. Its main operations are defined as follows: +nodes. Its main operations are as follows: \begin{enumerate} \item define the values associated to the bordering points needed by the neighbors, \item copy the values associated to the bordering points from the GPU to the CPU, -\item exchange the values associated to the bordering points between the neighboring CPUs, -\item copy the received values associated to the bordering points from the CPU to the GPU, +\item exchange the values associated to the bordering points between the neighboring CPUs, and +\item copy the received values associated to the bordering points from the CPU to the GPU. \end{enumerate} The first operation of this function is implemented as kernels to be performed by the GPU: \begin{itemize} -\item a kernel executed by $NX\times nz$ threads to define the values associated to the bordering vector elements along $y$-axis and, -\item a kernel executed by $NX\times ny$ threads to define the values associated to the bordering vector elements along $z$-axis. +\item a kernel executed by $NX\times nz$ threads to define the values associated to the bordering vector elements along the $y$-axis, and +\item a kernel executed by $NX\times ny$ threads to define the values associated to the bordering vector elements along the $z$-axis. \end{itemize} -As mentioned before, we develop the \emph{synchronous} and \emph{asynchronous} +As mentioned previously, we develop the \emph{synchronous} and \emph{asynchronous} algorithms of the projected Richardson method. Obviously, in this scope, the -synchronous\index{Synchronous} or asynchronous\index{Asynchronous} communications +synchronous\index{synchronous iterations} or asynchronous\index{asynchronous iterations} communications refer to the communications between the CPU cores (MPI processes) on the GPU cluster, in order to exchange the vector elements associated to subdomain boundaries. For the memory copies between a CPU core and its GPU, we use the synchronous communication @@ -486,19 +486,19 @@ in the synchronous algorithm and the asynchronous ones: \verb+cublasSetVectorAsy and \verb+cublasGetVectorAsync()+ in the asynchronous algorithm. Moreover, we use the communication routines of the MPI library to carry out the data exchanges between the neighboring nodes. We use the following communication routines: \verb+MPI_Isend()+ -and \verb+MPI_Irecv()+ to perform nonblocking\index{MPI~subroutines!Nonblocking} -sends and receptions, respectively. For the synchronous algorithm, we use the MPI +and \verb+MPI_Irecv()+ to perform nonblocking\index{MPI!nonblocking} +sends and receives, respectively. For the synchronous algorithm, we use the MPI routine \verb+MPI_Waitall()+ which puts the MPI process of a computing node in -blocking status until all data exchanges with neighboring nodes (sends and receptions) +blocking status until all data exchanges with neighboring nodes (sends and receives) are completed. In contrast, for the asynchronous algorithms, we use the MPI routine -\verb+MPI_Test()+ which tests the completion of a data exchange (send or reception) -without putting the MPI process in blocking status\index{MPI~subroutines!Blocking}. +\verb+MPI_Test()+ which tests the completion of a data exchange (send or receives) +without putting the MPI process in blocking status\index{MPI!blocking}. The function $Compute\_New\_Vector\_Elements()$ (line~$6$ in Algorithm~\ref{ch13:alg:02}) computes, at each iteration, the new elements of the iterate vector $U$. Its general code is presented in Listing~\ref{ch13:list:01} (CPU function). The iterations of the projected -Richardson method\index{Iterative~method!Projected~Richardson}, based on those of the Jacobi -method\index{Iterative~method!Jacobi}, are defined as follows: +Richardson method\index{iterative method!projected Richardson}, based on those of the Jacobi +method\index{iterative method!Jacobi}, are defined as follows: \begin{equation} \begin{array}{ll} u^{p+1}(x,y,z) =& \frac{1}{Center}(g(x,y,z) - (Center\cdot u^{p}(x,y,z) + \\ @@ -510,7 +510,7 @@ u^{p+1}(x,y,z) =& \frac{1}{Center}(g(x,y,z) - (Center\cdot u^{p}(x,y,z) + \\ \end{equation} where $u^{p}(x,y,z)$ is an element of the iterate vector $U$ computed at the iteration $p$ and $g(x,y,z)$ is a vector element of the right-hand side $G$. -The scalars $Center$, $West$, $East$, $South$, $North$, $Rear$ and $Front$ +The scalars $Center$, $West$, $East$, $South$, $North$, $Rear$, and $Front$ define constant coefficients of the block matrix $A$. Figure~\ref{ch13:fig:03} shows the positions of these coefficients in a three-dimensional domain. @@ -533,6 +533,7 @@ of the projected Richardson method, which are: the matrix-vector multiplication (\verb+MV_Multiplication()+) and the vector elements updates (\verb+Vector_Updates()+). The codes of these kernels are based on that presented in Listing~\ref{ch13:list:01}. +\pagebreak \lstinputlisting[label=ch13:list:02,caption=GPU kernels of the projected Richardson method]{Chapters/chapter13/ex2.cu} \begin{figure} @@ -544,40 +545,41 @@ The codes of these kernels are based on that presented in Listing~\ref{ch13:list Each kernel is executed by $NX\times ny$ GPU threads so that $nz$ slices of $(NX\times ny)$ vector elements are computed in a {\bf for} loop. In this case, each thread is in charge of one vector element from each slice -(in total $nz$ vector elements along $z$-axis). We can notice from the +(in total $nz$ vector elements along the $z$-axis). We can notice from the formula~(\ref{ch13:eq:17}) that the computation of a vector element $u^{p+1}(x,y,z)$, by a thread at iteration $p+1$, requires seven vector elements computed at the previous iteration $p$: two vector elements in each dimension plus -the vector element at the intersection of the three axises $x$, $y$ and $z$ +the vector element at the intersection of the three axes $x$, $y$, and $z$ (see Figure~\ref{ch13:fig:04}). So, to reduce the memory accesses to the high-latency global memory, the vector elements of the current slice can be stored in the low-latency shared memories of thread blocks, as is described in~\cite{ch13:ref9}. Nevertheless, the fact that the computation of a vector -element requires only two elements in each dimension does not allow to maximize +element requires only two elements in each dimension does not allow us to maximize the data reuse from the shared memories. The computation of a slice involves in total $(bx+2)\times(by+2)$ accesses to the global memory per thread block, to fill the required vector elements in the shared memory where $bx$ and $by$ are the dimensions of a thread block. Then, in order to optimize the memory accesses on GPUs, the elements of the iterate vector $U$ are filled in the -cache texture memory (see~\cite{ch13:ref10}). In new GPU generations as Fermi +cache texture memory (see~\cite{ch13:ref10}). In new GPU hardware and software as Fermi or Kepler, the global memory accesses are always cached in L1 and L2 caches. -For example, for a given kernel, we can favour the use of the L1 cache to that +For example, for a given kernel, we can favor the use of the L1 cache to that of the shared memory by using the function \verb+cudaFuncSetCacheConfig(Kernel,cudaFuncCachePreferL1)+. So, the initial access to the global memory loads the vector elements required by the threads of a block into the cache memory (texture or L1/L2 caches). Then, all the following memory accesses read from this cache memory. In Listing~\ref{ch13:list:02}, the function \verb+fetch_double(v,i)+ is used to read from the texture memory -the $i^{th}$ element of the double-precision vector \verb+v+ (see Listing~\ref{ch13:list:03}). +the $ith$ element of the double-precision vector \verb+v+ (see Listing~\ref{ch13:list:03}). Moreover, the seven constant coefficients of matrix $A$ can be stored in the constant memory but, since they are reused $nz$ times by each thread, it is more -interesting to fill them on the low-latency registers of each thread. +efficient to fill them on the low-latency registers of each thread. -\lstinputlisting[label=ch13:list:03,caption=Memory access to the cache texture memory]{Chapters/chapter13/ex3.cu} +\pagebreak +\lstinputlisting[label=ch13:list:03,caption=memory access to the cache texture memory]{Chapters/chapter13/ex3.cu} -The function $Convergence()$ (line~$11$ in Algorithm~\ref{ch13:alg:02}) allows +The function $Convergence()$ (line~$11$ in Algorithm~\ref{ch13:alg:02}) allows us to detect the convergence of the parallel iterative algorithm and is based on -the tolerance threshold\index{Convergence!Tolerance~threshold} $\varepsilon$ -and the maximum number of relaxations\index{Convergence!Maximum~number~of~relaxations} +the tolerance threshold\index{convergence!tolerance threshold} $\varepsilon$ +and the maximum number of relaxations\index{convergence!maximum number of relaxations} $MaxRelax$. We take into account the number of relaxations since that of iterations cannot be computed in the asynchronous case. Indeed, a relaxation is the update~(\ref{ch13:eq:13}) of a local iterate vector $U_i$ according to $F_i$. Then, counting the number @@ -585,7 +587,7 @@ of relaxations is possible in both synchronous and asynchronous cases. On the other hand, an iteration is the update of at least all vector components with $F_i$. -In the synchronous\index{Synchronous} algorithm, the global convergence is detected +In the synchronous\index{synchronous iterations} algorithm, the global convergence is detected when the maximal value of the absolute error, $error$, is sufficiently small and/or the maximum number of relaxations, $MaxRelax$, is reached, as follows: $$ @@ -596,15 +598,15 @@ AllReduce(error,\hspace{0.1cm}maxerror,\hspace{0.1cm}MAX); \\ conv \leftarrow true; \end{array} $$ -where the function $AllReduce()$ uses the MPI global reduction subroutine\index{MPI~subroutines!Global} +where the function $AllReduce()$ uses the MPI global reduction subroutine\index{MPI!global} \verb+MPI_Allreduce()+ to compute the maximal value, $maxerror$, among the local -absolute errors, $error$, of all computing nodes and $p$ (in Algorithm~\ref{ch13:alg:02}) +absolute errors, $error$, of all computing nodes, and $p$ (in Algorithm~\ref{ch13:alg:02}) is used as a counter of the local relaxations carried out by a computing node. In -the asynchronous\index{Asynchronous} algorithms, the global convergence is detected +the asynchronous\index{asynchronous iterations} algorithms, the global convergence is detected when all computing nodes locally converge. For this, we use a token ring architecture around which a boolean token travels, in one direction, from a computing node to another. Starting from node $0$, the boolean token is set to $true$ by node $i$ if the local -convergence is reached or to $false$ otherwise and, then, it is sent to node $i+1$. +convergence is reached or to $false$ otherwise, and then, it is sent to node $i+1$. Finally, the global convergence is detected when node $0$ receives from its neighbor node $S-1$, in the ring architecture, a token set to $true$. In this case, node $0$ sends a stop message (end of parallel solving) to all computing nodes in the cluster. @@ -615,54 +617,48 @@ sends a stop message (end of parallel solving) to all computing nodes in the clu %%--------------------------%% \section{Experimental tests on a GPU cluster} \label{ch13:sec:05} -The GPU cluster\index{GPU~cluster} of tests, that we used in this chapter, is an $20Gbps$ +The GPU cluster\index{GPU!cluster} of tests that we used in this chapter is an $20GB/s$ Infiniband network of six machines. Each machine is a Quad-Core Xeon E5530 CPU running at $2.4$GHz. It provides a RAM memory of $12$GB with a memory bandwidth of $25.6$GB/s and it -is equipped with two Nvidia Tesla C1060 GPUs. A Tesla GPU contains in total $240$ cores +is equipped with two NVIDIA Tesla C1060 GPUs. A Tesla GPU contains in total $240$ cores running at $1.3$GHz. It provides $4$GB of global memory with a memory bandwidth of $102$GB/s, accessible by all its cores and also by the CPU through the PCI-Express 16x Gen 2.0 interface with a throughput of $8$GB/s. Hence, the memory copy operations between the GPU and the CPU are about $12$ times slower than those of the Tesla GPU memory. We have performed our simulations -on a cluster of $24$ CPU cores and on a cluster of $12$ GPUs. Figure~\ref{ch13:fig:05} describes +on a cluster of $24$ CPU cores and on a cluster of $12$ GPUs. Figure~\ref{ch12:fig:04} describes the components of the GPU cluster of tests. Linux cluster version 2.6.39 OS is installed on CPUs. C programming language is used for coding the parallel algorithms of the methods on both GPU cluster and CPU cluster. CUDA version 4.0~\cite{ch13:ref12} is used for programming GPUs, using CUBLAS library~\cite{ch13:ref8} -to deal with vector operations in GPUs and, finally, MPI functions of OpenMPI 1.3.3 are +to deal with vector operations in GPUs, and finally, MPI functions of OpenMPI 1.3.3 are used to carry out the synchronous and asynchronous communications between CPU cores. Indeed, -in our experiments, a computing node is managed by a MPI process and it is composed of +in our experiments, a computing node is managed by one MPI process and it is composed of one CPU core and one GPU card. All experimental results of the parallel projected Richardson algorithms are obtained from simulations made in double precision data. The obstacle problems to be solved are defined in constant three-dimensional domain $\Omega\subset\mathbb{R}^{3}$. The numerical -values of the parameters of the obstacle problems are: $\eta=0.2$, $c=1.1$, $f$ is computed -by formula~(\ref{ch13:eq:18}) and final time $T=0.02$. Moreover, three time steps ($NbSteps=3$) +values of the parameters of the obstacle problems are $\eta=0.2$, $c=1.1$, $f$ is computed +by formula~(\ref{ch13:eq:18}), and final time $T=0.02$. Moreover, three time steps ($NbSteps=3$) are computed with $k=0.0066$. As the discretization matrix is constant along the time steps, the convergence properties of the iterative algorithms do not change. Thus, the performance characteristics obtained with three time steps will still be valid for more time steps. The initial function $u(0,x,y,z)$ of the obstacle problem~(\ref{ch13:eq:01}) is set to $0$, with a constraint $u\geq\phi=0$. The relaxation parameter $\gamma$ used by the projected Richardson method is computed automatically thanks to the diagonal entries -of the discretization matrix. The formula and its proof can be found in~\cite{ch13:ref11}, -Section~2.3. The convergence tolerance threshold $\varepsilon$ is set to $1e$-$04$ and the +of the discretization matrix. The formula and its proof can be found in~\cite{ch13:ref11}. +The convergence tolerance threshold $\varepsilon$ is set to $1e$-$04$ and the maximum number of relaxations is limited to $10^{6}$ relaxations. Finally, the number of threads per block is set to $256$ threads, which gives, in general, good performances for most GPU applications. We have performed some tests for the execution configurations and -we have noticed that the best configuration of the $256$ threads per block is an organization +have noticed that the best configuration of the $256$ threads per block is an organization into two dimensions of sizes $(64,4)$. -\begin{figure} -\centerline{\includegraphics[scale=0.25]{Chapters/chapter13/figures/cluster}} -\caption{GPU cluster of tests composed of 12 computing nodes (six machines, each with two GPUs.} -\label{ch13:fig:05} -\end{figure} - The performance measures that we took into account are the execution times and the number of relaxations performed by the parallel iterative algorithms, both synchronous and asynchronous versions, on the GPU and CPU clusters. These algorithms are used for solving nonlinear systems -derived from the discretization of obstacle problems of sizes $256^{3}$, $512^{3}$, $768^{3}$ +derived from the discretization of obstacle problems of sizes $256^{3}$, $512^{3}$, $768^{3}$, and $800^{3}$. In Table~\ref{ch13:tab:01} and Table~\ref{ch13:tab:02}, we show the performances of the parallel synchronous and asynchronous algorithms of the projected Richardson method implemented, respectively, on a cluster of $24$ CPU cores and on a cluster of $12$ GPUs. In @@ -670,14 +666,14 @@ these tables, the execution time defines the time spent by the slowest computing number of relaxations is computed as the summation of those carried out by all computing nodes. In the sixth column of Table~\ref{ch13:tab:01} and in the eighth column of Table~\ref{ch13:tab:02}, -we give the gains in $\%$ obtained by using an asynchronous algorithm compared to a synchronous +we give the gains in percentage obtained by using an asynchronous algorithm compared to a synchronous one. We can notice that the asynchronous version on CPU and GPU clusters is slightly faster than the synchronous one for both methods. Indeed, the cluster of tests is composed of local and homogeneous nodes communicating via low-latency connections. So, in the case of distant and/or heterogeneous -nodes (or even with geographically distant clusters) the asynchronous version would be faster than +nodes (or even with geographically distant clusters), the asynchronous version would be faster than the synchronous one. However, the gains obtained on the GPU cluster are better than those obtained on the CPU cluster. In fact, the computation times are reduced by accelerating the computations on -GPUs while the communication times still unchanged. +GPUs while the communication times remain unchanged. \begin{table} \centering @@ -685,7 +681,7 @@ GPUs while the communication times still unchanged. \hline \multirow{2}{*}{\bf Pb. size} & \multicolumn{2}{c|}{\bf Synchronous} & \multicolumn{2}{c|}{\bf Asynchronous} & \multirow{2}{*}{\bf Gain\%} \\ \cline{2-5} - & $\mathbf{T_{cpu}}$ & {\bf \#relax.} & $\mathbf{T_{cpu}}$ & {\bf \#relax.} & \\ \hline \hline + & $\mathbf{T_{cpu}}$ & {\bf \# Relax.} & $\mathbf{T_{cpu}}$ & {\bf \# Relax.} & \\ \hline \hline $256^{3}$ & $575.22$ & $198,288$ & $539.25$ & $198,613$ & $6.25$ \\ \hline \hline @@ -702,11 +698,12 @@ $800^{3}$ & $222,108.09$ & $1,769,232$ & $188,790 \begin{table} \centering +\begin{scriptsize} \begin{tabular}{|c|c|c|c|c|c|c|c|} \hline \multirow{2}{*}{\bf Pb. size} & \multicolumn{3}{c|}{\bf Synchronous} & \multicolumn{3}{c|}{\bf Asynchronous} & \multirow{2}{*}{\bf Gain\%} \\ \cline{2-7} - & $\mathbf{T_{gpu}}$ & {\bf \#relax.} & $\mathbf{\tau}$ & $\mathbf{T_{gpu}}$ & {\bf \#relax.} & $\mathbf{\tau}$ & \\ \hline \hline + & $\mathbf{T_{gpu}}$ & {\bf \# Relax.} & $\mathbf{\tau}$ & $\mathbf{T_{gpu}}$ & {\bf \# Relax.} & $\mathbf{\tau}$ & \\ \hline \hline $256^{3}$ & $29.67$ & $100,692$ & $19.39$ & $18.00$ & $94,215$ & $29.96$ & $39.33$ \\\hline \hline @@ -716,6 +713,7 @@ $768^{3}$ & $4,112.68$ & $831,144$ & $50.13$ $800^{3}$ & $3,950.87$ & $899,088$ & $56.22$ & $3,636.57$ & $834,900$ & $51.91$ & $7.95$ \\ \hline \end{tabular} +\end{scriptsize} \vspace{0.5cm} \caption{Execution times in seconds of the parallel projected Richardson method implemented on a cluster of 12 GPUs.} \label{ch13:tab:02} @@ -723,16 +721,16 @@ $800^{3}$ & $3,950.87$ & $899,088$ & $56.22$ The fourth and seventh columns of Table~\ref{ch13:tab:02} show the relative gains obtained by executing the parallel algorithms on the cluster of $12$ GPUs instead -on the cluster of $24$ CPU cores. We compute the relative gain\index{Relative~gain} +on the cluster of $24$ CPU cores. We compute the relative gain\index{relative gain} $\tau$ as a ratio between the execution time $T_{cpu}$ spent on the CPU cluster over that $T_{gpu}$ spent on the GPU cluster: \[\tau=\frac{T_{cpu}}{T_{gpu}}.\] We can see from these ratios that solving large obstacle problems is faster on the GPU cluster than on the CPU cluster. Indeed, the GPUs are more efficient than their counterpart -CPUs to execute large data-parallel operations. In addition, the projected Richardson -method is implemented as a fixed point-based iteration and uses the Jacobi vector updates -that allow a well thread-parallelization on GPUs, such that each GPU thread is in charge +CPUs at executing large data-parallel operations. In addition, the projected Richardson +method is implemented as a fixed point based iteration and uses the Jacobi vector updates +that allow a well-suited thread-parallelization on GPUs, such that each GPU thread is in charge of one vector component at a time without being dependent on other vector components -computed by other threads. Then, this allow to exploit at best the high performance +computed by other threads. Then, this allows us to exploit at best the high performance computing of the GPUs by using all the GPU resources and avoiding the idle cores. Finally, the number of relaxations performed by the parallel synchronous algorithm @@ -747,45 +745,45 @@ consequently it also depends on the number of computing nodes. %%--------------------------%% %% SECTION 6 %% %%--------------------------%% -\section{Red-Black ordering technique} +\section{Red-black ordering technique} \label{ch13:sec:06} -As is well-known, the Jacobi method\index{Iterative~method!Jacobi} is characterized -by a slow convergence\index{Convergence} rate compared to some iterative methods\index{Iterative~method} -(for example Gauss-Seidel method\index{Iterative~method!Gauss-Seidel}). So, in this +As is well-known, the Jacobi method\index{iterative method!Jacobi} is characterized +by a slow convergence\index{convergence} rate compared to some iterative methods\index{iterative method} +(for example, Gauss-Seidel method\index{iterative method!Gauss-Seidel}). So, in this section, we present some solutions to reduce the execution time and the number of relaxations and, more specifically, to speed up the convergence of the parallel projected Richardson method on the GPU cluster. We propose to use the point red-black -ordering technique\index{Iterative~method!Red-Black~ordering} to accelerate the +ordering technique\index{iterative method!red-black ordering} to accelerate the convergence. This technique is often used to increase the parallelism of iterative methods for solving linear systems~\cite{ch13:ref13,ch13:ref14,ch13:ref15}. We apply it to the projected Richardson method as a compromise between the Jacobi and Gauss-Seidel iterative methods. The general principle of the red-black technique is as follows. Let $t$ be the -summation of the integer $x$-, $y$- and $z$-coordinates of a vector element $u(x,y,z)$ +summation of the integer $x$-, $y$-, and $z$-coordinates of a vector element $u(x,y,z)$ on a three-dimensional domain: $t=x+y+z$. As is shown in Figure~\ref{ch13:fig:06.01}, -the red-black ordering technique consists in the parallel computing of the red -vector elements having even value $t$ by using the values of the black ones then, +the red-black ordering technique consists of the parallel computing of the red +vector elements having even value $t$ by using the values of the black ones, then the parallel computing of the black vector elements having odd values $t$ by using the new values of the red ones. This technique can be implemented on the GPU in two different manners: \begin{itemize} -\item among all launched threads ($NX\times ny$ threads), only one thread out of two computes its red or black vector element at a time or, -\item all launched threads (on average half of $NX\times ny$ threads) compute the red vector elements first and, then, the black ones. +\item among all launched threads ($NX\times ny$ threads), only one thread out of two computes its red or black vector element at a time or +\item all launched threads (on average half of $NX\times ny$ threads) compute the red vector elements first, and then the black ones. \end{itemize} However, in both solutions, for each memory transaction, only half of the memory segment addressed by a half-warp is used. So, the computation of the red and black -vector elements leads to use twice the initial number of memory transactions. Then, -we apply the point red-black ordering\index{Iterative~method!Red-Black~ordering} +vector elements leads to using twice the initial number of memory transactions. Then, +we apply the point red-black ordering\index{iterative method!red-black ordering} accordingly to the $y$-coordinate, as is shown in Figure~\ref{ch13:fig:06.02}. In this case, the vector elements having even $y$-coordinate are computed in parallel using the values of those having odd $y$-coordinate and then vice-versa. Moreover, in the GPU implementation of the parallel projected Richardson method (Section~\ref{ch13:sec:04}), -we have shown that a sub-problem of size $(NX\times ny\times nz)$ is decomposed into +we have shown that a subproblem of size $(NX\times ny\times nz)$ is decomposed into $nz$ grids of size $(NX\times ny)$. Then, each kernel is executed in parallel by $NX\times ny$ GPU threads, so that each thread is in charge of $nz$ vector elements -along $z$-axis (one vector element in each grid of the sub-problem). So, we propose +along the $z$-axis (one vector element in each grid of the subproblem). So, we propose to use the new values of the vector elements computed in grid $i$ to compute those of the vector elements in grid $i+1$. Listing~\ref{ch13:list:04} describes the kernel of the matrix-vector multiplication and the kernel of the vector elements updates of @@ -793,11 +791,12 @@ the parallel projected Richardson method using the red-black ordering technique. \begin{figure} \centering - \mbox{\subfigure[Red-Black ordering on x, y and z axises]{\includegraphics[width=2.3in]{Chapters/chapter13/figures/rouge-noir}\label{ch13:fig:06.01}}\quad - \subfigure[Red-Black ordering on y axis]{\includegraphics[width=2.3in]{Chapters/chapter13/figures/rouge-noir-y}\label{ch13:fig:06.02}}} -\caption{Red-Black ordering for computing the iterate vector elements in a three-dimensional space.} + \mbox{\subfigure[Red-black ordering on x, y, and z axises]{\includegraphics[width=2.3in]{Chapters/chapter13/figures/rouge-noir_clair}\label{ch13:fig:06.01}}\quad + \subfigure[Red-black ordering on y axis]{\includegraphics[width=2.3in]{Chapters/chapter13/figures/rouge-noir-y_clair}\label{ch13:fig:06.02}}} +\caption{Red-black ordering for computing the iterate vector elements in a three-dimensional space.} \end{figure} +\pagebreak \lstinputlisting[label=ch13:list:04,caption=GPU kernels of the projected Richardson method using the red-black technique]{Chapters/chapter13/ex4.cu} Finally, we exploit the concurrent executions between the host functions and the GPU @@ -814,8 +813,8 @@ neighboring CPUs and this in both synchronous and asynchronous cases. In Table~\ref{ch13:tab:03}, we report the execution times and the number of relaxations performed on a cluster of $12$ GPUs by the parallel projected Richardson algorithms; it -can be noted that the performances of the projected Richardson are improved by using the -point read-black ordering. We compare the performances of the parallel projected Richardson +can be noted that the performances of the projected Richardson algorithm are improved by using the +point red-black ordering. We compare the performances of the parallel projected Richardson method with and without this later ordering (Tables~\ref{ch13:tab:02} and~\ref{ch13:tab:03}). We can notice that both parallel synchronous and asynchronous algorithms are faster when they use the red-black ordering. Indeed, we can see in Table~\ref{ch13:tab:03} that the @@ -828,7 +827,7 @@ shown in Table~\ref{ch13:tab:02}. \hline \multirow{2}{*}{\bf Pb. size} & \multicolumn{2}{c|}{\bf Synchronous} & \multicolumn{2}{c|}{\bf Asynchronous} & \multirow{2}{*}{\bf Gain\%} \\ \cline{2-5} - & $\mathbf{T_{gpu}}$ & {\bf \#relax.} & $\mathbf{T_{gpu}}$ & {\bf \#relax.} & \\ \hline \hline + & $\mathbf{T_{gpu}}$ & {\bf \# Relax.} & $\mathbf{T_{gpu}}$ & {\bf \# Relax.} & \\ \hline \hline $256^{3}$ & $18.37$ & $71,988$ & $12.58$ & $67,638$ & $31.52$ \\ \hline \hline @@ -839,7 +838,7 @@ $768^{3}$ & $2,773.65$ & $590,652$ & $2,222.2 $800^{3}$ & $2,748.23$ & $638,916$ & $2,502.61$ & $592,525$ & $8.92$ \\ \hline \end{tabular} \vspace{0.5cm} -\caption{Execution times in seconds of the parallel projected Richardson method using read-black ordering technique implemented on a cluster of 12 GPUs.} +\caption{Execution times in seconds of the parallel projected Richardson method using red-black ordering technique implemented on a cluster of 12 GPUs.} \label{ch13:tab:03} \end{table} @@ -854,18 +853,18 @@ the communication time of the parallel projected Richardson algorithms on a GPU cluster. The experimental tests are carried out on a cluster composed of one to ten Tesla GPUs. We have focused on the weak scaling of both parallel, synchronous and asynchronous, algorithms using the red-black ordering technique. For this, we -have fixed the size of a sub-problem to $256^{3}$ per computing node (a CPU core +have fixed the size of a subproblem to $256^{3}$ per computing node (a CPU core and a GPU). Then, Figure~\ref{ch13:fig:07} shows the number of relaxations performed, on average, per second by a computing node. We can see from this figure that the efficiency of the asynchronous algorithm is almost stable, while that of the synchronous -algorithm decreases (down to $81\%$ in this example) with the increasing of the +algorithm decreases (down to $81\%$ in this example) with the increase in the number of computing nodes on the cluster. This is due to the fact that the ratio between the time of the computation over that of the communication is reduced when the computations are performed on GPUs. Indeed, GPUs compute faster than CPUs and -communications are more time consuming. In this context, asynchronous algorithms -are more scalable than synchronous ones. So, with large scale GPU clusters, synchronous\index{Synchronous} +communications are more time-consuming. In this context, asynchronous algorithms +are more scalable than synchronous ones. So, with large scale GPU clusters, synchronous\index{synchronous iterations} algorithms might be more penalized by communications, as can be deduced from Figure~\ref{ch13:fig:07}. -That is why we think that asynchronous\index{Asynchronous} iterative algorithms +That is why we think that asynchronous\index{asynchronous iterations} iterative algorithms are all the more interesting in this case. @@ -880,7 +879,7 @@ spatial discretization of three-dimensional obstacle problems. For this, we have both synchronous and asynchronous algorithms of the Richardson iterative method using a projection on a convex set. Indeed, this method uses point-based iterations of the Jacobi method that are very easy to parallelize on parallel computers. We have shown that its adapted parallel -algorithms to GPU architectures allows to exploit at best the computing power of the GPUs and +algorithms to GPU architectures allow us to exploit at best the computing power of the GPUs and to accelerate the resolution of large nonlinear systems. Consequently, the experimental results have shown that solving nonlinear systems of large obstacle problems with this method is about fifty times faster on a cluster of $12$ GPUs than on a cluster of $24$ CPU cores. Moreover, @@ -890,7 +889,7 @@ performed on the cluster of $12$ GPUs are reduced on average of $32\%$. Afterwards, the experiments have shown that the asynchronous version is slightly more efficient than the synchronous one. In fact, the computations are accelerated by using GPUs while the communication -times still unchanged. In addition, we have studied the weak-scaling in the synchronous and asynchronous +times are still unchanged. In addition, we have studied the weak-scaling in the synchronous and asynchronous cases, which has confirmed that the ratio between the computations and the communications are reduced when using a cluster of GPUs. We highlight that asynchronous iterative algorithms are more scalable than synchronous ones. Therefore, we can conclude that asynchronous iterations are well suited to