new

[book_gpu.git] / BookGPU / Chapters / chapter13 / ch13.tex
diff --git a/BookGPU/Chapters/chapter13/ch13.tex b/BookGPU/Chapters/chapter13/ch13.tex

index 096f676687ec183102b7ad11ae4bf5f2a13f2d42..db2ab78b8b97321189b32af7ee3d556ccd772cbe 100755 (executable)
--- a/BookGPU/Chapters/chapter13/ch13.tex
+++ b/BookGPU/Chapters/chapter13/ch13.tex
@@ -67,8 +67,8 @@ three-dimensional domain. This model is based on that presented in~\cite{ch13:re
  %%*******************
  \subsection{Mathematical model}
  \label{ch13:sec:02.01}
-An obstacle problem\index{Obstacle~problem}, arising for example in mechanics or financial
-derivatives, consists of solving a time-dependent nonlinear equation\index{Nonlinear}:
+An obstacle problem\index{obstacle problem}, arising for example in mechanics or financial
+derivatives, consists of solving a time-dependent nonlinear equation\index{nonlinear}:
  \begin{equation}
  \left\{
  \begin{array}{l}
@@ -92,7 +92,7 @@ or the Neumann condition (where the normal derivative of $u$ is fixed on $\parti
  
  The time-dependent equation~(\ref{ch13:eq:01}) is numerically solved by considering an
  implicit or a semi-implicit time marching, where at each time step $k$ a stationary nonlinear
-problem\index{Nonlinear} is solved:
+problem\index{nonlinear} is solved:
  \begin{equation}
  \left\{
  \begin{array}{l}
@@ -116,7 +116,7 @@ problem~(\ref{ch13:eq:02}) does not provide a symmetric matrix, because the
  convection-diffusion operator is not self-adjoint. Moreover, the fact that
  the operator is self-adjoint or not plays an important role in the choice
  of the appropriate algorithm for solving nonlinear systems derived from the
-discretization of the obstacle problem\index{Obstacle~problem}. Nevertheless,
+discretization of the obstacle problem\index{obstacle problem}. Nevertheless,
  since the convection coefficients arising in the operator~(\ref{ch13:eq:02})
  are constant, we can formulate the same problem by self-adjoint operator
  by performing a classical change of variables. Then, we can replace the stationary
@@ -134,8 +134,8 @@ where $b=\{b_{1},b_{2},b_{3}\}$, $\|b\|_{2}$ denotes the Euclidean norm of $b$,
  $v=e^{-a}.u$ represents the general change of variables such that $a=\frac{b^{t}(x,y,z)}{2\eta}$.
  Consequently, the numerical resolution of the diffusion problem (the self-adjoint
  operator~(\ref{ch13:eq:04})) is done by optimization algorithms, in contrast to that
-of the convection-diffusion problem (non self-adjoint operator~(\ref{ch13:eq:03}))
-which is done by relaxation algorithms. In the case of our studied algorithm, the convergence\index{Convergence}
+of the convection-diffusion problem (nonself-adjoint operator~(\ref{ch13:eq:03}))
+which is done by relaxation algorithms. In the case of our studied algorithm, the convergence\index{convergence}
  is ensured by M-matrix property; then, the performance is linked to the magnitude of
  the spectral radius of the iteration matrix, which is independent of the condition
  number.
@@ -176,15 +176,15 @@ an M-matrix. This property is important to the convergence of iterative methods.
  Owing to the large size of the previous discrete complementary problem~(\ref{ch13:eq:05}),
  we will solve it by parallel synchronous or asynchronous iterative algorithms (see~\cite{ch13:ref3,ch13:ref4,ch13:ref5}).
  In this chapter, we aim at harnessing the computing power of GPU clusters for solving these
-large nonlinear systems\index{Nonlinear}. Then, we choose to use the projected Richardson
-iterative method\index{Iterative~method!Projected~Richardson} for solving the diffusion
+large nonlinear systems\index{nonlinear}. Then, we choose to use the projected Richardson
+iterative method\index{iterative method!projected Richardson} for solving the diffusion
  problem~(\ref{ch13:eq:04}). Indeed, this method is based on the iterations of the Jacobi
-method\index{Iterative~method!Jacobi}, which are easy to parallelize on parallel computers
+method\index{iterative method!Jacobi}, which are easy to parallelize on parallel computers
  and easy to adapt to GPU architectures. Then, according to the boundary value problem
  formulation with a self-adjoint operator~(\ref{ch13:eq:04}), we can consider here the
  equivalent optimization problem and the fixed point mapping associated to its solution.
  
-Assume that $E=\mathbb{R}^{M}$ is a Hilbert space\index{Hilbert~space}, in which $\scalprod{.}{.}$
+Assume that $E=\mathbb{R}^{M}$ is a Hilbert space\index{Hilbert space}, in which $\scalprod{.}{.}$
  is the scalar product and $\|.\|$ its associated norm. So, the general fixed point problem
  to be solved is defined as follows:
  \begin{equation}
@@ -203,7 +203,7 @@ Let $K$ be a closed convex set defined by
  K = \{U | U \geq \Phi \mbox{~everywhere in~} E\},
  \label{ch13:eq:07}
  \end{equation}
-where $\Phi$ is the discrete obstacle function. In fact, the obstacle problem~(\ref{ch13:eq:05})\index{Obstacle~problem}
+where $\Phi$ is the discrete obstacle function. In fact, the obstacle problem~(\ref{ch13:eq:05})\index{obstacle problem}
  is formulated as the following constrained optimization problem:
  \begin{equation}
  \left\{
@@ -224,7 +224,7 @@ is a symmetric positive definite, and $A$ is the discretization matrix associate
  self-adjoint operator~(\ref{ch13:eq:04}) after change of variables.
  
  For any $U\in E$; let $P_K(U)$ be the projection of $U$ on $K$. For any $\gamma\in\mathbb{R}$,
-$\gamma>0$, the fixed point mapping $F_{\gamma}$ of the projected Richardson method\index{Iterative~method!Projected~Richardson}
+$\gamma>0$, the fixed point mapping $F_{\gamma}$ of the projected Richardson method\index{iterative method!projected Richardson}
  is defined as follows:
  \begin{equation}
  U^{*} = F_{\gamma}(U^{*}) = P_K(U^{*} - \gamma(\mathcal{A}.U^{*} - G)).
@@ -237,7 +237,7 @@ of the projected Richardson method~\cite{ch13:ref6}.
  
  Let $\alpha\in\mathbb{N}$ be a positive integer. We consider that the space $E=\displaystyle\prod_{i=1}^{\alpha} E_i$
  is a product of $\alpha$ subspaces $E_i$ where $i\in\{1,\ldots,\alpha\}$. Note that $E_i=\mathbb{R}^{m_i}$,
-where $\displaystyle\sum_{i=1}^{\alpha} m_{i}=M$, is also a Hilbert space\index{Hilbert~space}
+where $\displaystyle\sum_{i=1}^{\alpha} m_{i}=M$, is also a Hilbert space\index{Hilbert space}
  in which $\scalprod{.}{.}_i$ denotes the scalar product and $|.|_i$ the associated norm, for
  all $i\in\{1,\ldots,\alpha\}$. Then, for all $u,v\in E$, $\scalprod{u}{v}=\displaystyle\sum_{i=1}^{\alpha}\scalprod{u_i}{v_i}_i$
  is the scalar product on $E$.
@@ -255,7 +255,7 @@ Assume that the convex set $K=\displaystyle\prod_{i=1}^{\alpha}K_{i}$, such that
  and $K_i$ is a closed convex set. Let also $G=(G_1,\ldots,G_{\alpha})\in E$; for any
  $U\in E$, $P_K(U)=(P_{K_1}(U_1),\ldots,P_{K_{\alpha}}(U_{\alpha}))$ is the projection of $U$
  on $K$ where $\forall i\in\{1,\ldots,\alpha\},P_{K_i}$ is the projector from $E_i$ onto
-$K_i$. So, the fixed point mapping of the projected Richardson method~(\ref{ch13:eq:10})\index{Iterative~method!Projected~Richardson}
+$K_i$. So, the fixed point mapping of the projected Richardson method~(\ref{ch13:eq:10})\index{iterative method!projected Richardson}
  can be written in the following way:
  \begin{equation}
  \forall U\in E\mbox{,~}\forall i\in\{1,\ldots,\alpha\}\mbox{,~}F_{i,\gamma}(U) = P_{K_i}(U_i - \gamma(\mathcal{A}_i.U - G_i)).
@@ -299,12 +299,12 @@ and $\forall j\in\{1,\ldots,\alpha\}$,
  \label{ch13:eq:15}
  \end{equation}
  
-The previous asynchronous scheme\index{Asynchronous} of the projected Richardson
+The previous asynchronous scheme\index{asynchronous iterations} of the projected Richardson
  method models computations that are carried out in parallel without order or
  synchronization (according to the behavior of the parallel iterative method) and
  describes a subdomain method without overlapping. It is a general model that takes
  into account all possible situations of parallel computations and nonblocking message
-passing. So, the synchronous iterative scheme\index{Synchronous} is defined by
+passing. So, the synchronous iterative scheme\index{synchronous iterations} is defined by
  \begin{equation}
  \forall j\in\{1,\ldots,\alpha\} \mbox{,~} \forall p\in\mathbb{N} \mbox{,~} \rho_j(p)=p.
  \label{ch13:eq:16}
@@ -322,12 +322,12 @@ each component of the vector must be relaxed an infinite number of times. The ch
  relaxed components to be used in the computational process may be guided by any criterion,
  and in particular, a natural criterion is to pickup the most recently available
  values of the components computed by the processors. Furthermore, the asynchronous
-iterations are implemented by means of nonblocking MPI communication subroutines\index{MPI~subroutines!Nonblocking}
+iterations are implemented by means of nonblocking MPI communication subroutines\index{MPI!nonblocking}
  (asynchronous communications).
  
  The important property ensuring the convergence of the parallel projected Richardson
  method, both synchronous and asynchronous algorithms, is the fact that $\mathcal{A}$
-is an M-matrix. Moreover, the convergence\index{Convergence} proceeds from a result
+is an M-matrix. Moreover, the convergence\index{convergence} proceeds from a result
  of~\cite{ch13:ref6}. Indeed, there exists a value $\gamma_0>0$, such that $\forall\gamma\in ]0,\gamma_0[$,
  the parallel iterations~(\ref{ch13:eq:13}), (\ref{ch13:eq:14}), and~(\ref{ch13:eq:15}),
  associated to the fixed point mapping $F_\gamma$~(\ref{ch13:eq:12}), converge to the
@@ -348,7 +348,7 @@ of data, at each iteration between the GPU computing nodes, can be either synchr
  or asynchronous using the MPI communication subroutines, whereas inside each GPU node,
  a CUDA parallelization is performed.
  
-Let $S$ denote the number of computing nodes\index{Computing~node} on the GPU cluster,
+Let $S$ denote the number of computing nodes\index{computing node} on the GPU cluster,
  where a computing node is composed of CPU core holding one MPI process and a GPU card.
  So, before starting computations, the obstacle problem of size $(NX\times NY\times NZ)$
  is split into $S$ parallelepipedic subproblems, each for a node (MPI process, GPU), as
@@ -366,14 +366,14 @@ exchanges at subdomain boundaries compared to a naive $z$-axis-wise partitioning
  
  All the computing nodes of the GPU cluster execute in parallel the Algorithm~\ref{ch13:alg:01}
  on a three-dimensional subproblems of size $(NX\times ny\times nz)$.
-This algorithm gives the main key points for solving an obstacle problem\index{Obstacle~problem}
+This algorithm gives the main key points for solving an obstacle problem\index{obstacle problem}
  defined in a three-dimensional domain, where $A$ is the discretization matrix, $G$
  is the right-hand side, and $U$ is the solution vector. After the initialization step,
  all the data generated from the partitioning operation are copied from the CPU memories
  to the GPU global memories to be processed on the GPUs. Next, the algorithm uses $NbSteps$
  time steps to solve the global obstacle problem. In fact, it uses a parallel algorithm
  adapted to GPUs from the projected Richardson iterative method for solving the nonlinear
-systems\index{Nonlinear} of the obstacle problem. This function is defined by {\it Solve()}
+systems\index{nonlinear} of the obstacle problem. This function is defined by {\it Solve()}
  in Algorithm~\ref{ch13:alg:01}. At every time step, the initial guess $U^0$ for the iterative
  algorithm is set to the solution found at the previous time step. Moreover, the right-hand
  side $G$ is computed as follows: \[G = \frac{1}{k}.U^{prev} + F\] where $k$ is the time step,
@@ -419,12 +419,12 @@ Copy the solution $U$ back from GPU memory\;
  \end{algorithm}
  
  As are many other iterative methods, the algorithm of the projected Richardson
-method\index{Iterative~method!Projected~Richardson} is based on algebraic
+method\index{iterative method!projected Richardson} is based on algebraic
  functions operating on vectors and/or matrices, which are more efficient on
  parallel computers when they work on large vectors. Its parallel implementation
  on the GPU cluster is carried out so that the GPUs execute the vector operations
  as kernels and the CPUs execute the serial codes, supervise the kernel executions
-and the data exchanges with the neighboring nodes\index{Neighboring~node}, and
+and the data exchanges with the neighboring nodes\index{neighboring node}, and
  supply the GPUs with data. Algorithm~\ref{ch13:alg:02} shows the main key points
  of the parallel iterative algorithm (function $Solve()$ in Algorithm~\ref{ch13:alg:01}).
  All the vector operations inside the main loop ({\bf repeat} ... {\bf until})
@@ -477,7 +477,7 @@ The first operation of this function is implemented as kernels to be performed b
  \end{itemize}
  As mentioned previously, we develop the \emph{synchronous} and \emph{asynchronous}
  algorithms of the projected Richardson method. Obviously, in this scope, the
-synchronous\index{Synchronous} or asynchronous\index{Asynchronous} communications
+synchronous\index{synchronous iterations} or asynchronous\index{asynchronous iterations} communications
  refer to the communications between the CPU cores (MPI processes) on the GPU cluster,
  in order to exchange the vector elements associated to subdomain boundaries. For
  the memory copies between a CPU core and its GPU, we use the synchronous communication
@@ -486,19 +486,19 @@ in the synchronous algorithm and the asynchronous ones: \verb+cublasSetVectorAsy
  and \verb+cublasGetVectorAsync()+ in the asynchronous algorithm. Moreover, we
  use the communication routines of the MPI library to carry out the data exchanges
  between the neighboring nodes. We use the following communication routines: \verb+MPI_Isend()+
-and \verb+MPI_Irecv()+ to perform nonblocking\index{MPI~subroutines!Nonblocking}
+and \verb+MPI_Irecv()+ to perform nonblocking\index{MPI!nonblocking}
  sends and receives, respectively. For the synchronous algorithm, we use the MPI
  routine \verb+MPI_Waitall()+ which puts the MPI process of a computing node in
  blocking status until all data exchanges with neighboring nodes (sends and receives)
  are completed. In contrast, for the asynchronous algorithms, we use the MPI routine
  \verb+MPI_Test()+ which tests the completion of a data exchange (send or receives)
-without putting the MPI process in blocking status\index{MPI~subroutines!Blocking}.   
+without putting the MPI process in blocking status\index{MPI!blocking}.   
  
  The function $Compute\_New\_Vector\_Elements()$ (line~$6$ in Algorithm~\ref{ch13:alg:02})
  computes, at each iteration, the new elements of the iterate vector $U$. Its general code
  is presented in Listing~\ref{ch13:list:01} (CPU function). The iterations of the projected
-Richardson method\index{Iterative~method!Projected~Richardson}, based on those of the Jacobi
-method\index{Iterative~method!Jacobi}, are defined as follows: 
+Richardson method\index{iterative method!projected Richardson}, based on those of the Jacobi
+method\index{iterative method!Jacobi}, are defined as follows: 
  \begin{equation}
  \begin{array}{ll}
  u^{p+1}(x,y,z) =& \frac{1}{Center}(g(x,y,z) - (Center\cdot u^{p}(x,y,z) + \\
@@ -578,8 +578,8 @@ efficient to fill them on the low-latency registers of each thread.
  
  The function $Convergence()$ (line~$11$ in Algorithm~\ref{ch13:alg:02}) allows us
  to detect the convergence of the parallel iterative algorithm and is based on
-the tolerance threshold\index{Convergence!Tolerance~threshold} $\varepsilon$
-and the maximum number of relaxations\index{Convergence!Maximum~number~of~relaxations}
+the tolerance threshold\index{convergence!tolerance threshold} $\varepsilon$
+and the maximum number of relaxations\index{convergence!maximum number of relaxations}
  $MaxRelax$. We take into account the number of relaxations since that of iterations
  cannot be computed in the asynchronous case. Indeed, a relaxation is the update~(\ref{ch13:eq:13})
  of a local iterate vector $U_i$ according to $F_i$. Then, counting the number
@@ -587,7 +587,7 @@ of relaxations is possible in both synchronous and asynchronous cases. On the
  other hand, an iteration is the update of at least all vector components with
  $F_i$.
  
-In the synchronous\index{Synchronous} algorithm, the global convergence is detected
+In the synchronous\index{synchronous iterations} algorithm, the global convergence is detected
  when the maximal value of the absolute error, $error$, is sufficiently small and/or
  the maximum number of relaxations, $MaxRelax$, is reached, as follows:
  $$
@@ -598,11 +598,11 @@ AllReduce(error,\hspace{0.1cm}maxerror,\hspace{0.1cm}MAX); \\
  conv \leftarrow true;
  \end{array}
  $$
-where the function $AllReduce()$ uses the MPI global reduction subroutine\index{MPI~subroutines!Global}
+where the function $AllReduce()$ uses the MPI global reduction subroutine\index{MPI!global}
  \verb+MPI_Allreduce()+ to compute the maximal value, $maxerror$, among the local
  absolute errors, $error$, of all computing nodes, and $p$ (in Algorithm~\ref{ch13:alg:02})
  is used as a counter of the local relaxations carried out by a computing node. In
-the asynchronous\index{Asynchronous} algorithms, the global convergence is detected
+the asynchronous\index{asynchronous iterations} algorithms, the global convergence is detected
  when all computing nodes locally converge. For this, we use a token ring architecture
  around which a boolean token travels, in one direction, from a computing node to another.
  Starting from node $0$, the boolean token is set to $true$ by node $i$ if the local
@@ -617,7 +617,7 @@ sends a stop message (end of parallel solving) to all computing nodes in the clu
  %%--------------------------%%
  \section{Experimental tests on a GPU cluster}
  \label{ch13:sec:05}
-The GPU cluster\index{GPU~cluster} of tests that we used in this chapter is an $20GB/s$
+The GPU cluster\index{GPU!cluster} of tests that we used in this chapter is an $20GB/s$
  Infiniband network of six machines. Each machine is a Quad-Core Xeon E5530 CPU running at
  $2.4$GHz. It provides a RAM memory of $12$GB with a memory bandwidth of $25.6$GB/s and it
  is equipped with two NVIDIA Tesla C1060 GPUs. A Tesla GPU contains in total $240$ cores
@@ -698,6 +698,7 @@ $800^{3}$                     & $222,108.09$       & $1,769,232$      & $188,790
  
  \begin{table}
  \centering
+\begin{scriptsize}
  \begin{tabular}{|c|c|c|c|c|c|c|c|}
  \hline
  \multirow{2}{*}{\bf Pb. size} & \multicolumn{3}{c|}{\bf Synchronous}                 & \multicolumn{3}{c|}{\bf Asynchronous}                & \multirow{2}{*}{\bf Gain\%}  \\ \cline{2-7}
@@ -712,6 +713,7 @@ $768^{3}$                    & $4,112.68$         & $831,144$        & $50.13$
  
  $800^{3}$                    & $3,950.87$         & $899,088$        & $56.22$         & $3,636.57$        & $834,900$        & $51.91$     & $7.95$ \\ \hline 
  \end{tabular}
+\end{scriptsize}
  \vspace{0.5cm}
  \caption{Execution times in seconds of the parallel projected Richardson method implemented on a cluster of 12 GPUs.}
  \label{ch13:tab:02}
@@ -719,7 +721,7 @@ $800^{3}$                    & $3,950.87$         & $899,088$        & $56.22$
  
  The fourth and seventh columns of Table~\ref{ch13:tab:02} show the relative gains
  obtained by executing the parallel algorithms on the cluster of $12$ GPUs instead
-on the cluster of $24$ CPU cores. We compute the relative gain\index{Relative~gain}
+on the cluster of $24$ CPU cores. We compute the relative gain\index{relative gain}
  $\tau$ as a ratio between the execution time $T_{cpu}$ spent on the CPU cluster over
  that $T_{gpu}$ spent on the GPU cluster: \[\tau=\frac{T_{cpu}}{T_{gpu}}.\] We can see
  from these ratios that solving large obstacle problems is faster on the GPU cluster
@@ -745,13 +747,13 @@ consequently it also depends on the number of computing nodes.
  %%--------------------------%%
  \section{Red-black ordering technique}
  \label{ch13:sec:06}
-As is wellknown, the Jacobi method\index{Iterative~method!Jacobi} is characterized
-by a slow convergence\index{Convergence} rate compared to some iterative methods\index{Iterative~method}
-(for example, Gauss-Seidel method\index{Iterative~method!Gauss-Seidel}). So, in this
+As is well known, the Jacobi method\index{iterative method!Jacobi} is characterized
+by a slow convergence\index{convergence} rate compared to some iterative methods\index{iterative method}
+(for example, Gauss-Seidel method\index{iterative method!Gauss-Seidel}). So, in this
  section, we present some solutions to reduce the execution time and the number of
  relaxations and, more specifically, to speed up the convergence of the parallel
  projected Richardson method on the GPU cluster. We propose to use the point red-black
-ordering technique\index{Iterative~method!Red-Black~ordering} to accelerate the
+ordering technique\index{iterative method!red-black ordering} to accelerate the
  convergence. This technique is often used to increase the parallelism of iterative
  methods for solving linear systems~\cite{ch13:ref13,ch13:ref14,ch13:ref15}. We
  apply it to the projected Richardson method as a compromise between the Jacobi
@@ -773,10 +775,10 @@ This technique can be implemented on the GPU in two different manners:
  However, in both solutions, for each memory transaction, only half of the memory
  segment addressed by a half-warp is used. So, the computation of the red and black
  vector elements leads to using twice the initial number of memory transactions. Then,
-we apply the point red-black ordering\index{Iterative~method!Red-Black~ordering}
+we apply the point red-black ordering\index{iterative method!red-black ordering}
  accordingly to the $y$-coordinate, as is shown in Figure~\ref{ch13:fig:06.02}. In
  this case, the vector elements having even $y$-coordinate are computed in parallel
-using the values of those having odd $y$-coordinate and then viceversa. Moreover,
+using the values of those having odd $y$-coordinate and then vice-versa. Moreover,
  in the GPU implementation of the parallel projected Richardson method (Section~\ref{ch13:sec:04}),
  we have shown that a subproblem of size $(NX\times ny\times nz)$ is decomposed into
  $nz$ grids of size $(NX\times ny)$. Then, each kernel is executed in parallel by
@@ -860,9 +862,9 @@ number of computing nodes on the cluster. This is due to the fact that the ratio
  between the time of the computation over that of the communication is reduced when
  the computations are performed on GPUs. Indeed, GPUs compute faster than CPUs and
  communications are more time-consuming. In this context, asynchronous algorithms
-are more scalable than synchronous ones. So, with large scale GPU clusters, synchronous\index{Synchronous}
+are more scalable than synchronous ones. So, with large scale GPU clusters, synchronous\index{synchronous iterations}
  algorithms might be more penalized by communications, as can be deduced from Figure~\ref{ch13:fig:07}.
-That is why we think that asynchronous\index{Asynchronous} iterative algorithms
+That is why we think that asynchronous\index{asynchronous iterations} iterative algorithms
  are all the more interesting in this case.