new

[book_gpu.git] / BookGPU / Chapters / chapter14 / ch14.tex
diff --git a/BookGPU/Chapters/chapter14/ch14.tex b/BookGPU/Chapters/chapter14/ch14.tex

index 1acb0807d6fbee3870c139747cad7e500e2815f8..2ce9331300ab9c7fbaa4e4182c9edf49f10b96c0 100755 (executable)
--- a/BookGPU/Chapters/chapter14/ch14.tex
+++ b/BookGPU/Chapters/chapter14/ch14.tex
@@ -13,14 +13,14 @@ The lattice Boltzmann (LB) method \index{Lattice Boltzmann method} (for an overv
  dynamics problems.  It provides a way to solve the incompressible
  isothermal Navier-Stokes equations and has the attractive features of
  being both explicit in time and local in space. This makes the LB
  dynamics problems.  It provides a way to solve the incompressible
  isothermal Navier-Stokes equations and has the attractive features of
  being both explicit in time and local in space. This makes the LB
-method wellsuited to parallel computation. Many efficient parallel
+method well suited to parallel computation. Many efficient parallel
  implementations of the LB method have been undertaken, typically using
  a combination of distributed domain decomposition and the Message
  Passing Interface (MPI). However, the potential
  performance benefits offered by GPUs has motivated a new ``mixed-mode''
  approach to address very large problems. Here, fine-grained
  parallelism is implemented on the GPU, while MPI is reserved for
  implementations of the LB method have been undertaken, typically using
  a combination of distributed domain decomposition and the Message
  Passing Interface (MPI). However, the potential
  performance benefits offered by GPUs has motivated a new ``mixed-mode''
  approach to address very large problems. Here, fine-grained
  parallelism is implemented on the GPU, while MPI is reserved for
-largerscale parallelism.  This mixed mode is of increasing interest
+larger scale parallelism.  This mixed mode is of increasing interest
  to application programmers at a time when many supercomputing services
  are moving
  toward clusters of GPU accelerated nodes. The design questions which
  to application programmers at a time when many supercomputing services
  are moving
  toward clusters of GPU accelerated nodes. The design questions which
@@ -156,9 +156,9 @@ site, may be thought of as follows. A matrix-vector multiplication
  ${\cal M}_{ij}f_j$ is used to transform the distributions into the
  hydrodynamic quantities, where ${\cal M}_{ij}$ is a constant 19x19
  matrix related to the choice of
  ${\cal M}_{ij}f_j$ is used to transform the distributions into the
  hydrodynamic quantities, where ${\cal M}_{ij}$ is a constant 19x19
  matrix related to the choice of
-$\mathbf{c}_i$. The nonconserved hydrodynamic quantities are then
+$\mathbf{c}_i$. The non-conserved hydrodynamic quantities are then
  relaxed toward their (known) equilibrium values and are transformed
  relaxed toward their (known) equilibrium values and are transformed
-back to new postcollision distributions via the inverse transformation
+back to new post-collision distributions via the inverse transformation
  ${\cal M}^{-1}_{ij}$. This gives rise to the need for a minimum of $2\times 19^2$
  floating point multiplications per lattice site. (Additional operations are
  required to implement, for example, the force $\mathbf{f}(\mathbf{r})$.)
  ${\cal M}^{-1}_{ij}$. This gives rise to the need for a minimum of $2\times 19^2$
  floating point multiplications per lattice site. (Additional operations are
  required to implement, for example, the force $\mathbf{f}(\mathbf{r})$.)
@@ -193,12 +193,13 @@ Listing~\ref{ch14:listing1}: the fact that consecutive loads are from
  consecutive memory addresses allows the prefetcher to engage fully.
  (The temporary scalar \texttt{a\_tmp} allows caching of the
  intermediate accumulated value in the innermost loop.)  A variety of
  consecutive memory addresses allows the prefetcher to engage fully.
  (The temporary scalar \texttt{a\_tmp} allows caching of the
  intermediate accumulated value in the innermost loop.)  A variety of
-standard sequential optimisations are relevant for the collision stage
+standard sequential optimizations are relevant for the collision stage
  (loop unrolling, inclusion of SIMD operations, and so on
  \cite{wellein2006}).  For example, the collision stage in
  \textit{Ludwig} includes explicit SIMD code, which is useful if the
  (loop unrolling, inclusion of SIMD operations, and so on
  \cite{wellein2006}).  For example, the collision stage in
  \textit{Ludwig} includes explicit SIMD code, which is useful if the
-compiler on a given platform cannot identify it.  The propagation
-stage is separate and is organised as a ``pull'' (again, various
+compiler on a given platform cannot identify the appropriate vectorization.
+The propagation
+stage is separate and is organized as a ``pull'' (again, various
  optimizations have been considered, e.g.,
  \cite{pohl2003,mattila2007,wittmann2012}).  No further optimization is
  done here beyond ensuring that the ordering of the discrete velocities
  optimizations have been considered, e.g.,
  \cite{pohl2003,mattila2007,wittmann2012}).  No further optimization is
  done here beyond ensuring that the ordering of the discrete velocities
@@ -206,7 +207,7 @@ allows memory access to be as efficient as possible. While these
  optimizations are important, it should be remembered that for some
  complex fluid problems, the hydrodynamics embodied in the LB
  calculation is a relatively small part of the total computational cost
  optimizations are important, it should be remembered that for some
  complex fluid problems, the hydrodynamics embodied in the LB
  calculation is a relatively small part of the total computational cost
-(at the 10\%level in some cases).  This means optimization effort may
+(at the 10\% level in some cases).  This means optimization effort may
  be better concentrated elsewhere.
  
  
  be better concentrated elsewhere.
  
  
@@ -249,7 +250,7 @@ involving relevant data can take place, and so on. We note that only
  For the D3Q19 model, this reduces the volume of data traffic from 19 to
  5 of the $f_i(\mathbf{r};t)$ per lattice site at each edge. In the CPU
  version, the necessary transfers are implemented in place using
  For the D3Q19 model, this reduces the volume of data traffic from 19 to
  5 of the $f_i(\mathbf{r};t)$ per lattice site at each edge. In the CPU
  version, the necessary transfers are implemented in place using
-a vector of appropriately strided MPI datatypes for each direction.
+a vector of MPI datatypes with appropriate stride for each direction.
  
  
  \section{Single GPU implementation}\label{ch14:sec:singlegpu}
  
  
  \section{Single GPU implementation}\label{ch14:sec:singlegpu}
@@ -301,7 +302,7 @@ of CUDA threads. As the matrix ${\cal M}_{ij}$ is constant, it is
  assigned to the fast {\it constant} on-chip device memory.
  
  For the propagation stage, the GPU implementation adds a second time
  assigned to the fast {\it constant} on-chip device memory.
  
  For the propagation stage, the GPU implementation adds a second time
-level of distribution values. The datadependencies inherent in the
+level of distribution values. The data dependencies inherent in the
  propagation mean that the in-place propagation of the CPU version
  cannot be parallelized effectively without the additional time level.
  As both time levels may remain resident on the GPU, this is not a
  propagation mean that the in-place propagation of the CPU version
  cannot be parallelized effectively without the additional time level.
  As both time levels may remain resident on the GPU, this is not a
@@ -350,7 +351,8 @@ complication is that halo transfers between GPUs must be staged
  through the host (in the future, direct GPU-to-GPU data transfers via
  MPI may be possible, obviating the need for these steps). This
  means host MPI sends must be preceded by appropriate device-to-host
  through the host (in the future, direct GPU-to-GPU data transfers via
  MPI may be possible, obviating the need for these steps). This
  means host MPI sends must be preceded by appropriate device-to-host
-transfers and host MPI receives must be followed by corresponding host-to-device transfers.
+transfers and host MPI receives must be followed by corresponding
+host-to-device transfers.
  
  %To execute on multiple GPUs in parallel we decompose the problem using
  %the existing high-level parallelisaton framework: each MPI task acts
  
  %To execute on multiple GPUs in parallel we decompose the problem using
  %the existing high-level parallelisaton framework: each MPI task acts
@@ -393,7 +395,7 @@ retrieved by the host, these can be exchanged using MPI between
  hosts at the same time as kernels for packing and retrieving of
  data for the second coordinate direction are
  executed. This
  hosts at the same time as kernels for packing and retrieving of
  data for the second coordinate direction are
  executed. This
-overlapping must respect the synchronisation required to ensure
+overlapping must respect the synchronization required to ensure
  that data values at the corners of the subdomain are transferred
  correctly.
  We use a separate CUDA stream for each coordinate direction:
  that data values at the corners of the subdomain are transferred
  correctly.
  We use a separate CUDA stream for each coordinate direction:
@@ -503,7 +505,7 @@ scaling for the larger systems.
  \begin{figure}[t]
  \centering
  %\includegraphics[width=10cm]{Chapters/chapter14/figures/bbl}
  \begin{figure}[t]
  \centering
  %\includegraphics[width=10cm]{Chapters/chapter14/figures/bbl}
-\includegraphics[width=10cm]{Chapters/chapter14/figures/colloid_new}
+\includegraphics[width=10cm]{Chapters/chapter14/figures/colloid_new_gray}
  \caption[A two-dimensional schematic picture of spherical particles on the lattice.]{
  A two-dimensional schematic picture of spherical particles on the lattice.
  Left: a particle is allowed
  \caption[A two-dimensional schematic picture of spherical particles on the lattice.]{
  A two-dimensional schematic picture of spherical particles on the lattice.
  Left: a particle is allowed
@@ -541,11 +543,11 @@ to efficient GPU implementation of an LB code such as \textit{Ludwig}.
  In this section, we give a brief overview of the essential features
  of the CPU implementation, and how the considerations raised in the
  previous sections---maximizing parallelism and minimising both
  In this section, we give a brief overview of the essential features
  of the CPU implementation, and how the considerations raised in the
  previous sections---maximizing parallelism and minimising both
-host-to-device and GPU to GPU data movement---shape the design decisions
+host-to-device and GPU-to-GPU data movement---shape the design decisions
  for a GPU implementation. We restrict our discussion to a simple fluid
  in this section; additional solid-fluid boundary conditions (e.g.,
  wetting at a fluid-fluid-solid contact line) usually arise elsewhere
  for a GPU implementation. We restrict our discussion to a simple fluid
  in this section; additional solid-fluid boundary conditions (e.g.,
  wetting at a fluid-fluid-solid contact line) usually arise elsewhere
-in the calculation broadly independent of the hydrodynamic boundary
+in the calculation and are broadly independent of the hydrodynamic boundary
  conditions which are described in what follows.
  
  Moving solid particles (here, spheres) are defined by a center position
  conditions which are described in what follows.
  
  Moving solid particles (here, spheres) are defined by a center position
@@ -556,7 +558,7 @@ where a discrete velocity propagation $\mathbf{c}_i \Delta t$ would
  intercept or cut the spherical shell (see Fig.~\ref{ch14:fig:bbl}).
  Hydrodynamic boundary conditions are then implemented via the standard
  approach of bounce-back on links \cite{ladd1994, nguyen2002}, where the
  intercept or cut the spherical shell (see Fig.~\ref{ch14:fig:bbl}).
  Hydrodynamic boundary conditions are then implemented via the standard
  approach of bounce-back on links \cite{ladd1994, nguyen2002}, where the
-relevant postcollision distribution values are reversed at the
+relevant post-collision distribution values are reversed at the
  propagation stage with an appropriate correction to allow for the
  solid body motion. The exchange of momentum at each link must then be
  accumulated around the entire particle surface to provide the net
  propagation stage with an appropriate correction to allow for the
  solid body motion. The exchange of momentum at each link must then be
  accumulated around the entire particle surface to provide the net
@@ -565,8 +567,8 @@ and torque on the sphere. The particle motion can then be updated in a
  molecular dynamics-like step.
  
  For the CPU implementation, a number of additional MPI communications
  molecular dynamics-like step.
  
  For the CPU implementation, a number of additional MPI communications
-are required: (1) to exchange the center position, radius, etc of each
-particle as it moves and (2), to allow the accumulation of the force
+are required: (1) to exchange the center position, radius, etc.\ of each
+particle as it moves and (2) to allow the accumulation of the force
  and torque for each particle (the links of which may be distributed
  between up to 8 MPI tasks in three dimensions). Appropriate marshaling
  of these data can provide an effective parallelization for relatively
  and torque for each particle (the links of which may be distributed
  between up to 8 MPI tasks in three dimensions). Appropriate marshaling
  of these data can provide an effective parallelization for relatively
@@ -579,7 +581,7 @@ With these features in mind, we can identify a number of competing
  concerns which are relevant to a GPU implementation:
  \begin{enumerate}
  \item
  concerns which are relevant to a GPU implementation:
  \begin{enumerate}
  \item
-Minimisation of host-device data transfer would argue for moving the
+Minimization of host-device data transfer would argue for moving the
  entire particle code to the GPU. However, the code in question involves
  largely conditional logic (e.g., identifying cut surface links) and
  irregular memory accesses (e.g., access to distribution elements around
  entire particle code to the GPU. However, the code in question involves
  largely conditional logic (e.g., identifying cut surface links) and
  irregular memory accesses (e.g., access to distribution elements around
@@ -603,10 +605,11 @@ particle information.
  
  
  We have implemented the second option as follows. For each subdomain,
  
  
  We have implemented the second option as follows. For each subdomain,
-a list of boundary-cutting links is assembled on the CPU which includes
-the identity of the relevant element of the distribution. This list,
+a list of boundary-cutting links is assembled on the CPU; the list includes
+the identity of the relevant element of the distribution in each case.
+This list,
  together with the particle information required to compute the correct
  together with the particle information required to compute the correct
-bounce-back term, are transferred to the GPU. The updates to the relevant
+bounce-back term, is transferred to the GPU. The updates to the relevant
  elements of the distribution can then take place on the GPU. The
  corresponding information to compute the update of the particle dynamics
  is returned
  elements of the distribution can then take place on the GPU. The
  corresponding information to compute the update of the particle dynamics
  is returned
@@ -741,7 +744,7 @@ we have tried to maintain the modular structure of the CPU where
  possible. For each data structure, such as the distribution, a separate
  analogue is maintained in both the CPU and GPU memory spaces. However,
  the GPU copy does not include the complete CPU structure: in
  possible. For each data structure, such as the distribution, a separate
  analogue is maintained in both the CPU and GPU memory spaces. However,
  the GPU copy does not include the complete CPU structure: in
-particular, nonintrinsic datatypes such as MPI datatypes are not
+particular, non-intrinsic datatypes such as MPI datatypes are not
  required on the GPU. Functions to marshal data between CPU and GPU
  are provided for each data structure, abstracting the underlying
  CUDA implementation. (This reasonably lightweight abstraction layer
  required on the GPU. Functions to marshal data between CPU and GPU
  are provided for each data structure, abstracting the underlying
  CUDA implementation. (This reasonably lightweight abstraction layer
@@ -934,7 +937,7 @@ complex fluid problems.
  
  
  % use section* for acknowledgement
  
  
  % use section* for acknowledgement
-\section*{Acknowledgments}
+\section{Acknowledgments}
  AG was supported by the CRESTA project, which has received
  funding from the European Community's Seventh Framework Programme
  (ICT-2011.9.13) under Grant Agreement no. 287703. KS was supported
  AG was supported by the CRESTA project, which has received
  funding from the European Community's Seventh Framework Programme
  (ICT-2011.9.13) under Grant Agreement no. 287703. KS was supported