new

[book_gpu.git] / BookGPU / Chapters / chapter14 / ch14.tex
diff --git a/BookGPU/Chapters/chapter14/ch14.tex b/BookGPU/Chapters/chapter14/ch14.tex

index d50db31ad82a385645f028a1e9001e0599974392..89a72fdbadf2769908fa206d7b31ca652b85a68f 100755 (executable)
--- a/BookGPU/Chapters/chapter14/ch14.tex
+++ b/BookGPU/Chapters/chapter14/ch14.tex
@@ -1,25 +1,26 @@
-\chapterauthor{Alan Gray and Kevin Stratford}{EPCC, The University of Edinburgh}
+\chapterauthor{Alan Gray and Kevin Stratford}{EPCC, The University of Edinburgh, United Kingdom}
  
-\chapter{Ludwig: multiple GPUs for a complex fluid lattice Boltzmann
+\chapter[Ludwig: multiple GPUs for a fluid lattice Boltzmann
+application]{Ludwig: multiple GPUs for a complex fluid lattice Boltzmann
  application}
  
  %\putbib[biblio]
  
  
  \section{Introduction}
-The lattice Boltzmann (LB) method \index{Lattice Boltzmann method} (for an overview see, e.g.,
+The lattice Boltzmann (LB) method \index{lattice Boltzmann method} (for an overview see, e.g.,
  \cite{succi-book}) has become a popular approach to a variety of fluid
-dynamics problems.  It provides a way to solve the incompressible,
+dynamics problems.  It provides a way to solve the incompressible
  isothermal Navier-Stokes equations and has the attractive features of
  being both explicit in time and local in space. This makes the LB
-method well-suited to parallel computation. Many efficient parallel
+method well suited to parallel computation. Many efficient parallel
  implementations of the LB method have been undertaken, typically using
  a combination of distributed domain decomposition and the Message
  Passing Interface (MPI). However, the potential
-performance benefits offered by GPUs has motivated a new `mixed-mode'
+performance benefits offered by GPUs has motivated a new ``mixed-mode''
  approach to address very large problems. Here, fine-grained
  parallelism is implemented on the GPU, while MPI is reserved for
-larger-scale parallelism.  This mixed mode is of increasing interest
+larger scale parallelism.  This mixed mode is of increasing interest
  to application programmers at a time when many supercomputing services
  are moving
  toward clusters of GPU accelerated nodes. The design questions which
@@ -60,7 +61,7 @@ an LB application developed specifically for complex fluids
  (\textit{Ludwig} was named for Boltzmann, 1844--1906).
  We will present the steps
  required to allow \textit{Ludwig} to exploit efficiently both a single
-GPU, and also many GPUs in parallel.  We show that
+GPU and many GPUs in parallel.  We show that
  \textit{Ludwig} scales excellently to at least the one thousand GPU level
  (the largest resource available at the time of writing) with
  indications that the code will scale to much larger systems as
@@ -68,7 +69,7 @@ they become available. In addition, we consider the steps required
  to represent fully-resolved moving solid particles (usually referred
  to as colloids in the context of complex fluids). Such particles
  need to have their surface resolved on the lattice scale in order
-to include relevant surface physics, and must be able to move: e.g.,
+to include relevant surface physics and must be able to move, e.g.,
  to execute Brownian motion in response to random forces from the
  fluid. Standard methods are available to represent such particles
  (e.g., \cite{ladd1994,nguyen2002}) which are amenable to effective
@@ -81,7 +82,7 @@ In the following section we provide a brief overview of the lattice
  Boltzmann method, and mention some of the general issues which can
  influence performance in the CPU context.
  In Section \ref{ch14:sec:singlegpu}, we then describe the alterations
-which are required to exploit the GPU architecture effectively, and
+which are required to exploit the GPU architecture effectively and
  highlight how the resulting code differs from the CPU version. In
  Section \ref{ch14:sec:parallelgpu}, we extend this description to
  include the steps required to allow exploitation of many GPUs in
@@ -127,8 +128,8 @@ are chosen to capture the correct symmetries of the Navier-Stokes
  equations. A typical choice, used here, is the so-called D3Q19
  basis in three dimensions where there is one velocity such that
  $\mathbf{c} \Delta t$ is zero, along with six extending to the nearest
-neighbour
-lattice sites, and twelve extending to the next-nearest neighbour sites
+neighbor
+lattice sites, and twelve extending to the next-nearest neighbor sites
  ($\Delta t$ being the discrete time step). The fundamental object
  in LB is then the distribution function $f_i (\mathbf{r};t)$ whose
  moments are related to the local hydrodynamic quantities: the fluid
@@ -138,16 +139,16 @@ function is described by a discrete Boltzmann equation
  f_i(\mathbf{r} + \mathbf{c}_i \Delta t; t) - f_i(\mathbf{r}; t) 
  = - {\cal L}_{ij} f_j(\mathbf{r};t).
  \end{equation}
-It is convenient to think of this in two stages. First, the right hand
+It is convenient to think of this in two stages. First, the right-hand
  side represents the action of a collision operator ${\cal L}_{ij}$,
  which is local to each lattice site and relaxes the distribution toward
  a local equilibrium at a rate ultimately related to the fluid viscosity.
-Second, the left hand side represents a propagation step (sometimes referred
+Second, the left-hand side represents a propagation step (sometimes referred
  to as streaming step), in which each element $i$ of the distribution is
  displaced $\mathbf{c}_i \Delta t$, i.e., one lattice spacing in the
  appropriate direction per discrete time step. 
  
-More specifically, we store a vector of 19 double precision floating
+More specifically, we store a vector of 19 double-precision floating
  point values at each lattice site for the distribution function
  $f_i(\mathbf{r};t)$.
  The collision operation ${\cal L}_{ij}$, which is local at each lattice
@@ -155,19 +156,35 @@ site, may be thought of as follows. A matrix-vector multiplication
  ${\cal M}_{ij}f_j$ is used to transform the distributions into the
  hydrodynamic quantities, where ${\cal M}_{ij}$ is a constant 19x19
  matrix related to the choice of
-$\mathbf{c}_i$. The non-conserved hydrodynamic quantities are then
-relaxed toward their (known) equilibrium values, and are transformed
+$\mathbf{c}_i$. The nonconserved hydrodynamic quantities are then
+relaxed toward their (known) equilibrium values and are transformed
  back to new post-collision distributions via the inverse transformation
-${\cal M}^{-1}_{ij}$. This gives rise to the need for a minimum of 2x19$^2$
+${\cal M}^{-1}_{ij}$. This gives rise to the need for a minimum of $2\times 19^2$
  floating point multiplications per lattice site. (Additional operations are
  required to implement, for example, the force $\mathbf{f}(\mathbf{r})$.)
  In contrast, the
  propagation stage consists solely of moving the distribution values
-between lattice sites, and involves no floating point operations.
+between lattice sites and involves no floating point operations.
+
+
+\begin{figure}[tbp]
+\centering
+\includegraphics[width=12cm]{Chapters/chapter14/figures/decomphalo}
+\caption[The lattice is decomposed between MPI tasks.]{Left: the lattice is decomposed between MPI tasks. For
+  clarity we show a 2D decomposition of a 3D lattice, but in practice
+  we decompose in all 3 dimensions. Halo cells are added to each
+  subdomain (as shown on the upper right for a single slice) which store
+  data retrieved from remote neighbors in the halo exchange. Lower
+  right: the D3Q19 velocity set resident on a lattice site;
+  highlighted are the 5 ``outgoing'' elements to be transferred in a
+  specific direction.}
+\label{ch14:fig:decomphalo}
+\end{figure}
+
  
  In the CPU version, the \textit{Ludwig} implementation stores one time
  level of distribution values $f_i(\mathbf{r}; t)$. This distribution
-is stored as a flat array of C data type \texttt{double}, and laid out
+is stored as a flat array of C data type \texttt{double} and laid out
  so that all elements of the velocity are contiguous at a given site
  (often referred to as ``array-of-structures'' order). This is
  appropriate for sums over the distribution required locally at the
@@ -176,24 +193,27 @@ Listing~\ref{ch14:listing1}: the fact that consecutive loads are from
  consecutive memory addresses allows the prefetcher to engage fully.
  (The temporary scalar \texttt{a\_tmp} allows caching of the
  intermediate accumulated value in the innermost loop.)  A variety of
-standard sequential optimisations are relevant for the collision stage
+standard sequential optimizations are relevant for the collision stage
  (loop unrolling, inclusion of SIMD operations, and so on
  \cite{wellein2006}).  For example, the collision stage in
  \textit{Ludwig} includes explicit SIMD code, which is useful if the
-compiler on a given platform cannot identify it.  The propagation
-stage is separate, and is organised as a ``pull'' (again, various
-optimisation have been considered, e.g.,
-\cite{pohl2003,mattila2007,wittmann2012}).  No further optimisation is
+compiler on a given platform cannot identify the appropriate vectorization.
+The propagation
+stage is separate and is organized as a ``pull'' (again, various
+optimizations have been considered, e.g.,
+\cite{pohl2003,mattila2007,wittmann2012}).  No further optimization is
  done here beyond ensuring that the ordering of the discrete velocities
  allows memory access to be as efficient as possible. While these
-optimisations are important, it should be remembered that for some
+optimizations are important, it should be remembered that for some
  complex fluid problems, the hydrodynamics embodied in the LB
  calculation is a relatively small part of the total computational cost
-(at the 10\%-level in some cases).  This means optimisation effort may
+(at the 10\% level in some cases).  This means optimization effort may
  be better concentrated elsewhere.
  
+
+
  \begin{lstlisting}[float, label=ch14:listing1,
-caption = Collision schematic for CPU.]
+caption = collision schematic for CPU.]
  /* loop over lattice sites */
  for (is = 0; is < nsites; is++) {
    ...
@@ -214,44 +234,32 @@ for (is = 0; is < nsites; is++) {
  \end{lstlisting}
  
  
-\begin{figure}[!t]
-\centering
-\includegraphics[width=12cm]{Chapters/chapter14/figures/decomphalo}
-\caption[The lattice is decomposed between MPI tasks.]{Left: the lattice is decomposed between MPI tasks. For
-  clarity we show a 2D decomposition of a 3D lattice, but in practice
-  we decompose in all 3 dimensions. Halo cells are added to each
-  sub-domain (as shown on the upper right for a single slice) which store
-  data retrieved from remote neighbours in the halo exchange. Lower
-  right: the D3Q19 velocity set resident on a lattice site;
-  highlighted are the 5 ``outgoing'' elements to be transferred in a
-  specific direction.}
-\label{ch14:fig:decomphalo}
-\end{figure}
+
  
  
  The regular 3D decomposition is illustrated in Fig.~\ref{ch14:fig:decomphalo}.
-Each local sub-domain is surrounded by a halo, or ghost, region of width one
+Each local subdomain is surrounded by a halo, or ghost, region of width one
  lattice site. While the collision is local, elements of the distribution
  must be exchanged at the edges of the domains to facilitate the propagation.
  To achieve the full 3D halo exchange, the standard approach of shifting the
-relevant data in each co-ordinate direction in turn is adopted. This
-requires appropriate synchronisation, i.e., a receive in the the first
-co-ordinate direction must be complete before a send in the second direction
+relevant data in each coordinate direction in turn is adopted. This
+requires appropriate synchronization, i.e., a receive in the the first
+coordinate direction must be complete before a send in the second direction
  involving relevant data can take place, and so on. We note that only
  ``outgoing'' elements of the distribution need to be sent at each edge.
  For the D3Q19 model, this reduces the volume of data traffic from 19 to
  5 of the $f_i(\mathbf{r};t)$ per lattice site at each edge. In the CPU
  version, the necessary transfers are implemented in place using
-a vector of appropriately strided MPI datatypes for each direction.
-
+a vector of MPI datatypes with appropriate stride for each direction.
  
-\section{Single GPU Implementation}\label{ch14:sec:singlegpu}
+\clearpage
+\section{Single GPU implementation}\label{ch14:sec:singlegpu}
  
  
  In this section we describe the steps taken to enable \textit{Ludwig}
  for the GPU. There are a number of crucial issues: first, the
-minimisation of data traffic between host and device; second, the
-optimal mapping of available parallelism onto the architecture and
+minimization of data traffic between host and device; second, the
+optimal mapping of available parallelism onto the architecture; and
  third, the issue of memory coalescing. We discuss each of these in
  turn.
  
@@ -273,12 +281,12 @@ parallelism inherent in the GPU architecture, particularly for
  those matrix-vector operations within the collision kernel.
  The GPU architecture features a hierarchy of parallelism. At the
  lowest level, groups of 32 threads (warps) operate in
-lock-step on different data elements: this is SIMD style vector-level
+lock-step on different data elements: this is SIMD-style vector-level
  parallelism. Multiple warps are combined into a thread block (in which
-communication and synchronisation are possible), and multiple blocks can
+communication and synchronization are possible), and multiple blocks can
  run concurrently across the streaming multiprocessors in the GPU
  (with no communication or
-synchronisation possible across blocks).  To decompose on the GPU, we
+synchronization possible across blocks).  To decompose on the GPU, we
  must choose which part of the collision to assign to which level of
  parallelism.
  
@@ -289,14 +297,14 @@ to all levels of parallelism, i.e., we use a separate CUDA thread for
  every lattice site, and each thread performs the full matrix-vector
  operation. We find that a block size of 256 performs well on current
  devices: we therefore decompose
-the lattice into groups of 256 sites, and assign each group to a block
+the lattice into groups of 256 sites and assign each group to a block
  of CUDA threads. As the matrix ${\cal M}_{ij}$ is constant, it is
  assigned to the fast {\it constant} on-chip device memory.
  
  For the propagation stage, the GPU implementation adds a second time
-level of distribution values. The data-dependencies inherent in the
+level of distribution values. The data dependencies inherent in the
  propagation mean that the in-place propagation of the CPU version
-cannot be parallelised effectively without the additional time level.
+cannot be parallelized effectively without the additional time level.
  As both time levels may remain resident on the GPU, this is not a
  significant overhead.
  
@@ -307,7 +315,7 @@ is only achieved when data are structured such that threads within a
  segment in a single transaction: this is memory coalescing. It can be
  seen that the array-of-structures ordering used for the distribution
  in the CPU code would not be suitable for coalescing; in fact, it would
-result in serialised memory accesses and relative poor performance.
+result in serialized memory accesses and relative poor performance.
  To meet the coalescing criteria and allow consecutive threads to read
  consecutive memory addresses on the GPU, we transpose the layout of the
  distribution so that, for each velocity component, consecutive sites
@@ -315,7 +323,7 @@ are contiguous in memory (``structure-of-arrays'' order). A schematic of
  the GPU collision code is given in Listing~\ref{ch14:listing2}.
  
  \begin{lstlisting}[float, label=ch14:listing2,
-caption = Collision schematic for GPU.]
+caption = collision schematic for GPU.]
  /* compute current site index 'is' from CUDA thread and block */
  /* indices and load distribution from "structure-of-arrays" */
  
@@ -333,18 +341,18 @@ defer some further comments on software engineering aspects of the code to
  the summary.
  
  
-\section{Multiple GPU Implementation}\label{ch14:sec:parallelgpu}
+\section{Multiple GPU implementation}\label{ch14:sec:parallelgpu}
  
  To execute on multiple GPUs, we use the same domain decomposition
  and message passing framework as the CPU version. Within each
-sub-domain (allocated to one MPI task) the GPU implementation
+subdomain (allocated to one MPI task) the GPU implementation
  proceeds as described in the previous section. The only additional
  complication is that halo transfers between GPUs must be staged
-through the host (in future, direct GPU to GPU data transfers via
+through the host (in the future, direct GPU-to-GPU data transfers via
  MPI may be possible, obviating the need for these steps). This
-means host MPI sends must be preceded by appropriate device to host
-transfers and host MPI receives must be followed by corresponding host
-to device transfers.
+means host MPI sends must be preceded by appropriate device-to-host
+transfers and host MPI receives must be followed by corresponding
+host-to-device transfers.
  
  %To execute on multiple GPUs in parallel we decompose the problem using
  %the existing high-level parallelisaton framework: each MPI task acts
@@ -379,18 +387,18 @@ to device transfers.
  
  In practice, this data movement requires additional GPU kernels to
  pack and unpack the relevant data before and after corresponding MPI
-calls. However, the standard shift algorithm, in which each co-ordinate
+calls. However, the standard shift algorithm, in which each coordinate
  direction is treated in turn, does provide some scope for the
  overlapping of different operations.
-For example, after the data for the first co-ordinate direction have been
+For example, after the data for the first coordinate direction have been
  retrieved by the host, these can be exchanged using MPI between
  hosts at the same time as kernels for packing and retrieving of
-data for the second co-ordinate direction are
+data for the second coordinate direction are
  executed. This
-overlapping must respect the synchronisation required to ensure
-that data values at the corners of the sub-domain are transferred
+overlapping must respect the synchronization required to ensure
+that data values at the corners of the subdomain are transferred
  correctly.
-We use a separate CUDA stream for each co-ordinate direction:
+We use a separate CUDA stream for each coordinate direction:
  this allows some of the host-device communication time to be
  effectively ``hidden'' behind the host-host MPI communication,
  resulting in an overall speedup.
@@ -414,7 +422,7 @@ CPUs per node, and with nodes interconnected using Cray Gemini technology.
  For GPU results, a Cray XK6 system is used: this is very similar to the
  XE6, but has one CPU per node replaced with an NVIDIA X2090 GPU. Each
  node in the GPU system therefore features a single Opteron CPU acting
-as a host to a single GPU. The inter-node interconnect architecture is
+as a host to a single GPU. The internode interconnect architecture is
  the same as for the Cray XE6. The GPU performance tests use a prototype
  Cray XK6 system with 936 nodes (the largest available at the time of writing).
  To provide a fair comparison, we compare scaling on a \textit{per node}
@@ -464,7 +472,7 @@ a simulation in terms of accounting budgets, or electricity.
  
  Figure \ref{ch14:fig:scaling} shows the results of both weak and
  strong scaling tests (top and bottom panels, respectively). For
-weak scaling, where the local sub-domain size is kept fixed
+weak scaling, where the local subdomain size is kept fixed
  (here a relatively large 196$^3$ lattice), the time taken by
  an ideal code would remain constant. It can be seen that while
  scaling of the CPU version shows a pronounced upward slope, the
@@ -474,9 +482,9 @@ has a factor of 32 fewer MPI tasks per node, so communications
  require a significantly smaller number of larger data transfers.
  The performance
  advantage of the GPU version ranges from a factor of around 1.5 to
-around 1.8. Careful study by other authors \cite{williams2011} have
-found the absolute performance (in terms of floating point
-operations per second, or floating point operations per Watt) to
+around 1.8. Careful studies by other authors \cite{williams2011} have
+found the absolute performance (in terms of floating-point
+operations per second, or floating-point operations per Watt) to
  be remarkably similar between architectures.
  
  Perhaps of more interest is the strong scaling picture (lower
@@ -492,18 +500,18 @@ In contrast, the GPU version shows
  reasonably robust scaling even for the smaller system sizes and good
  scaling for the larger systems.
  
-\section{Moving Solid Particles}\label{ch14:sec:particles}
+\section{Moving solid particles}\label{ch14:sec:particles}
  
  \begin{figure}[t]
  \centering
  %\includegraphics[width=10cm]{Chapters/chapter14/figures/bbl}
-\includegraphics[width=10cm]{Chapters/chapter14/figures/colloid_new}
+\includegraphics[width=10cm]{Chapters/chapter14/figures/colloid_new_gray}
  \caption[A two-dimensional schematic picture of spherical particles on the lattice.]{
  A two-dimensional schematic picture of spherical particles on the lattice.
  Left: a particle is allowed
  to move continuously across the lattice, and the position of the
-surface defines fluid lattice sites (light blue) and solid lattice
-sites (dark red). The discrete surface is defined by links
+surface defines fluid lattice sites (gray) and solid lattice
+sites (black). The discrete surface is defined by links
  where propagation would intersect the surface (arrows). Note the
  discrete shape of the two particles is different. Right: post-collision
  distributions are reversed at the surface by the process of bounce-back
@@ -534,15 +542,15 @@ The introduction of moving solid particles poses an additional hurdle
  to efficient GPU implementation of an LB code such as \textit{Ludwig}.
  In this section, we give a brief overview of the essential features
  of the CPU implementation, and how the considerations raised in the
-previous sections --- maximising parallelism and minimising both
-host to device and GPU to GPU data movement --- shape the design decisions
+previous sections---maximizing parallelism and minimising both
+host-to-device and GPU-to-GPU data movement---shape the design decisions
  for a GPU implementation. We restrict our discussion to a simple fluid
  in this section; additional solid-fluid boundary conditions (e.g.,
  wetting at a fluid-fluid-solid contact line) usually arise elsewhere
-in the calculation and broadly independent of the hydrodynamic boundary
+in the calculation and are broadly independent of the hydrodynamic boundary
  conditions which are described in what follows.
  
-Moving solid particles (here, spheres) are defined by a centre position
+Moving solid particles (here, spheres) are defined by a center position
  which is allowed to move continuously across the space of the lattice,
  and a fixed radius which is typically a few lattice spacings $\Delta r$.
  The surface of the particle is defined by a series of \textit{links}
@@ -559,11 +567,11 @@ and torque on the sphere. The particle motion can then be updated in a
  molecular dynamics-like step.
  
  For the CPU implementation, a number of additional MPI communications
-are required: (1) to exchange the centre position, radius, etc of each
-particle as it moves and (2), to allow the accumulation of the force
+are required: (1) to exchange the center position, radius, etc.\ of each
+particle as it moves and (2) to allow the accumulation of the force
  and torque for each particle (the links of which may be distributed
  between up to 8 MPI tasks in three dimensions). Appropriate marshaling
-of these data can provide an effective parallelisation for relatively
+of these data can provide an effective parallelization for relatively
  dense particle suspensions of up to  40\% solid volume fraction
  \cite{stratford2008}. As a final consideration, fluid distributions
  must be removed and replaced consistently as a given particle moves
@@ -573,12 +581,12 @@ With these features in mind, we can identify a number of competing
  concerns which are relevant to a GPU implementation:
  \begin{enumerate}
  \item
-Minimisation of host-device data transfer would argue for moving the
+Minimization of host-device data transfer would argue for moving the
  entire particle code to the GPU. However, the code in question involves
  largely conditional logic (e.g., identifying cut surface links) and
  irregular memory accesses (e.g., access to distribution elements around
  a spherical particle). These operations seem poorly suited to effective
-parallelisation on the GPU. As an additional complication, the sums
+parallelization on the GPU. As an additional complication, the sums
  required over the particle surface would involve potentially tricky
  and inefficient reductions in GPU memory.
  \item
@@ -596,11 +604,12 @@ particle information.
  \end{enumerate}
  
  
-We have implemented the second option as follows. For each sub-domain,
-a list of boundary-cutting links is assembled on the CPU which includes
-the identity of the relevant element of the distribution. This list,
+We have implemented the second option as follows. For each subdomain,
+a list of boundary-cutting links is assembled on the CPU; the list includes
+the identity of the relevant element of the distribution in each case.
+This list,
  together with the particle information required to compute the correct
-bounce-back term, are transferred to the GPU. The updates to the relevant
+bounce-back term, is transferred to the GPU. The updates to the relevant
  elements of the distribution can then take place on the GPU. The
  corresponding information to compute the update of the particle dynamics
  is returned
@@ -617,14 +626,14 @@ at solid lattice points by consulting a look-up table of solid/fluid
  status. On the GPU, it is perhaps preferable to perform the collision
  stage everywhere instead of moving the look-up table to the GPU and
  introducing the associated logic.
-Ultimately, the GPU might favour other boundary methods which treat solid and
+Ultimately, the GPU might favor other boundary methods which treat solid and
  fluid on a somewhat more equal basis, for example, the immersed boundary
  method \cite{ch14:immersed1,ch14:immersed2,ch14:immersed-lb} or smoothed profile method
  \cite{ch14:spm}.
  However, the approach adopted here  allows us to exploit
-the GPU for the intensive fluid simulation whilst maintaining the complex
+the GPU for the intensive fluid simulation while maintaining the complex
  code required for particles on the CPU. Overheads of CPU-GPU transfer are
-minimised by transferring only those data relevant to the hydrodynamic
+minimized by transferring only those data relevant to the hydrodynamic
  interaction implemented via bounce-back on links.
  
  % END NEW SOLID TEXT
@@ -735,7 +744,7 @@ we have tried to maintain the modular structure of the CPU where
  possible. For each data structure, such as the distribution, a separate
  analogue is maintained in both the CPU and GPU memory spaces. However,
  the GPU copy does not include the complete CPU structure: in
-particular, non-intrinsic datatypes such as MPI datatypes are not
+particular, nonintrinsic datatypes such as MPI datatypes are not
  required on the GPU. Functions to marshal data between CPU and GPU
  are provided for each data structure, abstracting the underlying
  CUDA implementation. (This reasonably lightweight abstraction layer
@@ -764,11 +773,11 @@ keep the entire problem resident on the device.
  
  
  
-
+\clearpage
  \section{Summary}
  \label{ch14:sec:summary}
  
-We have described the steps take to implement, on both single and
+We have described the steps taken to implement, on both single and
  multiple GPUs, the \textit{Ludwig} code which was originally
  designed for complex fluid problems on a conventional CPU
  architecture. We have added the necessary functionality using
@@ -776,14 +785,14 @@ NVIDIA CUDA and discussed the important changes to the main
  data structures. By retaining domain decomposition and message
  passing via MPI, we have demonstrated it is possible to scale
  complex fluid problems to large numbers of GPUs in parallel.
-By following the key design criteria of maximising parallelism
-and minimising host-device data transfers, we have confirmed
+By following the key design criteria of maximizing parallelism
+and minimizing host-device data transfers, we have confirmed that
  the mixed MPI-GPU approach is viable for scientific applications.
  For the more intricate problem of moving solid particles, we find
  it is possible to retain the more serial elements related to
  particle link operations on the CPU, while offloading only the
  parallel lattice-based operations involving the LB distribution
-to the GPU. Again, this minimises host-device movement of data.
+to the GPU. Again, this minimizes host-device movement of data.
  
  From the software engineering viewpoint, some duplication of code
  to allow efficient implementation on both host and device is
@@ -796,7 +805,7 @@ codes which also exhibit portable performance (perhaps in conjunction
  with automatic tuning approaches).
  
  So, while the challenges in designing portable and efficient scientific
-applications remain very real, this work provides some hope the large
+applications remain very real, this work provides some hope that large
  clusters of GPU machines can be used effectively for a wide range of
  complex fluid problems.
  
@@ -928,7 +937,7 @@ complex fluid problems.
  
  
  % use section* for acknowledgement
-\section*{Acknowledgments}
+\section{Acknowledgments}
  AG was supported by the CRESTA project, which has received
  funding from the European Community's Seventh Framework Programme
  (ICT-2011.9.13) under Grant Agreement no. 287703. KS was supported
@@ -958,6 +967,7 @@ by UK EPSRC under grant EV/J007404/1.
  % set second argument of \begin to the number of references
  % (used to reserve space for the reference number labels box)
  
+\clearpage
  \putbib[Chapters/chapter14/biblio14]
  %\begin{thebibliography}{1}