-\chapterauthor{Alan Gray and Kevin Stratford}{EPCC, The University of Edinburgh}
+\chapterauthor{Alan Gray and Kevin Stratford}{EPCC, The University of Edinburgh, United Kingdom}
\chapter[Ludwig: multiple GPUs for a fluid lattice Boltzmann
application]{Ludwig: multiple GPUs for a complex fluid lattice Boltzmann
\section{Introduction}
-The lattice Boltzmann (LB) method \index{Lattice Boltzmann method} (for an overview see, e.g.,
+The lattice Boltzmann (LB) method \index{lattice Boltzmann method} (for an overview see, e.g.,
\cite{succi-book}) has become a popular approach to a variety of fluid
-dynamics problems. It provides a way to solve the incompressible,
+dynamics problems. It provides a way to solve the incompressible
isothermal Navier-Stokes equations and has the attractive features of
being both explicit in time and local in space. This makes the LB
-method well-suited to parallel computation. Many efficient parallel
+method well suited to parallel computation. Many efficient parallel
implementations of the LB method have been undertaken, typically using
a combination of distributed domain decomposition and the Message
Passing Interface (MPI). However, the potential
-performance benefits offered by GPUs has motivated a new `mixed-mode'
+performance benefits offered by GPUs has motivated a new ``mixed-mode''
approach to address very large problems. Here, fine-grained
parallelism is implemented on the GPU, while MPI is reserved for
-larger-scale parallelism. This mixed mode is of increasing interest
+larger scale parallelism. This mixed mode is of increasing interest
to application programmers at a time when many supercomputing services
are moving
toward clusters of GPU accelerated nodes. The design questions which
(\textit{Ludwig} was named for Boltzmann, 1844--1906).
We will present the steps
required to allow \textit{Ludwig} to exploit efficiently both a single
-GPU, and also many GPUs in parallel. We show that
+GPU and many GPUs in parallel. We show that
\textit{Ludwig} scales excellently to at least the one thousand GPU level
(the largest resource available at the time of writing) with
indications that the code will scale to much larger systems as
to represent fully-resolved moving solid particles (usually referred
to as colloids in the context of complex fluids). Such particles
need to have their surface resolved on the lattice scale in order
-to include relevant surface physics, and must be able to move: e.g.,
+to include relevant surface physics and must be able to move, e.g.,
to execute Brownian motion in response to random forces from the
fluid. Standard methods are available to represent such particles
(e.g., \cite{ladd1994,nguyen2002}) which are amenable to effective
Boltzmann method, and mention some of the general issues which can
influence performance in the CPU context.
In Section \ref{ch14:sec:singlegpu}, we then describe the alterations
-which are required to exploit the GPU architecture effectively, and
+which are required to exploit the GPU architecture effectively and
highlight how the resulting code differs from the CPU version. In
Section \ref{ch14:sec:parallelgpu}, we extend this description to
include the steps required to allow exploitation of many GPUs in
equations. A typical choice, used here, is the so-called D3Q19
basis in three dimensions where there is one velocity such that
$\mathbf{c} \Delta t$ is zero, along with six extending to the nearest
-neighbour
-lattice sites, and twelve extending to the next-nearest neighbour sites
+neighbor
+lattice sites, and twelve extending to the next-nearest neighbor sites
($\Delta t$ being the discrete time step). The fundamental object
in LB is then the distribution function $f_i (\mathbf{r};t)$ whose
moments are related to the local hydrodynamic quantities: the fluid
f_i(\mathbf{r} + \mathbf{c}_i \Delta t; t) - f_i(\mathbf{r}; t)
= - {\cal L}_{ij} f_j(\mathbf{r};t).
\end{equation}
-It is convenient to think of this in two stages. First, the right hand
+It is convenient to think of this in two stages. First, the right-hand
side represents the action of a collision operator ${\cal L}_{ij}$,
which is local to each lattice site and relaxes the distribution toward
a local equilibrium at a rate ultimately related to the fluid viscosity.
-Second, the left hand side represents a propagation step (sometimes referred
+Second, the left-hand side represents a propagation step (sometimes referred
to as streaming step), in which each element $i$ of the distribution is
displaced $\mathbf{c}_i \Delta t$, i.e., one lattice spacing in the
appropriate direction per discrete time step.
-More specifically, we store a vector of 19 double precision floating
+More specifically, we store a vector of 19 double-precision floating
point values at each lattice site for the distribution function
$f_i(\mathbf{r};t)$.
The collision operation ${\cal L}_{ij}$, which is local at each lattice
hydrodynamic quantities, where ${\cal M}_{ij}$ is a constant 19x19
matrix related to the choice of
$\mathbf{c}_i$. The non-conserved hydrodynamic quantities are then
-relaxed toward their (known) equilibrium values, and are transformed
+relaxed toward their (known) equilibrium values and are transformed
back to new post-collision distributions via the inverse transformation
-${\cal M}^{-1}_{ij}$. This gives rise to the need for a minimum of 2x19$^2$
+${\cal M}^{-1}_{ij}$. This gives rise to the need for a minimum of $2\times 19^2$
floating point multiplications per lattice site. (Additional operations are
required to implement, for example, the force $\mathbf{f}(\mathbf{r})$.)
In contrast, the
propagation stage consists solely of moving the distribution values
-between lattice sites, and involves no floating point operations.
+between lattice sites and involves no floating point operations.
+
+
+\begin{figure}[tbp]
+\centering
+\includegraphics[width=12cm]{Chapters/chapter14/figures/decomphalo}
+\caption[The lattice is decomposed between MPI tasks.]{Left: the lattice is decomposed between MPI tasks. For
+ clarity we show a 2D decomposition of a 3D lattice, but in practice
+ we decompose in all 3 dimensions. Halo cells are added to each
+ subdomain (as shown on the upper right for a single slice) which store
+ data retrieved from remote neighbors in the halo exchange. Lower
+ right: the D3Q19 velocity set resident on a lattice site;
+ highlighted are the 5 ``outgoing'' elements to be transferred in a
+ specific direction.}
+\label{ch14:fig:decomphalo}
+\end{figure}
+
In the CPU version, the \textit{Ludwig} implementation stores one time
level of distribution values $f_i(\mathbf{r}; t)$. This distribution
-is stored as a flat array of C data type \texttt{double}, and laid out
+is stored as a flat array of C data type \texttt{double} and laid out
so that all elements of the velocity are contiguous at a given site
(often referred to as ``array-of-structures'' order). This is
appropriate for sums over the distribution required locally at the
consecutive memory addresses allows the prefetcher to engage fully.
(The temporary scalar \texttt{a\_tmp} allows caching of the
intermediate accumulated value in the innermost loop.) A variety of
-standard sequential optimisations are relevant for the collision stage
+standard sequential optimizations are relevant for the collision stage
(loop unrolling, inclusion of SIMD operations, and so on
\cite{wellein2006}). For example, the collision stage in
\textit{Ludwig} includes explicit SIMD code, which is useful if the
-compiler on a given platform cannot identify it. The propagation
-stage is separate, and is organised as a ``pull'' (again, various
-optimisation have been considered, e.g.,
-\cite{pohl2003,mattila2007,wittmann2012}). No further optimisation is
+compiler on a given platform cannot identify the appropriate vectorization.
+The propagation
+stage is separate and is organized as a ``pull'' (again, various
+optimizations have been considered, e.g.,
+\cite{pohl2003,mattila2007,wittmann2012}). No further optimization is
done here beyond ensuring that the ordering of the discrete velocities
allows memory access to be as efficient as possible. While these
-optimisations are important, it should be remembered that for some
+optimizations are important, it should be remembered that for some
complex fluid problems, the hydrodynamics embodied in the LB
calculation is a relatively small part of the total computational cost
-(at the 10\%-level in some cases). This means optimisation effort may
+(at the 10\% level in some cases). This means optimization effort may
be better concentrated elsewhere.
+
+
\begin{lstlisting}[float, label=ch14:listing1,
-caption = Collision schematic for CPU.]
+caption = collision schematic for CPU.]
/* loop over lattice sites */
for (is = 0; is < nsites; is++) {
...
\end{lstlisting}
-\begin{figure}[!t]
-\centering
-\includegraphics[width=12cm]{Chapters/chapter14/figures/decomphalo}
-\caption[The lattice is decomposed between MPI tasks.]{Left: the lattice is decomposed between MPI tasks. For
- clarity we show a 2D decomposition of a 3D lattice, but in practice
- we decompose in all 3 dimensions. Halo cells are added to each
- sub-domain (as shown on the upper right for a single slice) which store
- data retrieved from remote neighbours in the halo exchange. Lower
- right: the D3Q19 velocity set resident on a lattice site;
- highlighted are the 5 ``outgoing'' elements to be transferred in a
- specific direction.}
-\label{ch14:fig:decomphalo}
-\end{figure}
+
The regular 3D decomposition is illustrated in Fig.~\ref{ch14:fig:decomphalo}.
-Each local sub-domain is surrounded by a halo, or ghost, region of width one
+Each local subdomain is surrounded by a halo, or ghost, region of width one
lattice site. While the collision is local, elements of the distribution
must be exchanged at the edges of the domains to facilitate the propagation.
To achieve the full 3D halo exchange, the standard approach of shifting the
-relevant data in each co-ordinate direction in turn is adopted. This
-requires appropriate synchronisation, i.e., a receive in the the first
-co-ordinate direction must be complete before a send in the second direction
+relevant data in each coordinate direction in turn is adopted. This
+requires appropriate synchronization, i.e., a receive in the the first
+coordinate direction must be complete before a send in the second direction
involving relevant data can take place, and so on. We note that only
``outgoing'' elements of the distribution need to be sent at each edge.
For the D3Q19 model, this reduces the volume of data traffic from 19 to
5 of the $f_i(\mathbf{r};t)$ per lattice site at each edge. In the CPU
version, the necessary transfers are implemented in place using
-a vector of appropriately strided MPI datatypes for each direction.
+a vector of MPI datatypes with appropriate stride for each direction.
-\section{Single GPU Implementation}\label{ch14:sec:singlegpu}
+\section{Single GPU implementation}\label{ch14:sec:singlegpu}
In this section we describe the steps taken to enable \textit{Ludwig}
for the GPU. There are a number of crucial issues: first, the
-minimisation of data traffic between host and device; second, the
-optimal mapping of available parallelism onto the architecture and
+minimization of data traffic between host and device; second, the
+optimal mapping of available parallelism onto the architecture; and
third, the issue of memory coalescing. We discuss each of these in
turn.
those matrix-vector operations within the collision kernel.
The GPU architecture features a hierarchy of parallelism. At the
lowest level, groups of 32 threads (warps) operate in
-lock-step on different data elements: this is SIMD style vector-level
+lock-step on different data elements: this is SIMD-style vector-level
parallelism. Multiple warps are combined into a thread block (in which
-communication and synchronisation are possible), and multiple blocks can
+communication and synchronization are possible), and multiple blocks can
run concurrently across the streaming multiprocessors in the GPU
(with no communication or
-synchronisation possible across blocks). To decompose on the GPU, we
+synchronization possible across blocks). To decompose on the GPU, we
must choose which part of the collision to assign to which level of
parallelism.
every lattice site, and each thread performs the full matrix-vector
operation. We find that a block size of 256 performs well on current
devices: we therefore decompose
-the lattice into groups of 256 sites, and assign each group to a block
+the lattice into groups of 256 sites and assign each group to a block
of CUDA threads. As the matrix ${\cal M}_{ij}$ is constant, it is
assigned to the fast {\it constant} on-chip device memory.
For the propagation stage, the GPU implementation adds a second time
-level of distribution values. The data-dependencies inherent in the
+level of distribution values. The data dependencies inherent in the
propagation mean that the in-place propagation of the CPU version
-cannot be parallelised effectively without the additional time level.
+cannot be parallelized effectively without the additional time level.
As both time levels may remain resident on the GPU, this is not a
significant overhead.
segment in a single transaction: this is memory coalescing. It can be
seen that the array-of-structures ordering used for the distribution
in the CPU code would not be suitable for coalescing; in fact, it would
-result in serialised memory accesses and relative poor performance.
+result in serialized memory accesses and relative poor performance.
To meet the coalescing criteria and allow consecutive threads to read
consecutive memory addresses on the GPU, we transpose the layout of the
distribution so that, for each velocity component, consecutive sites
the GPU collision code is given in Listing~\ref{ch14:listing2}.
\begin{lstlisting}[float, label=ch14:listing2,
-caption = Collision schematic for GPU.]
+caption = collision schematic for GPU.]
/* compute current site index 'is' from CUDA thread and block */
/* indices and load distribution from "structure-of-arrays" */
the summary.
-\section{Multiple GPU Implementation}\label{ch14:sec:parallelgpu}
+\section{Multiple GPU implementation}\label{ch14:sec:parallelgpu}
To execute on multiple GPUs, we use the same domain decomposition
and message passing framework as the CPU version. Within each
-sub-domain (allocated to one MPI task) the GPU implementation
+subdomain (allocated to one MPI task) the GPU implementation
proceeds as described in the previous section. The only additional
complication is that halo transfers between GPUs must be staged
-through the host (in future, direct GPU to GPU data transfers via
+through the host (in the future, direct GPU-to-GPU data transfers via
MPI may be possible, obviating the need for these steps). This
-means host MPI sends must be preceded by appropriate device to host
-transfers and host MPI receives must be followed by corresponding host
-to device transfers.
+means host MPI sends must be preceded by appropriate device-to-host
+transfers and host MPI receives must be followed by corresponding
+host-to-device transfers.
%To execute on multiple GPUs in parallel we decompose the problem using
%the existing high-level parallelisaton framework: each MPI task acts
In practice, this data movement requires additional GPU kernels to
pack and unpack the relevant data before and after corresponding MPI
-calls. However, the standard shift algorithm, in which each co-ordinate
+calls. However, the standard shift algorithm, in which each coordinate
direction is treated in turn, does provide some scope for the
overlapping of different operations.
-For example, after the data for the first co-ordinate direction have been
+For example, after the data for the first coordinate direction have been
retrieved by the host, these can be exchanged using MPI between
hosts at the same time as kernels for packing and retrieving of
-data for the second co-ordinate direction are
+data for the second coordinate direction are
executed. This
-overlapping must respect the synchronisation required to ensure
-that data values at the corners of the sub-domain are transferred
+overlapping must respect the synchronization required to ensure
+that data values at the corners of the subdomain are transferred
correctly.
-We use a separate CUDA stream for each co-ordinate direction:
+We use a separate CUDA stream for each coordinate direction:
this allows some of the host-device communication time to be
effectively ``hidden'' behind the host-host MPI communication,
resulting in an overall speedup.
For GPU results, a Cray XK6 system is used: this is very similar to the
XE6, but has one CPU per node replaced with an NVIDIA X2090 GPU. Each
node in the GPU system therefore features a single Opteron CPU acting
-as a host to a single GPU. The inter-node interconnect architecture is
+as a host to a single GPU. The internode interconnect architecture is
the same as for the Cray XE6. The GPU performance tests use a prototype
Cray XK6 system with 936 nodes (the largest available at the time of writing).
To provide a fair comparison, we compare scaling on a \textit{per node}
Figure \ref{ch14:fig:scaling} shows the results of both weak and
strong scaling tests (top and bottom panels, respectively). For
-weak scaling, where the local sub-domain size is kept fixed
+weak scaling, where the local subdomain size is kept fixed
(here a relatively large 196$^3$ lattice), the time taken by
an ideal code would remain constant. It can be seen that while
scaling of the CPU version shows a pronounced upward slope, the
require a significantly smaller number of larger data transfers.
The performance
advantage of the GPU version ranges from a factor of around 1.5 to
-around 1.8. Careful study by other authors \cite{williams2011} have
-found the absolute performance (in terms of floating point
-operations per second, or floating point operations per Watt) to
+around 1.8. Careful studies by other authors \cite{williams2011} have
+found the absolute performance (in terms of floating-point
+operations per second, or floating-point operations per Watt) to
be remarkably similar between architectures.
Perhaps of more interest is the strong scaling picture (lower
reasonably robust scaling even for the smaller system sizes and good
scaling for the larger systems.
-\section{Moving Solid Particles}\label{ch14:sec:particles}
+\section{Moving solid particles}\label{ch14:sec:particles}
\begin{figure}[t]
\centering
%\includegraphics[width=10cm]{Chapters/chapter14/figures/bbl}
-\includegraphics[width=10cm]{Chapters/chapter14/figures/colloid_new}
+\includegraphics[width=10cm]{Chapters/chapter14/figures/colloid_new_gray}
\caption[A two-dimensional schematic picture of spherical particles on the lattice.]{
A two-dimensional schematic picture of spherical particles on the lattice.
Left: a particle is allowed
to move continuously across the lattice, and the position of the
-surface defines fluid lattice sites (light blue) and solid lattice
-sites (dark red). The discrete surface is defined by links
+surface defines fluid lattice sites (gray) and solid lattice
+sites (black). The discrete surface is defined by links
where propagation would intersect the surface (arrows). Note the
discrete shape of the two particles is different. Right: post-collision
distributions are reversed at the surface by the process of bounce-back
to efficient GPU implementation of an LB code such as \textit{Ludwig}.
In this section, we give a brief overview of the essential features
of the CPU implementation, and how the considerations raised in the
-previous sections --- maximising parallelism and minimising both
-host to device and GPU to GPU data movement --- shape the design decisions
+previous sections---maximizing parallelism and minimising both
+host-to-device and GPU-to-GPU data movement---shape the design decisions
for a GPU implementation. We restrict our discussion to a simple fluid
in this section; additional solid-fluid boundary conditions (e.g.,
wetting at a fluid-fluid-solid contact line) usually arise elsewhere
-in the calculation and broadly independent of the hydrodynamic boundary
+in the calculation and are broadly independent of the hydrodynamic boundary
conditions which are described in what follows.
-Moving solid particles (here, spheres) are defined by a centre position
+Moving solid particles (here, spheres) are defined by a center position
which is allowed to move continuously across the space of the lattice,
and a fixed radius which is typically a few lattice spacings $\Delta r$.
The surface of the particle is defined by a series of \textit{links}
molecular dynamics-like step.
For the CPU implementation, a number of additional MPI communications
-are required: (1) to exchange the centre position, radius, etc of each
-particle as it moves and (2), to allow the accumulation of the force
+are required: (1) to exchange the center position, radius, etc.\ of each
+particle as it moves and (2) to allow the accumulation of the force
and torque for each particle (the links of which may be distributed
between up to 8 MPI tasks in three dimensions). Appropriate marshaling
-of these data can provide an effective parallelisation for relatively
+of these data can provide an effective parallelization for relatively
dense particle suspensions of up to 40\% solid volume fraction
\cite{stratford2008}. As a final consideration, fluid distributions
must be removed and replaced consistently as a given particle moves
concerns which are relevant to a GPU implementation:
\begin{enumerate}
\item
-Minimisation of host-device data transfer would argue for moving the
+Minimization of host-device data transfer would argue for moving the
entire particle code to the GPU. However, the code in question involves
largely conditional logic (e.g., identifying cut surface links) and
irregular memory accesses (e.g., access to distribution elements around
a spherical particle). These operations seem poorly suited to effective
-parallelisation on the GPU. As an additional complication, the sums
+parallelization on the GPU. As an additional complication, the sums
required over the particle surface would involve potentially tricky
and inefficient reductions in GPU memory.
\item
\end{enumerate}
-We have implemented the second option as follows. For each sub-domain,
-a list of boundary-cutting links is assembled on the CPU which includes
-the identity of the relevant element of the distribution. This list,
+We have implemented the second option as follows. For each subdomain,
+a list of boundary-cutting links is assembled on the CPU; the list includes
+the identity of the relevant element of the distribution in each case.
+This list,
together with the particle information required to compute the correct
-bounce-back term, are transferred to the GPU. The updates to the relevant
+bounce-back term, is transferred to the GPU. The updates to the relevant
elements of the distribution can then take place on the GPU. The
corresponding information to compute the update of the particle dynamics
is returned
status. On the GPU, it is perhaps preferable to perform the collision
stage everywhere instead of moving the look-up table to the GPU and
introducing the associated logic.
-Ultimately, the GPU might favour other boundary methods which treat solid and
+Ultimately, the GPU might favor other boundary methods which treat solid and
fluid on a somewhat more equal basis, for example, the immersed boundary
method \cite{ch14:immersed1,ch14:immersed2,ch14:immersed-lb} or smoothed profile method
\cite{ch14:spm}.
However, the approach adopted here allows us to exploit
-the GPU for the intensive fluid simulation whilst maintaining the complex
+the GPU for the intensive fluid simulation while maintaining the complex
code required for particles on the CPU. Overheads of CPU-GPU transfer are
-minimised by transferring only those data relevant to the hydrodynamic
+minimized by transferring only those data relevant to the hydrodynamic
interaction implemented via bounce-back on links.
% END NEW SOLID TEXT
\section{Summary}
\label{ch14:sec:summary}
-We have described the steps take to implement, on both single and
+We have described the steps taken to implement, on both single and
multiple GPUs, the \textit{Ludwig} code which was originally
designed for complex fluid problems on a conventional CPU
architecture. We have added the necessary functionality using
data structures. By retaining domain decomposition and message
passing via MPI, we have demonstrated it is possible to scale
complex fluid problems to large numbers of GPUs in parallel.
-By following the key design criteria of maximising parallelism
-and minimising host-device data transfers, we have confirmed
+By following the key design criteria of maximizing parallelism
+and minimizing host-device data transfers, we have confirmed that
the mixed MPI-GPU approach is viable for scientific applications.
For the more intricate problem of moving solid particles, we find
it is possible to retain the more serial elements related to
particle link operations on the CPU, while offloading only the
parallel lattice-based operations involving the LB distribution
-to the GPU. Again, this minimises host-device movement of data.
+to the GPU. Again, this minimizes host-device movement of data.
From the software engineering viewpoint, some duplication of code
to allow efficient implementation on both host and device is
with automatic tuning approaches).
So, while the challenges in designing portable and efficient scientific
-applications remain very real, this work provides some hope the large
+applications remain very real, this work provides some hope that large
clusters of GPU machines can be used effectively for a wide range of
complex fluid problems.
% use section* for acknowledgement
-\section*{Acknowledgments}
+\section{Acknowledgments}
AG was supported by the CRESTA project, which has received
funding from the European Community's Seventh Framework Programme
(ICT-2011.9.13) under Grant Agreement no. 287703. KS was supported