+\chapterauthor{Alan Gray and Kevin Stratford}{EPCC, The University of Edinburgh}
+
+\chapter{Ludwig: multiple GPUs for a complex fluid lattice Boltzmann
+application}
+
+%\putbib[biblio]
+
+
+\section{Introduction}
+The lattice Boltzmann (LB) method (for an overview see, e.g.,
+\cite{succi-book}) has become a popular approach to a variety of fluid
+dynamics problems. It provides a way to solve the incompressible,
+isothermal Navier-Stokes equations and has the attractive features of
+being both explicit in time and local in space. This makes the LB
+method well-suited to parallel computation. Many efficient parallel
+implementations of the LB method have been undertaken, typically using
+a combination of distributed domain decomposition and the Message
+Passing Interface (MPI). However, the potential
+performance benefits offered by GPUs has motivated a new `mixed-mode'
+approach to address very large problems. Here, fine-grained
+parallelism is implemented on the GPU, while MPI is reserved for
+larger-scale parallelism. This mixed mode is of increasing interest
+to application programmers at a time when many supercomputing services
+are moving
+toward clusters of GPU accelerated nodes. The design questions which
+arise when developing a lattice Boltzmann code for this type of
+heterogeneous system are therefore worth studying. Furthermore, similar
+questions also recur in many other types of stencil-based algorithms.
+
+The first applications of LB on GPUs were to achieve fluid-like
+effects in computer animation, rather than scientific applications per
+se. These early works include simple fluids~\cite{wei2004}, miscible
+two-component flow~\cite{zhu2006}, and various image processing tasks
+based on the use of partial differential equations~\cite{zhao2007}.
+While these early works used relatively low level graphics APIs, the
+first CUDA runtime interface implementation was a two-dimensional
+simple fluid problem~\cite{toelke2010}. Following pioneering work on
+clusters of GPUs coupled via MPI to study air
+pollution~\cite{fan2004}, more recent work has included mixed OpenMP
+and CUDA~\cite{myre2011}, Posix threads and CUDA~\cite{obrecht2011},
+and MPI and CUDA for increasingly large GPU clusters
+\cite{bernaschi2010,xian2011,feichtinger2011}. The heterogeneous
+nature of these systems has also spurred interest in approaches
+including automatic code generation \cite{walshsaar2012} and auto-tuning
+\cite{williams2011} to aid application performance.
+
+Many of these authors make use of
+another attractive feature of LB: the ability to include fixed
+solid-fluid boundary conditions as a straightforward addition to
+the algorithm to study, for example, flow in
+porous media. This points to an important
+application area for LB methods: that of complex fluids.
+Complex fluids include mixtures, surfactants, liquid crystals,
+and particle suspensions, and typically require additional physics
+beyond the bare Navier-Stokes equations to provide a full
+description~\cite{aidun2010}. The representation of this extra
+physics raises additional design questions for the application
+programmer. Here, we consider the \textit{Ludwig} code \cite{desplat},
+an LB application developed specifically for complex fluids
+(\textit{Ludwig} was named for Boltzmann, 1844--1906).
+We will present the steps
+required to allow \textit{Ludwig} to exploit efficiently both a single
+GPU, and also many GPUs in parallel. We show that
+\textit{Ludwig} scales excellently to at least the one thousand GPU level
+(the largest resource available at the time of writing) with
+indications that the code will scale to much larger systems as
+they become available. In addition, we consider the steps required
+to represent fully-resolved moving solid particles (usually referred
+to as colloids in the context of complex fluids). Such particles
+need to have their surface resolved on the lattice scale in order
+to include relevant surface physics, and must be able to move: e.g.,
+to execute Brownian motion in response to random forces from the
+fluid. Standard methods are available to represent such particles
+(e.g., \cite{ladd1994,nguyen2002}) which are amenable to effective
+domain decomposition and message passing \cite{stratford2008}.
+We describe below the additional considerations which arise
+for such moving particles when developing an
+implementation on a GPU.
+
+In the following section we provide a brief overview of the lattice
+Boltzmann method, and mention some of the general issues which can
+influence performance in the CPU context.
+In Section \ref{ch14:sec:singlegpu}, we then describe the alterations
+which are required to exploit the GPU architecture effectively, and
+highlight how the resulting code differs from the CPU version. In
+Section \ref{ch14:sec:parallelgpu}, we extend this description to
+include the steps required to allow exploitation of many GPUs in
+parallel while retaining effective scaling. We also present results for
+a typical benchmark for a fluid-only problem in Section
+\ref{ch14:sec:parallelgpu} to demonstrate the success of the approach.
+In Section \ref{ch14:sec:particles}, we describe the design choices
+which are relevant to a GPU implementation of moving particles. Finally,
+we include a number of general observations on software
+engineering and maintenance which arise from our experience.
+A summary is provided in Section~\ref{ch14:sec:summary}.
+
+
+\section{Background}\label{ch14:sec:background}
+
+%\begin{figure}[!t]
+%\centering
+%\includegraphics[width=12cm]{Chapters/chapter14/figures/basiclattice}
+%\caption{The lattice Boltzmann approach discritises space into a 3D lattice (left). The fluid is represented at each lattice site (right), by the {\it distribution} vector: each component corresponding to a specific velocity direction. Here we are interested in the {\it D3Q19} model, in which there are 3 spatial directions and 19 velocity components (including a zero component).}
+%\label{ch14:fig:basiclattice}
+%\end{figure}
+
+
+For a general complex fluid problem the starting point is the
+fluid velocity field $\mathbf{u}(\mathbf{r})$, whose evolution
+obeys the Navier-Stokes equations describing the conservation of mass
+(or density $\rho$), and momentum:
+\begin{equation}
+\rho [ \partial_t \mathbf{u} + (\mathbf{u}.\mathbf{\nabla})\mathbf{u} ]
+= -\nabla p + \eta \nabla^2 \mathbf{u} + \mathbf{f}(\mathbf{r}),
+\end{equation}
+where $p$ is the isotropic pressure and $\eta$ is the viscosity.
+A local force $\mathbf{f}(\mathbf{r})$ provides a means for coupling
+to other complex fluid constituents, e.g., it might represent the force
+exerted on the fluid by a curved interface between different phases or
+components.
+
+The LB approach makes use of a regular three-dimensional
+lattice (see Figure \ref{ch14:fig:decomphalo}) with discrete spacing
+$\Delta r$. It also makes use of a
+discrete velocity space $\mathbf{c}_i$, where the $\mathbf{c}_i$
+are chosen to capture the correct symmetries of the Navier-Stokes
+equations. A typical choice, used here, is the so-called D3Q19
+basis in three dimensions where there is one velocity such that
+$\mathbf{c} \Delta t$ is zero, along with six extending to the nearest
+neighbour
+lattice sites, and twelve extending to the next-nearest neighbour sites
+($\Delta t$ being the discrete time step). The fundamental object
+in LB is then the distribution function $f_i (\mathbf{r};t)$ whose
+moments are related to the local hydrodynamic quantities: the fluid
+density, momentum, and stress. The time evolution of the distribution
+function is described by a discrete Boltzmann equation
+\begin{equation}
+f_i(\mathbf{r} + \mathbf{c}_i \Delta t; t) - f_i(\mathbf{r}; t)
+= - {\cal L}_{ij} f_j(\mathbf{r};t).
+\end{equation}
+It is convenient to think of this in two stages. First, the right hand
+side represents the action of a collision operator ${\cal L}_{ij}$,
+which is local to each lattice site and relaxes the distribution toward
+a local equilibrium at a rate ultimately related to the fluid viscosity.
+Second, the left hand side represents a propagation step (sometimes referred
+to as streaming step), in which each element $i$ of the distribution is
+displaced $\mathbf{c}_i \Delta t$, i.e., one lattice spacing in the
+appropriate direction per discrete time step.
+
+More specifically, we store a vector of 19 double precision floating
+point values at each lattice site for the distribution function
+$f_i(\mathbf{r};t)$.
+The collision operation ${\cal L}_{ij}$, which is local at each lattice
+site, may be thought of as follows. A matrix-vector multiplication
+${\cal M}_{ij}f_j$ is used to transform the distributions into the
+hydrodynamic quantities, where ${\cal M}_{ij}$ is a constant 19x19
+matrix related to the choice of
+$\mathbf{c}_i$. The non-conserved hydrodynamic quantities are then
+relaxed toward their (known) equilibrium values, and are transformed
+back to new post-collision distributions via the inverse transformation
+${\cal M}^{-1}_{ij}$. This gives rise to the need for a minimum of 2x19$^2$
+floating point multiplications per lattice site. (Additional operations are
+required to implement, for example, the force $\mathbf{f}(\mathbf{r})$.)
+In contrast, the
+propagation stage consists solely of moving the distribution values
+between lattice sites, and involves no floating point operations.
+
+In the CPU version, the \textit{Ludwig} implementation stores one time
+level of distribution values $f_i(\mathbf{r}; t)$. This distribution
+is stored as a flat array of C data type \texttt{double}, and laid out
+so that all elements of the velocity are contiguous at a given site
+(often referred to as ``array-of-structures'' order). This is
+appropriate for sums over the distribution required locally at the
+collision stage as illustrated schematically in
+Listing~\ref{ch14:listing1}: the fact that consecutive loads are from
+consecutive memory addresses allows the prefetcher to engage fully.
+(The temporary scalar \texttt{a\_tmp} allows caching of the
+intermediate accumulated value in the innermost loop.) A variety of
+standard sequential optimisations are relevant for the collision stage
+(loop unrolling, inclusion of SIMD operations, and so on
+\cite{wellein2006}). For example, the collision stage in
+\textit{Ludwig} includes explicit SIMD code, which is useful if the
+compiler on a given platform cannot identify it. The propagation
+stage is separate, and is organised as a ``pull'' (again, various
+optimisation have been considered, e.g.,
+\cite{pohl2003,mattila2007,wittmann2012}). No further optimisation is
+done here beyond ensuring that the ordering of the discrete velocities
+allows memory access to be as efficient as possible. While these
+optimisations are important, it should be remembered that for some
+complex fluid problems, the hydrodynamics embodied in the LB
+calculation is a relatively small part of the total computational cost
+(at the 10\%-level in some cases). This means optimisation effort may
+be better concentrated elsewhere.
+
+\begin{lstlisting}[float, label=ch14:listing1,
+caption = Collision schematic for CPU.]
+/* loop over lattice sites */
+for (is = 0; is < nsites; is++) {
+ ...
+ /* load distribution from ``array-of-structures'' */
+ for (i = 0; i < 19; i++)
+ f[i] = f_aos[19*is + i]
+ ...
+ /* perform matrix-vector multiplication */
+ for (i = 0; i < 19; i++) {
+ a_tmp = 0.0;
+ for (j = 0; j < 19; j++) {
+ a_tmp += f[j]*M[i][j];
+ }
+ a[i] = a_tmp;
+ }
+ ...
+}
+\end{lstlisting}
+
+
+\begin{figure}[!t]
+\centering
+\includegraphics[width=12cm]{Chapters/chapter14/figures/decomphalo}
+\caption{Left: the lattice is decomposed between MPI tasks. For
+ clarity we show a 2D decomposition of a 3D lattice, but in practice
+ we decompose in all 3 dimensions. Halo cells are added to each
+ sub-domain (as shown on the upper right for a single slice) which store
+ data retrieved from remote neighbours in the halo exchange. Lower
+ right: the D3Q19 velocity set resident on a lattice site;
+ highlighted are the 5 ``outgoing'' elements to be transferred in a
+ specific direction.}
+\label{ch14:fig:decomphalo}
+\end{figure}
+
+
+The regular 3D decomposition is illustrated in Fig.~\ref{ch14:fig:decomphalo}.
+Each local sub-domain is surrounded by a halo, or ghost, region of width one
+lattice site. While the collision is local, elements of the distribution
+must be exchanged at the edges of the domains to facilitate the propagation.
+To achieve the full 3D halo exchange, the standard approach of shifting the
+relevant data in each co-ordinate direction in turn is adopted. This
+requires appropriate synchronisation, i.e., a receive in the the first
+co-ordinate direction must be complete before a send in the second direction
+involving relevant data can take place, and so on. We note that only
+``outgoing'' elements of the distribution need to be sent at each edge.
+For the D3Q19 model, this reduces the volume of data traffic from 19 to
+5 of the $f_i(\mathbf{r};t)$ per lattice site at each edge. In the CPU
+version, the necessary transfers are implemented in place using
+a vector of appropriately strided MPI datatypes for each direction.
+
+
+\section{Single GPU Implementation}\label{ch14:sec:singlegpu}
+
+
+In this section we describe the steps taken to enable \textit{Ludwig}
+for the GPU. There are a number of crucial issues: first, the
+minimisation of data traffic between host and device; second, the
+optimal mapping of available parallelism onto the architecture and
+third, the issue of memory coalescing. We discuss each of these in
+turn.
+
+While the most important section of the LB in terms of
+floating-point performance is the collision stage, this cannot be the
+only consideration for a GPU implementation. It is essential to
+offload all computational activity which involves the main data
+structures (such as the distribution) to the GPU. This includes
+kernels with relatively low computational demand, such as the
+propagation stage. All relevant data then remain resident on the GPU,
+to avoid expensive host-device data transfer at each iteration of
+the algorithm. Such transfers would negate any benefit of GPU
+acceleration. We note that for a complex fluid code, this requirement
+can extend to a considerable number of kernels, although we limit the
+discussion to collision and propagation for brevity.
+
+To achieve optimal performance, it is vital to exploit fully the
+parallelism inherent in the GPU architecture, particularly for
+those matrix-vector operations within the collision kernel.
+The GPU architecture features a hierarchy of parallelism. At the
+lowest level, groups of 32 threads (warps) operate in
+lock-step on different data elements: this is SIMD style vector-level
+parallelism. Multiple warps are combined into a thread block (in which
+communication and synchronisation are possible), and multiple blocks can
+run concurrently across the streaming multiprocessors in the GPU
+(with no communication or
+synchronisation possible across blocks). To decompose on the GPU, we
+must choose which part of the collision to assign to which level of
+parallelism.
+
+While there exists parallelism within the matrix-vector operation, the
+length of each vector (here, 19) is smaller than the warp size and typical
+thread block sizes. So we simply decompose the loop over lattice sites
+to all levels of parallelism, i.e., we use a separate CUDA thread for
+every lattice site, and each thread performs the full matrix-vector
+operation. We find that a block size of 256 performs well on current
+devices: we therefore decompose
+the lattice into groups of 256 sites, and assign each group to a block
+of CUDA threads. As the matrix ${\cal M}_{ij}$ is constant, it is
+assigned to the fast {\it constant} on-chip device memory.
+
+For the propagation stage, the GPU implementation adds a second time
+level of distribution values. The data-dependencies inherent in the
+propagation mean that the in-place propagation of the CPU version
+cannot be parallelised effectively without the additional time level.
+As both time levels may remain resident on the GPU, this is not a
+significant overhead.
+
+An architectural
+constraint of GPUs means that optimal global memory bandwidth
+is only achieved when data are structured such that threads within a
+{\it half-warp} (a group of 16 threads) load data from the same memory
+segment in a single transaction: this is memory coalescing. It can be
+seen that the array-of-structures ordering used for the distribution
+in the CPU code would not be suitable for coalescing; in fact, it would
+result in serialised memory accesses and relative poor performance.
+To meet the coalescing criteria and allow consecutive threads to read
+consecutive memory addresses on the GPU, we transpose the layout of the
+distribution so that, for each velocity component, consecutive sites
+are contiguous in memory (``structure-of-arrays'' order). A schematic of
+the GPU collision code is given in Listing~\ref{ch14:listing2}.
+
+\begin{lstlisting}[float, label=ch14:listing2,
+caption = Collision schematic for GPU.]
+/* compute current site index 'is' from CUDA thread and block */
+/* indices and load distribution from "structure-of-arrays" */
+
+for (i = 0; i < 19; i++)
+ f[i] = f_soa[nsites*i + is]
+
+/* perform matrix-vector multiplication as before */
+...
+\end{lstlisting}
+
+
+\textit{Ludwig} was modified to allow a choice of distribution data layout
+at compilation time depending on the target architecture: CPU or GPU. We
+defer some further comments on software engineering aspects of the code to
+the summary.
+
+
+\section{Multiple GPU Implementation}\label{ch14:sec:parallelgpu}
+
+To execute on multiple GPUs, we use the same domain decomposition
+and message passing framework as the CPU version. Within each
+sub-domain (allocated to one MPI task) the GPU implementation
+proceeds as described in the previous section. The only additional
+complication is that halo transfers between GPUs must be staged
+through the host (in future, direct GPU to GPU data transfers via
+MPI may be possible, obviating the need for these steps). This
+means host MPI sends must be preceded by appropriate device to host
+transfers and host MPI receives must be followed by corresponding host
+to device transfers.
+
+%To execute on multiple GPUs in parallel we decompose the problem using
+%the existing high-level parallelisaton framework: each MPI task acts
+%as a host to a separate GPU, with each GPU kernel operating on the
+%subset of the lattice volume corresponding to the partition assigned
+%to that task. As described in Section \ref{ch14:sec:background}, halo
+%data exchanges are required at each timestep. Significant
+%developments are made for the GPU adaptation of this communication
+%phase to achieve good inter-GPU communication performance and allow
+%scaling to many GPUs.
+
+%The CPU code uses strided MPI datatypes to define the six
+%planes of lattice data to be sent or received, and performs the
+%communications {\it in-place}, i.e. there is no explicit buffering of
+%data in the application. For the GPU adaptation, the halo planes must
+%be exchanged between GPUs, on which MPI functionality is not
+%available, so transfers must be staged through the host CPUs. Buffer
+%packing/unpacking of the halo planes is explicitly performed using
+%CUDA kernels, and CUDA memory copy calls are used to transfer data
+%from a device to its host (using pinned host memory) through the PCI-e
+%bus (and vice versa). MPI calls are used to transfer the buffers
+%between hosts. The CPU version of Ludwig has a mechanism, using MPI built-in
+%data types, to reduce the amount of data communicated by around a
+%factor of four by only sending those elements of the distribution
+%propagating outward from a local domain. For the GPU adaptation, this
+%filtering of data is explicitly performed (again since MPI
+%functionality is not available on the GPU) in the buffer
+%packing/unpacking kernels. Special care had to be taken for those
+%lattice sites on the corners of the halo planes, which had to be sent
+%in more than one direction (since the propagation includes accesses to
+%diagonal neighbours).
+
+In practice, this data movement requires additional GPU kernels to
+pack and unpack the relevant data before and after corresponding MPI
+calls. However, the standard shift algorithm, in which each co-ordinate
+direction is treated in turn, does provide some scope for the
+overlapping of different operations.
+For example, after the data for the first co-ordinate direction have been
+retrieved by the host, these can be exchanged using MPI between
+hosts at the same time as kernels for packing and retrieving of
+data for the second co-ordinate direction are
+executed. This
+overlapping must respect the synchronisation required to ensure
+that data values at the corners of the sub-domain are transferred
+correctly.
+We use a separate CUDA stream for each co-ordinate direction:
+this allows some of the host-device communication time to be
+effectively ``hidden'' behind the host-host MPI communication,
+resulting in an overall speedup.
+The improvement is more pronounced for the smaller local lattice size,
+perhaps because of less CPU memory bandwidth contention. The
+overlapping is then a particularly valuable aid to strong scaling,
+where the local system size decreases as the number of GPUs increases.
+
+
+%\section{Performance}\label{ch14:sec:performance}
+
+To demonstrate the effectiveness of our approach, we compare the
+performance of both CPU and GPU versions of \textit{Ludwig}. To
+test the complex fluid nature of the code, the problem is
+actually an immiscible fluid mixture which uses a second distribution
+function to introduce a composition variable. The interested reader
+is referred to \cite{ch14:stratford-jsp2005} for further details. The
+largest total problem size used is $2548\times 1764\times 1568$.
+The CPU system features a Cray XE6 architecture with 2 16-core AMD Opteron
+CPUs per node, and with nodes interconnected using Cray Gemini technology.
+For GPU results, a Cray XK6 system is used: this is very similar to the
+XE6, but has one CPU per node replaced with an NVIDIA X2090 GPU. Each
+node in the GPU system therefore features a single Opteron CPU acting
+as a host to a single GPU. The inter-node interconnect architecture is
+the same as for the Cray XE6. The GPU performance tests use a prototype
+Cray XK6 system with 936 nodes (the largest available at the time of writing).
+To provide a fair comparison, we compare scaling on a \textit{per node}
+basis. That is, we compare 1 fully occupied 32-core CPU node
+(running 32 MPI tasks) with 1 GPU node (host running 1 MPI task,
+and 1 device). We believe this is representative of the true ``cost'' of
+a simulation in terms of accounting budgets, or electricity.
+
+
+%\begin{figure}[!t]
+%\centering
+%\includegraphics[width=10cm]{Chapters/chapter14/figures/length_scale}
+%\caption{The dependence of the characteristic length scale of a binary
+% fluid mixture on the simulation timestep number. Superimposed are
+% three snapshots showing increased mixture separation with time:
+% these are obtained through the order parameter and each show one
+% eighth of the total simulation size, which is $1548\times 1764\times 1568$. }\label{ch14:fig:length_scale}
+%\end{figure}
+
+\begin{figure}[!t]
+\centering
+\includegraphics[width=10cm]{Chapters/chapter14/figures/two_graphs_ud}
+\caption{The weak (top) and strong (bottom) scaling of \textit{Ludwig}. Closed
+ shapes denote results using the CPU version run on the Cray
+ XE6 (using two 16-core AMD Interlagos CPUs per node), while open
+ shapes denote results using the GPU version on the Cray XK6 (using a
+ single NVIDIA X2090 GPU per node). Top: the benchmark time is shown
+ where the problem size per node is constant. Bottom: performance is
+ shown where, for each of the four problem sizes presented, the
+ results are scaled by the lattice volume, and all results are
+ normalized by the same arbitrary constant. }
+\label{ch14:fig:scaling}
+\end{figure}
+
+
+%The top of Figure \ref{ch14:fig:scaling} shows the benchmark time,
+%where the problem size is increased with the number of nodes. This
+%shows weak scaling: the time taken by a perfectly scaling code would
+%remain constant. The figure plots the time against the number of {\it
+% nodes}: for the GPU version we utilize the one GPU per Cray XK6 node
+%whilst for the CPU version we fully utilize two 16-core CPUs
+%per Cray XE6 node, i.e. one GPU is compared with 32 CPU cores. We use
+%a single MPI task per Cray XK6 node to host the GPU, and 32 MPI tasks
+%to fully utilize the CPUs in the Cray XE6 node. Note that our CPU
+%version is highly optimised: in particular it has been tuned for
+%optimal utilisation of the SIMD vector units in each CPU core.
+
+Figure \ref{ch14:fig:scaling} shows the results of both weak and
+strong scaling tests (top and bottom panels, respectively). For
+weak scaling, where the local sub-domain size is kept fixed
+(here a relatively large 196$^3$ lattice), the time taken by
+an ideal code would remain constant. It can be seen that while
+scaling of the CPU version shows a pronounced upward slope, the
+GPU version scales almost perfectly. The advantage of the GPU
+version over the CPU version is the fact that the GPU version
+has a factor of 32 fewer MPI tasks per node, so communications
+require a significantly smaller number of larger data transfers.
+The performance
+advantage of the GPU version ranges from a factor of around 1.5 to
+around 1.8. Careful study by other authors \cite{williams2011} have
+found the absolute performance (in terms of floating point
+operations per second, or floating point operations per Watt) to
+be remarkably similar between architectures.
+
+Perhaps of more interest is the strong scaling picture (lower
+panel in Figure \ref{ch14:fig:scaling}) where the performance
+as a function of the number of nodes is measured for fixed problem
+size. We consider four different fixed problem sizes on both CPU
+(up to 512 nodes shown) and GPU (up to 768 nodes). To allow comparison,
+the results are scaled by total system size in each case. For strong
+scaling, the disparity in the number of MPI tasks is clearly revealed
+in the failure of the CPU version to provide any significant benefit
+beyond a modest number of nodes as communication overheads dominate.
+In contrast, the GPU version shows
+reasonably robust scaling even for the smaller system sizes and good
+scaling for the larger systems.
+
+\section{Moving Solid Particles}\label{ch14:sec:particles}
+
+\begin{figure}[t]
+\centering
+%\includegraphics[width=10cm]{Chapters/chapter14/figures/bbl}
+\includegraphics[width=10cm]{Chapters/chapter14/figures/colloid_new}
+\caption{
+A two-dimensional schematic picture of spherical particles on the lattice.
+Left: a particle is allowed
+to move continuously across the lattice, and the position of the
+surface defines fluid lattice sites (light blue) and solid lattice
+sites (dark red). The discrete surface is defined by links
+where propagation would intersect the surface (arrows). Note the
+discrete shape of the two particles is different. Right: post-collision
+distributions are reversed at the surface by the process of bounce-back
+on links, which replaces the propagation.
+%A 2D illustration of colloidal particles (grey circles)
+% interacting with the lattice, where each lattice point is
+% represented by a cross. Left: each particle spans multiple sites.
+% Right: the particle-facing distribution components are moved inside
+% the particle with the direction reversed, such that the ``bounce
+% back'' will be completed when the fluid propagates.
+}
+\label{ch14:fig:bbl}
+\end{figure}
+
+
+%\begin{figure}[t]
+%\centering
+%\includegraphics[width=6cm]{Chapters/chapter14/figures/particlemove}
+%\caption{A 2D illustration of a colloidal particle moving across the
+% lattice. The old and new positions are represented by the open and
+% filled circles respectively.}
+%\label{ch14:fig:particlemove}
+%\end{figure}
+
+% BEING NEW SOLID TEXT
+
+The introduction of moving solid particles poses an additional hurdle
+to efficient GPU implementation of an LB code such as \textit{Ludwig}.
+In this section, we give a brief overview of the essential features
+of the CPU implementation, and how the considerations raised in the
+previous sections --- maximising parallelism and minimising both
+host to device and GPU to GPU data movement --- shape the design decisions
+for a GPU implementation. We restrict our discussion to a simple fluid
+in this section; additional solid-fluid boundary conditions (e.g.,
+wetting at a fluid-fluid-solid contact line) usually arise elsewhere
+in the calculation and broadly independent of the hydrodynamic boundary
+conditions which are described in what follows.
+
+Moving solid particles (here, spheres) are defined by a centre position
+which is allowed to move continuously across the space of the lattice,
+and a fixed radius which is typically a few lattice spacings $\Delta r$.
+The surface of the particle is defined by a series of \textit{links}
+where a discrete velocity propagation $\mathbf{c}_i \Delta t$ would
+intercept or cut the spherical shell (see Fig.~\ref{ch14:fig:bbl}).
+Hydrodynamic boundary conditions are then implemented via the standard
+approach of bounce-back on links \cite{ladd1994, nguyen2002}, where the
+relevant post-collision distribution values are reversed at the
+propagation stage with an appropriate correction to allow for the
+solid body motion. The exchange of momentum at each link must then be
+accumulated around the entire particle surface to provide the net
+hydrodynamic force
+and torque on the sphere. The particle motion can then be updated in a
+molecular dynamics-like step.
+
+For the CPU implementation, a number of additional MPI communications
+are required: (1) to exchange the centre position, radius, etc of each
+particle as it moves and (2), to allow the accumulation of the force
+and torque for each particle (the links of which may be distributed
+between up to 8 MPI tasks in three dimensions). Appropriate marshaling
+of these data can provide an effective parallelisation for relatively
+dense particle suspensions of up to 40\% solid volume fraction
+\cite{stratford2008}. As a final consideration, fluid distributions
+must be removed and replaced consistently as a given particle moves
+across the lattice and changes its discrete shape.
+
+With these features in mind, we can identify a number of competing
+concerns which are relevant to a GPU implementation:
+\begin{enumerate}
+\item
+Minimisation of host-device data transfer would argue for moving the
+entire particle code to the GPU. However, the code in question involves
+largely conditional logic (e.g., identifying cut surface links) and
+irregular memory accesses (e.g., access to distribution elements around
+a spherical particle). These operations seem poorly suited to effective
+parallelisation on the GPU. As an additional complication, the sums
+required over the particle surface would involve potentially tricky
+and inefficient reductions in GPU memory.
+\item
+The alternative is to retain the relevant code on the CPU. While the
+transfer of the entire distribution $f_i(\mathbf{r};t)$ between host
+and device at each time step is unconscionable owing to PCIe bus bandwidth
+considerations, the transfer of only relevant distribution information
+to allow bounce-back on links is possible. At modest solid volumes
+only a very small fraction of the distribution data is
+involved in bounce-back (e.g., a 2\% solid volume fraction of particles
+radius 2.3$\Delta r$ would involve approximately 1\% of the distribution
+data). This option also has the advantage that no further host-device
+data transfers are necessary to allow the MPI exchanges required for
+particle information.
+\end{enumerate}
+
+
+We have implemented the second option as follows. For each sub-domain,
+a list of boundary-cutting links is assembled on the CPU which includes
+the identity of the relevant element of the distribution. This list,
+together with the particle information required to compute the correct
+bounce-back term, are transferred to the GPU. The updates to the relevant
+elements of the distribution can then take place on the GPU. The
+corresponding information to compute the update of the particle dynamics
+is returned
+to the CPU, where the reduction over the surface links is computed.
+The change of particle shape may be dealt with in a similar manner:
+the relatively small number of updates required at any one time step
+(or however frequently the particle position is updated) can be
+marshaled to the GPU as necessary. This preliminary implementation
+is found to be effective on problems involving up to 4\%
+solid volume fraction.
+
+We note that the CPU version actually avoids the collision calculation
+at solid lattice points by consulting a look-up table of solid/fluid
+status. On the GPU, it is perhaps preferable to perform the collision
+stage everywhere instead of moving the look-up table to the GPU and
+introducing the associated logic.
+Ultimately, the GPU might favour other boundary methods which treat solid and
+fluid on a somewhat more equal basis, for example, the immersed boundary
+method \cite{ch14:immersed,ch14:immersed-lb} or smoothed profile method
+\cite{ch14:spm}.
+However, the approach adopted here allows us to exploit
+the GPU for the intensive fluid simulation whilst maintaining the complex
+code required for particles on the CPU. Overheads of CPU-GPU transfer are
+minimised by transferring only those data relevant to the hydrodynamic
+interaction implemented via bounce-back on links.
+
+% END NEW SOLID TEXT
+
+%It is of interest, in many situations, to include colloidal particles
+%in the simulation **more from Kevin***. These are by represented by
+%moving spheres that interact with the fluid through transfer of
+%momenta: fluid coming into contact with a particle will bounce off it,
+%giving the particle a kick in the process. As illustrated on the left
+%of Figure \ref{ch14:fig:bbl} (which shows 2 dimensions only for
+%clarity - in practice, the particles will be represented by spheres on
+%a 3D lattice, and will be much larger relative to the lattice
+%spacing), typically each particle spans multiple lattice sites, and
+%therefore ``intercepts'' multiple inter-site links. Ludwig stores
+%particle information using nested linked lists: a list of particles,
+%each with a list of all the links intercepted by that particle (plus
+%other quantities). The fluid-particle interaction proceeds using the
+%so-called {\it bounce back on links} procedure during each timestep. As
+%illustrated in the right of Figure \ref{ch14:fig:bbl}, for each of the
+%intercepting links, those velocity components of the distribution that
+%face inwards to each particle are moved from the site
+%outside the particle to the site inside the particle, and
+%reversed in direction. This is done before the propagation stage,
+%which then propagates these same velocity components back out of the
+%particles: the velocity bounces back from the particles. The
+%momenta transfer from fluid to particles is also determined
+%from the distribution values.
+
+%A simplified schematic of the implementation on the CPU is as follows,
+%noting that multiplicative coefficients are required:
+%{\footnotesize
+%\begin{verbatim}
+%For each particle:
+% For each link:
+% Calculate coefficient
+% Bounce back fluid using coefficient
+% Update particle momentum
+%\end{verbatim}
+%}
+%This process is relatively inexpensive on the CPU, since the total
+%number of particles (and intercepted links) is always relatively
+%small, but it does require access to distribution data which we want
+%to keep resident on the GPU throughout the simulation. The overhead of
+%transferring the full distribution to the CPU and back every timestep
+%would be too high. But on the other hand, the full codebase required
+%to manage particles is quite complex and comprehensive: completely
+%porting it to the GPU, such that the particles effectively also remain
+%resident on the GPU, would be a large effort and would cause a major
+%diversion of CPU and GPU codebases. Further complications would arise
+%when running on multiple GPUs due to the fact that particles can move
+%between parallel subdomains. Our solution, which minimises overhead,
+%is to keep the particles resident on CPU, but offload only their
+%interaction with the fluid to the GPU, as we describe below.
+
+%On the CPU, the code in the above listing can be restructured into
+%three distinct stages as (introducing arrays to temporrily store
+%indices and data):
+%{\footnotesize
+%\begin{verbatim}
+%1) For each particle, for each link:
+% Store site and velocity indices associated with link
+% Calculate and store coefficient
+%2) For each particle, for each link:
+% Update velocity data using stored values
+% Calculate and store particle momenta updates
+%3) For each particle, for each link:
+% Update particle momentum using stored values
+%\end{verbatim}
+%}
+%Stage 2 is only one that that accesses fluid data: this can therefore
+%be moved to the GPU, while stages 1 and 3 remain on the CPU. Stage 2
+%therefore becomes, for the GPU implementation
+%{\footnotesize
+%\begin{verbatim}
+%Transfer coefficient and link index arrays to GPU
+%Run CUDA kernel:
+% Update fluid data using coefficient and link index arrays
+% Calculate and store particle momenta updates from fluid data
+%Transfer particle momenta updates array back to CPU
+%\end{verbatim}
+%}
+%Since the total number of links involves is relatively small, these
+%partial data transfers have minimal overhead. A complication arises
+%due to the fact that the particles move through the lattice, albeit
+%relatively slowly, and therefore the list of links that they intercept
+%occasionally changes. When a
+%particle moves off a lattice site, the fluid on that site must be
+%reconstructed. Similarly, the fluid on the newly obstructed site must
+%be removed. For the same reasons given above, we want to avoid moving
+%all the code dealing with this complication to the GPU, whilst
+%minimising overheads. Since, on any one timestep, the number of
+%lattice sites affected is relatively low, we implement functionality
+%to perform partial data transfers of the distribution. Only those
+%sites affected are packed into a buffer on the GPU, transferred to the
+%CPU and unpacked into the CPU version of the distribution. Then, the
+%existing code can be used to update those affected sites, before the
+%reverse partial data transfer is performed to update the GPU version
+%of data.
+
+It is perhaps interesting at this point to make some more general
+observations on the software engineering challenge presented when
+extending an existing CPU code to the GPU. The already complex
+task of maintaining the code in a portable fashion while also
+maintaining performance is currently formidable. To help this process,
+we have followed a number of basic principles. First,
+in order to port to the GPU in an incremental fashion,
+we have tried to maintain the modular structure of the CPU where
+possible. For each data structure, such as the distribution, a separate
+analogue is maintained in both the CPU and GPU memory spaces. However,
+the GPU copy does not include the complete CPU structure: in
+particular, non-intrinsic datatypes such as MPI datatypes are not
+required on the GPU. Functions to marshal data between CPU and GPU
+are provided for each data structure, abstracting the underlying
+CUDA implementation. (This reasonably lightweight abstraction layer
+could also allow an OpenCL implementation to be developed.)
+This makes it easy to switch between the CPU and GPU for different
+components in the code, which is useful in development
+and testing. GPU functionality can be added incrementally while
+retaining a code that runs correctly (albeit slowly due to data
+transfer overheads). Once all relevant components are moved to
+the GPU, it becomes possible to remove such data transfers and
+keep the entire problem resident on the device.
+
+%To ensure that \textit{Ludwig} remains suitable for CPU-based systems,
+%we augment rather than modify the original code. All new GPU-related
+%functionality are implemented in an additive manner, mirroring the
+%modular structure of the original code. The new GPU modules have
+%interfaces similar to the original CPU functions, and the ability to
+%switch between versions is achieved at compile-time via C
+%pre-processing directives. All GPU compute kernels are implemented using
+%CUDA. Wrapper routines are developed to specify the
+%decomposition, invoke the CUDA kernels and perform the necessary data
+%management operations: the latter are facilitated through newly
+%developed modules for creation and destruction of device memory
+%structures, the copying of data between host and device, and where
+%necessary the packing and buffering of data.
+
+
+
+
+\section{Summary}
+\label{ch14:sec:summary}
+
+We have described the steps take to implement, on both single and
+multiple GPUs, the \textit{Ludwig} code which was originally
+designed for complex fluid problems on a conventional CPU
+architecture. We have added the necessary functionality using
+NVIDIA CUDA and discussed the important changes to the main
+data structures. By retaining domain decomposition and message
+passing via MPI, we have demonstrated it is possible to scale
+complex fluid problems to large numbers of GPUs in parallel.
+By following the key design criteria of maximising parallelism
+and minimising host-device data transfers, we have confirmed
+the mixed MPI-GPU approach is viable for scientific applications.
+For the more intricate problem of moving solid particles, we find
+it is possible to retain the more serial elements related to
+particle link operations on the CPU, while offloading only the
+parallel lattice-based operations involving the LB distribution
+to the GPU. Again, this minimises host-device movement of data.
+
+From the software engineering viewpoint, some duplication of code
+to allow efficient implementation on both host and device is
+currently required. This issue might be addressed by approaches
+such as automatic kernel generation, but may also be addressed
+naturally in time as GPU and CPU hardware converge. Portable
+abstractions and APIs, perhaps based on approaches such as OpenCL,
+will also facilitate the development and maintenance of portable
+codes which also exhibit portable performance (perhaps in conjunction
+with automatic tuning approaches).
+
+So, while the challenges in designing portable and efficient scientific
+applications remain very real, this work provides some hope the large
+clusters of GPU machines can be used effectively for a wide range of
+complex fluid problems.
+
+
+%The \textit{Ludwig} LB application, which specializes
+%in complex fluid problems, has undergone the developments required to
+%use large-scale parallel GPU accelerated supercomputers effectively.
+%The new functionality augments the original code, so that
+%\textit{Ludwig} can run on either CPU-based or GPU-accelerated systems.
+%Using the NVIDIA CUDA programming language, all the computational
+%kernels were offloaded to the GPU. This included not only those
+%responsible for the majority of the compute time, but any that
+%accessed the relevant data structures. This allows data to be kept
+%resident on the GPU and avoids expensive data transfers. We described
+%steps taken to optimise the performance on the GPU architecture by
+%reducing off-chip memory accesses and restructuring the data layout
+%in memory to ensure the available memory bandwidth was exploited fully.
+
+%A new halo-exchange communication phase for the code was developed to
+%allow efficient parallel scaling to many GPUs. The CPU version
+%relies on MPI datatype functionality for CPU to CPU communication; we
+%have explicitly written buffer packing/unpacking and transfer
+%operations to for GPU to GPU data exchange, staged through the host
+%CPUs. This includes the filtering of velocity components to only
+%transfer those required in a specific direction, hence reducing the
+%overall volume of data transferred. We used CUDA stream functionality
+%to overlap separate stages within the communication phase, and reduce
+%the overall communication time. The effect was measured to be more
+%profound for smaller local system sizes, i.e. it is especially useful
+%in aiding strong scaling.
+
+%We described work to enable colloidal particles in the simulation with
+%minimal overhead. We implemented these in such a way that we avoid a
+%major diversion of the CPU and GPU codebases, whilst minimising data
+%transfers between the CPU and GPU at each timestep. We keep the
+%majority of the (relatively inexpensive) particle related code on the
+%CPU, while offloading only those parts responsible for the interaction
+%with the fluid to the GPU (where the fluid data resides).
+
+%Our resulting package is able to exploit the largest of GPU
+%accelerated architectures to address a wide range of complex problems.
+
+
+
+% An example of a floating figure using the graphicx package.
+% Note that \label must occur AFTER (or within) \caption.
+% For figures, \caption should occur after the \includegraphics.
+% Note that IEEEtran v1.7 and later has special internal code that
+% is designed to preserve the operation of \label within \caption
+% even when the captionsoff option is in effect. However, because
+% of issues like this, it may be the safest practice to put all your
+% \label just after \caption rather than within \caption{}.
+%
+% Reminder: the "draftcls" or "draftclsnofoot", not "draft", class
+% option should be used if it is desired that the figures are to be
+% displayed while in draft mode.
+%
+%\begin{figure}[!t]
+%\centering
+%\includegraphics[width=2.5in]{myfigure}
+% where an .eps filename suffix will be assumed under latex,
+% and a .pdf suffix will be assumed for pdflatex; or what has been declared
+% via \DeclareGraphicsExtensions.
+%\caption{Simulation Results}
+%\label{fig_sim}
+%\end{figure}
+
+% Note that IEEE typically puts floats only at the top, even when this
+% results in a large percentage of a column being occupied by floats.
+
+
+% An example of a double column floating figure using two subfigures.
+% (The subfig.sty package must be loaded for this to work.)
+% The subfigure \label commands are set within each subfloat command, the
+% \label for the overall figure must come after \caption.
+% \hfil must be used as a separator to get equal spacing.
+% The subfigure.sty package works much the same way, except \subfigure is
+% used instead of \subfloat.
+%
+%\begin{figure*}[!t]
+%\centerline{\subfloat[Case I]\includegraphics[width=2.5in]{subfigcase1}%
+%\label{fig_first_case}}
+%\hfil
+%\subfloat[Case II]{\includegraphics[width=2.5in]{subfigcase2}%
+%\label{fig_second_case}}}
+%\caption{Simulation results}
+%\label{fig_sim}
+%\end{figure*}
+%
+% Note that often IEEE papers with subfigures do not employ subfigure
+% captions (using the optional argument to \subfloat), but instead will
+% reference/describe all of them (a), (b), etc., within the main caption.
+
+
+% An example of a floating table. Note that, for IEEE style tables, the
+% \caption command should come BEFORE the table. Table text will default to
+% \footnotesize as IEEE normally uses this smaller font for tables.
+% The \label must come after \caption as always.
+%
+%\begin{table}[!t]
+%% increase table row spacing, adjust to taste
+%\renewcommand{\arraystretch}{1.3}
+% if using array.sty, it might be a good idea to tweak the value of
+% \extrarowheight as needed to properly center the text within the cells
+%\caption{An Example of a Table}
+%\label{table_example}
+%\centering
+%% Some packages, such as MDW tools, offer better commands for making tables
+%% than the plain LaTeX2e tabular which is used here.
+%\begin{tabular}{|c||c|}
+%\hline
+%One & Two\\
+%\hline
+%Three & Four\\
+%\hline
+%\end{tabular}
+%\end{table}
+
+
+% Note that IEEE does not put floats in the very first column - or typically
+% anywhere on the first page for that matter. Also, in-text middle ("here")
+% positioning is not used. Most IEEE journals/conferences use top floats
+% exclusively. Note that, LaTeX2e, unlike IEEE journals/conferences, places
+% footnotes above bottom floats. This can be corrected via the \fnbelowfloat
+% command of the stfloats package.
+
+
+% conference papers do not normally have an appendix
+
+
+% use section* for acknowledgement
+\section*{Acknowledgments}
+AG was supported by the CRESTA project, which has received
+funding from the European Community's Seventh Framework Programme
+(ICT-2011.9.13) under Grant Agreement no. 287703. KS was supported
+by UK EPSRC under grant EV/J007404/1.
+
+
+% trigger a \newpage just before the given reference
+% number - used to balance the columns on the last page
+% adjust value as needed - may need to be readjusted if
+% the document is modified later
+%\IEEEtriggeratref{8}
+% The "triggered" command can be changed if desired:
+%\IEEEtriggercmd{\enlargethispage{-5in}}
+
+% references section
+
+% can use a bibliography generated by BibTeX as a .bbl file
+% BibTeX documentation can be easily obtained at:
+% http://www.ctan.org/tex-archive/biblio/bibtex/contrib/doc/
+% The IEEEtran BibTeX style support page is at:
+% http://www.michaelshell.org/tex/ieeetran/bibtex/
+%\bibliographystyle{IEEEtran}
+% argument is your BibTeX string definitions and bibliography database(s)
+%\bibliography{IEEEabrv,../bib/paper}
+%
+% <OR> manually copy in the resultant .bbl file
+% set second argument of \begin to the number of references
+% (used to reserve space for the reference number labels box)
+
+
+\begin{thebibliography}{1}
+
+%\bibitem{IEEEhowto:kopka}
+%H.~Kopka and P.~W. Daly, \emph{A Guide to \LaTeX}, 3rd~ed.\hskip 1em plus
+% 0.5em minus 0.4em\relax Harlow, England: Addison-Wesley, 1999.
+
+
+
+\bibitem{succi-book}
+S. Succi, \textit{The lattice Boltzmann equation and beyond},
+Oxford University Press, Oxford, 2001.
+
+%\bibitem{mpi-standard}
+%Message Passing Interface Forum, http://www.mpi-forum.org
+
+\bibitem{desplat}
+Desplat, J.-C., I. Pagonabarraga, and P. Bladon,
+\textit{LUDWIG: A parallel lattice-Boltzmann code for complex fluids}.
+Comput. Phys. Comms., \textbf{134}, 273, 2001.
+
+\bibitem{aidun2010}
+C.K. Aidun and J.R. Clausen,
+\textit{Lattice Boltzmann method for complex flows},
+Ann. Rev. Fluid Mech., \textbf{42} 439--472 (2010).
+
+%\bibitem{bray1994}
+%A.J. Bray,
+%\textit{Theory of phase-ordering kinetics},
+%Adv. Phys., \textbf{43} 357--459 (1994).
+
+%\bibitem{kendon2001}
+%V.M. Kendon, M.E. Cates, I. Pangonabarraga, J.-C. Desplat, and P. Bladon,
+%\textit{Inertial effects in the three-dimensional spinodal decomposition of a
+%symmetric binary fluid mixture; a lattice Boltzmann study},
+%J. Fluid Mech., \textbf{440}, 147--203 (2001).
+
+%\bibitem{swift1996}
+%M.R. Swift, E. Orlandini, W.R. Osborn, and J.M. Yeomans,
+%\textit{Lattice Boltzmann simulations of liquid-gas and binary
+%fluid systems},
+%Phys. Rev. E, \textbf{54}, 5041--5052 (1996).
+
+\bibitem{stratford2008}
+K. Stratford and I. Pagonabarraga,
+Parallel domain decomposition for lattice Boltzmann with moving particles,
+\textit{Comput. Math. with Applications} \textbf{55}, 1585 (2008).
+
+%\bibitem{xe6}
+%Cray XE6 Product Brochure, available from
+%http://www.cray.com/Products/XE/Resources.aspx (2010)
+
+%\bibitem{hector} HECToR home page, http://www.hector.ac.uk
+
+%\bibitem{xk6}
+%Cray XK6 Product Brochure, available from
+%http://www.cray.com/Products/XK6/XK6.aspx (2011)
+
+
+\bibitem{wei2004}
+X. Wei, W. Li, K. M\"uller, and A.E. Kaufman,
+\textit{The lattice Boltzmann method for simulating gaseous phenomena},
+IEEE Transactions on Visualization and Computer Graphics,
+\textbf{10}, 164--176 (2004).
+%% Apparently first LB via GPU; a serious contribution using single fluid
+%%d3q19 single relaxation time.
+
+\bibitem{zhu2006}
+H. Zhu, X. Liu, Y. Liu, and E. Wu,
+\textit{Simulation of miscible binary mixtures based on lattice Boltzmann method},
+Comp. Anim. Virtual Worlds, \textbf{17}, 403--410 (2006).
+%% Single relaxation (MRT mentioned) time d3q19 apparently sound although
+%%not a lot of detail. Pre-cuda so graphics code.
+
+\bibitem{zhao2007}
+Y. Zhao,
+\textit{Lattice Boltzmann based PDE solver on the GPU},
+Visual Comput., doi 10.1007/s00371-0070191-y (2007).
+
+\bibitem{toelke2010}
+J. T\"olke,
+Implementation of a lattice Boltzmann kernel using the compute unified
+device architecture developed by nVIDIA,
+Comput. Visual Sci. 13 29--39 (2010).
+
+\bibitem{fan2004}
+Z. Fan, F. Qiu, A. Kaufman, and S. Yoakum-Stover,
+\textit{GPU cluster for high performance computing},
+Proceedings of ACM/IEEE Supercomputing Conference, pp. 47--59,
+IEEE Computer Society Press, Pittsburgh, PA (2004).
+
+\bibitem{myre2011}
+J. Myre, S.D.C. Walsh, D. Lilja, and M.O. Saar,
+\textit{Performance analysis of single-phase, multiphase, and multicomponent
+lattice Boltzmann fluid flow simulations on GPU clusters},
+Concurrency Computat.: Pract. Exper., \textbf{23}, 332--350 (2011).
+
+\bibitem{obrecht2011}
+C. Obrecht, F. Kuznik, B. Tourancheau, and J.-J. Roux,
+\textit{Multi-GPU implementation of the lattice Boltzmann method},
+Comput. Math. with Applications,
+doi:10.1016/j.camwa.2011.02.020 (2011).
+
+\bibitem{bernaschi2010}
+M. Bernaschi, M. Fatica, S. Melchionna, S. Succi, and E. Kaxiras,
+\textit{A flexible high-performance lattice Boltzmann GPU code for the
+simulations of fluid flow in complex geometries},
+Concurrency Computat.: Pract. Exper., \textbf{22}, 1--14 (2010).
+
+\bibitem{xian2011}
+W. Xian and A. Takayuki,
+\textit{Multi-GPU performance of incompressible flow computation by
+lattice Boltzmann method on GPU cluster},
+Parallel Comput., doi:10.1016/j.parco.2011.02.007 (2011).
+
+\bibitem{feichtinger2011}
+C. Feichtinger, J. Habich, H. K\"ostler, G. Hager, U. R\"ude, and
+G. Wellein,
+A flexible patch-based lattice Boltzmann parallelization approach
+for heterogeneous GPU-CPU clusters,
+\textit{Parallel Computing} \textbf{37} 536--549 (2011).
+
+\bibitem{wellein2006}
+G. Wellein, T. Zeiser, G Hager, and S. Donath,
+On the single processor performance of simple lattice Boltzmann kernels,
+\textit{Computers and Fluids}, \textbf{35}, 910--919 (2006).
+
+\bibitem{pohl2003}
+T. Pohl, M. Kowarschik, J. Wilke, K. Igelberger, and U. R\"ude,
+Optimization and profiling of the cache performance of parallel
+lattice Boltzmann code,
+\textit{Parallel Process Lett.} \textit{13} 549--560 (2003).
+
+\bibitem{mattila2007}
+K. Mattila, J. Hyv\"aluoma, T. Rossi, M. Aspn\"as and J. Westerholm,
+An efficient swap algorithm for the lattice Boltzmann method,
+\textit{Comput. Phys. Comms.} \textit{176} 200-210 (2007).
+
+\bibitem{wittmann2012}
+M. Wittmann, T. Zeiser, G. Hager, and G. Wellein,
+Comparison of different propagation steps for lattice Boltzmann methods,
+\textit{Comput. Math with Appl.} doi:10.1016/j.camwa.2012.05.002 (2012).
+
+\bibitem{walshsaar2012}
+S.D.C. Walsh and M.O. Saar,
+Developing extensible lattice Boltzmann simulators for general-purpose
+graphics-processing units,
+\textit{Comm. Comput. Phys.}, \textbf{13} 867--879 (2013).
+
+
+\bibitem{williams2011}
+S. Williams, L. Oliker, J. Carter, and J Shalf,
+Extracting ultra-scale lattice Boltzmann performance via
+hierarchical and distributed auto-tuning,
+\textit{Proc. SC2011}.
+
+
+\bibitem{ch14:stratford-jsp2005}
+K. Stratford, R. Adhikari, I. Pagonabarraga, and J.-C. Desplat,
+\textit{Lattice Boltzmann for Binary Fluids with Suspended Colloids},
+J. Stat. Phys. \textbf{121}, 163 (2005).
+
+\bibitem{ladd1994}
+A.J.C. Ladd,
+Numerical simulations of particle suspensions via a discretized
+Boltzmann equation. Part 1. Theoretical foundation,
+\textit{J. Fluid Mech.} \textbf{271} 285--309 (1994);
+Part II. Numerical results,
+\textit{ibid.} \textbf{271} 311--339 (1994).
+
+
+\bibitem{nguyen2002}
+N.-Q. Nguyen and A.J.C. Ladd,
+Lubrication corrections for lattice Boltzmann simulations of particle
+suspensions,
+\textit{Phys. Rev. E} \textbf{66} 046708 (2002).
+
+\bibitem{ch14:immersed}
+C.S. Peskin,
+Flow patterns around heart valves; a numerical method,
+\textit{J. Comp. Phys.}, \textbf{10}, 252--271 (1972);
+C.S. Peskin,
+The immersed boundary method,
+\textit{Acta Nummerica} \textbf{11} 479--517 (2002).
+
+\bibitem{ch14:immersed-lb}
+Z.-G. Feng and E.E. Michaelides,
+The immersed boundary-lattice Boltzmann method for solving
+fluid-particles interaction problem,
+\textit{J. Comp. Phys.}, \textbf{195} 602--628 (2004).
+
+\bibitem{ch14:spm}
+Y. Nakayama and R. Yammamoto,
+Simulation method to resolve hydrodynamic interactions in colloidal
+dispersions,
+\textit{Phys. Rev. E}, \textbf{71} 036707 (2005).
+
+
+\end{thebibliography}
+
+
+
+
+% that's all folks
+\end{document}
+
+
+
+% LocalWords: CRESTA EPSRC Succi Desplat Pagonabarraga