BookGPU/Chapters/chapter14/ch14.tex

   1 \chapterauthor{Alan Gray and Kevin Stratford}{EPCC, The University of Edinburgh, United Kingdom}
   2
   3 \chapter[Ludwig: multiple GPUs for a fluid lattice Boltzmann
   4 application]{Ludwig: multiple GPUs for a complex fluid lattice Boltzmann
   5 application}
   6
   7 %\putbib[biblio]
   8
   9
  10 \section{Introduction}
  11 The lattice Boltzmann (LB) method \index{Lattice Boltzmann method} (for an overview see, e.g.,
  12 \cite{succi-book}) has become a popular approach to a variety of fluid
  13 dynamics problems.  It provides a way to solve the incompressible,
  14 isothermal Navier-Stokes equations and has the attractive features of
  15 being both explicit in time and local in space. This makes the LB
  16 method well-suited to parallel computation. Many efficient parallel
  17 implementations of the LB method have been undertaken, typically using
  18 a combination of distributed domain decomposition and the Message
  19 Passing Interface (MPI). However, the potential
  20 performance benefits offered by GPUs has motivated a new `mixed-mode'
  21 approach to address very large problems. Here, fine-grained
  22 parallelism is implemented on the GPU, while MPI is reserved for
  23 larger-scale parallelism.  This mixed mode is of increasing interest
  24 to application programmers at a time when many supercomputing services
  25 are moving
  26 toward clusters of GPU accelerated nodes. The design questions which
  27 arise when developing a lattice Boltzmann code for this type of
  28 heterogeneous system are therefore worth studying. Furthermore, similar
  29 questions also recur in many other types of stencil-based algorithms.
  30
  31 The first applications of LB on GPUs were to achieve fluid-like
  32 effects in computer animation, rather than scientific applications per
  33 se.  These early works include simple fluids~\cite{wei2004}, miscible
  34 two-component flow~\cite{zhu2006}, and various image processing tasks
  35 based on the use of partial differential equations~\cite{zhao2007}.
  36 While these early works used relatively low level graphics APIs, the
  37 first CUDA runtime interface implementation was a two-dimensional
  38 simple fluid problem~\cite{toelke2010}.  Following pioneering work on
  39 clusters of GPUs coupled via MPI to study air
  40 pollution~\cite{fan2004}, more recent work has included mixed OpenMP
  41 and CUDA~\cite{myre2011}, Posix threads and CUDA~\cite{obrecht2011},
  42 and MPI and CUDA for increasingly large GPU clusters
  43 \cite{bernaschi2010,xian2011,feichtinger2011}. The heterogeneous
  44 nature of these systems has also spurred interest in approaches
  45 including automatic code generation \cite{walshsaar2012} and auto-tuning
  46 \cite{williams2011} to aid application performance.
  47
  48 Many of these authors make use of
  49 another attractive feature of LB: the ability to include fixed
  50 solid-fluid boundary conditions as a straightforward addition to
  51 the algorithm to study, for example, flow in
  52 porous media. This points to an important
  53 application area for LB methods: that of complex fluids.
  54 Complex fluids include mixtures, surfactants, liquid crystals,
  55 and particle suspensions, and typically require additional physics
  56 beyond the bare Navier-Stokes equations to provide a full
  57 description~\cite{aidun2010}. The representation of this extra
  58 physics raises additional design questions for the application
  59 programmer. Here, we consider the \textit{Ludwig} code \cite{desplat}\index{Ludwig code},
  60 an LB application developed specifically for complex fluids
  61 (\textit{Ludwig} was named for Boltzmann, 1844--1906).
  62 We will present the steps
  63 required to allow \textit{Ludwig} to exploit efficiently both a single
  64 GPU, and also many GPUs in parallel.  We show that
  65 \textit{Ludwig} scales excellently to at least the one thousand GPU level
  66 (the largest resource available at the time of writing) with
  67 indications that the code will scale to much larger systems as
  68 they become available. In addition, we consider the steps required
  69 to represent fully-resolved moving solid particles (usually referred
  70 to as colloids in the context of complex fluids). Such particles
  71 need to have their surface resolved on the lattice scale in order
  72 to include relevant surface physics, and must be able to move: e.g.,
  73 to execute Brownian motion in response to random forces from the
  74 fluid. Standard methods are available to represent such particles
  75 (e.g., \cite{ladd1994,nguyen2002}) which are amenable to effective
  76 domain decomposition and message passing \cite{stratford2008}.
  77 We describe below the additional considerations which arise
  78 for such moving particles when developing an
  79 implementation on a GPU.
  80
  81 In the following section we provide a brief overview of the lattice
  82 Boltzmann method, and mention some of the general issues which can
  83 influence performance in the CPU context.
  84 In Section \ref{ch14:sec:singlegpu}, we then describe the alterations
  85 which are required to exploit the GPU architecture effectively, and
  86 highlight how the resulting code differs from the CPU version. In
  87 Section \ref{ch14:sec:parallelgpu}, we extend this description to
  88 include the steps required to allow exploitation of many GPUs in
  89 parallel while retaining effective scaling. We also present results for
  90 a typical benchmark for a fluid-only problem in Section
  91 \ref{ch14:sec:parallelgpu} to demonstrate the success of the approach.
  92 In Section \ref{ch14:sec:particles}, we describe the design choices
  93 which are relevant to a GPU implementation of moving particles. Finally,
  94 we include a number of general observations on software
  95 engineering and maintenance which arise from our experience.
  96 A summary is provided in Section~\ref{ch14:sec:summary}.
  97
  98
  99 \section{Background}\label{ch14:sec:background}
 100
 101 %\begin{figure}[!t]
 102 %\centering
 103 %\includegraphics[width=12cm]{Chapters/chapter14/figures/basiclattice}
 104 %\caption{The lattice Boltzmann approach discritises space into a 3D lattice (left). The fluid is represented at each lattice site (right), by the {\it distribution} vector: each component corresponding to a specific velocity direction. Here we are interested in the {\it D3Q19} model, in which there are 3 spatial directions and 19 velocity components (including a zero component).}
 105 %\label{ch14:fig:basiclattice}
 106 %\end{figure}
 107
 108
 109 For a general complex fluid problem the starting point is the
 110 fluid velocity field $\mathbf{u}(\mathbf{r})$, whose evolution
 111 obeys the Navier-Stokes equations describing the conservation of mass
 112 (or density $\rho$), and momentum:
 113 \begin{equation}
 114 \rho  [ \partial_t \mathbf{u} + (\mathbf{u}.\mathbf{\nabla})\mathbf{u} ]
 115 = -\nabla p + \eta \nabla^2 \mathbf{u} + \mathbf{f}(\mathbf{r}),
 116 \end{equation}
 117 where $p$ is the isotropic pressure and $\eta$ is the viscosity.
 118 A local force $\mathbf{f}(\mathbf{r})$ provides a means for coupling
 119 to other complex fluid constituents, e.g., it might represent the force
 120 exerted on the fluid by a curved interface between different phases or
 121 components.
 122
 123 The LB approach makes use of a regular three-dimensional
 124 lattice (see Figure \ref{ch14:fig:decomphalo}) with discrete spacing
 125 $\Delta r$. It also makes use of a
 126 discrete velocity space $\mathbf{c}_i$, where the $\mathbf{c}_i$
 127 are chosen to capture the correct symmetries of the Navier-Stokes
 128 equations. A typical choice, used here, is the so-called D3Q19
 129 basis in three dimensions where there is one velocity such that
 130 $\mathbf{c} \Delta t$ is zero, along with six extending to the nearest
 131 neighbour
 132 lattice sites, and twelve extending to the next-nearest neighbour sites
 133 ($\Delta t$ being the discrete time step). The fundamental object
 134 in LB is then the distribution function $f_i (\mathbf{r};t)$ whose
 135 moments are related to the local hydrodynamic quantities: the fluid
 136 density, momentum, and stress. The time evolution of the distribution
 137 function is described by a discrete Boltzmann equation
 138 \begin{equation}
 139 f_i(\mathbf{r} + \mathbf{c}_i \Delta t; t) - f_i(\mathbf{r}; t)
 140 = - {\cal L}_{ij} f_j(\mathbf{r};t).
 141 \end{equation}
 142 It is convenient to think of this in two stages. First, the right hand
 143 side represents the action of a collision operator ${\cal L}_{ij}$,
 144 which is local to each lattice site and relaxes the distribution toward
 145 a local equilibrium at a rate ultimately related to the fluid viscosity.
 146 Second, the left hand side represents a propagation step (sometimes referred
 147 to as streaming step), in which each element $i$ of the distribution is
 148 displaced $\mathbf{c}_i \Delta t$, i.e., one lattice spacing in the
 149 appropriate direction per discrete time step.
 150
 151 More specifically, we store a vector of 19 double precision floating
 152 point values at each lattice site for the distribution function
 153 $f_i(\mathbf{r};t)$.
 154 The collision operation ${\cal L}_{ij}$, which is local at each lattice
 155 site, may be thought of as follows. A matrix-vector multiplication
 156 ${\cal M}_{ij}f_j$ is used to transform the distributions into the
 157 hydrodynamic quantities, where ${\cal M}_{ij}$ is a constant 19x19
 158 matrix related to the choice of
 159 $\mathbf{c}_i$. The non-conserved hydrodynamic quantities are then
 160 relaxed toward their (known) equilibrium values, and are transformed
 161 back to new post-collision distributions via the inverse transformation
 162 ${\cal M}^{-1}_{ij}$. This gives rise to the need for a minimum of 2x19$^2$
 163 floating point multiplications per lattice site. (Additional operations are
 164 required to implement, for example, the force $\mathbf{f}(\mathbf{r})$.)
 165 In contrast, the
 166 propagation stage consists solely of moving the distribution values
 167 between lattice sites, and involves no floating point operations.
 168
 169 In the CPU version, the \textit{Ludwig} implementation stores one time
 170 level of distribution values $f_i(\mathbf{r}; t)$. This distribution
 171 is stored as a flat array of C data type \texttt{double}, and laid out
 172 so that all elements of the velocity are contiguous at a given site
 173 (often referred to as ``array-of-structures'' order). This is
 174 appropriate for sums over the distribution required locally at the
 175 collision stage as illustrated schematically in
 176 Listing~\ref{ch14:listing1}: the fact that consecutive loads are from
 177 consecutive memory addresses allows the prefetcher to engage fully.
 178 (The temporary scalar \texttt{a\_tmp} allows caching of the
 179 intermediate accumulated value in the innermost loop.)  A variety of
 180 standard sequential optimisations are relevant for the collision stage
 181 (loop unrolling, inclusion of SIMD operations, and so on
 182 \cite{wellein2006}).  For example, the collision stage in
 183 \textit{Ludwig} includes explicit SIMD code, which is useful if the
 184 compiler on a given platform cannot identify it.  The propagation
 185 stage is separate, and is organised as a ``pull'' (again, various
 186 optimisation have been considered, e.g.,
 187 \cite{pohl2003,mattila2007,wittmann2012}).  No further optimisation is
 188 done here beyond ensuring that the ordering of the discrete velocities
 189 allows memory access to be as efficient as possible. While these
 190 optimisations are important, it should be remembered that for some
 191 complex fluid problems, the hydrodynamics embodied in the LB
 192 calculation is a relatively small part of the total computational cost
 193 (at the 10\%-level in some cases).  This means optimisation effort may
 194 be better concentrated elsewhere.
 195
 196 \begin{lstlisting}[float, label=ch14:listing1,
 197 caption = Collision schematic for CPU.]
 198 /* loop over lattice sites */
 199 for (is = 0; is < nsites; is++) {
 200   ...
 201   /* load distribution from ``array-of-structures'' */
 202   for (i = 0; i < 19; i++)
 203     f[i] = f_aos[19*is + i]
 204   ...
 205   /* perform matrix-vector multiplication */
 206   for (i = 0; i < 19; i++) {
 207     a_tmp = 0.0;
 208     for (j = 0; j < 19; j++) {
 209       a_tmp += f[j]*M[i][j];
 210     }
 211     a[i] = a_tmp;
 212   }
 213   ...
 214 }
 215 \end{lstlisting}
 216
 217
 218 \begin{figure}[!t]
 219 \centering
 220 \includegraphics[width=12cm]{Chapters/chapter14/figures/decomphalo}
 221 \caption[The lattice is decomposed between MPI tasks.]{Left: the lattice is decomposed between MPI tasks. For
 222   clarity we show a 2D decomposition of a 3D lattice, but in practice
 223   we decompose in all 3 dimensions. Halo cells are added to each
 224   sub-domain (as shown on the upper right for a single slice) which store
 225   data retrieved from remote neighbours in the halo exchange. Lower
 226   right: the D3Q19 velocity set resident on a lattice site;
 227   highlighted are the 5 ``outgoing'' elements to be transferred in a
 228   specific direction.}
 229 \label{ch14:fig:decomphalo}
 230 \end{figure}
 231
 232
 233 The regular 3D decomposition is illustrated in Fig.~\ref{ch14:fig:decomphalo}.
 234 Each local sub-domain is surrounded by a halo, or ghost, region of width one
 235 lattice site. While the collision is local, elements of the distribution
 236 must be exchanged at the edges of the domains to facilitate the propagation.
 237 To achieve the full 3D halo exchange, the standard approach of shifting the
 238 relevant data in each co-ordinate direction in turn is adopted. This
 239 requires appropriate synchronisation, i.e., a receive in the the first
 240 co-ordinate direction must be complete before a send in the second direction
 241 involving relevant data can take place, and so on. We note that only
 242 ``outgoing'' elements of the distribution need to be sent at each edge.
 243 For the D3Q19 model, this reduces the volume of data traffic from 19 to
 244 5 of the $f_i(\mathbf{r};t)$ per lattice site at each edge. In the CPU
 245 version, the necessary transfers are implemented in place using
 246 a vector of appropriately strided MPI datatypes for each direction.
 247
 248
 249 \section{Single GPU Implementation}\label{ch14:sec:singlegpu}
 250
 251
 252 In this section we describe the steps taken to enable \textit{Ludwig}
 253 for the GPU. There are a number of crucial issues: first, the
 254 minimisation of data traffic between host and device; second, the
 255 optimal mapping of available parallelism onto the architecture and
 256 third, the issue of memory coalescing. We discuss each of these in
 257 turn.
 258
 259 While the most important section of the LB in terms of
 260 floating-point performance is the collision stage, this cannot be the
 261 only consideration for a GPU implementation. It is essential to
 262 offload all computational activity which involves the main data
 263 structures (such as the distribution) to the GPU. This includes
 264 kernels with relatively low computational demand, such as the
 265 propagation stage. All relevant data then remain resident on the GPU,
 266 to avoid expensive host-device data transfer at each iteration of
 267 the algorithm. Such transfers would negate any benefit of GPU
 268 acceleration.  We note that for a complex fluid code, this requirement
 269 can extend to a considerable number of kernels, although we limit the
 270 discussion to collision and propagation for brevity.
 271
 272 To achieve optimal performance, it is vital to exploit fully the
 273 parallelism inherent in the GPU architecture, particularly for
 274 those matrix-vector operations within the collision kernel.
 275 The GPU architecture features a hierarchy of parallelism. At the
 276 lowest level, groups of 32 threads (warps) operate in
 277 lock-step on different data elements: this is SIMD style vector-level
 278 parallelism. Multiple warps are combined into a thread block (in which
 279 communication and synchronisation are possible), and multiple blocks can
 280 run concurrently across the streaming multiprocessors in the GPU
 281 (with no communication or
 282 synchronisation possible across blocks).  To decompose on the GPU, we
 283 must choose which part of the collision to assign to which level of
 284 parallelism.
 285
 286 While there exists parallelism within the matrix-vector operation, the
 287 length of each vector (here, 19) is smaller than the warp size and typical
 288 thread block sizes. So we simply decompose the loop over lattice sites
 289 to all levels of parallelism, i.e., we use a separate CUDA thread for
 290 every lattice site, and each thread performs the full matrix-vector
 291 operation. We find that a block size of 256 performs well on current
 292 devices: we therefore decompose
 293 the lattice into groups of 256 sites, and assign each group to a block
 294 of CUDA threads. As the matrix ${\cal M}_{ij}$ is constant, it is
 295 assigned to the fast {\it constant} on-chip device memory.
 296
 297 For the propagation stage, the GPU implementation adds a second time
 298 level of distribution values. The data-dependencies inherent in the
 299 propagation mean that the in-place propagation of the CPU version
 300 cannot be parallelised effectively without the additional time level.
 301 As both time levels may remain resident on the GPU, this is not a
 302 significant overhead.
 303
 304 An architectural
 305 constraint of GPUs means that optimal global memory bandwidth
 306 is only achieved when data are structured such that threads within a
 307 {\it half-warp} (a group of 16 threads) load data from the same memory
 308 segment in a single transaction: this is memory coalescing. It can be
 309 seen that the array-of-structures ordering used for the distribution
 310 in the CPU code would not be suitable for coalescing; in fact, it would
 311 result in serialised memory accesses and relative poor performance.
 312 To meet the coalescing criteria and allow consecutive threads to read
 313 consecutive memory addresses on the GPU, we transpose the layout of the
 314 distribution so that, for each velocity component, consecutive sites
 315 are contiguous in memory (``structure-of-arrays'' order). A schematic of
 316 the GPU collision code is given in Listing~\ref{ch14:listing2}.
 317
 318 \begin{lstlisting}[float, label=ch14:listing2,
 319 caption = Collision schematic for GPU.]
 320 /* compute current site index 'is' from CUDA thread and block */
 321 /* indices and load distribution from "structure-of-arrays" */
 322
 323 for (i = 0; i < 19; i++)
 324   f[i] = f_soa[nsites*i + is]
 325
 326 /* perform matrix-vector multiplication as before */
 327 ...
 328 \end{lstlisting}
 329
 330
 331 \textit{Ludwig} was modified to allow a choice of distribution data layout
 332 at compilation time depending on the target architecture: CPU or GPU. We
 333 defer some further comments on software engineering aspects of the code to
 334 the summary.
 335
 336
 337 \section{Multiple GPU Implementation}\label{ch14:sec:parallelgpu}
 338
 339 To execute on multiple GPUs, we use the same domain decomposition
 340 and message passing framework as the CPU version. Within each
 341 sub-domain (allocated to one MPI task) the GPU implementation
 342 proceeds as described in the previous section. The only additional
 343 complication is that halo transfers between GPUs must be staged
 344 through the host (in future, direct GPU to GPU data transfers via
 345 MPI may be possible, obviating the need for these steps). This
 346 means host MPI sends must be preceded by appropriate device to host
 347 transfers and host MPI receives must be followed by corresponding host
 348 to device transfers.
 349
 350 %To execute on multiple GPUs in parallel we decompose the problem using
 351 %the existing high-level parallelisaton framework: each MPI task acts
 352 %as a host to a separate GPU, with each GPU kernel operating on the
 353 %subset of the lattice volume corresponding to the partition assigned
 354 %to that task.  As described in Section \ref{ch14:sec:background}, halo
 355 %data exchanges are required at each timestep.  Significant
 356 %developments are made for the GPU adaptation of this communication
 357 %phase to achieve good inter-GPU communication performance and allow
 358 %scaling to many GPUs.
 359
 360 %The CPU code uses strided MPI datatypes to define the six
 361 %planes of lattice data to be sent or received, and performs the
 362 %communications {\it in-place}, i.e.  there is no explicit buffering of
 363 %data in the application.  For the GPU adaptation, the halo planes must
 364 %be exchanged between GPUs, on which MPI functionality is not
 365 %available, so transfers must be staged through the host CPUs. Buffer
 366 %packing/unpacking of the halo planes is explicitly performed using
 367 %CUDA kernels, and CUDA memory copy calls are used to transfer data
 368 %from a device to its host (using pinned host memory) through the PCI-e
 369 %bus (and vice versa). MPI calls are used to transfer the buffers
 370 %between hosts. The CPU version of Ludwig has a mechanism, using MPI built-in
 371 %data types, to reduce the amount of data communicated by around a
 372 %factor of four by only sending those elements of the distribution
 373 %propagating outward from a local domain. For the GPU adaptation, this
 374 %filtering of data is explicitly performed (again since MPI
 375 %functionality is not available on the GPU) in the buffer
 376 %packing/unpacking kernels.  Special care had to be taken for those
 377 %lattice sites on the corners of the halo planes, which had to be sent
 378 %in more than one direction (since the propagation includes accesses to
 379 %diagonal neighbours).
 380
 381 In practice, this data movement requires additional GPU kernels to
 382 pack and unpack the relevant data before and after corresponding MPI
 383 calls. However, the standard shift algorithm, in which each co-ordinate
 384 direction is treated in turn, does provide some scope for the
 385 overlapping of different operations.
 386 For example, after the data for the first co-ordinate direction have been
 387 retrieved by the host, these can be exchanged using MPI between
 388 hosts at the same time as kernels for packing and retrieving of
 389 data for the second co-ordinate direction are
 390 executed. This
 391 overlapping must respect the synchronisation required to ensure
 392 that data values at the corners of the sub-domain are transferred
 393 correctly.
 394 We use a separate CUDA stream for each co-ordinate direction:
 395 this allows some of the host-device communication time to be
 396 effectively ``hidden'' behind the host-host MPI communication,
 397 resulting in an overall speedup.
 398 The improvement is more pronounced for the smaller local lattice size,
 399 perhaps because of less CPU memory bandwidth contention. The
 400 overlapping is then a particularly valuable aid to strong scaling,
 401 where the local system size decreases as the number of GPUs increases.
 402
 403
 404 %\section{Performance}\label{ch14:sec:performance}
 405
 406 To demonstrate the effectiveness of our approach, we compare the
 407 performance of both CPU and GPU versions of \textit{Ludwig}. To
 408 test the complex fluid nature of the code, the problem is
 409 actually an immiscible fluid mixture which uses a second distribution
 410 function to introduce a composition variable. The interested reader
 411 is referred to \cite{ch14:stratford-jsp2005} for further details. The
 412 largest total problem size used is $2548\times 1764\times 1568$.
 413 The CPU system features a Cray XE6 architecture with 2 16-core AMD Opteron
 414 CPUs per node, and with nodes interconnected using Cray Gemini technology.
 415 For GPU results, a Cray XK6 system is used: this is very similar to the
 416 XE6, but has one CPU per node replaced with an NVIDIA X2090 GPU. Each
 417 node in the GPU system therefore features a single Opteron CPU acting
 418 as a host to a single GPU. The inter-node interconnect architecture is
 419 the same as for the Cray XE6. The GPU performance tests use a prototype
 420 Cray XK6 system with 936 nodes (the largest available at the time of writing).
 421 To provide a fair comparison, we compare scaling on a \textit{per node}
 422 basis. That is, we compare 1 fully occupied 32-core CPU node
 423 (running 32 MPI tasks) with 1 GPU node (host running 1 MPI task,
 424 and 1 device). We believe this is representative of the true ``cost'' of
 425 a simulation in terms of accounting budgets, or electricity.
 426
 427
 428 %\begin{figure}[!t]
 429 %\centering
 430 %\includegraphics[width=10cm]{Chapters/chapter14/figures/length_scale}
 431 %\caption{The dependence of the characteristic length scale of a binary
 432 %  fluid mixture on the simulation timestep number. Superimposed are
 433 %  three snapshots showing increased mixture separation with time:
 434 %  these are obtained through the order parameter and each show one
 435 %  eighth of the total simulation size, which is $1548\times 1764\times 1568$. }\label{ch14:fig:length_scale}
 436 %\end{figure}
 437
 438 \begin{figure}[!t]
 439 \centering
 440 \includegraphics[width=10cm]{Chapters/chapter14/figures/two_graphs_ud}
 441 \caption[The weak (top) and strong (bottom) scaling of \textit{Ludwig}.]{The weak (top) and strong (bottom) scaling of \textit{Ludwig}.  Closed
 442   shapes denote results using the CPU version run on the Cray
 443   XE6 (using two 16-core AMD Interlagos CPUs per node), while open
 444   shapes denote results using the GPU version on the Cray XK6 (using a
 445   single NVIDIA X2090 GPU per node). Top: the benchmark time is shown
 446   where the problem size per node is constant. Bottom: performance is
 447   shown where, for each of the four problem sizes presented, the
 448   results are scaled by the lattice volume, and all results are
 449   normalized by the same arbitrary constant.  }
 450 \label{ch14:fig:scaling}
 451 \end{figure}
 452
 453
 454 %The top of Figure \ref{ch14:fig:scaling} shows the benchmark time,
 455 %where the problem size is increased with the number of nodes. This
 456 %shows weak scaling: the time taken by a perfectly scaling code would
 457 %remain constant. The figure plots the time against the number of {\it
 458 %  nodes}: for the GPU version we utilize the one GPU per Cray XK6 node
 459 %whilst for the CPU version we fully utilize two 16-core CPUs
 460 %per Cray XE6 node, i.e. one GPU is compared with 32 CPU cores.  We use
 461 %a single MPI task per Cray XK6 node to host the GPU, and 32 MPI tasks
 462 %to fully utilize the CPUs in the Cray XE6 node.  Note that our CPU
 463 %version is highly optimised: in particular it has been tuned for
 464 %optimal utilisation of the SIMD vector units in each CPU core.
 465
 466 Figure \ref{ch14:fig:scaling} shows the results of both weak and
 467 strong scaling tests (top and bottom panels, respectively). For
 468 weak scaling, where the local sub-domain size is kept fixed
 469 (here a relatively large 196$^3$ lattice), the time taken by
 470 an ideal code would remain constant. It can be seen that while
 471 scaling of the CPU version shows a pronounced upward slope, the
 472 GPU version scales almost perfectly. The advantage of the GPU
 473 version over the CPU version is the fact that the GPU version
 474 has a factor of 32 fewer MPI tasks per node, so communications
 475 require a significantly smaller number of larger data transfers.
 476 The performance
 477 advantage of the GPU version ranges from a factor of around 1.5 to
 478 around 1.8. Careful study by other authors \cite{williams2011} have
 479 found the absolute performance (in terms of floating point
 480 operations per second, or floating point operations per Watt) to
 481 be remarkably similar between architectures.
 482
 483 Perhaps of more interest is the strong scaling picture (lower
 484 panel in Figure \ref{ch14:fig:scaling}) where the performance
 485 as a function of the number of nodes is measured for fixed problem
 486 size. We consider four different fixed problem sizes on both CPU
 487 (up to 512 nodes shown) and GPU (up to 768 nodes). To allow comparison,
 488 the results are scaled by total system size in each case. For strong
 489 scaling, the disparity in the number of MPI tasks is clearly revealed
 490 in the failure of the CPU version to provide any significant benefit
 491 beyond a modest number of nodes as communication overheads dominate.
 492 In contrast, the GPU version shows
 493 reasonably robust scaling even for the smaller system sizes and good
 494 scaling for the larger systems.
 495
 496 \section{Moving Solid Particles}\label{ch14:sec:particles}
 497
 498 \begin{figure}[t]
 499 \centering
 500 %\includegraphics[width=10cm]{Chapters/chapter14/figures/bbl}
 501 \includegraphics[width=10cm]{Chapters/chapter14/figures/colloid_new}
 502 \caption[A two-dimensional schematic picture of spherical particles on the lattice.]{
 503 A two-dimensional schematic picture of spherical particles on the lattice.
 504 Left: a particle is allowed
 505 to move continuously across the lattice, and the position of the
 506 surface defines fluid lattice sites (light blue) and solid lattice
 507 sites (dark red). The discrete surface is defined by links
 508 where propagation would intersect the surface (arrows). Note the
 509 discrete shape of the two particles is different. Right: post-collision
 510 distributions are reversed at the surface by the process of bounce-back
 511 on links, which replaces the propagation.
 512 %A 2D illustration of colloidal particles (grey circles)
 513 %  interacting with the lattice, where each lattice point is
 514 %  represented by a cross. Left: each particle spans multiple sites.
 515 %  Right: the particle-facing distribution components are moved inside
 516 %  the particle with the direction reversed, such that the ``bounce
 517 %  back'' will be completed when the fluid propagates.
 518 }
 519 \label{ch14:fig:bbl}
 520 \end{figure}
 521
 522
 523 %\begin{figure}[t]
 524 %\centering
 525 %\includegraphics[width=6cm]{Chapters/chapter14/figures/particlemove}
 526 %\caption{A 2D illustration of a colloidal particle moving across the
 527 %    lattice. The old and new positions are represented by the open and
 528 %    filled circles respectively.}
 529 %\label{ch14:fig:particlemove}
 530 %\end{figure}
 531
 532 % BEING NEW SOLID TEXT
 533
 534 The introduction of moving solid particles poses an additional hurdle
 535 to efficient GPU implementation of an LB code such as \textit{Ludwig}.
 536 In this section, we give a brief overview of the essential features
 537 of the CPU implementation, and how the considerations raised in the
 538 previous sections --- maximising parallelism and minimising both
 539 host to device and GPU to GPU data movement --- shape the design decisions
 540 for a GPU implementation. We restrict our discussion to a simple fluid
 541 in this section; additional solid-fluid boundary conditions (e.g.,
 542 wetting at a fluid-fluid-solid contact line) usually arise elsewhere
 543 in the calculation and broadly independent of the hydrodynamic boundary
 544 conditions which are described in what follows.
 545
 546 Moving solid particles (here, spheres) are defined by a centre position
 547 which is allowed to move continuously across the space of the lattice,
 548 and a fixed radius which is typically a few lattice spacings $\Delta r$.
 549 The surface of the particle is defined by a series of \textit{links}
 550 where a discrete velocity propagation $\mathbf{c}_i \Delta t$ would
 551 intercept or cut the spherical shell (see Fig.~\ref{ch14:fig:bbl}).
 552 Hydrodynamic boundary conditions are then implemented via the standard
 553 approach of bounce-back on links \cite{ladd1994, nguyen2002}, where the
 554 relevant post-collision distribution values are reversed at the
 555 propagation stage with an appropriate correction to allow for the
 556 solid body motion. The exchange of momentum at each link must then be
 557 accumulated around the entire particle surface to provide the net
 558 hydrodynamic force
 559 and torque on the sphere. The particle motion can then be updated in a
 560 molecular dynamics-like step.
 561
 562 For the CPU implementation, a number of additional MPI communications
 563 are required: (1) to exchange the centre position, radius, etc of each
 564 particle as it moves and (2), to allow the accumulation of the force
 565 and torque for each particle (the links of which may be distributed
 566 between up to 8 MPI tasks in three dimensions). Appropriate marshaling
 567 of these data can provide an effective parallelisation for relatively
 568 dense particle suspensions of up to  40\% solid volume fraction
 569 \cite{stratford2008}. As a final consideration, fluid distributions
 570 must be removed and replaced consistently as a given particle moves
 571 across the lattice and changes its discrete shape.
 572
 573 With these features in mind, we can identify a number of competing
 574 concerns which are relevant to a GPU implementation:
 575 \begin{enumerate}
 576 \item
 577 Minimisation of host-device data transfer would argue for moving the
 578 entire particle code to the GPU. However, the code in question involves
 579 largely conditional logic (e.g., identifying cut surface links) and
 580 irregular memory accesses (e.g., access to distribution elements around
 581 a spherical particle). These operations seem poorly suited to effective
 582 parallelisation on the GPU. As an additional complication, the sums
 583 required over the particle surface would involve potentially tricky
 584 and inefficient reductions in GPU memory.
 585 \item
 586 The alternative is to retain the relevant code on the CPU. While the
 587 transfer of the entire distribution $f_i(\mathbf{r};t)$ between host
 588 and device at each time step is unconscionable owing to PCIe bus bandwidth
 589 considerations, the transfer of only relevant distribution information
 590 to allow bounce-back on links is possible. At modest solid volumes
 591 only a very small fraction of the distribution data is
 592 involved in bounce-back (e.g., a 2\% solid volume fraction of particles
 593 radius 2.3$\Delta r$ would involve approximately 1\% of the distribution
 594 data). This option also has the advantage that no further host-device
 595 data transfers are necessary to allow the MPI exchanges required for
 596 particle information.
 597 \end{enumerate}
 598
 599
 600 We have implemented the second option as follows. For each sub-domain,
 601 a list of boundary-cutting links is assembled on the CPU which includes
 602 the identity of the relevant element of the distribution. This list,
 603 together with the particle information required to compute the correct
 604 bounce-back term, are transferred to the GPU. The updates to the relevant
 605 elements of the distribution can then take place on the GPU. The
 606 corresponding information to compute the update of the particle dynamics
 607 is returned
 608 to the CPU, where the reduction over the surface links is computed.
 609 The change of particle shape may be dealt with in a similar manner:
 610 the relatively small number of updates required at any one time step
 611 (or however frequently the particle position is updated) can be
 612 marshaled to the GPU as necessary. This preliminary implementation
 613 is found to be effective on problems involving up to 4\%
 614 solid volume fraction.
 615
 616 We note that the CPU version actually avoids the collision calculation
 617 at solid lattice points by consulting a look-up table of solid/fluid
 618 status. On the GPU, it is perhaps preferable to perform the collision
 619 stage everywhere instead of moving the look-up table to the GPU and
 620 introducing the associated logic.
 621 Ultimately, the GPU might favour other boundary methods which treat solid and
 622 fluid on a somewhat more equal basis, for example, the immersed boundary
 623 method \cite{ch14:immersed1,ch14:immersed2,ch14:immersed-lb} or smoothed profile method
 624 \cite{ch14:spm}.
 625 However, the approach adopted here  allows us to exploit
 626 the GPU for the intensive fluid simulation whilst maintaining the complex
 627 code required for particles on the CPU. Overheads of CPU-GPU transfer are
 628 minimised by transferring only those data relevant to the hydrodynamic
 629 interaction implemented via bounce-back on links.
 630
 631 % END NEW SOLID TEXT
 632
 633 %It is of interest, in many situations, to include colloidal particles
 634 %in the simulation **more from Kevin***. These are by represented by
 635 %moving spheres that interact with the fluid through transfer of
 636 %momenta: fluid coming into contact with a particle will bounce off it,
 637 %giving the particle a kick in the process. As illustrated on the left
 638 %of Figure \ref{ch14:fig:bbl} (which shows 2 dimensions only for
 639 %clarity - in practice, the particles will be represented by spheres on
 640 %a 3D lattice, and will be much larger relative to the lattice
 641 %spacing), typically each particle spans multiple lattice sites, and
 642 %therefore ``intercepts'' multiple inter-site links. Ludwig stores
 643 %particle information using nested linked lists: a list of particles,
 644 %each with a list of all the links intercepted by that particle (plus
 645 %other quantities). The fluid-particle interaction proceeds using the
 646 %so-called {\it bounce back on links} procedure during each timestep. As
 647 %illustrated in the right of Figure \ref{ch14:fig:bbl}, for each of the
 648 %intercepting links, those velocity components of the distribution that
 649 %face inwards to each particle are moved from the site
 650 %outside the particle to the site inside the particle, and
 651 %reversed in direction. This is done before the propagation stage,
 652 %which then propagates these same velocity components back out of the
 653 %particles: the velocity bounces back from the particles. The
 654 %momenta transfer from fluid to particles is also determined
 655 %from the distribution values.
 656
 657 %A simplified schematic of the implementation on the CPU is as follows,
 658 %noting that multiplicative coefficients are required:
 659 %{\footnotesize
 660 %\begin{verbatim}
 661 %For each particle:
 662 %  For each link:
 663 %    Calculate coefficient
 664 %    Bounce back fluid using coefficient
 665 %    Update particle momentum
 666 %\end{verbatim}
 667 %}
 668 %This process is relatively inexpensive on the CPU, since the total
 669 %number of particles (and intercepted links) is always relatively
 670 %small, but it does require access to distribution data which we want
 671 %to keep resident on the GPU throughout the simulation. The overhead of
 672 %transferring the full distribution to the CPU and back every timestep
 673 %would be too high. But on the other hand, the full codebase required
 674 %to manage particles is quite complex and comprehensive: completely
 675 %porting it to the GPU, such that the particles effectively also remain
 676 %resident on the GPU, would be a large effort and would cause a major
 677 %diversion of CPU and GPU codebases. Further complications would arise
 678 %when running on multiple GPUs due to the fact that particles can move
 679 %between parallel subdomains. Our solution, which minimises overhead,
 680 %is to keep the particles resident on CPU, but offload only their
 681 %interaction with the fluid to the GPU, as we describe below.
 682
 683 %On the CPU, the code in the above listing can be restructured into
 684 %three distinct stages as (introducing arrays to temporrily store
 685 %indices and data):
 686 %{\footnotesize
 687 %\begin{verbatim}
 688 %1) For each particle, for each link:
 689 %     Store site and velocity indices associated with link
 690 %     Calculate and store coefficient
 691 %2) For each particle, for each link:
 692 %     Update velocity data using stored values
 693 %     Calculate and store particle momenta updates
 694 %3) For each particle, for each link:
 695 %     Update particle momentum using stored values
 696 %\end{verbatim}
 697 %}
 698 %Stage 2 is only one that that accesses fluid data: this can therefore
 699 %be moved to the GPU, while stages 1 and 3 remain on the CPU. Stage 2
 700 %therefore becomes, for the GPU implementation
 701 %{\footnotesize
 702 %\begin{verbatim}
 703 %Transfer coefficient and link index arrays to GPU
 704 %Run CUDA kernel:
 705 %  Update fluid data using coefficient and link index arrays
 706 %  Calculate and store particle momenta updates from fluid data
 707 %Transfer particle momenta updates array back to CPU
 708 %\end{verbatim}
 709 %}
 710 %Since the total number of links involves is relatively small, these
 711 %partial data transfers have minimal overhead.  A complication arises
 712 %due to the fact that the particles move through the lattice, albeit
 713 %relatively slowly, and therefore the list of links that they intercept
 714 %occasionally changes.  When a
 715 %particle moves off a lattice site, the fluid on that site must be
 716 %reconstructed. Similarly, the fluid on the newly obstructed site must
 717 %be removed. For the same reasons given above, we want to avoid moving
 718 %all the code dealing with this complication to the GPU, whilst
 719 %minimising overheads. Since, on any one timestep, the number of
 720 %lattice sites affected is relatively low, we implement functionality
 721 %to perform partial data transfers of the distribution. Only those
 722 %sites affected are packed into a buffer on the GPU, transferred to the
 723 %CPU and unpacked into the CPU version of the distribution. Then, the
 724 %existing code can be used to update those affected sites, before the
 725 %reverse partial data transfer is performed to update the GPU version
 726 %of data.
 727
 728 It is perhaps interesting at this point to make some more general
 729 observations on the software engineering challenge presented when
 730 extending an existing CPU code to the GPU. The already complex
 731 task of maintaining the code in a portable fashion while also
 732 maintaining performance is currently formidable. To help this process,
 733 we have followed a number of basic principles. First,
 734 in order to port to the GPU in an incremental fashion,
 735 we have tried to maintain the modular structure of the CPU where
 736 possible. For each data structure, such as the distribution, a separate
 737 analogue is maintained in both the CPU and GPU memory spaces. However,
 738 the GPU copy does not include the complete CPU structure: in
 739 particular, non-intrinsic datatypes such as MPI datatypes are not
 740 required on the GPU. Functions to marshal data between CPU and GPU
 741 are provided for each data structure, abstracting the underlying
 742 CUDA implementation. (This reasonably lightweight abstraction layer
 743 could also allow an OpenCL implementation to be developed.)
 744 This makes it easy to switch between the CPU and GPU for different
 745 components in the code, which is useful in development
 746 and testing. GPU functionality can be added incrementally while
 747 retaining a code that runs correctly (albeit slowly due to data
 748 transfer overheads).  Once all relevant components are moved to
 749 the GPU, it becomes possible to remove such data transfers and
 750 keep the entire problem resident on the device.
 751
 752 %To ensure that \textit{Ludwig} remains suitable for CPU-based systems,
 753 %we augment rather than modify the original code. All new GPU-related
 754 %functionality are implemented in an additive manner, mirroring the
 755 %modular structure of the original code. The new GPU modules have
 756 %interfaces similar to the original CPU functions, and the ability to
 757 %switch between versions is achieved at compile-time via C
 758 %pre-processing directives.  All GPU compute kernels are implemented using
 759 %CUDA.  Wrapper routines are developed to specify the
 760 %decomposition, invoke the CUDA kernels and perform the necessary data
 761 %management operations: the latter are facilitated through newly
 762 %developed modules for creation and destruction of device memory
 763 %structures, the copying of data between host and device, and where
 764 %necessary the packing and buffering of data.
 765
 766
 767
 768
 769 \section{Summary}
 770 \label{ch14:sec:summary}
 771
 772 We have described the steps take to implement, on both single and
 773 multiple GPUs, the \textit{Ludwig} code which was originally
 774 designed for complex fluid problems on a conventional CPU
 775 architecture. We have added the necessary functionality using
 776 NVIDIA CUDA and discussed the important changes to the main
 777 data structures. By retaining domain decomposition and message
 778 passing via MPI, we have demonstrated it is possible to scale
 779 complex fluid problems to large numbers of GPUs in parallel.
 780 By following the key design criteria of maximising parallelism
 781 and minimising host-device data transfers, we have confirmed
 782 the mixed MPI-GPU approach is viable for scientific applications.
 783 For the more intricate problem of moving solid particles, we find
 784 it is possible to retain the more serial elements related to
 785 particle link operations on the CPU, while offloading only the
 786 parallel lattice-based operations involving the LB distribution
 787 to the GPU. Again, this minimises host-device movement of data.
 788
 789 From the software engineering viewpoint, some duplication of code
 790 to allow efficient implementation on both host and device is
 791 currently required. This issue might be addressed by approaches
 792 such as automatic kernel generation, but may also be addressed
 793 naturally in time as GPU and CPU hardware converge. Portable
 794 abstractions and APIs, perhaps based on approaches such as OpenCL,
 795 will also facilitate the development and maintenance of portable
 796 codes which also exhibit portable performance (perhaps in conjunction
 797 with automatic tuning approaches).
 798
 799 So, while the challenges in designing portable and efficient scientific
 800 applications remain very real, this work provides some hope the large
 801 clusters of GPU machines can be used effectively for a wide range of
 802 complex fluid problems.
 803
 804
 805 %The \textit{Ludwig} LB application, which specializes
 806 %in complex fluid problems, has undergone the developments required to
 807 %use large-scale parallel GPU accelerated supercomputers effectively.
 808 %The new functionality augments the original code, so that
 809 %\textit{Ludwig} can run on either CPU-based or GPU-accelerated systems.
 810 %Using the NVIDIA CUDA programming language, all the computational
 811 %kernels were offloaded to the GPU. This included not only those
 812 %responsible for the majority of the compute time, but any that
 813 %accessed the relevant data structures. This allows data to be kept
 814 %resident on the GPU and avoids expensive data transfers. We described
 815 %steps taken to optimise the performance on the GPU architecture by
 816 %reducing off-chip memory accesses and restructuring the data layout
 817 %in memory to ensure the available memory bandwidth was exploited fully.
 818
 819 %A new halo-exchange communication phase for the code was developed to
 820 %allow efficient parallel scaling to many GPUs. The CPU version
 821 %relies on MPI datatype functionality for CPU to CPU communication; we
 822 %have explicitly written buffer packing/unpacking and transfer
 823 %operations to for GPU to GPU data exchange, staged through the host
 824 %CPUs. This includes the filtering of velocity components to only
 825 %transfer those required in a specific direction, hence reducing the
 826 %overall volume of data transferred. We used CUDA stream functionality
 827 %to overlap separate stages within the communication phase, and reduce
 828 %the overall communication time. The effect was measured to be more
 829 %profound for smaller local system sizes, i.e. it is especially useful
 830 %in aiding strong scaling.
 831
 832 %We described work to enable colloidal particles in the simulation with
 833 %minimal overhead. We implemented these in such a way that we avoid a
 834 %major diversion of the CPU and GPU codebases, whilst minimising data
 835 %transfers between the CPU and GPU at each timestep. We keep the
 836 %majority of the (relatively inexpensive) particle related code on the
 837 %CPU, while offloading only those parts responsible for the interaction
 838 %with the fluid to the GPU (where the fluid data resides).
 839
 840 %Our resulting package is able to exploit the largest of GPU
 841 %accelerated architectures to address a wide range of complex problems.
 842
 843
 844
 845 % An example of a floating figure using the graphicx package.
 846 % Note that \label must occur AFTER (or within) \caption.
 847 % For figures, \caption should occur after the \includegraphics.
 848 % Note that IEEEtran v1.7 and later has special internal code that
 849 % is designed to preserve the operation of \label within \caption
 850 % even when the captionsoff option is in effect. However, because
 851 % of issues like this, it may be the safest practice to put all your
 852 % \label just after \caption rather than within \caption{}.
 853 %
 854 % Reminder: the "draftcls" or "draftclsnofoot", not "draft", class
 855 % option should be used if it is desired that the figures are to be
 856 % displayed while in draft mode.
 857 %
 858 %\begin{figure}[!t]
 859 %\centering
 860 %\includegraphics[width=2.5in]{myfigure}
 861 % where an .eps filename suffix will be assumed under latex,
 862 % and a .pdf suffix will be assumed for pdflatex; or what has been declared
 863 % via \DeclareGraphicsExtensions.
 864 %\caption{Simulation Results}
 865 %\label{fig_sim}
 866 %\end{figure}
 867
 868 % Note that IEEE typically puts floats only at the top, even when this
 869 % results in a large percentage of a column being occupied by floats.
 870
 871
 872 % An example of a double column floating figure using two subfigures.
 873 % (The subfig.sty package must be loaded for this to work.)
 874 % The subfigure \label commands are set within each subfloat command, the
 875 % \label for the overall figure must come after \caption.
 876 % \hfil must be used as a separator to get equal spacing.
 877 % The subfigure.sty package works much the same way, except \subfigure is
 878 % used instead of \subfloat.
 879 %
 880 %\begin{figure*}[!t]
 881 %\centerline{\subfloat[Case I]\includegraphics[width=2.5in]{subfigcase1}%
 882 %\label{fig_first_case}}
 883 %\hfil
 884 %\subfloat[Case II]{\includegraphics[width=2.5in]{subfigcase2}%
 885 %\label{fig_second_case}}}
 886 %\caption{Simulation results}
 887 %\label{fig_sim}
 888 %\end{figure*}
 889 %
 890 % Note that often IEEE papers with subfigures do not employ subfigure
 891 % captions (using the optional argument to \subfloat), but instead will
 892 % reference/describe all of them (a), (b), etc., within the main caption.
 893
 894
 895 % An example of a floating table. Note that, for IEEE style tables, the
 896 % \caption command should come BEFORE the table. Table text will default to
 897 % \footnotesize as IEEE normally uses this smaller font for tables.
 898 % The \label must come after \caption as always.
 899 %
 900 %\begin{table}[!t]
 901 %% increase table row spacing, adjust to taste
 902 %\renewcommand{\arraystretch}{1.3}
 903 % if using array.sty, it might be a good idea to tweak the value of
 904 % \extrarowheight as needed to properly center the text within the cells
 905 %\caption{An Example of a Table}
 906 %\label{table_example}
 907 %\centering
 908 %% Some packages, such as MDW tools, offer better commands for making tables
 909 %% than the plain LaTeX2e tabular which is used here.
 910 %\begin{tabular}{|c||c|}
 911 %\hline
 912 %One & Two\\
 913 %\hline
 914 %Three & Four\\
 915 %\hline
 916 %\end{tabular}
 917 %\end{table}
 918
 919
 920 % Note that IEEE does not put floats in the very first column - or typically
 921 % anywhere on the first page for that matter. Also, in-text middle ("here")
 922 % positioning is not used. Most IEEE journals/conferences use top floats
 923 % exclusively. Note that, LaTeX2e, unlike IEEE journals/conferences, places
 924 % footnotes above bottom floats. This can be corrected via the \fnbelowfloat
 925 % command of the stfloats package.
 926
 927
 928 % conference papers do not normally have an appendix
 929
 930
 931 % use section* for acknowledgement
 932 \section*{Acknowledgments}
 933 AG was supported by the CRESTA project, which has received
 934 funding from the European Community's Seventh Framework Programme
 935 (ICT-2011.9.13) under Grant Agreement no. 287703. KS was supported
 936 by UK EPSRC under grant EV/J007404/1.
 937
 938
 939 % trigger a \newpage just before the given reference
 940 % number - used to balance the columns on the last page
 941 % adjust value as needed - may need to be readjusted if
 942 % the document is modified later
 943 %\IEEEtriggeratref{8}
 944 % The "triggered" command can be changed if desired:
 945 %\IEEEtriggercmd{\enlargethispage{-5in}}
 946
 947 % references section
 948
 949 % can use a bibliography generated by BibTeX as a .bbl file
 950 % BibTeX documentation can be easily obtained at:
 951 % http://www.ctan.org/tex-archive/biblio/bibtex/contrib/doc/
 952 % The IEEEtran BibTeX style support page is at:
 953 % http://www.michaelshell.org/tex/ieeetran/bibtex/
 954 %\bibliographystyle{IEEEtran}
 955 % argument is your BibTeX string definitions and bibliography database(s)
 956 %\bibliography{IEEEabrv,../bib/paper}
 957 %
 958 % <OR> manually copy in the resultant .bbl file
 959 % set second argument of \begin to the number of references
 960 % (used to reserve space for the reference number labels box)
 961
 962 \putbib[Chapters/chapter14/biblio14]
 963 %\begin{thebibliography}{1}
 964
 965 %\bibitem{IEEEhowto:kopka}
 966 %H.~Kopka and P.~W. Daly, \emph{A Guide to \LaTeX}, 3rd~ed.\hskip 1em plus
 967 %  0.5em minus 0.4em\relax Harlow, England: Addison-Wesley, 1999.
 968
 969
 970
 971 %% \bibitem{succi-book}
 972 %% S. Succi, \textit{The lattice Boltzmann equation and beyond},
 973 %% Oxford University Press, Oxford, 2001.
 974
 975 %\bibitem{mpi-standard}
 976 %Message Passing Interface Forum, http://www.mpi-forum.org
 977
 978 %% \bibitem{desplat}
 979 %% Desplat, J.-C., I. Pagonabarraga, and P. Bladon,
 980 %% \textit{LUDWIG: A parallel lattice-Boltzmann code for complex fluids}.
 981 %% Comput. Phys. Comms., \textbf{134}, 273, 2001.
 982
 983 %% \bibitem{aidun2010}
 984 %% C.K. Aidun and J.R. Clausen,
 985 %% \textit{Lattice Boltzmann method for complex flows},
 986 %% Ann. Rev. Fluid Mech., \textbf{42} 439--472 (2010).
 987
 988 %\bibitem{bray1994}
 989 %A.J. Bray,
 990 %\textit{Theory of phase-ordering kinetics},
 991 %Adv. Phys., \textbf{43} 357--459 (1994).
 992
 993 %\bibitem{kendon2001}
 994 %V.M. Kendon, M.E. Cates, I. Pangonabarraga, J.-C. Desplat, and P. Bladon,
 995 %\textit{Inertial effects in the three-dimensional spinodal decomposition of a
 996 %symmetric binary fluid mixture; a lattice Boltzmann study},
 997 %J. Fluid Mech., \textbf{440}, 147--203 (2001).
 998
 999 %\bibitem{swift1996}
1000 %M.R. Swift, E. Orlandini, W.R. Osborn, and J.M. Yeomans,
1001 %\textit{Lattice Boltzmann simulations of liquid-gas and binary
1002 %fluid systems},
1003 %Phys. Rev. E, \textbf{54}, 5041--5052 (1996).
1004
1005 %% \bibitem{stratford2008}
1006 %% K. Stratford and I. Pagonabarraga,
1007 %% Parallel domain decomposition for lattice Boltzmann with moving particles,
1008 %% \textit{Comput. Math. with Applications} \textbf{55}, 1585 (2008).
1009
1010 %\bibitem{xe6}
1011 %Cray XE6 Product Brochure, available from
1012 %http://www.cray.com/Products/XE/Resources.aspx (2010)
1013
1014 %\bibitem{hector} HECToR home page, http://www.hector.ac.uk
1015
1016 %\bibitem{xk6}
1017 %Cray XK6 Product Brochure, available from
1018 %http://www.cray.com/Products/XK6/XK6.aspx (2011)
1019
1020
1021 %% \bibitem{wei2004}
1022 %% X. Wei, W. Li, K. M\"uller, and A.E. Kaufman,
1023 %% \textit{The lattice Boltzmann method for simulating gaseous phenomena},
1024 %% IEEE Transactions on Visualization and Computer Graphics,
1025 %% \textbf{10}, 164--176 (2004).
1026 %% Apparently first LB via GPU; a serious contribution using single fluid
1027 %%d3q19 single relaxation time.
1028
1029 %% \bibitem{zhu2006}
1030 %% H. Zhu, X. Liu, Y. Liu, and E. Wu,
1031 %% \textit{Simulation of miscible binary mixtures based on lattice Boltzmann method},
1032 %% Comp. Anim. Virtual Worlds, \textbf{17}, 403--410 (2006).
1033 %% Single relaxation (MRT mentioned) time d3q19 apparently sound although
1034 %%not a lot of detail. Pre-cuda so graphics code.
1035
1036 %% \bibitem{zhao2007}
1037 %% Y. Zhao,
1038 %% \textit{Lattice Boltzmann based PDE solver on the GPU},
1039 %% Visual Comput., doi 10.1007/s00371-0070191-y (2007).
1040
1041 %% \bibitem{toelke2010}
1042 %% J. T\"olke,
1043 %% Implementation of a lattice Boltzmann kernel using the compute unified
1044 %% device architecture developed by nVIDIA,
1045 %% Comput. Visual Sci. 13 29--39 (2010).
1046
1047 %% \bibitem{fan2004}
1048 %% Z. Fan, F. Qiu, A. Kaufman, and S. Yoakum-Stover,
1049 %% \textit{GPU cluster for high performance computing},
1050 %% Proceedings of ACM/IEEE Supercomputing Conference, pp. 47--59,
1051 %% IEEE Computer Society Press, Pittsburgh, PA (2004).
1052
1053 %% \bibitem{myre2011}
1054 %% J. Myre, S.D.C. Walsh, D. Lilja, and M.O. Saar,
1055 %% \textit{Performance analysis of single-phase, multiphase, and multicomponent
1056 %% lattice Boltzmann fluid flow simulations on GPU clusters},
1057 %% Concurrency Computat.: Pract. Exper., \textbf{23}, 332--350 (2011).
1058
1059 %% \bibitem{obrecht2011}
1060 %% C. Obrecht, F. Kuznik, B. Tourancheau, and J.-J. Roux,
1061 %% \textit{Multi-GPU implementation of the lattice Boltzmann method},
1062 %% Comput. Math. with Applications,
1063 %% doi:10.1016/j.camwa.2011.02.020 (2011).
1064
1065 %% \bibitem{bernaschi2010}
1066 %% M. Bernaschi, M. Fatica, S. Melchionna, S. Succi, and E. Kaxiras,
1067 %% \textit{A flexible high-performance lattice Boltzmann GPU code for the
1068 %% simulations of fluid flow in complex geometries},
1069 %% Concurrency Computat.: Pract. Exper., \textbf{22}, 1--14 (2010).
1070
1071 %% \bibitem{xian2011}
1072 %% W. Xian and A. Takayuki,
1073 %% \textit{Multi-GPU performance of incompressible flow computation by
1074 %% lattice Boltzmann method on GPU cluster},
1075 %% Parallel Comput., doi:10.1016/j.parco.2011.02.007 (2011).
1076
1077 %% \bibitem{feichtinger2011}
1078 %% C. Feichtinger, J. Habich, H. K\"ostler, G. Hager, U. R\"ude, and
1079 %% G. Wellein,
1080 %% A flexible patch-based lattice Boltzmann parallelization approach
1081 %% for heterogeneous GPU-CPU clusters,
1082 %% \textit{Parallel Computing} \textbf{37} 536--549 (2011).
1083
1084 %% \bibitem{wellein2006}
1085 %% G. Wellein, T. Zeiser, G Hager, and S. Donath,
1086 %% On the single processor performance of simple lattice Boltzmann kernels,
1087 %% \textit{Computers and Fluids}, \textbf{35}, 910--919 (2006).
1088
1089 %% \bibitem{pohl2003}
1090 %% T. Pohl, M. Kowarschik, J. Wilke, K. Igelberger, and U. R\"ude,
1091 %% Optimization and profiling of the cache performance of parallel
1092 %% lattice Boltzmann code,
1093 %% \textit{Parallel Process Lett.} \textit{13} 549--560 (2003).
1094
1095 %% \bibitem{mattila2007}
1096 %% K. Mattila, J. Hyv\"aluoma, T. Rossi, M. Aspn\"as and J. Westerholm,
1097 %% An efficient swap algorithm for the lattice Boltzmann method,
1098 %% \textit{Comput. Phys. Comms.} \textit{176} 200-210 (2007).
1099
1100 %% \bibitem{wittmann2012}
1101 %% M. Wittmann, T. Zeiser, G. Hager, and G. Wellein,
1102 %% Comparison of different propagation steps for lattice Boltzmann methods,
1103 %% \textit{Comput. Math with Appl.} doi:10.1016/j.camwa.2012.05.002 (2012).
1104
1105 %% \bibitem{walshsaar2012}
1106 %% S.D.C. Walsh and M.O. Saar,
1107 %% Developing extensible lattice Boltzmann simulators for general-purpose
1108 %% graphics-processing units,
1109 %% \textit{Comm. Comput. Phys.}, \textbf{13} 867--879 (2013).
1110
1111
1112 %% \bibitem{williams2011}
1113 %% S. Williams, L. Oliker, J. Carter, and J Shalf,
1114 %% Extracting ultra-scale lattice Boltzmann performance via
1115 %% hierarchical and distributed auto-tuning,
1116 %% \textit{Proc. SC2011}.
1117
1118
1119 %% \bibitem{ch14:stratford-jsp2005}
1120 %% K. Stratford, R. Adhikari, I. Pagonabarraga, and J.-C. Desplat,
1121 %% \textit{Lattice Boltzmann for Binary Fluids with Suspended Colloids},
1122 %% J. Stat. Phys. \textbf{121}, 163 (2005).
1123
1124 %% \bibitem{ladd1994}
1125 %% A.J.C. Ladd,
1126 %% Numerical simulations of particle suspensions via a discretized
1127 %% Boltzmann equation. Part 1. Theoretical foundation,
1128 %% \textit{J. Fluid Mech.} \textbf{271} 285--309 (1994);
1129 %% Part II. Numerical results,
1130 %% \textit{ibid.} \textbf{271} 311--339 (1994).
1131
1132
1133 %% \bibitem{nguyen2002}
1134 %% N.-Q. Nguyen and A.J.C. Ladd,
1135 %% Lubrication corrections for lattice Boltzmann simulations of particle
1136 %% suspensions,
1137 %% \textit{Phys. Rev. E} \textbf{66} 046708 (2002).
1138
1139 %% \bibitem{ch14:immersed}
1140 %% C.S. Peskin,
1141 %% Flow patterns around heart valves; a numerical method,
1142 %% \textit{J. Comp. Phys.}, \textbf{10}, 252--271 (1972);
1143 %% C.S. Peskin,
1144 %% The immersed boundary method,
1145 %% \textit{Acta Nummerica} \textbf{11} 479--517 (2002).
1146
1147 %% \bibitem{ch14:immersed-lb}
1148 %% Z.-G. Feng and E.E. Michaelides,
1149 %% The immersed boundary-lattice Boltzmann method for solving
1150 %% fluid-particles interaction problem,
1151 %% \textit{J. Comp. Phys.}, \textbf{195} 602--628 (2004).
1152
1153 %% \bibitem{ch14:spm}
1154 %% Y. Nakayama and R. Yammamoto,
1155 %% Simulation method to resolve hydrodynamic interactions in colloidal
1156 %% dispersions,
1157 %% \textit{Phys. Rev. E}, \textbf{71} 036707 (2005).
1158
1159
1160 %\end{thebibliography}
1161
1162
1163
1164
1165
1166
1167
1168 % LocalWords:  CRESTA EPSRC Succi Desplat Pagonabarraga