BookGPU/Chapters/chapter14/ch14.tex

   1 \chapterauthor{Alan Gray and Kevin Stratford}{EPCC, The University of Edinburgh, United Kingdom}
   2
   3 \chapter[Ludwig: multiple GPUs for a fluid lattice Boltzmann
   4 application]{Ludwig: multiple GPUs for a complex fluid lattice Boltzmann
   5 application}
   6
   7 %\putbib[biblio]
   8
   9
  10 \section{Introduction}
  11 The lattice Boltzmann (LB) method \index{Lattice Boltzmann method} (for an overview see, e.g.,
  12 \cite{succi-book}) has become a popular approach to a variety of fluid
  13 dynamics problems.  It provides a way to solve the incompressible
  14 isothermal Navier-Stokes equations and has the attractive features of
  15 being both explicit in time and local in space. This makes the LB
  16 method wellsuited to parallel computation. Many efficient parallel
  17 implementations of the LB method have been undertaken, typically using
  18 a combination of distributed domain decomposition and the Message
  19 Passing Interface (MPI). However, the potential
  20 performance benefits offered by GPUs has motivated a new ``mixed-mode''
  21 approach to address very large problems. Here, fine-grained
  22 parallelism is implemented on the GPU, while MPI is reserved for
  23 largerscale parallelism.  This mixed mode is of increasing interest
  24 to application programmers at a time when many supercomputing services
  25 are moving
  26 toward clusters of GPU accelerated nodes. The design questions which
  27 arise when developing a lattice Boltzmann code for this type of
  28 heterogeneous system are therefore worth studying. Furthermore, similar
  29 questions also recur in many other types of stencil-based algorithms.
  30
  31 The first applications of LB on GPUs were to achieve fluid-like
  32 effects in computer animation, rather than scientific applications per
  33 se.  These early works include simple fluids~\cite{wei2004}, miscible
  34 two-component flow~\cite{zhu2006}, and various image processing tasks
  35 based on the use of partial differential equations~\cite{zhao2007}.
  36 While these early works used relatively low level graphics APIs, the
  37 first CUDA runtime interface implementation was a two-dimensional
  38 simple fluid problem~\cite{toelke2010}.  Following pioneering work on
  39 clusters of GPUs coupled via MPI to study air
  40 pollution~\cite{fan2004}, more recent work has included mixed OpenMP
  41 and CUDA~\cite{myre2011}, Posix threads and CUDA~\cite{obrecht2011},
  42 and MPI and CUDA for increasingly large GPU clusters
  43 \cite{bernaschi2010,xian2011,feichtinger2011}. The heterogeneous
  44 nature of these systems has also spurred interest in approaches
  45 including automatic code generation \cite{walshsaar2012} and auto-tuning
  46 \cite{williams2011} to aid application performance.
  47
  48 Many of these authors make use of
  49 another attractive feature of LB: the ability to include fixed
  50 solid-fluid boundary conditions as a straightforward addition to
  51 the algorithm to study, for example, flow in
  52 porous media. This points to an important
  53 application area for LB methods: that of complex fluids.
  54 Complex fluids include mixtures, surfactants, liquid crystals,
  55 and particle suspensions, and typically require additional physics
  56 beyond the bare Navier-Stokes equations to provide a full
  57 description~\cite{aidun2010}. The representation of this extra
  58 physics raises additional design questions for the application
  59 programmer. Here, we consider the \textit{Ludwig} code \cite{desplat}\index{Ludwig code},
  60 an LB application developed specifically for complex fluids
  61 (\textit{Ludwig} was named for Boltzmann, 1844--1906).
  62 We will present the steps
  63 required to allow \textit{Ludwig} to exploit efficiently both a single
  64 GPU and many GPUs in parallel.  We show that
  65 \textit{Ludwig} scales excellently to at least the one thousand GPU level
  66 (the largest resource available at the time of writing) with
  67 indications that the code will scale to much larger systems as
  68 they become available. In addition, we consider the steps required
  69 to represent fully-resolved moving solid particles (usually referred
  70 to as colloids in the context of complex fluids). Such particles
  71 need to have their surface resolved on the lattice scale in order
  72 to include relevant surface physics and must be able to move, e.g.,
  73 to execute Brownian motion in response to random forces from the
  74 fluid. Standard methods are available to represent such particles
  75 (e.g., \cite{ladd1994,nguyen2002}) which are amenable to effective
  76 domain decomposition and message passing \cite{stratford2008}.
  77 We describe below the additional considerations which arise
  78 for such moving particles when developing an
  79 implementation on a GPU.
  80
  81 In the following section we provide a brief overview of the lattice
  82 Boltzmann method, and mention some of the general issues which can
  83 influence performance in the CPU context.
  84 In Section \ref{ch14:sec:singlegpu}, we then describe the alterations
  85 which are required to exploit the GPU architecture effectively and
  86 highlight how the resulting code differs from the CPU version. In
  87 Section \ref{ch14:sec:parallelgpu}, we extend this description to
  88 include the steps required to allow exploitation of many GPUs in
  89 parallel while retaining effective scaling. We also present results for
  90 a typical benchmark for a fluid-only problem in Section
  91 \ref{ch14:sec:parallelgpu} to demonstrate the success of the approach.
  92 In Section \ref{ch14:sec:particles}, we describe the design choices
  93 which are relevant to a GPU implementation of moving particles. Finally,
  94 we include a number of general observations on software
  95 engineering and maintenance which arise from our experience.
  96 A summary is provided in Section~\ref{ch14:sec:summary}.
  97
  98
  99 \section{Background}\label{ch14:sec:background}
 100
 101 %\begin{figure}[!t]
 102 %\centering
 103 %\includegraphics[width=12cm]{Chapters/chapter14/figures/basiclattice}
 104 %\caption{The lattice Boltzmann approach discritises space into a 3D lattice (left). The fluid is represented at each lattice site (right), by the {\it distribution} vector: each component corresponding to a specific velocity direction. Here we are interested in the {\it D3Q19} model, in which there are 3 spatial directions and 19 velocity components (including a zero component).}
 105 %\label{ch14:fig:basiclattice}
 106 %\end{figure}
 107
 108
 109 For a general complex fluid problem the starting point is the
 110 fluid velocity field $\mathbf{u}(\mathbf{r})$, whose evolution
 111 obeys the Navier-Stokes equations describing the conservation of mass
 112 (or density $\rho$), and momentum:
 113 \begin{equation}
 114 \rho  [ \partial_t \mathbf{u} + (\mathbf{u}.\mathbf{\nabla})\mathbf{u} ]
 115 = -\nabla p + \eta \nabla^2 \mathbf{u} + \mathbf{f}(\mathbf{r}),
 116 \end{equation}
 117 where $p$ is the isotropic pressure and $\eta$ is the viscosity.
 118 A local force $\mathbf{f}(\mathbf{r})$ provides a means for coupling
 119 to other complex fluid constituents, e.g., it might represent the force
 120 exerted on the fluid by a curved interface between different phases or
 121 components.
 122
 123 The LB approach makes use of a regular three-dimensional
 124 lattice (see Figure \ref{ch14:fig:decomphalo}) with discrete spacing
 125 $\Delta r$. It also makes use of a
 126 discrete velocity space $\mathbf{c}_i$, where the $\mathbf{c}_i$
 127 are chosen to capture the correct symmetries of the Navier-Stokes
 128 equations. A typical choice, used here, is the so-called D3Q19
 129 basis in three dimensions where there is one velocity such that
 130 $\mathbf{c} \Delta t$ is zero, along with six extending to the nearest
 131 neighbor
 132 lattice sites, and twelve extending to the next-nearest neighbor sites
 133 ($\Delta t$ being the discrete time step). The fundamental object
 134 in LB is then the distribution function $f_i (\mathbf{r};t)$ whose
 135 moments are related to the local hydrodynamic quantities: the fluid
 136 density, momentum, and stress. The time evolution of the distribution
 137 function is described by a discrete Boltzmann equation
 138 \begin{equation}
 139 f_i(\mathbf{r} + \mathbf{c}_i \Delta t; t) - f_i(\mathbf{r}; t)
 140 = - {\cal L}_{ij} f_j(\mathbf{r};t).
 141 \end{equation}
 142 It is convenient to think of this in two stages. First, the right-hand
 143 side represents the action of a collision operator ${\cal L}_{ij}$,
 144 which is local to each lattice site and relaxes the distribution toward
 145 a local equilibrium at a rate ultimately related to the fluid viscosity.
 146 Second, the left-hand side represents a propagation step (sometimes referred
 147 to as streaming step), in which each element $i$ of the distribution is
 148 displaced $\mathbf{c}_i \Delta t$, i.e., one lattice spacing in the
 149 appropriate direction per discrete time step.
 150
 151 More specifically, we store a vector of 19 double-precision floating
 152 point values at each lattice site for the distribution function
 153 $f_i(\mathbf{r};t)$.
 154 The collision operation ${\cal L}_{ij}$, which is local at each lattice
 155 site, may be thought of as follows. A matrix-vector multiplication
 156 ${\cal M}_{ij}f_j$ is used to transform the distributions into the
 157 hydrodynamic quantities, where ${\cal M}_{ij}$ is a constant 19x19
 158 matrix related to the choice of
 159 $\mathbf{c}_i$. The nonconserved hydrodynamic quantities are then
 160 relaxed toward their (known) equilibrium values and are transformed
 161 back to new postcollision distributions via the inverse transformation
 162 ${\cal M}^{-1}_{ij}$. This gives rise to the need for a minimum of $2\times 19^2$
 163 floating point multiplications per lattice site. (Additional operations are
 164 required to implement, for example, the force $\mathbf{f}(\mathbf{r})$.)
 165 In contrast, the
 166 propagation stage consists solely of moving the distribution values
 167 between lattice sites and involves no floating point operations.
 168
 169
 170 \begin{figure}[tbp]
 171 \centering
 172 \includegraphics[width=12cm]{Chapters/chapter14/figures/decomphalo}
 173 \caption[The lattice is decomposed between MPI tasks.]{Left: the lattice is decomposed between MPI tasks. For
 174   clarity we show a 2D decomposition of a 3D lattice, but in practice
 175   we decompose in all 3 dimensions. Halo cells are added to each
 176   subdomain (as shown on the upper right for a single slice) which store
 177   data retrieved from remote neighbors in the halo exchange. Lower
 178   right: the D3Q19 velocity set resident on a lattice site;
 179   highlighted are the 5 ``outgoing'' elements to be transferred in a
 180   specific direction.}
 181 \label{ch14:fig:decomphalo}
 182 \end{figure}
 183
 184
 185 In the CPU version, the \textit{Ludwig} implementation stores one time
 186 level of distribution values $f_i(\mathbf{r}; t)$. This distribution
 187 is stored as a flat array of C data type \texttt{double} and laid out
 188 so that all elements of the velocity are contiguous at a given site
 189 (often referred to as ``array-of-structures'' order). This is
 190 appropriate for sums over the distribution required locally at the
 191 collision stage as illustrated schematically in
 192 Listing~\ref{ch14:listing1}: the fact that consecutive loads are from
 193 consecutive memory addresses allows the prefetcher to engage fully.
 194 (The temporary scalar \texttt{a\_tmp} allows caching of the
 195 intermediate accumulated value in the innermost loop.)  A variety of
 196 standard sequential optimisations are relevant for the collision stage
 197 (loop unrolling, inclusion of SIMD operations, and so on
 198 \cite{wellein2006}).  For example, the collision stage in
 199 \textit{Ludwig} includes explicit SIMD code, which is useful if the
 200 compiler on a given platform cannot identify it.  The propagation
 201 stage is separate and is organised as a ``pull'' (again, various
 202 optimizations have been considered, e.g.,
 203 \cite{pohl2003,mattila2007,wittmann2012}).  No further optimization is
 204 done here beyond ensuring that the ordering of the discrete velocities
 205 allows memory access to be as efficient as possible. While these
 206 optimizations are important, it should be remembered that for some
 207 complex fluid problems, the hydrodynamics embodied in the LB
 208 calculation is a relatively small part of the total computational cost
 209 (at the 10\%level in some cases).  This means optimization effort may
 210 be better concentrated elsewhere.
 211
 212
 213
 214 \begin{lstlisting}[float, label=ch14:listing1,
 215 caption = collision schematic for CPU.]
 216 /* loop over lattice sites */
 217 for (is = 0; is < nsites; is++) {
 218   ...
 219   /* load distribution from ``array-of-structures'' */
 220   for (i = 0; i < 19; i++)
 221     f[i] = f_aos[19*is + i]
 222   ...
 223   /* perform matrix-vector multiplication */
 224   for (i = 0; i < 19; i++) {
 225     a_tmp = 0.0;
 226     for (j = 0; j < 19; j++) {
 227       a_tmp += f[j]*M[i][j];
 228     }
 229     a[i] = a_tmp;
 230   }
 231   ...
 232 }
 233 \end{lstlisting}
 234
 235
 236
 237
 238
 239 The regular 3D decomposition is illustrated in Fig.~\ref{ch14:fig:decomphalo}.
 240 Each local subdomain is surrounded by a halo, or ghost, region of width one
 241 lattice site. While the collision is local, elements of the distribution
 242 must be exchanged at the edges of the domains to facilitate the propagation.
 243 To achieve the full 3D halo exchange, the standard approach of shifting the
 244 relevant data in each coordinate direction in turn is adopted. This
 245 requires appropriate synchronization, i.e., a receive in the the first
 246 coordinate direction must be complete before a send in the second direction
 247 involving relevant data can take place, and so on. We note that only
 248 ``outgoing'' elements of the distribution need to be sent at each edge.
 249 For the D3Q19 model, this reduces the volume of data traffic from 19 to
 250 5 of the $f_i(\mathbf{r};t)$ per lattice site at each edge. In the CPU
 251 version, the necessary transfers are implemented in place using
 252 a vector of appropriately strided MPI datatypes for each direction.
 253
 254
 255 \section{Single GPU implementation}\label{ch14:sec:singlegpu}
 256
 257
 258 In this section we describe the steps taken to enable \textit{Ludwig}
 259 for the GPU. There are a number of crucial issues: first, the
 260 minimization of data traffic between host and device; second, the
 261 optimal mapping of available parallelism onto the architecture; and
 262 third, the issue of memory coalescing. We discuss each of these in
 263 turn.
 264
 265 While the most important section of the LB in terms of
 266 floating-point performance is the collision stage, this cannot be the
 267 only consideration for a GPU implementation. It is essential to
 268 offload all computational activity which involves the main data
 269 structures (such as the distribution) to the GPU. This includes
 270 kernels with relatively low computational demand, such as the
 271 propagation stage. All relevant data then remain resident on the GPU,
 272 to avoid expensive host-device data transfer at each iteration of
 273 the algorithm. Such transfers would negate any benefit of GPU
 274 acceleration.  We note that for a complex fluid code, this requirement
 275 can extend to a considerable number of kernels, although we limit the
 276 discussion to collision and propagation for brevity.
 277
 278 To achieve optimal performance, it is vital to exploit fully the
 279 parallelism inherent in the GPU architecture, particularly for
 280 those matrix-vector operations within the collision kernel.
 281 The GPU architecture features a hierarchy of parallelism. At the
 282 lowest level, groups of 32 threads (warps) operate in
 283 lock-step on different data elements: this is SIMD-style vector-level
 284 parallelism. Multiple warps are combined into a thread block (in which
 285 communication and synchronization are possible), and multiple blocks can
 286 run concurrently across the streaming multiprocessors in the GPU
 287 (with no communication or
 288 synchronization possible across blocks).  To decompose on the GPU, we
 289 must choose which part of the collision to assign to which level of
 290 parallelism.
 291
 292 While there exists parallelism within the matrix-vector operation, the
 293 length of each vector (here, 19) is smaller than the warp size and typical
 294 thread block sizes. So we simply decompose the loop over lattice sites
 295 to all levels of parallelism, i.e., we use a separate CUDA thread for
 296 every lattice site, and each thread performs the full matrix-vector
 297 operation. We find that a block size of 256 performs well on current
 298 devices: we therefore decompose
 299 the lattice into groups of 256 sites and assign each group to a block
 300 of CUDA threads. As the matrix ${\cal M}_{ij}$ is constant, it is
 301 assigned to the fast {\it constant} on-chip device memory.
 302
 303 For the propagation stage, the GPU implementation adds a second time
 304 level of distribution values. The datadependencies inherent in the
 305 propagation mean that the in-place propagation of the CPU version
 306 cannot be parallelized effectively without the additional time level.
 307 As both time levels may remain resident on the GPU, this is not a
 308 significant overhead.
 309
 310 An architectural
 311 constraint of GPUs means that optimal global memory bandwidth
 312 is only achieved when data are structured such that threads within a
 313 {\it half-warp} (a group of 16 threads) load data from the same memory
 314 segment in a single transaction: this is memory coalescing. It can be
 315 seen that the array-of-structures ordering used for the distribution
 316 in the CPU code would not be suitable for coalescing; in fact, it would
 317 result in serialized memory accesses and relative poor performance.
 318 To meet the coalescing criteria and allow consecutive threads to read
 319 consecutive memory addresses on the GPU, we transpose the layout of the
 320 distribution so that, for each velocity component, consecutive sites
 321 are contiguous in memory (``structure-of-arrays'' order). A schematic of
 322 the GPU collision code is given in Listing~\ref{ch14:listing2}.
 323
 324 \begin{lstlisting}[float, label=ch14:listing2,
 325 caption = collision schematic for GPU.]
 326 /* compute current site index 'is' from CUDA thread and block */
 327 /* indices and load distribution from "structure-of-arrays" */
 328
 329 for (i = 0; i < 19; i++)
 330   f[i] = f_soa[nsites*i + is]
 331
 332 /* perform matrix-vector multiplication as before */
 333 ...
 334 \end{lstlisting}
 335
 336
 337 \textit{Ludwig} was modified to allow a choice of distribution data layout
 338 at compilation time depending on the target architecture: CPU or GPU. We
 339 defer some further comments on software engineering aspects of the code to
 340 the summary.
 341
 342
 343 \section{Multiple GPU implementation}\label{ch14:sec:parallelgpu}
 344
 345 To execute on multiple GPUs, we use the same domain decomposition
 346 and message passing framework as the CPU version. Within each
 347 subdomain (allocated to one MPI task) the GPU implementation
 348 proceeds as described in the previous section. The only additional
 349 complication is that halo transfers between GPUs must be staged
 350 through the host (in the future, direct GPU-to-GPU data transfers via
 351 MPI may be possible, obviating the need for these steps). This
 352 means host MPI sends must be preceded by appropriate device-to-host
 353 transfers and host MPI receives must be followed by corresponding host-to-device transfers.
 354
 355 %To execute on multiple GPUs in parallel we decompose the problem using
 356 %the existing high-level parallelisaton framework: each MPI task acts
 357 %as a host to a separate GPU, with each GPU kernel operating on the
 358 %subset of the lattice volume corresponding to the partition assigned
 359 %to that task.  As described in Section \ref{ch14:sec:background}, halo
 360 %data exchanges are required at each timestep.  Significant
 361 %developments are made for the GPU adaptation of this communication
 362 %phase to achieve good inter-GPU communication performance and allow
 363 %scaling to many GPUs.
 364
 365 %The CPU code uses strided MPI datatypes to define the six
 366 %planes of lattice data to be sent or received, and performs the
 367 %communications {\it in-place}, i.e.  there is no explicit buffering of
 368 %data in the application.  For the GPU adaptation, the halo planes must
 369 %be exchanged between GPUs, on which MPI functionality is not
 370 %available, so transfers must be staged through the host CPUs. Buffer
 371 %packing/unpacking of the halo planes is explicitly performed using
 372 %CUDA kernels, and CUDA memory copy calls are used to transfer data
 373 %from a device to its host (using pinned host memory) through the PCI-e
 374 %bus (and vice versa). MPI calls are used to transfer the buffers
 375 %between hosts. The CPU version of Ludwig has a mechanism, using MPI built-in
 376 %data types, to reduce the amount of data communicated by around a
 377 %factor of four by only sending those elements of the distribution
 378 %propagating outward from a local domain. For the GPU adaptation, this
 379 %filtering of data is explicitly performed (again since MPI
 380 %functionality is not available on the GPU) in the buffer
 381 %packing/unpacking kernels.  Special care had to be taken for those
 382 %lattice sites on the corners of the halo planes, which had to be sent
 383 %in more than one direction (since the propagation includes accesses to
 384 %diagonal neighbours).
 385
 386 In practice, this data movement requires additional GPU kernels to
 387 pack and unpack the relevant data before and after corresponding MPI
 388 calls. However, the standard shift algorithm, in which each coordinate
 389 direction is treated in turn, does provide some scope for the
 390 overlapping of different operations.
 391 For example, after the data for the first coordinate direction have been
 392 retrieved by the host, these can be exchanged using MPI between
 393 hosts at the same time as kernels for packing and retrieving of
 394 data for the second coordinate direction are
 395 executed. This
 396 overlapping must respect the synchronisation required to ensure
 397 that data values at the corners of the subdomain are transferred
 398 correctly.
 399 We use a separate CUDA stream for each coordinate direction:
 400 this allows some of the host-device communication time to be
 401 effectively ``hidden'' behind the host-host MPI communication,
 402 resulting in an overall speedup.
 403 The improvement is more pronounced for the smaller local lattice size,
 404 perhaps because of less CPU memory bandwidth contention. The
 405 overlapping is then a particularly valuable aid to strong scaling,
 406 where the local system size decreases as the number of GPUs increases.
 407
 408
 409 %\section{Performance}\label{ch14:sec:performance}
 410
 411 To demonstrate the effectiveness of our approach, we compare the
 412 performance of both CPU and GPU versions of \textit{Ludwig}. To
 413 test the complex fluid nature of the code, the problem is
 414 actually an immiscible fluid mixture which uses a second distribution
 415 function to introduce a composition variable. The interested reader
 416 is referred to \cite{ch14:stratford-jsp2005} for further details. The
 417 largest total problem size used is $2548\times 1764\times 1568$.
 418 The CPU system features a Cray XE6 architecture with 2 16-core AMD Opteron
 419 CPUs per node, and with nodes interconnected using Cray Gemini technology.
 420 For GPU results, a Cray XK6 system is used: this is very similar to the
 421 XE6, but has one CPU per node replaced with an NVIDIA X2090 GPU. Each
 422 node in the GPU system therefore features a single Opteron CPU acting
 423 as a host to a single GPU. The internode interconnect architecture is
 424 the same as for the Cray XE6. The GPU performance tests use a prototype
 425 Cray XK6 system with 936 nodes (the largest available at the time of writing).
 426 To provide a fair comparison, we compare scaling on a \textit{per node}
 427 basis. That is, we compare 1 fully occupied 32-core CPU node
 428 (running 32 MPI tasks) with 1 GPU node (host running 1 MPI task,
 429 and 1 device). We believe this is representative of the true ``cost'' of
 430 a simulation in terms of accounting budgets, or electricity.
 431
 432
 433 %\begin{figure}[!t]
 434 %\centering
 435 %\includegraphics[width=10cm]{Chapters/chapter14/figures/length_scale}
 436 %\caption{The dependence of the characteristic length scale of a binary
 437 %  fluid mixture on the simulation timestep number. Superimposed are
 438 %  three snapshots showing increased mixture separation with time:
 439 %  these are obtained through the order parameter and each show one
 440 %  eighth of the total simulation size, which is $1548\times 1764\times 1568$. }\label{ch14:fig:length_scale}
 441 %\end{figure}
 442
 443 \begin{figure}[!t]
 444 \centering
 445 \includegraphics[width=10cm]{Chapters/chapter14/figures/two_graphs_ud}
 446 \caption[The weak (top) and strong (bottom) scaling of \textit{Ludwig}.]{The weak (top) and strong (bottom) scaling of \textit{Ludwig}.  Closed
 447   shapes denote results using the CPU version run on the Cray
 448   XE6 (using two 16-core AMD Interlagos CPUs per node), while open
 449   shapes denote results using the GPU version on the Cray XK6 (using a
 450   single NVIDIA X2090 GPU per node). Top: the benchmark time is shown
 451   where the problem size per node is constant. Bottom: performance is
 452   shown where, for each of the four problem sizes presented, the
 453   results are scaled by the lattice volume, and all results are
 454   normalized by the same arbitrary constant.  }
 455 \label{ch14:fig:scaling}
 456 \end{figure}
 457
 458
 459 %The top of Figure \ref{ch14:fig:scaling} shows the benchmark time,
 460 %where the problem size is increased with the number of nodes. This
 461 %shows weak scaling: the time taken by a perfectly scaling code would
 462 %remain constant. The figure plots the time against the number of {\it
 463 %  nodes}: for the GPU version we utilize the one GPU per Cray XK6 node
 464 %whilst for the CPU version we fully utilize two 16-core CPUs
 465 %per Cray XE6 node, i.e. one GPU is compared with 32 CPU cores.  We use
 466 %a single MPI task per Cray XK6 node to host the GPU, and 32 MPI tasks
 467 %to fully utilize the CPUs in the Cray XE6 node.  Note that our CPU
 468 %version is highly optimised: in particular it has been tuned for
 469 %optimal utilisation of the SIMD vector units in each CPU core.
 470
 471 Figure \ref{ch14:fig:scaling} shows the results of both weak and
 472 strong scaling tests (top and bottom panels, respectively). For
 473 weak scaling, where the local subdomain size is kept fixed
 474 (here a relatively large 196$^3$ lattice), the time taken by
 475 an ideal code would remain constant. It can be seen that while
 476 scaling of the CPU version shows a pronounced upward slope, the
 477 GPU version scales almost perfectly. The advantage of the GPU
 478 version over the CPU version is the fact that the GPU version
 479 has a factor of 32 fewer MPI tasks per node, so communications
 480 require a significantly smaller number of larger data transfers.
 481 The performance
 482 advantage of the GPU version ranges from a factor of around 1.5 to
 483 around 1.8. Careful studies by other authors \cite{williams2011} have
 484 found the absolute performance (in terms of floating-point
 485 operations per second, or floating-point operations per Watt) to
 486 be remarkably similar between architectures.
 487
 488 Perhaps of more interest is the strong scaling picture (lower
 489 panel in Figure \ref{ch14:fig:scaling}) where the performance
 490 as a function of the number of nodes is measured for fixed problem
 491 size. We consider four different fixed problem sizes on both CPU
 492 (up to 512 nodes shown) and GPU (up to 768 nodes). To allow comparison,
 493 the results are scaled by total system size in each case. For strong
 494 scaling, the disparity in the number of MPI tasks is clearly revealed
 495 in the failure of the CPU version to provide any significant benefit
 496 beyond a modest number of nodes as communication overheads dominate.
 497 In contrast, the GPU version shows
 498 reasonably robust scaling even for the smaller system sizes and good
 499 scaling for the larger systems.
 500
 501 \section{Moving solid particles}\label{ch14:sec:particles}
 502
 503 \begin{figure}[t]
 504 \centering
 505 %\includegraphics[width=10cm]{Chapters/chapter14/figures/bbl}
 506 \includegraphics[width=10cm]{Chapters/chapter14/figures/colloid_new}
 507 \caption[A two-dimensional schematic picture of spherical particles on the lattice.]{
 508 A two-dimensional schematic picture of spherical particles on the lattice.
 509 Left: a particle is allowed
 510 to move continuously across the lattice, and the position of the
 511 surface defines fluid lattice sites (light blue) and solid lattice
 512 sites (dark red). The discrete surface is defined by links
 513 where propagation would intersect the surface (arrows). Note the
 514 discrete shape of the two particles is different. Right: post-collision
 515 distributions are reversed at the surface by the process of bounce-back
 516 on links, which replaces the propagation.
 517 %A 2D illustration of colloidal particles (grey circles)
 518 %  interacting with the lattice, where each lattice point is
 519 %  represented by a cross. Left: each particle spans multiple sites.
 520 %  Right: the particle-facing distribution components are moved inside
 521 %  the particle with the direction reversed, such that the ``bounce
 522 %  back'' will be completed when the fluid propagates.
 523 }
 524 \label{ch14:fig:bbl}
 525 \end{figure}
 526
 527
 528 %\begin{figure}[t]
 529 %\centering
 530 %\includegraphics[width=6cm]{Chapters/chapter14/figures/particlemove}
 531 %\caption{A 2D illustration of a colloidal particle moving across the
 532 %    lattice. The old and new positions are represented by the open and
 533 %    filled circles respectively.}
 534 %\label{ch14:fig:particlemove}
 535 %\end{figure}
 536
 537 % BEING NEW SOLID TEXT
 538
 539 The introduction of moving solid particles poses an additional hurdle
 540 to efficient GPU implementation of an LB code such as \textit{Ludwig}.
 541 In this section, we give a brief overview of the essential features
 542 of the CPU implementation, and how the considerations raised in the
 543 previous sections---maximizing parallelism and minimising both
 544 host-to-device and GPU to GPU data movement---shape the design decisions
 545 for a GPU implementation. We restrict our discussion to a simple fluid
 546 in this section; additional solid-fluid boundary conditions (e.g.,
 547 wetting at a fluid-fluid-solid contact line) usually arise elsewhere
 548 in the calculation broadly independent of the hydrodynamic boundary
 549 conditions which are described in what follows.
 550
 551 Moving solid particles (here, spheres) are defined by a center position
 552 which is allowed to move continuously across the space of the lattice,
 553 and a fixed radius which is typically a few lattice spacings $\Delta r$.
 554 The surface of the particle is defined by a series of \textit{links}
 555 where a discrete velocity propagation $\mathbf{c}_i \Delta t$ would
 556 intercept or cut the spherical shell (see Fig.~\ref{ch14:fig:bbl}).
 557 Hydrodynamic boundary conditions are then implemented via the standard
 558 approach of bounce-back on links \cite{ladd1994, nguyen2002}, where the
 559 relevant postcollision distribution values are reversed at the
 560 propagation stage with an appropriate correction to allow for the
 561 solid body motion. The exchange of momentum at each link must then be
 562 accumulated around the entire particle surface to provide the net
 563 hydrodynamic force
 564 and torque on the sphere. The particle motion can then be updated in a
 565 molecular dynamics-like step.
 566
 567 For the CPU implementation, a number of additional MPI communications
 568 are required: (1) to exchange the center position, radius, etc of each
 569 particle as it moves and (2), to allow the accumulation of the force
 570 and torque for each particle (the links of which may be distributed
 571 between up to 8 MPI tasks in three dimensions). Appropriate marshaling
 572 of these data can provide an effective parallelization for relatively
 573 dense particle suspensions of up to  40\% solid volume fraction
 574 \cite{stratford2008}. As a final consideration, fluid distributions
 575 must be removed and replaced consistently as a given particle moves
 576 across the lattice and changes its discrete shape.
 577
 578 With these features in mind, we can identify a number of competing
 579 concerns which are relevant to a GPU implementation:
 580 \begin{enumerate}
 581 \item
 582 Minimisation of host-device data transfer would argue for moving the
 583 entire particle code to the GPU. However, the code in question involves
 584 largely conditional logic (e.g., identifying cut surface links) and
 585 irregular memory accesses (e.g., access to distribution elements around
 586 a spherical particle). These operations seem poorly suited to effective
 587 parallelization on the GPU. As an additional complication, the sums
 588 required over the particle surface would involve potentially tricky
 589 and inefficient reductions in GPU memory.
 590 \item
 591 The alternative is to retain the relevant code on the CPU. While the
 592 transfer of the entire distribution $f_i(\mathbf{r};t)$ between host
 593 and device at each time step is unconscionable owing to PCIe bus bandwidth
 594 considerations, the transfer of only relevant distribution information
 595 to allow bounce-back on links is possible. At modest solid volumes
 596 only a very small fraction of the distribution data is
 597 involved in bounce-back (e.g., a 2\% solid volume fraction of particles
 598 radius 2.3$\Delta r$ would involve approximately 1\% of the distribution
 599 data). This option also has the advantage that no further host-device
 600 data transfers are necessary to allow the MPI exchanges required for
 601 particle information.
 602 \end{enumerate}
 603
 604
 605 We have implemented the second option as follows. For each subdomain,
 606 a list of boundary-cutting links is assembled on the CPU which includes
 607 the identity of the relevant element of the distribution. This list,
 608 together with the particle information required to compute the correct
 609 bounce-back term, are transferred to the GPU. The updates to the relevant
 610 elements of the distribution can then take place on the GPU. The
 611 corresponding information to compute the update of the particle dynamics
 612 is returned
 613 to the CPU, where the reduction over the surface links is computed.
 614 The change of particle shape may be dealt with in a similar manner:
 615 the relatively small number of updates required at any one time step
 616 (or however frequently the particle position is updated) can be
 617 marshaled to the GPU as necessary. This preliminary implementation
 618 is found to be effective on problems involving up to 4\%
 619 solid volume fraction.
 620
 621 We note that the CPU version actually avoids the collision calculation
 622 at solid lattice points by consulting a look-up table of solid/fluid
 623 status. On the GPU, it is perhaps preferable to perform the collision
 624 stage everywhere instead of moving the look-up table to the GPU and
 625 introducing the associated logic.
 626 Ultimately, the GPU might favor other boundary methods which treat solid and
 627 fluid on a somewhat more equal basis, for example, the immersed boundary
 628 method \cite{ch14:immersed1,ch14:immersed2,ch14:immersed-lb} or smoothed profile method
 629 \cite{ch14:spm}.
 630 However, the approach adopted here  allows us to exploit
 631 the GPU for the intensive fluid simulation while maintaining the complex
 632 code required for particles on the CPU. Overheads of CPU-GPU transfer are
 633 minimized by transferring only those data relevant to the hydrodynamic
 634 interaction implemented via bounce-back on links.
 635
 636 % END NEW SOLID TEXT
 637
 638 %It is of interest, in many situations, to include colloidal particles
 639 %in the simulation **more from Kevin***. These are by represented by
 640 %moving spheres that interact with the fluid through transfer of
 641 %momenta: fluid coming into contact with a particle will bounce off it,
 642 %giving the particle a kick in the process. As illustrated on the left
 643 %of Figure \ref{ch14:fig:bbl} (which shows 2 dimensions only for
 644 %clarity - in practice, the particles will be represented by spheres on
 645 %a 3D lattice, and will be much larger relative to the lattice
 646 %spacing), typically each particle spans multiple lattice sites, and
 647 %therefore ``intercepts'' multiple inter-site links. Ludwig stores
 648 %particle information using nested linked lists: a list of particles,
 649 %each with a list of all the links intercepted by that particle (plus
 650 %other quantities). The fluid-particle interaction proceeds using the
 651 %so-called {\it bounce back on links} procedure during each timestep. As
 652 %illustrated in the right of Figure \ref{ch14:fig:bbl}, for each of the
 653 %intercepting links, those velocity components of the distribution that
 654 %face inwards to each particle are moved from the site
 655 %outside the particle to the site inside the particle, and
 656 %reversed in direction. This is done before the propagation stage,
 657 %which then propagates these same velocity components back out of the
 658 %particles: the velocity bounces back from the particles. The
 659 %momenta transfer from fluid to particles is also determined
 660 %from the distribution values.
 661
 662 %A simplified schematic of the implementation on the CPU is as follows,
 663 %noting that multiplicative coefficients are required:
 664 %{\footnotesize
 665 %\begin{verbatim}
 666 %For each particle:
 667 %  For each link:
 668 %    Calculate coefficient
 669 %    Bounce back fluid using coefficient
 670 %    Update particle momentum
 671 %\end{verbatim}
 672 %}
 673 %This process is relatively inexpensive on the CPU, since the total
 674 %number of particles (and intercepted links) is always relatively
 675 %small, but it does require access to distribution data which we want
 676 %to keep resident on the GPU throughout the simulation. The overhead of
 677 %transferring the full distribution to the CPU and back every timestep
 678 %would be too high. But on the other hand, the full codebase required
 679 %to manage particles is quite complex and comprehensive: completely
 680 %porting it to the GPU, such that the particles effectively also remain
 681 %resident on the GPU, would be a large effort and would cause a major
 682 %diversion of CPU and GPU codebases. Further complications would arise
 683 %when running on multiple GPUs due to the fact that particles can move
 684 %between parallel subdomains. Our solution, which minimises overhead,
 685 %is to keep the particles resident on CPU, but offload only their
 686 %interaction with the fluid to the GPU, as we describe below.
 687
 688 %On the CPU, the code in the above listing can be restructured into
 689 %three distinct stages as (introducing arrays to temporrily store
 690 %indices and data):
 691 %{\footnotesize
 692 %\begin{verbatim}
 693 %1) For each particle, for each link:
 694 %     Store site and velocity indices associated with link
 695 %     Calculate and store coefficient
 696 %2) For each particle, for each link:
 697 %     Update velocity data using stored values
 698 %     Calculate and store particle momenta updates
 699 %3) For each particle, for each link:
 700 %     Update particle momentum using stored values
 701 %\end{verbatim}
 702 %}
 703 %Stage 2 is only one that that accesses fluid data: this can therefore
 704 %be moved to the GPU, while stages 1 and 3 remain on the CPU. Stage 2
 705 %therefore becomes, for the GPU implementation
 706 %{\footnotesize
 707 %\begin{verbatim}
 708 %Transfer coefficient and link index arrays to GPU
 709 %Run CUDA kernel:
 710 %  Update fluid data using coefficient and link index arrays
 711 %  Calculate and store particle momenta updates from fluid data
 712 %Transfer particle momenta updates array back to CPU
 713 %\end{verbatim}
 714 %}
 715 %Since the total number of links involves is relatively small, these
 716 %partial data transfers have minimal overhead.  A complication arises
 717 %due to the fact that the particles move through the lattice, albeit
 718 %relatively slowly, and therefore the list of links that they intercept
 719 %occasionally changes.  When a
 720 %particle moves off a lattice site, the fluid on that site must be
 721 %reconstructed. Similarly, the fluid on the newly obstructed site must
 722 %be removed. For the same reasons given above, we want to avoid moving
 723 %all the code dealing with this complication to the GPU, whilst
 724 %minimising overheads. Since, on any one timestep, the number of
 725 %lattice sites affected is relatively low, we implement functionality
 726 %to perform partial data transfers of the distribution. Only those
 727 %sites affected are packed into a buffer on the GPU, transferred to the
 728 %CPU and unpacked into the CPU version of the distribution. Then, the
 729 %existing code can be used to update those affected sites, before the
 730 %reverse partial data transfer is performed to update the GPU version
 731 %of data.
 732
 733 It is perhaps interesting at this point to make some more general
 734 observations on the software engineering challenge presented when
 735 extending an existing CPU code to the GPU. The already complex
 736 task of maintaining the code in a portable fashion while also
 737 maintaining performance is currently formidable. To help this process,
 738 we have followed a number of basic principles. First,
 739 in order to port to the GPU in an incremental fashion,
 740 we have tried to maintain the modular structure of the CPU where
 741 possible. For each data structure, such as the distribution, a separate
 742 analogue is maintained in both the CPU and GPU memory spaces. However,
 743 the GPU copy does not include the complete CPU structure: in
 744 particular, nonintrinsic datatypes such as MPI datatypes are not
 745 required on the GPU. Functions to marshal data between CPU and GPU
 746 are provided for each data structure, abstracting the underlying
 747 CUDA implementation. (This reasonably lightweight abstraction layer
 748 could also allow an OpenCL implementation to be developed.)
 749 This makes it easy to switch between the CPU and GPU for different
 750 components in the code, which is useful in development
 751 and testing. GPU functionality can be added incrementally while
 752 retaining a code that runs correctly (albeit slowly due to data
 753 transfer overheads).  Once all relevant components are moved to
 754 the GPU, it becomes possible to remove such data transfers and
 755 keep the entire problem resident on the device.
 756
 757 %To ensure that \textit{Ludwig} remains suitable for CPU-based systems,
 758 %we augment rather than modify the original code. All new GPU-related
 759 %functionality are implemented in an additive manner, mirroring the
 760 %modular structure of the original code. The new GPU modules have
 761 %interfaces similar to the original CPU functions, and the ability to
 762 %switch between versions is achieved at compile-time via C
 763 %pre-processing directives.  All GPU compute kernels are implemented using
 764 %CUDA.  Wrapper routines are developed to specify the
 765 %decomposition, invoke the CUDA kernels and perform the necessary data
 766 %management operations: the latter are facilitated through newly
 767 %developed modules for creation and destruction of device memory
 768 %structures, the copying of data between host and device, and where
 769 %necessary the packing and buffering of data.
 770
 771
 772
 773
 774 \section{Summary}
 775 \label{ch14:sec:summary}
 776
 777 We have described the steps taken to implement, on both single and
 778 multiple GPUs, the \textit{Ludwig} code which was originally
 779 designed for complex fluid problems on a conventional CPU
 780 architecture. We have added the necessary functionality using
 781 NVIDIA CUDA and discussed the important changes to the main
 782 data structures. By retaining domain decomposition and message
 783 passing via MPI, we have demonstrated it is possible to scale
 784 complex fluid problems to large numbers of GPUs in parallel.
 785 By following the key design criteria of maximizing parallelism
 786 and minimizing host-device data transfers, we have confirmed that
 787 the mixed MPI-GPU approach is viable for scientific applications.
 788 For the more intricate problem of moving solid particles, we find
 789 it is possible to retain the more serial elements related to
 790 particle link operations on the CPU, while offloading only the
 791 parallel lattice-based operations involving the LB distribution
 792 to the GPU. Again, this minimizes host-device movement of data.
 793
 794 From the software engineering viewpoint, some duplication of code
 795 to allow efficient implementation on both host and device is
 796 currently required. This issue might be addressed by approaches
 797 such as automatic kernel generation, but may also be addressed
 798 naturally in time as GPU and CPU hardware converge. Portable
 799 abstractions and APIs, perhaps based on approaches such as OpenCL,
 800 will also facilitate the development and maintenance of portable
 801 codes which also exhibit portable performance (perhaps in conjunction
 802 with automatic tuning approaches).
 803
 804 So, while the challenges in designing portable and efficient scientific
 805 applications remain very real, this work provides some hope that large
 806 clusters of GPU machines can be used effectively for a wide range of
 807 complex fluid problems.
 808
 809
 810 %The \textit{Ludwig} LB application, which specializes
 811 %in complex fluid problems, has undergone the developments required to
 812 %use large-scale parallel GPU accelerated supercomputers effectively.
 813 %The new functionality augments the original code, so that
 814 %\textit{Ludwig} can run on either CPU-based or GPU-accelerated systems.
 815 %Using the NVIDIA CUDA programming language, all the computational
 816 %kernels were offloaded to the GPU. This included not only those
 817 %responsible for the majority of the compute time, but any that
 818 %accessed the relevant data structures. This allows data to be kept
 819 %resident on the GPU and avoids expensive data transfers. We described
 820 %steps taken to optimise the performance on the GPU architecture by
 821 %reducing off-chip memory accesses and restructuring the data layout
 822 %in memory to ensure the available memory bandwidth was exploited fully.
 823
 824 %A new halo-exchange communication phase for the code was developed to
 825 %allow efficient parallel scaling to many GPUs. The CPU version
 826 %relies on MPI datatype functionality for CPU to CPU communication; we
 827 %have explicitly written buffer packing/unpacking and transfer
 828 %operations to for GPU to GPU data exchange, staged through the host
 829 %CPUs. This includes the filtering of velocity components to only
 830 %transfer those required in a specific direction, hence reducing the
 831 %overall volume of data transferred. We used CUDA stream functionality
 832 %to overlap separate stages within the communication phase, and reduce
 833 %the overall communication time. The effect was measured to be more
 834 %profound for smaller local system sizes, i.e. it is especially useful
 835 %in aiding strong scaling.
 836
 837 %We described work to enable colloidal particles in the simulation with
 838 %minimal overhead. We implemented these in such a way that we avoid a
 839 %major diversion of the CPU and GPU codebases, whilst minimising data
 840 %transfers between the CPU and GPU at each timestep. We keep the
 841 %majority of the (relatively inexpensive) particle related code on the
 842 %CPU, while offloading only those parts responsible for the interaction
 843 %with the fluid to the GPU (where the fluid data resides).
 844
 845 %Our resulting package is able to exploit the largest of GPU
 846 %accelerated architectures to address a wide range of complex problems.
 847
 848
 849
 850 % An example of a floating figure using the graphicx package.
 851 % Note that \label must occur AFTER (or within) \caption.
 852 % For figures, \caption should occur after the \includegraphics.
 853 % Note that IEEEtran v1.7 and later has special internal code that
 854 % is designed to preserve the operation of \label within \caption
 855 % even when the captionsoff option is in effect. However, because
 856 % of issues like this, it may be the safest practice to put all your
 857 % \label just after \caption rather than within \caption{}.
 858 %
 859 % Reminder: the "draftcls" or "draftclsnofoot", not "draft", class
 860 % option should be used if it is desired that the figures are to be
 861 % displayed while in draft mode.
 862 %
 863 %\begin{figure}[!t]
 864 %\centering
 865 %\includegraphics[width=2.5in]{myfigure}
 866 % where an .eps filename suffix will be assumed under latex,
 867 % and a .pdf suffix will be assumed for pdflatex; or what has been declared
 868 % via \DeclareGraphicsExtensions.
 869 %\caption{Simulation Results}
 870 %\label{fig_sim}
 871 %\end{figure}
 872
 873 % Note that IEEE typically puts floats only at the top, even when this
 874 % results in a large percentage of a column being occupied by floats.
 875
 876
 877 % An example of a double column floating figure using two subfigures.
 878 % (The subfig.sty package must be loaded for this to work.)
 879 % The subfigure \label commands are set within each subfloat command, the
 880 % \label for the overall figure must come after \caption.
 881 % \hfil must be used as a separator to get equal spacing.
 882 % The subfigure.sty package works much the same way, except \subfigure is
 883 % used instead of \subfloat.
 884 %
 885 %\begin{figure*}[!t]
 886 %\centerline{\subfloat[Case I]\includegraphics[width=2.5in]{subfigcase1}%
 887 %\label{fig_first_case}}
 888 %\hfil
 889 %\subfloat[Case II]{\includegraphics[width=2.5in]{subfigcase2}%
 890 %\label{fig_second_case}}}
 891 %\caption{Simulation results}
 892 %\label{fig_sim}
 893 %\end{figure*}
 894 %
 895 % Note that often IEEE papers with subfigures do not employ subfigure
 896 % captions (using the optional argument to \subfloat), but instead will
 897 % reference/describe all of them (a), (b), etc., within the main caption.
 898
 899
 900 % An example of a floating table. Note that, for IEEE style tables, the
 901 % \caption command should come BEFORE the table. Table text will default to
 902 % \footnotesize as IEEE normally uses this smaller font for tables.
 903 % The \label must come after \caption as always.
 904 %
 905 %\begin{table}[!t]
 906 %% increase table row spacing, adjust to taste
 907 %\renewcommand{\arraystretch}{1.3}
 908 % if using array.sty, it might be a good idea to tweak the value of
 909 % \extrarowheight as needed to properly center the text within the cells
 910 %\caption{An Example of a Table}
 911 %\label{table_example}
 912 %\centering
 913 %% Some packages, such as MDW tools, offer better commands for making tables
 914 %% than the plain LaTeX2e tabular which is used here.
 915 %\begin{tabular}{|c||c|}
 916 %\hline
 917 %One & Two\\
 918 %\hline
 919 %Three & Four\\
 920 %\hline
 921 %\end{tabular}
 922 %\end{table}
 923
 924
 925 % Note that IEEE does not put floats in the very first column - or typically
 926 % anywhere on the first page for that matter. Also, in-text middle ("here")
 927 % positioning is not used. Most IEEE journals/conferences use top floats
 928 % exclusively. Note that, LaTeX2e, unlike IEEE journals/conferences, places
 929 % footnotes above bottom floats. This can be corrected via the \fnbelowfloat
 930 % command of the stfloats package.
 931
 932
 933 % conference papers do not normally have an appendix
 934
 935
 936 % use section* for acknowledgement
 937 \section*{Acknowledgments}
 938 AG was supported by the CRESTA project, which has received
 939 funding from the European Community's Seventh Framework Programme
 940 (ICT-2011.9.13) under Grant Agreement no. 287703. KS was supported
 941 by UK EPSRC under grant EV/J007404/1.
 942
 943
 944 % trigger a \newpage just before the given reference
 945 % number - used to balance the columns on the last page
 946 % adjust value as needed - may need to be readjusted if
 947 % the document is modified later
 948 %\IEEEtriggeratref{8}
 949 % The "triggered" command can be changed if desired:
 950 %\IEEEtriggercmd{\enlargethispage{-5in}}
 951
 952 % references section
 953
 954 % can use a bibliography generated by BibTeX as a .bbl file
 955 % BibTeX documentation can be easily obtained at:
 956 % http://www.ctan.org/tex-archive/biblio/bibtex/contrib/doc/
 957 % The IEEEtran BibTeX style support page is at:
 958 % http://www.michaelshell.org/tex/ieeetran/bibtex/
 959 %\bibliographystyle{IEEEtran}
 960 % argument is your BibTeX string definitions and bibliography database(s)
 961 %\bibliography{IEEEabrv,../bib/paper}
 962 %
 963 % <OR> manually copy in the resultant .bbl file
 964 % set second argument of \begin to the number of references
 965 % (used to reserve space for the reference number labels box)
 966
 967 \putbib[Chapters/chapter14/biblio14]
 968 %\begin{thebibliography}{1}
 969
 970 %\bibitem{IEEEhowto:kopka}
 971 %H.~Kopka and P.~W. Daly, \emph{A Guide to \LaTeX}, 3rd~ed.\hskip 1em plus
 972 %  0.5em minus 0.4em\relax Harlow, England: Addison-Wesley, 1999.
 973
 974
 975
 976 %% \bibitem{succi-book}
 977 %% S. Succi, \textit{The lattice Boltzmann equation and beyond},
 978 %% Oxford University Press, Oxford, 2001.
 979
 980 %\bibitem{mpi-standard}
 981 %Message Passing Interface Forum, http://www.mpi-forum.org
 982
 983 %% \bibitem{desplat}
 984 %% Desplat, J.-C., I. Pagonabarraga, and P. Bladon,
 985 %% \textit{LUDWIG: A parallel lattice-Boltzmann code for complex fluids}.
 986 %% Comput. Phys. Comms., \textbf{134}, 273, 2001.
 987
 988 %% \bibitem{aidun2010}
 989 %% C.K. Aidun and J.R. Clausen,
 990 %% \textit{Lattice Boltzmann method for complex flows},
 991 %% Ann. Rev. Fluid Mech., \textbf{42} 439--472 (2010).
 992
 993 %\bibitem{bray1994}
 994 %A.J. Bray,
 995 %\textit{Theory of phase-ordering kinetics},
 996 %Adv. Phys., \textbf{43} 357--459 (1994).
 997
 998 %\bibitem{kendon2001}
 999 %V.M. Kendon, M.E. Cates, I. Pangonabarraga, J.-C. Desplat, and P. Bladon,
1000 %\textit{Inertial effects in the three-dimensional spinodal decomposition of a
1001 %symmetric binary fluid mixture; a lattice Boltzmann study},
1002 %J. Fluid Mech., \textbf{440}, 147--203 (2001).
1003
1004 %\bibitem{swift1996}
1005 %M.R. Swift, E. Orlandini, W.R. Osborn, and J.M. Yeomans,
1006 %\textit{Lattice Boltzmann simulations of liquid-gas and binary
1007 %fluid systems},
1008 %Phys. Rev. E, \textbf{54}, 5041--5052 (1996).
1009
1010 %% \bibitem{stratford2008}
1011 %% K. Stratford and I. Pagonabarraga,
1012 %% Parallel domain decomposition for lattice Boltzmann with moving particles,
1013 %% \textit{Comput. Math. with Applications} \textbf{55}, 1585 (2008).
1014
1015 %\bibitem{xe6}
1016 %Cray XE6 Product Brochure, available from
1017 %http://www.cray.com/Products/XE/Resources.aspx (2010)
1018
1019 %\bibitem{hector} HECToR home page, http://www.hector.ac.uk
1020
1021 %\bibitem{xk6}
1022 %Cray XK6 Product Brochure, available from
1023 %http://www.cray.com/Products/XK6/XK6.aspx (2011)
1024
1025
1026 %% \bibitem{wei2004}
1027 %% X. Wei, W. Li, K. M\"uller, and A.E. Kaufman,
1028 %% \textit{The lattice Boltzmann method for simulating gaseous phenomena},
1029 %% IEEE Transactions on Visualization and Computer Graphics,
1030 %% \textbf{10}, 164--176 (2004).
1031 %% Apparently first LB via GPU; a serious contribution using single fluid
1032 %%d3q19 single relaxation time.
1033
1034 %% \bibitem{zhu2006}
1035 %% H. Zhu, X. Liu, Y. Liu, and E. Wu,
1036 %% \textit{Simulation of miscible binary mixtures based on lattice Boltzmann method},
1037 %% Comp. Anim. Virtual Worlds, \textbf{17}, 403--410 (2006).
1038 %% Single relaxation (MRT mentioned) time d3q19 apparently sound although
1039 %%not a lot of detail. Pre-cuda so graphics code.
1040
1041 %% \bibitem{zhao2007}
1042 %% Y. Zhao,
1043 %% \textit{Lattice Boltzmann based PDE solver on the GPU},
1044 %% Visual Comput., doi 10.1007/s00371-0070191-y (2007).
1045
1046 %% \bibitem{toelke2010}
1047 %% J. T\"olke,
1048 %% Implementation of a lattice Boltzmann kernel using the compute unified
1049 %% device architecture developed by nVIDIA,
1050 %% Comput. Visual Sci. 13 29--39 (2010).
1051
1052 %% \bibitem{fan2004}
1053 %% Z. Fan, F. Qiu, A. Kaufman, and S. Yoakum-Stover,
1054 %% \textit{GPU cluster for high performance computing},
1055 %% Proceedings of ACM/IEEE Supercomputing Conference, pp. 47--59,
1056 %% IEEE Computer Society Press, Pittsburgh, PA (2004).
1057
1058 %% \bibitem{myre2011}
1059 %% J. Myre, S.D.C. Walsh, D. Lilja, and M.O. Saar,
1060 %% \textit{Performance analysis of single-phase, multiphase, and multicomponent
1061 %% lattice Boltzmann fluid flow simulations on GPU clusters},
1062 %% Concurrency Computat.: Pract. Exper., \textbf{23}, 332--350 (2011).
1063
1064 %% \bibitem{obrecht2011}
1065 %% C. Obrecht, F. Kuznik, B. Tourancheau, and J.-J. Roux,
1066 %% \textit{Multi-GPU implementation of the lattice Boltzmann method},
1067 %% Comput. Math. with Applications,
1068 %% doi:10.1016/j.camwa.2011.02.020 (2011).
1069
1070 %% \bibitem{bernaschi2010}
1071 %% M. Bernaschi, M. Fatica, S. Melchionna, S. Succi, and E. Kaxiras,
1072 %% \textit{A flexible high-performance lattice Boltzmann GPU code for the
1073 %% simulations of fluid flow in complex geometries},
1074 %% Concurrency Computat.: Pract. Exper., \textbf{22}, 1--14 (2010).
1075
1076 %% \bibitem{xian2011}
1077 %% W. Xian and A. Takayuki,
1078 %% \textit{Multi-GPU performance of incompressible flow computation by
1079 %% lattice Boltzmann method on GPU cluster},
1080 %% Parallel Comput., doi:10.1016/j.parco.2011.02.007 (2011).
1081
1082 %% \bibitem{feichtinger2011}
1083 %% C. Feichtinger, J. Habich, H. K\"ostler, G. Hager, U. R\"ude, and
1084 %% G. Wellein,
1085 %% A flexible patch-based lattice Boltzmann parallelization approach
1086 %% for heterogeneous GPU-CPU clusters,
1087 %% \textit{Parallel Computing} \textbf{37} 536--549 (2011).
1088
1089 %% \bibitem{wellein2006}
1090 %% G. Wellein, T. Zeiser, G Hager, and S. Donath,
1091 %% On the single processor performance of simple lattice Boltzmann kernels,
1092 %% \textit{Computers and Fluids}, \textbf{35}, 910--919 (2006).
1093
1094 %% \bibitem{pohl2003}
1095 %% T. Pohl, M. Kowarschik, J. Wilke, K. Igelberger, and U. R\"ude,
1096 %% Optimization and profiling of the cache performance of parallel
1097 %% lattice Boltzmann code,
1098 %% \textit{Parallel Process Lett.} \textit{13} 549--560 (2003).
1099
1100 %% \bibitem{mattila2007}
1101 %% K. Mattila, J. Hyv\"aluoma, T. Rossi, M. Aspn\"as and J. Westerholm,
1102 %% An efficient swap algorithm for the lattice Boltzmann method,
1103 %% \textit{Comput. Phys. Comms.} \textit{176} 200-210 (2007).
1104
1105 %% \bibitem{wittmann2012}
1106 %% M. Wittmann, T. Zeiser, G. Hager, and G. Wellein,
1107 %% Comparison of different propagation steps for lattice Boltzmann methods,
1108 %% \textit{Comput. Math with Appl.} doi:10.1016/j.camwa.2012.05.002 (2012).
1109
1110 %% \bibitem{walshsaar2012}
1111 %% S.D.C. Walsh and M.O. Saar,
1112 %% Developing extensible lattice Boltzmann simulators for general-purpose
1113 %% graphics-processing units,
1114 %% \textit{Comm. Comput. Phys.}, \textbf{13} 867--879 (2013).
1115
1116
1117 %% \bibitem{williams2011}
1118 %% S. Williams, L. Oliker, J. Carter, and J Shalf,
1119 %% Extracting ultra-scale lattice Boltzmann performance via
1120 %% hierarchical and distributed auto-tuning,
1121 %% \textit{Proc. SC2011}.
1122
1123
1124 %% \bibitem{ch14:stratford-jsp2005}
1125 %% K. Stratford, R. Adhikari, I. Pagonabarraga, and J.-C. Desplat,
1126 %% \textit{Lattice Boltzmann for Binary Fluids with Suspended Colloids},
1127 %% J. Stat. Phys. \textbf{121}, 163 (2005).
1128
1129 %% \bibitem{ladd1994}
1130 %% A.J.C. Ladd,
1131 %% Numerical simulations of particle suspensions via a discretized
1132 %% Boltzmann equation. Part 1. Theoretical foundation,
1133 %% \textit{J. Fluid Mech.} \textbf{271} 285--309 (1994);
1134 %% Part II. Numerical results,
1135 %% \textit{ibid.} \textbf{271} 311--339 (1994).
1136
1137
1138 %% \bibitem{nguyen2002}
1139 %% N.-Q. Nguyen and A.J.C. Ladd,
1140 %% Lubrication corrections for lattice Boltzmann simulations of particle
1141 %% suspensions,
1142 %% \textit{Phys. Rev. E} \textbf{66} 046708 (2002).
1143
1144 %% \bibitem{ch14:immersed}
1145 %% C.S. Peskin,
1146 %% Flow patterns around heart valves; a numerical method,
1147 %% \textit{J. Comp. Phys.}, \textbf{10}, 252--271 (1972);
1148 %% C.S. Peskin,
1149 %% The immersed boundary method,
1150 %% \textit{Acta Nummerica} \textbf{11} 479--517 (2002).
1151
1152 %% \bibitem{ch14:immersed-lb}
1153 %% Z.-G. Feng and E.E. Michaelides,
1154 %% The immersed boundary-lattice Boltzmann method for solving
1155 %% fluid-particles interaction problem,
1156 %% \textit{J. Comp. Phys.}, \textbf{195} 602--628 (2004).
1157
1158 %% \bibitem{ch14:spm}
1159 %% Y. Nakayama and R. Yammamoto,
1160 %% Simulation method to resolve hydrodynamic interactions in colloidal
1161 %% dispersions,
1162 %% \textit{Phys. Rev. E}, \textbf{71} 036707 (2005).
1163
1164
1165 %\end{thebibliography}
1166
1167
1168
1169
1170
1171
1172
1173 % LocalWords:  CRESTA EPSRC Succi Desplat Pagonabarraga