BookGPU/Chapters/chapter14/ch14.tex

   1 \chapterauthor{Alan Gray and Kevin Stratford}{EPCC, The University of Edinburgh, United Kingdom}
   2
   3 \chapter[Ludwig: multiple GPUs for a fluid lattice Boltzmann
   4 application]{Ludwig: multiple GPUs for a complex fluid lattice Boltzmann
   5 application}
   6
   7 %\putbib[biblio]
   8
   9
  10 \section{Introduction}
  11 The lattice Boltzmann (LB) method \index{lattice Boltzmann method} (for an overview see, e.g.,
  12 \cite{succi-book}) has become a popular approach to a variety of fluid
  13 dynamics problems.  It provides a way to solve the incompressible
  14 isothermal Navier-Stokes equations and has the attractive features of
  15 being both explicit in time and local in space. This makes the LB
  16 method well suited to parallel computation. Many efficient parallel
  17 implementations of the LB method have been undertaken, typically using
  18 a combination of distributed domain decomposition and the Message
  19 Passing Interface (MPI). However, the potential
  20 performance benefits offered by GPUs has motivated a new ``mixed-mode''
  21 approach to address very large problems. Here, fine-grained
  22 parallelism is implemented on the GPU, while MPI is reserved for
  23 larger scale parallelism.  This mixed mode is of increasing interest
  24 to application programmers at a time when many supercomputing services
  25 are moving
  26 toward clusters of GPU accelerated nodes. The design questions which
  27 arise when developing a lattice Boltzmann code for this type of
  28 heterogeneous system are therefore worth studying. Furthermore, similar
  29 questions also recur in many other types of stencil-based algorithms.
  30
  31 The first applications of LB on GPUs were to achieve fluid-like
  32 effects in computer animation, rather than scientific applications per
  33 se.  These early works include simple fluids~\cite{wei2004}, miscible
  34 two-component flow~\cite{zhu2006}, and various image processing tasks
  35 based on the use of partial differential equations~\cite{zhao2007}.
  36 While these early works used relatively low level graphics APIs, the
  37 first CUDA runtime interface implementation was a two-dimensional
  38 simple fluid problem~\cite{toelke2010}.  Following pioneering work on
  39 clusters of GPUs coupled via MPI to study air
  40 pollution~\cite{fan2004}, more recent work has included mixed OpenMP
  41 and CUDA~\cite{myre2011}, Posix threads and CUDA~\cite{obrecht2011},
  42 and MPI and CUDA for increasingly large GPU clusters
  43 \cite{bernaschi2010,xian2011,feichtinger2011}. The heterogeneous
  44 nature of these systems has also spurred interest in approaches
  45 including automatic code generation \cite{walshsaar2012} and auto-tuning
  46 \cite{williams2011} to aid application performance.
  47
  48 Many of these authors make use of
  49 another attractive feature of LB: the ability to include fixed
  50 solid-fluid boundary conditions as a straightforward addition to
  51 the algorithm to study, for example, flow in
  52 porous media. This points to an important
  53 application area for LB methods: that of complex fluids.
  54 Complex fluids include mixtures, surfactants, liquid crystals,
  55 and particle suspensions, and typically require additional physics
  56 beyond the bare Navier-Stokes equations to provide a full
  57 description~\cite{aidun2010}. The representation of this extra
  58 physics raises additional design questions for the application
  59 programmer. Here, we consider the \textit{Ludwig} code \cite{desplat}\index{Ludwig code},
  60 an LB application developed specifically for complex fluids
  61 (\textit{Ludwig} was named for Boltzmann, 1844--1906).
  62 We will present the steps
  63 required to allow \textit{Ludwig} to exploit efficiently both a single
  64 GPU and many GPUs in parallel.  We show that
  65 \textit{Ludwig} scales excellently to at least the one thousand GPU level
  66 (the largest resource available at the time of writing) with
  67 indications that the code will scale to much larger systems as
  68 they become available. In addition, we consider the steps required
  69 to represent fully-resolved moving solid particles (usually referred
  70 to as colloids in the context of complex fluids). Such particles
  71 need to have their surface resolved on the lattice scale in order
  72 to include relevant surface physics and must be able to move, e.g.,
  73 to execute Brownian motion in response to random forces from the
  74 fluid. Standard methods are available to represent such particles
  75 (e.g., \cite{ladd1994,nguyen2002}) which are amenable to effective
  76 domain decomposition and message passing \cite{stratford2008}.
  77 We describe below the additional considerations which arise
  78 for such moving particles when developing an
  79 implementation on a GPU.
  80
  81 In the following section we provide a brief overview of the lattice
  82 Boltzmann method, and mention some of the general issues which can
  83 influence performance in the CPU context.
  84 In Section \ref{ch14:sec:singlegpu}, we then describe the alterations
  85 which are required to exploit the GPU architecture effectively and
  86 highlight how the resulting code differs from the CPU version. In
  87 Section \ref{ch14:sec:parallelgpu}, we extend this description to
  88 include the steps required to allow exploitation of many GPUs in
  89 parallel while retaining effective scaling. We also present results for
  90 a typical benchmark for a fluid-only problem in Section
  91 \ref{ch14:sec:parallelgpu} to demonstrate the success of the approach.
  92 In Section \ref{ch14:sec:particles}, we describe the design choices
  93 which are relevant to a GPU implementation of moving particles. Finally,
  94 we include a number of general observations on software
  95 engineering and maintenance which arise from our experience.
  96 A summary is provided in Section~\ref{ch14:sec:summary}.
  97
  98
  99 \section{Background}\label{ch14:sec:background}
 100
 101 %\begin{figure}[!t]
 102 %\centering
 103 %\includegraphics[width=12cm]{Chapters/chapter14/figures/basiclattice}
 104 %\caption{The lattice Boltzmann approach discritises space into a 3D lattice (left). The fluid is represented at each lattice site (right), by the {\it distribution} vector: each component corresponding to a specific velocity direction. Here we are interested in the {\it D3Q19} model, in which there are 3 spatial directions and 19 velocity components (including a zero component).}
 105 %\label{ch14:fig:basiclattice}
 106 %\end{figure}
 107
 108
 109 For a general complex fluid problem the starting point is the
 110 fluid velocity field $\mathbf{u}(\mathbf{r})$, whose evolution
 111 obeys the Navier-Stokes equations describing the conservation of mass
 112 (or density $\rho$), and momentum:
 113 \begin{equation}
 114 \rho  [ \partial_t \mathbf{u} + (\mathbf{u}.\mathbf{\nabla})\mathbf{u} ]
 115 = -\nabla p + \eta \nabla^2 \mathbf{u} + \mathbf{f}(\mathbf{r}),
 116 \end{equation}
 117 where $p$ is the isotropic pressure and $\eta$ is the viscosity.
 118 A local force $\mathbf{f}(\mathbf{r})$ provides a means for coupling
 119 to other complex fluid constituents, e.g., it might represent the force
 120 exerted on the fluid by a curved interface between different phases or
 121 components.
 122
 123 The LB approach makes use of a regular three-dimensional
 124 lattice (see Figure \ref{ch14:fig:decomphalo}) with discrete spacing
 125 $\Delta r$. It also makes use of a
 126 discrete velocity space $\mathbf{c}_i$, where the $\mathbf{c}_i$
 127 are chosen to capture the correct symmetries of the Navier-Stokes
 128 equations. A typical choice, used here, is the so-called D3Q19
 129 basis in three dimensions where there is one velocity such that
 130 $\mathbf{c} \Delta t$ is zero, along with six extending to the nearest
 131 neighbor
 132 lattice sites, and twelve extending to the next-nearest neighbor sites
 133 ($\Delta t$ being the discrete time step). The fundamental object
 134 in LB is then the distribution function $f_i (\mathbf{r};t)$ whose
 135 moments are related to the local hydrodynamic quantities: the fluid
 136 density, momentum, and stress. The time evolution of the distribution
 137 function is described by a discrete Boltzmann equation
 138 \begin{equation}
 139 f_i(\mathbf{r} + \mathbf{c}_i \Delta t; t) - f_i(\mathbf{r}; t)
 140 = - {\cal L}_{ij} f_j(\mathbf{r};t).
 141 \end{equation}
 142 It is convenient to think of this in two stages. First, the right-hand
 143 side represents the action of a collision operator ${\cal L}_{ij}$,
 144 which is local to each lattice site and relaxes the distribution toward
 145 a local equilibrium at a rate ultimately related to the fluid viscosity.
 146 Second, the left-hand side represents a propagation step (sometimes referred
 147 to as streaming step), in which each element $i$ of the distribution is
 148 displaced $\mathbf{c}_i \Delta t$, i.e., one lattice spacing in the
 149 appropriate direction per discrete time step.
 150
 151 More specifically, we store a vector of 19 double-precision floating
 152 point values at each lattice site for the distribution function
 153 $f_i(\mathbf{r};t)$.
 154 The collision operation ${\cal L}_{ij}$, which is local at each lattice
 155 site, may be thought of as follows. A matrix-vector multiplication
 156 ${\cal M}_{ij}f_j$ is used to transform the distributions into the
 157 hydrodynamic quantities, where ${\cal M}_{ij}$ is a constant 19x19
 158 matrix related to the choice of
 159 $\mathbf{c}_i$. The non-conserved hydrodynamic quantities are then
 160 relaxed toward their (known) equilibrium values and are transformed
 161 back to new post-collision distributions via the inverse transformation
 162 ${\cal M}^{-1}_{ij}$. This gives rise to the need for a minimum of $2\times 19^2$
 163 floating point multiplications per lattice site. (Additional operations are
 164 required to implement, for example, the force $\mathbf{f}(\mathbf{r})$.)
 165 In contrast, the
 166 propagation stage consists solely of moving the distribution values
 167 between lattice sites and involves no floating point operations.
 168
 169
 170 \begin{figure}[tbp]
 171 \centering
 172 \includegraphics[width=12cm]{Chapters/chapter14/figures/decomphalo}
 173 \caption[The lattice is decomposed between MPI tasks.]{Left: the lattice is decomposed between MPI tasks. For
 174   clarity we show a 2D decomposition of a 3D lattice, but in practice
 175   we decompose in all 3 dimensions. Halo cells are added to each
 176   subdomain (as shown on the upper right for a single slice) which store
 177   data retrieved from remote neighbors in the halo exchange. Lower
 178   right: the D3Q19 velocity set resident on a lattice site;
 179   highlighted are the 5 ``outgoing'' elements to be transferred in a
 180   specific direction.}
 181 \label{ch14:fig:decomphalo}
 182 \end{figure}
 183
 184
 185 In the CPU version, the \textit{Ludwig} implementation stores one time
 186 level of distribution values $f_i(\mathbf{r}; t)$. This distribution
 187 is stored as a flat array of C data type \texttt{double} and laid out
 188 so that all elements of the velocity are contiguous at a given site
 189 (often referred to as ``array-of-structures'' order). This is
 190 appropriate for sums over the distribution required locally at the
 191 collision stage as illustrated schematically in
 192 Listing~\ref{ch14:listing1}: the fact that consecutive loads are from
 193 consecutive memory addresses allows the prefetcher to engage fully.
 194 (The temporary scalar \texttt{a\_tmp} allows caching of the
 195 intermediate accumulated value in the innermost loop.)  A variety of
 196 standard sequential optimizations are relevant for the collision stage
 197 (loop unrolling, inclusion of SIMD operations, and so on
 198 \cite{wellein2006}).  For example, the collision stage in
 199 \textit{Ludwig} includes explicit SIMD code, which is useful if the
 200 compiler on a given platform cannot identify the appropriate vectorization.
 201 The propagation
 202 stage is separate and is organized as a ``pull'' (again, various
 203 optimizations have been considered, e.g.,
 204 \cite{pohl2003,mattila2007,wittmann2012}).  No further optimization is
 205 done here beyond ensuring that the ordering of the discrete velocities
 206 allows memory access to be as efficient as possible. While these
 207 optimizations are important, it should be remembered that for some
 208 complex fluid problems, the hydrodynamics embodied in the LB
 209 calculation is a relatively small part of the total computational cost
 210 (at the 10\% level in some cases).  This means optimization effort may
 211 be better concentrated elsewhere.
 212
 213
 214
 215 \begin{lstlisting}[float, label=ch14:listing1,
 216 caption = collision schematic for CPU.]
 217 /* loop over lattice sites */
 218 for (is = 0; is < nsites; is++) {
 219   ...
 220   /* load distribution from ``array-of-structures'' */
 221   for (i = 0; i < 19; i++)
 222     f[i] = f_aos[19*is + i]
 223   ...
 224   /* perform matrix-vector multiplication */
 225   for (i = 0; i < 19; i++) {
 226     a_tmp = 0.0;
 227     for (j = 0; j < 19; j++) {
 228       a_tmp += f[j]*M[i][j];
 229     }
 230     a[i] = a_tmp;
 231   }
 232   ...
 233 }
 234 \end{lstlisting}
 235
 236
 237
 238
 239
 240 The regular 3D decomposition is illustrated in Fig.~\ref{ch14:fig:decomphalo}.
 241 Each local subdomain is surrounded by a halo, or ghost, region of width one
 242 lattice site. While the collision is local, elements of the distribution
 243 must be exchanged at the edges of the domains to facilitate the propagation.
 244 To achieve the full 3D halo exchange, the standard approach of shifting the
 245 relevant data in each coordinate direction in turn is adopted. This
 246 requires appropriate synchronization, i.e., a receive in the the first
 247 coordinate direction must be complete before a send in the second direction
 248 involving relevant data can take place, and so on. We note that only
 249 ``outgoing'' elements of the distribution need to be sent at each edge.
 250 For the D3Q19 model, this reduces the volume of data traffic from 19 to
 251 5 of the $f_i(\mathbf{r};t)$ per lattice site at each edge. In the CPU
 252 version, the necessary transfers are implemented in place using
 253 a vector of MPI datatypes with appropriate stride for each direction.
 254
 255
 256 \section{Single GPU implementation}\label{ch14:sec:singlegpu}
 257
 258
 259 In this section we describe the steps taken to enable \textit{Ludwig}
 260 for the GPU. There are a number of crucial issues: first, the
 261 minimization of data traffic between host and device; second, the
 262 optimal mapping of available parallelism onto the architecture; and
 263 third, the issue of memory coalescing. We discuss each of these in
 264 turn.
 265
 266 While the most important section of the LB in terms of
 267 floating-point performance is the collision stage, this cannot be the
 268 only consideration for a GPU implementation. It is essential to
 269 offload all computational activity which involves the main data
 270 structures (such as the distribution) to the GPU. This includes
 271 kernels with relatively low computational demand, such as the
 272 propagation stage. All relevant data then remain resident on the GPU,
 273 to avoid expensive host-device data transfer at each iteration of
 274 the algorithm. Such transfers would negate any benefit of GPU
 275 acceleration.  We note that for a complex fluid code, this requirement
 276 can extend to a considerable number of kernels, although we limit the
 277 discussion to collision and propagation for brevity.
 278
 279 To achieve optimal performance, it is vital to exploit fully the
 280 parallelism inherent in the GPU architecture, particularly for
 281 those matrix-vector operations within the collision kernel.
 282 The GPU architecture features a hierarchy of parallelism. At the
 283 lowest level, groups of 32 threads (warps) operate in
 284 lock-step on different data elements: this is SIMD-style vector-level
 285 parallelism. Multiple warps are combined into a thread block (in which
 286 communication and synchronization are possible), and multiple blocks can
 287 run concurrently across the streaming multiprocessors in the GPU
 288 (with no communication or
 289 synchronization possible across blocks).  To decompose on the GPU, we
 290 must choose which part of the collision to assign to which level of
 291 parallelism.
 292
 293 While there exists parallelism within the matrix-vector operation, the
 294 length of each vector (here, 19) is smaller than the warp size and typical
 295 thread block sizes. So we simply decompose the loop over lattice sites
 296 to all levels of parallelism, i.e., we use a separate CUDA thread for
 297 every lattice site, and each thread performs the full matrix-vector
 298 operation. We find that a block size of 256 performs well on current
 299 devices: we therefore decompose
 300 the lattice into groups of 256 sites and assign each group to a block
 301 of CUDA threads. As the matrix ${\cal M}_{ij}$ is constant, it is
 302 assigned to the fast {\it constant} on-chip device memory.
 303
 304 For the propagation stage, the GPU implementation adds a second time
 305 level of distribution values. The data dependencies inherent in the
 306 propagation mean that the in-place propagation of the CPU version
 307 cannot be parallelized effectively without the additional time level.
 308 As both time levels may remain resident on the GPU, this is not a
 309 significant overhead.
 310
 311 An architectural
 312 constraint of GPUs means that optimal global memory bandwidth
 313 is only achieved when data are structured such that threads within a
 314 {\it half-warp} (a group of 16 threads) load data from the same memory
 315 segment in a single transaction: this is memory coalescing. It can be
 316 seen that the array-of-structures ordering used for the distribution
 317 in the CPU code would not be suitable for coalescing; in fact, it would
 318 result in serialized memory accesses and relative poor performance.
 319 To meet the coalescing criteria and allow consecutive threads to read
 320 consecutive memory addresses on the GPU, we transpose the layout of the
 321 distribution so that, for each velocity component, consecutive sites
 322 are contiguous in memory (``structure-of-arrays'' order). A schematic of
 323 the GPU collision code is given in Listing~\ref{ch14:listing2}.
 324
 325 \begin{lstlisting}[float, label=ch14:listing2,
 326 caption = collision schematic for GPU.]
 327 /* compute current site index 'is' from CUDA thread and block */
 328 /* indices and load distribution from "structure-of-arrays" */
 329
 330 for (i = 0; i < 19; i++)
 331   f[i] = f_soa[nsites*i + is]
 332
 333 /* perform matrix-vector multiplication as before */
 334 ...
 335 \end{lstlisting}
 336
 337
 338 \textit{Ludwig} was modified to allow a choice of distribution data layout
 339 at compilation time depending on the target architecture: CPU or GPU. We
 340 defer some further comments on software engineering aspects of the code to
 341 the summary.
 342
 343
 344 \section{Multiple GPU implementation}\label{ch14:sec:parallelgpu}
 345
 346 To execute on multiple GPUs, we use the same domain decomposition
 347 and message passing framework as the CPU version. Within each
 348 subdomain (allocated to one MPI task) the GPU implementation
 349 proceeds as described in the previous section. The only additional
 350 complication is that halo transfers between GPUs must be staged
 351 through the host (in the future, direct GPU-to-GPU data transfers via
 352 MPI may be possible, obviating the need for these steps). This
 353 means host MPI sends must be preceded by appropriate device-to-host
 354 transfers and host MPI receives must be followed by corresponding
 355 host-to-device transfers.
 356
 357 %To execute on multiple GPUs in parallel we decompose the problem using
 358 %the existing high-level parallelisaton framework: each MPI task acts
 359 %as a host to a separate GPU, with each GPU kernel operating on the
 360 %subset of the lattice volume corresponding to the partition assigned
 361 %to that task.  As described in Section \ref{ch14:sec:background}, halo
 362 %data exchanges are required at each timestep.  Significant
 363 %developments are made for the GPU adaptation of this communication
 364 %phase to achieve good inter-GPU communication performance and allow
 365 %scaling to many GPUs.
 366
 367 %The CPU code uses strided MPI datatypes to define the six
 368 %planes of lattice data to be sent or received, and performs the
 369 %communications {\it in-place}, i.e.  there is no explicit buffering of
 370 %data in the application.  For the GPU adaptation, the halo planes must
 371 %be exchanged between GPUs, on which MPI functionality is not
 372 %available, so transfers must be staged through the host CPUs. Buffer
 373 %packing/unpacking of the halo planes is explicitly performed using
 374 %CUDA kernels, and CUDA memory copy calls are used to transfer data
 375 %from a device to its host (using pinned host memory) through the PCI-e
 376 %bus (and vice versa). MPI calls are used to transfer the buffers
 377 %between hosts. The CPU version of Ludwig has a mechanism, using MPI built-in
 378 %data types, to reduce the amount of data communicated by around a
 379 %factor of four by only sending those elements of the distribution
 380 %propagating outward from a local domain. For the GPU adaptation, this
 381 %filtering of data is explicitly performed (again since MPI
 382 %functionality is not available on the GPU) in the buffer
 383 %packing/unpacking kernels.  Special care had to be taken for those
 384 %lattice sites on the corners of the halo planes, which had to be sent
 385 %in more than one direction (since the propagation includes accesses to
 386 %diagonal neighbours).
 387
 388 In practice, this data movement requires additional GPU kernels to
 389 pack and unpack the relevant data before and after corresponding MPI
 390 calls. However, the standard shift algorithm, in which each coordinate
 391 direction is treated in turn, does provide some scope for the
 392 overlapping of different operations.
 393 For example, after the data for the first coordinate direction have been
 394 retrieved by the host, these can be exchanged using MPI between
 395 hosts at the same time as kernels for packing and retrieving of
 396 data for the second coordinate direction are
 397 executed. This
 398 overlapping must respect the synchronization required to ensure
 399 that data values at the corners of the subdomain are transferred
 400 correctly.
 401 We use a separate CUDA stream for each coordinate direction:
 402 this allows some of the host-device communication time to be
 403 effectively ``hidden'' behind the host-host MPI communication,
 404 resulting in an overall speedup.
 405 The improvement is more pronounced for the smaller local lattice size,
 406 perhaps because of less CPU memory bandwidth contention. The
 407 overlapping is then a particularly valuable aid to strong scaling,
 408 where the local system size decreases as the number of GPUs increases.
 409
 410
 411 %\section{Performance}\label{ch14:sec:performance}
 412
 413 To demonstrate the effectiveness of our approach, we compare the
 414 performance of both CPU and GPU versions of \textit{Ludwig}. To
 415 test the complex fluid nature of the code, the problem is
 416 actually an immiscible fluid mixture which uses a second distribution
 417 function to introduce a composition variable. The interested reader
 418 is referred to \cite{ch14:stratford-jsp2005} for further details. The
 419 largest total problem size used is $2548\times 1764\times 1568$.
 420 The CPU system features a Cray XE6 architecture with 2 16-core AMD Opteron
 421 CPUs per node, and with nodes interconnected using Cray Gemini technology.
 422 For GPU results, a Cray XK6 system is used: this is very similar to the
 423 XE6, but has one CPU per node replaced with an NVIDIA X2090 GPU. Each
 424 node in the GPU system therefore features a single Opteron CPU acting
 425 as a host to a single GPU. The internode interconnect architecture is
 426 the same as for the Cray XE6. The GPU performance tests use a prototype
 427 Cray XK6 system with 936 nodes (the largest available at the time of writing).
 428 To provide a fair comparison, we compare scaling on a \textit{per node}
 429 basis. That is, we compare 1 fully occupied 32-core CPU node
 430 (running 32 MPI tasks) with 1 GPU node (host running 1 MPI task,
 431 and 1 device). We believe this is representative of the true ``cost'' of
 432 a simulation in terms of accounting budgets, or electricity.
 433
 434
 435 %\begin{figure}[!t]
 436 %\centering
 437 %\includegraphics[width=10cm]{Chapters/chapter14/figures/length_scale}
 438 %\caption{The dependence of the characteristic length scale of a binary
 439 %  fluid mixture on the simulation timestep number. Superimposed are
 440 %  three snapshots showing increased mixture separation with time:
 441 %  these are obtained through the order parameter and each show one
 442 %  eighth of the total simulation size, which is $1548\times 1764\times 1568$. }\label{ch14:fig:length_scale}
 443 %\end{figure}
 444
 445 \begin{figure}[!t]
 446 \centering
 447 \includegraphics[width=10cm]{Chapters/chapter14/figures/two_graphs_ud}
 448 \caption[The weak (top) and strong (bottom) scaling of \textit{Ludwig}.]{The weak (top) and strong (bottom) scaling of \textit{Ludwig}.  Closed
 449   shapes denote results using the CPU version run on the Cray
 450   XE6 (using two 16-core AMD Interlagos CPUs per node), while open
 451   shapes denote results using the GPU version on the Cray XK6 (using a
 452   single NVIDIA X2090 GPU per node). Top: the benchmark time is shown
 453   where the problem size per node is constant. Bottom: performance is
 454   shown where, for each of the four problem sizes presented, the
 455   results are scaled by the lattice volume, and all results are
 456   normalized by the same arbitrary constant.  }
 457 \label{ch14:fig:scaling}
 458 \end{figure}
 459
 460
 461 %The top of Figure \ref{ch14:fig:scaling} shows the benchmark time,
 462 %where the problem size is increased with the number of nodes. This
 463 %shows weak scaling: the time taken by a perfectly scaling code would
 464 %remain constant. The figure plots the time against the number of {\it
 465 %  nodes}: for the GPU version we utilize the one GPU per Cray XK6 node
 466 %whilst for the CPU version we fully utilize two 16-core CPUs
 467 %per Cray XE6 node, i.e. one GPU is compared with 32 CPU cores.  We use
 468 %a single MPI task per Cray XK6 node to host the GPU, and 32 MPI tasks
 469 %to fully utilize the CPUs in the Cray XE6 node.  Note that our CPU
 470 %version is highly optimised: in particular it has been tuned for
 471 %optimal utilisation of the SIMD vector units in each CPU core.
 472
 473 Figure \ref{ch14:fig:scaling} shows the results of both weak and
 474 strong scaling tests (top and bottom panels, respectively). For
 475 weak scaling, where the local subdomain size is kept fixed
 476 (here a relatively large 196$^3$ lattice), the time taken by
 477 an ideal code would remain constant. It can be seen that while
 478 scaling of the CPU version shows a pronounced upward slope, the
 479 GPU version scales almost perfectly. The advantage of the GPU
 480 version over the CPU version is the fact that the GPU version
 481 has a factor of 32 fewer MPI tasks per node, so communications
 482 require a significantly smaller number of larger data transfers.
 483 The performance
 484 advantage of the GPU version ranges from a factor of around 1.5 to
 485 around 1.8. Careful studies by other authors \cite{williams2011} have
 486 found the absolute performance (in terms of floating-point
 487 operations per second, or floating-point operations per Watt) to
 488 be remarkably similar between architectures.
 489
 490 Perhaps of more interest is the strong scaling picture (lower
 491 panel in Figure \ref{ch14:fig:scaling}) where the performance
 492 as a function of the number of nodes is measured for fixed problem
 493 size. We consider four different fixed problem sizes on both CPU
 494 (up to 512 nodes shown) and GPU (up to 768 nodes). To allow comparison,
 495 the results are scaled by total system size in each case. For strong
 496 scaling, the disparity in the number of MPI tasks is clearly revealed
 497 in the failure of the CPU version to provide any significant benefit
 498 beyond a modest number of nodes as communication overheads dominate.
 499 In contrast, the GPU version shows
 500 reasonably robust scaling even for the smaller system sizes and good
 501 scaling for the larger systems.
 502
 503 \section{Moving solid particles}\label{ch14:sec:particles}
 504
 505 \begin{figure}[t]
 506 \centering
 507 %\includegraphics[width=10cm]{Chapters/chapter14/figures/bbl}
 508 \includegraphics[width=10cm]{Chapters/chapter14/figures/colloid_new_gray}
 509 \caption[A two-dimensional schematic picture of spherical particles on the lattice.]{
 510 A two-dimensional schematic picture of spherical particles on the lattice.
 511 Left: a particle is allowed
 512 to move continuously across the lattice, and the position of the
 513 surface defines fluid lattice sites (gray) and solid lattice
 514 sites (black). The discrete surface is defined by links
 515 where propagation would intersect the surface (arrows). Note the
 516 discrete shape of the two particles is different. Right: post-collision
 517 distributions are reversed at the surface by the process of bounce-back
 518 on links, which replaces the propagation.
 519 %A 2D illustration of colloidal particles (grey circles)
 520 %  interacting with the lattice, where each lattice point is
 521 %  represented by a cross. Left: each particle spans multiple sites.
 522 %  Right: the particle-facing distribution components are moved inside
 523 %  the particle with the direction reversed, such that the ``bounce
 524 %  back'' will be completed when the fluid propagates.
 525 }
 526 \label{ch14:fig:bbl}
 527 \end{figure}
 528
 529
 530 %\begin{figure}[t]
 531 %\centering
 532 %\includegraphics[width=6cm]{Chapters/chapter14/figures/particlemove}
 533 %\caption{A 2D illustration of a colloidal particle moving across the
 534 %    lattice. The old and new positions are represented by the open and
 535 %    filled circles respectively.}
 536 %\label{ch14:fig:particlemove}
 537 %\end{figure}
 538
 539 % BEING NEW SOLID TEXT
 540
 541 The introduction of moving solid particles poses an additional hurdle
 542 to efficient GPU implementation of an LB code such as \textit{Ludwig}.
 543 In this section, we give a brief overview of the essential features
 544 of the CPU implementation, and how the considerations raised in the
 545 previous sections---maximizing parallelism and minimising both
 546 host-to-device and GPU-to-GPU data movement---shape the design decisions
 547 for a GPU implementation. We restrict our discussion to a simple fluid
 548 in this section; additional solid-fluid boundary conditions (e.g.,
 549 wetting at a fluid-fluid-solid contact line) usually arise elsewhere
 550 in the calculation and are broadly independent of the hydrodynamic boundary
 551 conditions which are described in what follows.
 552
 553 Moving solid particles (here, spheres) are defined by a center position
 554 which is allowed to move continuously across the space of the lattice,
 555 and a fixed radius which is typically a few lattice spacings $\Delta r$.
 556 The surface of the particle is defined by a series of \textit{links}
 557 where a discrete velocity propagation $\mathbf{c}_i \Delta t$ would
 558 intercept or cut the spherical shell (see Fig.~\ref{ch14:fig:bbl}).
 559 Hydrodynamic boundary conditions are then implemented via the standard
 560 approach of bounce-back on links \cite{ladd1994, nguyen2002}, where the
 561 relevant post-collision distribution values are reversed at the
 562 propagation stage with an appropriate correction to allow for the
 563 solid body motion. The exchange of momentum at each link must then be
 564 accumulated around the entire particle surface to provide the net
 565 hydrodynamic force
 566 and torque on the sphere. The particle motion can then be updated in a
 567 molecular dynamics-like step.
 568
 569 For the CPU implementation, a number of additional MPI communications
 570 are required: (1) to exchange the center position, radius, etc.\ of each
 571 particle as it moves and (2) to allow the accumulation of the force
 572 and torque for each particle (the links of which may be distributed
 573 between up to 8 MPI tasks in three dimensions). Appropriate marshaling
 574 of these data can provide an effective parallelization for relatively
 575 dense particle suspensions of up to  40\% solid volume fraction
 576 \cite{stratford2008}. As a final consideration, fluid distributions
 577 must be removed and replaced consistently as a given particle moves
 578 across the lattice and changes its discrete shape.
 579
 580 With these features in mind, we can identify a number of competing
 581 concerns which are relevant to a GPU implementation:
 582 \begin{enumerate}
 583 \item
 584 Minimization of host-device data transfer would argue for moving the
 585 entire particle code to the GPU. However, the code in question involves
 586 largely conditional logic (e.g., identifying cut surface links) and
 587 irregular memory accesses (e.g., access to distribution elements around
 588 a spherical particle). These operations seem poorly suited to effective
 589 parallelization on the GPU. As an additional complication, the sums
 590 required over the particle surface would involve potentially tricky
 591 and inefficient reductions in GPU memory.
 592 \item
 593 The alternative is to retain the relevant code on the CPU. While the
 594 transfer of the entire distribution $f_i(\mathbf{r};t)$ between host
 595 and device at each time step is unconscionable owing to PCIe bus bandwidth
 596 considerations, the transfer of only relevant distribution information
 597 to allow bounce-back on links is possible. At modest solid volumes
 598 only a very small fraction of the distribution data is
 599 involved in bounce-back (e.g., a 2\% solid volume fraction of particles
 600 radius 2.3$\Delta r$ would involve approximately 1\% of the distribution
 601 data). This option also has the advantage that no further host-device
 602 data transfers are necessary to allow the MPI exchanges required for
 603 particle information.
 604 \end{enumerate}
 605
 606
 607 We have implemented the second option as follows. For each subdomain,
 608 a list of boundary-cutting links is assembled on the CPU; the list includes
 609 the identity of the relevant element of the distribution in each case.
 610 This list,
 611 together with the particle information required to compute the correct
 612 bounce-back term, is transferred to the GPU. The updates to the relevant
 613 elements of the distribution can then take place on the GPU. The
 614 corresponding information to compute the update of the particle dynamics
 615 is returned
 616 to the CPU, where the reduction over the surface links is computed.
 617 The change of particle shape may be dealt with in a similar manner:
 618 the relatively small number of updates required at any one time step
 619 (or however frequently the particle position is updated) can be
 620 marshaled to the GPU as necessary. This preliminary implementation
 621 is found to be effective on problems involving up to 4\%
 622 solid volume fraction.
 623
 624 We note that the CPU version actually avoids the collision calculation
 625 at solid lattice points by consulting a look-up table of solid/fluid
 626 status. On the GPU, it is perhaps preferable to perform the collision
 627 stage everywhere instead of moving the look-up table to the GPU and
 628 introducing the associated logic.
 629 Ultimately, the GPU might favor other boundary methods which treat solid and
 630 fluid on a somewhat more equal basis, for example, the immersed boundary
 631 method \cite{ch14:immersed1,ch14:immersed2,ch14:immersed-lb} or smoothed profile method
 632 \cite{ch14:spm}.
 633 However, the approach adopted here  allows us to exploit
 634 the GPU for the intensive fluid simulation while maintaining the complex
 635 code required for particles on the CPU. Overheads of CPU-GPU transfer are
 636 minimized by transferring only those data relevant to the hydrodynamic
 637 interaction implemented via bounce-back on links.
 638
 639 % END NEW SOLID TEXT
 640
 641 %It is of interest, in many situations, to include colloidal particles
 642 %in the simulation **more from Kevin***. These are by represented by
 643 %moving spheres that interact with the fluid through transfer of
 644 %momenta: fluid coming into contact with a particle will bounce off it,
 645 %giving the particle a kick in the process. As illustrated on the left
 646 %of Figure \ref{ch14:fig:bbl} (which shows 2 dimensions only for
 647 %clarity - in practice, the particles will be represented by spheres on
 648 %a 3D lattice, and will be much larger relative to the lattice
 649 %spacing), typically each particle spans multiple lattice sites, and
 650 %therefore ``intercepts'' multiple inter-site links. Ludwig stores
 651 %particle information using nested linked lists: a list of particles,
 652 %each with a list of all the links intercepted by that particle (plus
 653 %other quantities). The fluid-particle interaction proceeds using the
 654 %so-called {\it bounce back on links} procedure during each timestep. As
 655 %illustrated in the right of Figure \ref{ch14:fig:bbl}, for each of the
 656 %intercepting links, those velocity components of the distribution that
 657 %face inwards to each particle are moved from the site
 658 %outside the particle to the site inside the particle, and
 659 %reversed in direction. This is done before the propagation stage,
 660 %which then propagates these same velocity components back out of the
 661 %particles: the velocity bounces back from the particles. The
 662 %momenta transfer from fluid to particles is also determined
 663 %from the distribution values.
 664
 665 %A simplified schematic of the implementation on the CPU is as follows,
 666 %noting that multiplicative coefficients are required:
 667 %{\footnotesize
 668 %\begin{verbatim}
 669 %For each particle:
 670 %  For each link:
 671 %    Calculate coefficient
 672 %    Bounce back fluid using coefficient
 673 %    Update particle momentum
 674 %\end{verbatim}
 675 %}
 676 %This process is relatively inexpensive on the CPU, since the total
 677 %number of particles (and intercepted links) is always relatively
 678 %small, but it does require access to distribution data which we want
 679 %to keep resident on the GPU throughout the simulation. The overhead of
 680 %transferring the full distribution to the CPU and back every timestep
 681 %would be too high. But on the other hand, the full codebase required
 682 %to manage particles is quite complex and comprehensive: completely
 683 %porting it to the GPU, such that the particles effectively also remain
 684 %resident on the GPU, would be a large effort and would cause a major
 685 %diversion of CPU and GPU codebases. Further complications would arise
 686 %when running on multiple GPUs due to the fact that particles can move
 687 %between parallel subdomains. Our solution, which minimises overhead,
 688 %is to keep the particles resident on CPU, but offload only their
 689 %interaction with the fluid to the GPU, as we describe below.
 690
 691 %On the CPU, the code in the above listing can be restructured into
 692 %three distinct stages as (introducing arrays to temporrily store
 693 %indices and data):
 694 %{\footnotesize
 695 %\begin{verbatim}
 696 %1) For each particle, for each link:
 697 %     Store site and velocity indices associated with link
 698 %     Calculate and store coefficient
 699 %2) For each particle, for each link:
 700 %     Update velocity data using stored values
 701 %     Calculate and store particle momenta updates
 702 %3) For each particle, for each link:
 703 %     Update particle momentum using stored values
 704 %\end{verbatim}
 705 %}
 706 %Stage 2 is only one that that accesses fluid data: this can therefore
 707 %be moved to the GPU, while stages 1 and 3 remain on the CPU. Stage 2
 708 %therefore becomes, for the GPU implementation
 709 %{\footnotesize
 710 %\begin{verbatim}
 711 %Transfer coefficient and link index arrays to GPU
 712 %Run CUDA kernel:
 713 %  Update fluid data using coefficient and link index arrays
 714 %  Calculate and store particle momenta updates from fluid data
 715 %Transfer particle momenta updates array back to CPU
 716 %\end{verbatim}
 717 %}
 718 %Since the total number of links involves is relatively small, these
 719 %partial data transfers have minimal overhead.  A complication arises
 720 %due to the fact that the particles move through the lattice, albeit
 721 %relatively slowly, and therefore the list of links that they intercept
 722 %occasionally changes.  When a
 723 %particle moves off a lattice site, the fluid on that site must be
 724 %reconstructed. Similarly, the fluid on the newly obstructed site must
 725 %be removed. For the same reasons given above, we want to avoid moving
 726 %all the code dealing with this complication to the GPU, whilst
 727 %minimising overheads. Since, on any one timestep, the number of
 728 %lattice sites affected is relatively low, we implement functionality
 729 %to perform partial data transfers of the distribution. Only those
 730 %sites affected are packed into a buffer on the GPU, transferred to the
 731 %CPU and unpacked into the CPU version of the distribution. Then, the
 732 %existing code can be used to update those affected sites, before the
 733 %reverse partial data transfer is performed to update the GPU version
 734 %of data.
 735
 736 It is perhaps interesting at this point to make some more general
 737 observations on the software engineering challenge presented when
 738 extending an existing CPU code to the GPU. The already complex
 739 task of maintaining the code in a portable fashion while also
 740 maintaining performance is currently formidable. To help this process,
 741 we have followed a number of basic principles. First,
 742 in order to port to the GPU in an incremental fashion,
 743 we have tried to maintain the modular structure of the CPU where
 744 possible. For each data structure, such as the distribution, a separate
 745 analogue is maintained in both the CPU and GPU memory spaces. However,
 746 the GPU copy does not include the complete CPU structure: in
 747 particular, non-intrinsic datatypes such as MPI datatypes are not
 748 required on the GPU. Functions to marshal data between CPU and GPU
 749 are provided for each data structure, abstracting the underlying
 750 CUDA implementation. (This reasonably lightweight abstraction layer
 751 could also allow an OpenCL implementation to be developed.)
 752 This makes it easy to switch between the CPU and GPU for different
 753 components in the code, which is useful in development
 754 and testing. GPU functionality can be added incrementally while
 755 retaining a code that runs correctly (albeit slowly due to data
 756 transfer overheads).  Once all relevant components are moved to
 757 the GPU, it becomes possible to remove such data transfers and
 758 keep the entire problem resident on the device.
 759
 760 %To ensure that \textit{Ludwig} remains suitable for CPU-based systems,
 761 %we augment rather than modify the original code. All new GPU-related
 762 %functionality are implemented in an additive manner, mirroring the
 763 %modular structure of the original code. The new GPU modules have
 764 %interfaces similar to the original CPU functions, and the ability to
 765 %switch between versions is achieved at compile-time via C
 766 %pre-processing directives.  All GPU compute kernels are implemented using
 767 %CUDA.  Wrapper routines are developed to specify the
 768 %decomposition, invoke the CUDA kernels and perform the necessary data
 769 %management operations: the latter are facilitated through newly
 770 %developed modules for creation and destruction of device memory
 771 %structures, the copying of data between host and device, and where
 772 %necessary the packing and buffering of data.
 773
 774
 775
 776
 777 \section{Summary}
 778 \label{ch14:sec:summary}
 779
 780 We have described the steps taken to implement, on both single and
 781 multiple GPUs, the \textit{Ludwig} code which was originally
 782 designed for complex fluid problems on a conventional CPU
 783 architecture. We have added the necessary functionality using
 784 NVIDIA CUDA and discussed the important changes to the main
 785 data structures. By retaining domain decomposition and message
 786 passing via MPI, we have demonstrated it is possible to scale
 787 complex fluid problems to large numbers of GPUs in parallel.
 788 By following the key design criteria of maximizing parallelism
 789 and minimizing host-device data transfers, we have confirmed that
 790 the mixed MPI-GPU approach is viable for scientific applications.
 791 For the more intricate problem of moving solid particles, we find
 792 it is possible to retain the more serial elements related to
 793 particle link operations on the CPU, while offloading only the
 794 parallel lattice-based operations involving the LB distribution
 795 to the GPU. Again, this minimizes host-device movement of data.
 796
 797 From the software engineering viewpoint, some duplication of code
 798 to allow efficient implementation on both host and device is
 799 currently required. This issue might be addressed by approaches
 800 such as automatic kernel generation, but may also be addressed
 801 naturally in time as GPU and CPU hardware converge. Portable
 802 abstractions and APIs, perhaps based on approaches such as OpenCL,
 803 will also facilitate the development and maintenance of portable
 804 codes which also exhibit portable performance (perhaps in conjunction
 805 with automatic tuning approaches).
 806
 807 So, while the challenges in designing portable and efficient scientific
 808 applications remain very real, this work provides some hope that large
 809 clusters of GPU machines can be used effectively for a wide range of
 810 complex fluid problems.
 811
 812
 813 %The \textit{Ludwig} LB application, which specializes
 814 %in complex fluid problems, has undergone the developments required to
 815 %use large-scale parallel GPU accelerated supercomputers effectively.
 816 %The new functionality augments the original code, so that
 817 %\textit{Ludwig} can run on either CPU-based or GPU-accelerated systems.
 818 %Using the NVIDIA CUDA programming language, all the computational
 819 %kernels were offloaded to the GPU. This included not only those
 820 %responsible for the majority of the compute time, but any that
 821 %accessed the relevant data structures. This allows data to be kept
 822 %resident on the GPU and avoids expensive data transfers. We described
 823 %steps taken to optimise the performance on the GPU architecture by
 824 %reducing off-chip memory accesses and restructuring the data layout
 825 %in memory to ensure the available memory bandwidth was exploited fully.
 826
 827 %A new halo-exchange communication phase for the code was developed to
 828 %allow efficient parallel scaling to many GPUs. The CPU version
 829 %relies on MPI datatype functionality for CPU to CPU communication; we
 830 %have explicitly written buffer packing/unpacking and transfer
 831 %operations to for GPU to GPU data exchange, staged through the host
 832 %CPUs. This includes the filtering of velocity components to only
 833 %transfer those required in a specific direction, hence reducing the
 834 %overall volume of data transferred. We used CUDA stream functionality
 835 %to overlap separate stages within the communication phase, and reduce
 836 %the overall communication time. The effect was measured to be more
 837 %profound for smaller local system sizes, i.e. it is especially useful
 838 %in aiding strong scaling.
 839
 840 %We described work to enable colloidal particles in the simulation with
 841 %minimal overhead. We implemented these in such a way that we avoid a
 842 %major diversion of the CPU and GPU codebases, whilst minimising data
 843 %transfers between the CPU and GPU at each timestep. We keep the
 844 %majority of the (relatively inexpensive) particle related code on the
 845 %CPU, while offloading only those parts responsible for the interaction
 846 %with the fluid to the GPU (where the fluid data resides).
 847
 848 %Our resulting package is able to exploit the largest of GPU
 849 %accelerated architectures to address a wide range of complex problems.
 850
 851
 852
 853 % An example of a floating figure using the graphicx package.
 854 % Note that \label must occur AFTER (or within) \caption.
 855 % For figures, \caption should occur after the \includegraphics.
 856 % Note that IEEEtran v1.7 and later has special internal code that
 857 % is designed to preserve the operation of \label within \caption
 858 % even when the captionsoff option is in effect. However, because
 859 % of issues like this, it may be the safest practice to put all your
 860 % \label just after \caption rather than within \caption{}.
 861 %
 862 % Reminder: the "draftcls" or "draftclsnofoot", not "draft", class
 863 % option should be used if it is desired that the figures are to be
 864 % displayed while in draft mode.
 865 %
 866 %\begin{figure}[!t]
 867 %\centering
 868 %\includegraphics[width=2.5in]{myfigure}
 869 % where an .eps filename suffix will be assumed under latex,
 870 % and a .pdf suffix will be assumed for pdflatex; or what has been declared
 871 % via \DeclareGraphicsExtensions.
 872 %\caption{Simulation Results}
 873 %\label{fig_sim}
 874 %\end{figure}
 875
 876 % Note that IEEE typically puts floats only at the top, even when this
 877 % results in a large percentage of a column being occupied by floats.
 878
 879
 880 % An example of a double column floating figure using two subfigures.
 881 % (The subfig.sty package must be loaded for this to work.)
 882 % The subfigure \label commands are set within each subfloat command, the
 883 % \label for the overall figure must come after \caption.
 884 % \hfil must be used as a separator to get equal spacing.
 885 % The subfigure.sty package works much the same way, except \subfigure is
 886 % used instead of \subfloat.
 887 %
 888 %\begin{figure*}[!t]
 889 %\centerline{\subfloat[Case I]\includegraphics[width=2.5in]{subfigcase1}%
 890 %\label{fig_first_case}}
 891 %\hfil
 892 %\subfloat[Case II]{\includegraphics[width=2.5in]{subfigcase2}%
 893 %\label{fig_second_case}}}
 894 %\caption{Simulation results}
 895 %\label{fig_sim}
 896 %\end{figure*}
 897 %
 898 % Note that often IEEE papers with subfigures do not employ subfigure
 899 % captions (using the optional argument to \subfloat), but instead will
 900 % reference/describe all of them (a), (b), etc., within the main caption.
 901
 902
 903 % An example of a floating table. Note that, for IEEE style tables, the
 904 % \caption command should come BEFORE the table. Table text will default to
 905 % \footnotesize as IEEE normally uses this smaller font for tables.
 906 % The \label must come after \caption as always.
 907 %
 908 %\begin{table}[!t]
 909 %% increase table row spacing, adjust to taste
 910 %\renewcommand{\arraystretch}{1.3}
 911 % if using array.sty, it might be a good idea to tweak the value of
 912 % \extrarowheight as needed to properly center the text within the cells
 913 %\caption{An Example of a Table}
 914 %\label{table_example}
 915 %\centering
 916 %% Some packages, such as MDW tools, offer better commands for making tables
 917 %% than the plain LaTeX2e tabular which is used here.
 918 %\begin{tabular}{|c||c|}
 919 %\hline
 920 %One & Two\\
 921 %\hline
 922 %Three & Four\\
 923 %\hline
 924 %\end{tabular}
 925 %\end{table}
 926
 927
 928 % Note that IEEE does not put floats in the very first column - or typically
 929 % anywhere on the first page for that matter. Also, in-text middle ("here")
 930 % positioning is not used. Most IEEE journals/conferences use top floats
 931 % exclusively. Note that, LaTeX2e, unlike IEEE journals/conferences, places
 932 % footnotes above bottom floats. This can be corrected via the \fnbelowfloat
 933 % command of the stfloats package.
 934
 935
 936 % conference papers do not normally have an appendix
 937
 938
 939 % use section* for acknowledgement
 940 \section{Acknowledgments}
 941 AG was supported by the CRESTA project, which has received
 942 funding from the European Community's Seventh Framework Programme
 943 (ICT-2011.9.13) under Grant Agreement no. 287703. KS was supported
 944 by UK EPSRC under grant EV/J007404/1.
 945
 946
 947 % trigger a \newpage just before the given reference
 948 % number - used to balance the columns on the last page
 949 % adjust value as needed - may need to be readjusted if
 950 % the document is modified later
 951 %\IEEEtriggeratref{8}
 952 % The "triggered" command can be changed if desired:
 953 %\IEEEtriggercmd{\enlargethispage{-5in}}
 954
 955 % references section
 956
 957 % can use a bibliography generated by BibTeX as a .bbl file
 958 % BibTeX documentation can be easily obtained at:
 959 % http://www.ctan.org/tex-archive/biblio/bibtex/contrib/doc/
 960 % The IEEEtran BibTeX style support page is at:
 961 % http://www.michaelshell.org/tex/ieeetran/bibtex/
 962 %\bibliographystyle{IEEEtran}
 963 % argument is your BibTeX string definitions and bibliography database(s)
 964 %\bibliography{IEEEabrv,../bib/paper}
 965 %
 966 % <OR> manually copy in the resultant .bbl file
 967 % set second argument of \begin to the number of references
 968 % (used to reserve space for the reference number labels box)
 969
 970 \putbib[Chapters/chapter14/biblio14]
 971 %\begin{thebibliography}{1}
 972
 973 %\bibitem{IEEEhowto:kopka}
 974 %H.~Kopka and P.~W. Daly, \emph{A Guide to \LaTeX}, 3rd~ed.\hskip 1em plus
 975 %  0.5em minus 0.4em\relax Harlow, England: Addison-Wesley, 1999.
 976
 977
 978
 979 %% \bibitem{succi-book}
 980 %% S. Succi, \textit{The lattice Boltzmann equation and beyond},
 981 %% Oxford University Press, Oxford, 2001.
 982
 983 %\bibitem{mpi-standard}
 984 %Message Passing Interface Forum, http://www.mpi-forum.org
 985
 986 %% \bibitem{desplat}
 987 %% Desplat, J.-C., I. Pagonabarraga, and P. Bladon,
 988 %% \textit{LUDWIG: A parallel lattice-Boltzmann code for complex fluids}.
 989 %% Comput. Phys. Comms., \textbf{134}, 273, 2001.
 990
 991 %% \bibitem{aidun2010}
 992 %% C.K. Aidun and J.R. Clausen,
 993 %% \textit{Lattice Boltzmann method for complex flows},
 994 %% Ann. Rev. Fluid Mech., \textbf{42} 439--472 (2010).
 995
 996 %\bibitem{bray1994}
 997 %A.J. Bray,
 998 %\textit{Theory of phase-ordering kinetics},
 999 %Adv. Phys., \textbf{43} 357--459 (1994).
1000
1001 %\bibitem{kendon2001}
1002 %V.M. Kendon, M.E. Cates, I. Pangonabarraga, J.-C. Desplat, and P. Bladon,
1003 %\textit{Inertial effects in the three-dimensional spinodal decomposition of a
1004 %symmetric binary fluid mixture; a lattice Boltzmann study},
1005 %J. Fluid Mech., \textbf{440}, 147--203 (2001).
1006
1007 %\bibitem{swift1996}
1008 %M.R. Swift, E. Orlandini, W.R. Osborn, and J.M. Yeomans,
1009 %\textit{Lattice Boltzmann simulations of liquid-gas and binary
1010 %fluid systems},
1011 %Phys. Rev. E, \textbf{54}, 5041--5052 (1996).
1012
1013 %% \bibitem{stratford2008}
1014 %% K. Stratford and I. Pagonabarraga,
1015 %% Parallel domain decomposition for lattice Boltzmann with moving particles,
1016 %% \textit{Comput. Math. with Applications} \textbf{55}, 1585 (2008).
1017
1018 %\bibitem{xe6}
1019 %Cray XE6 Product Brochure, available from
1020 %http://www.cray.com/Products/XE/Resources.aspx (2010)
1021
1022 %\bibitem{hector} HECToR home page, http://www.hector.ac.uk
1023
1024 %\bibitem{xk6}
1025 %Cray XK6 Product Brochure, available from
1026 %http://www.cray.com/Products/XK6/XK6.aspx (2011)
1027
1028
1029 %% \bibitem{wei2004}
1030 %% X. Wei, W. Li, K. M\"uller, and A.E. Kaufman,
1031 %% \textit{The lattice Boltzmann method for simulating gaseous phenomena},
1032 %% IEEE Transactions on Visualization and Computer Graphics,
1033 %% \textbf{10}, 164--176 (2004).
1034 %% Apparently first LB via GPU; a serious contribution using single fluid
1035 %%d3q19 single relaxation time.
1036
1037 %% \bibitem{zhu2006}
1038 %% H. Zhu, X. Liu, Y. Liu, and E. Wu,
1039 %% \textit{Simulation of miscible binary mixtures based on lattice Boltzmann method},
1040 %% Comp. Anim. Virtual Worlds, \textbf{17}, 403--410 (2006).
1041 %% Single relaxation (MRT mentioned) time d3q19 apparently sound although
1042 %%not a lot of detail. Pre-cuda so graphics code.
1043
1044 %% \bibitem{zhao2007}
1045 %% Y. Zhao,
1046 %% \textit{Lattice Boltzmann based PDE solver on the GPU},
1047 %% Visual Comput., doi 10.1007/s00371-0070191-y (2007).
1048
1049 %% \bibitem{toelke2010}
1050 %% J. T\"olke,
1051 %% Implementation of a lattice Boltzmann kernel using the compute unified
1052 %% device architecture developed by nVIDIA,
1053 %% Comput. Visual Sci. 13 29--39 (2010).
1054
1055 %% \bibitem{fan2004}
1056 %% Z. Fan, F. Qiu, A. Kaufman, and S. Yoakum-Stover,
1057 %% \textit{GPU cluster for high performance computing},
1058 %% Proceedings of ACM/IEEE Supercomputing Conference, pp. 47--59,
1059 %% IEEE Computer Society Press, Pittsburgh, PA (2004).
1060
1061 %% \bibitem{myre2011}
1062 %% J. Myre, S.D.C. Walsh, D. Lilja, and M.O. Saar,
1063 %% \textit{Performance analysis of single-phase, multiphase, and multicomponent
1064 %% lattice Boltzmann fluid flow simulations on GPU clusters},
1065 %% Concurrency Computat.: Pract. Exper., \textbf{23}, 332--350 (2011).
1066
1067 %% \bibitem{obrecht2011}
1068 %% C. Obrecht, F. Kuznik, B. Tourancheau, and J.-J. Roux,
1069 %% \textit{Multi-GPU implementation of the lattice Boltzmann method},
1070 %% Comput. Math. with Applications,
1071 %% doi:10.1016/j.camwa.2011.02.020 (2011).
1072
1073 %% \bibitem{bernaschi2010}
1074 %% M. Bernaschi, M. Fatica, S. Melchionna, S. Succi, and E. Kaxiras,
1075 %% \textit{A flexible high-performance lattice Boltzmann GPU code for the
1076 %% simulations of fluid flow in complex geometries},
1077 %% Concurrency Computat.: Pract. Exper., \textbf{22}, 1--14 (2010).
1078
1079 %% \bibitem{xian2011}
1080 %% W. Xian and A. Takayuki,
1081 %% \textit{Multi-GPU performance of incompressible flow computation by
1082 %% lattice Boltzmann method on GPU cluster},
1083 %% Parallel Comput., doi:10.1016/j.parco.2011.02.007 (2011).
1084
1085 %% \bibitem{feichtinger2011}
1086 %% C. Feichtinger, J. Habich, H. K\"ostler, G. Hager, U. R\"ude, and
1087 %% G. Wellein,
1088 %% A flexible patch-based lattice Boltzmann parallelization approach
1089 %% for heterogeneous GPU-CPU clusters,
1090 %% \textit{Parallel Computing} \textbf{37} 536--549 (2011).
1091
1092 %% \bibitem{wellein2006}
1093 %% G. Wellein, T. Zeiser, G Hager, and S. Donath,
1094 %% On the single processor performance of simple lattice Boltzmann kernels,
1095 %% \textit{Computers and Fluids}, \textbf{35}, 910--919 (2006).
1096
1097 %% \bibitem{pohl2003}
1098 %% T. Pohl, M. Kowarschik, J. Wilke, K. Igelberger, and U. R\"ude,
1099 %% Optimization and profiling of the cache performance of parallel
1100 %% lattice Boltzmann code,
1101 %% \textit{Parallel Process Lett.} \textit{13} 549--560 (2003).
1102
1103 %% \bibitem{mattila2007}
1104 %% K. Mattila, J. Hyv\"aluoma, T. Rossi, M. Aspn\"as and J. Westerholm,
1105 %% An efficient swap algorithm for the lattice Boltzmann method,
1106 %% \textit{Comput. Phys. Comms.} \textit{176} 200-210 (2007).
1107
1108 %% \bibitem{wittmann2012}
1109 %% M. Wittmann, T. Zeiser, G. Hager, and G. Wellein,
1110 %% Comparison of different propagation steps for lattice Boltzmann methods,
1111 %% \textit{Comput. Math with Appl.} doi:10.1016/j.camwa.2012.05.002 (2012).
1112
1113 %% \bibitem{walshsaar2012}
1114 %% S.D.C. Walsh and M.O. Saar,
1115 %% Developing extensible lattice Boltzmann simulators for general-purpose
1116 %% graphics-processing units,
1117 %% \textit{Comm. Comput. Phys.}, \textbf{13} 867--879 (2013).
1118
1119
1120 %% \bibitem{williams2011}
1121 %% S. Williams, L. Oliker, J. Carter, and J Shalf,
1122 %% Extracting ultra-scale lattice Boltzmann performance via
1123 %% hierarchical and distributed auto-tuning,
1124 %% \textit{Proc. SC2011}.
1125
1126
1127 %% \bibitem{ch14:stratford-jsp2005}
1128 %% K. Stratford, R. Adhikari, I. Pagonabarraga, and J.-C. Desplat,
1129 %% \textit{Lattice Boltzmann for Binary Fluids with Suspended Colloids},
1130 %% J. Stat. Phys. \textbf{121}, 163 (2005).
1131
1132 %% \bibitem{ladd1994}
1133 %% A.J.C. Ladd,
1134 %% Numerical simulations of particle suspensions via a discretized
1135 %% Boltzmann equation. Part 1. Theoretical foundation,
1136 %% \textit{J. Fluid Mech.} \textbf{271} 285--309 (1994);
1137 %% Part II. Numerical results,
1138 %% \textit{ibid.} \textbf{271} 311--339 (1994).
1139
1140
1141 %% \bibitem{nguyen2002}
1142 %% N.-Q. Nguyen and A.J.C. Ladd,
1143 %% Lubrication corrections for lattice Boltzmann simulations of particle
1144 %% suspensions,
1145 %% \textit{Phys. Rev. E} \textbf{66} 046708 (2002).
1146
1147 %% \bibitem{ch14:immersed}
1148 %% C.S. Peskin,
1149 %% Flow patterns around heart valves; a numerical method,
1150 %% \textit{J. Comp. Phys.}, \textbf{10}, 252--271 (1972);
1151 %% C.S. Peskin,
1152 %% The immersed boundary method,
1153 %% \textit{Acta Nummerica} \textbf{11} 479--517 (2002).
1154
1155 %% \bibitem{ch14:immersed-lb}
1156 %% Z.-G. Feng and E.E. Michaelides,
1157 %% The immersed boundary-lattice Boltzmann method for solving
1158 %% fluid-particles interaction problem,
1159 %% \textit{J. Comp. Phys.}, \textbf{195} 602--628 (2004).
1160
1161 %% \bibitem{ch14:spm}
1162 %% Y. Nakayama and R. Yammamoto,
1163 %% Simulation method to resolve hydrodynamic interactions in colloidal
1164 %% dispersions,
1165 %% \textit{Phys. Rev. E}, \textbf{71} 036707 (2005).
1166
1167
1168 %\end{thebibliography}
1169
1170
1171
1172
1173
1174
1175
1176 % LocalWords:  CRESTA EPSRC Succi Desplat Pagonabarraga