BookGPU/Chapters/chapter14/ch14.tex

   1 \chapterauthor{Alan Gray and Kevin Stratford}{EPCC, The University of Edinburgh}
   2
   3 \chapter{Ludwig: multiple GPUs for a complex fluid lattice Boltzmann
   4 application}
   5
   6 %\putbib[biblio]
   7
   8
   9 \section{Introduction}
  10 The lattice Boltzmann (LB) method (for an overview see, e.g.,
  11 \cite{succi-book}) has become a popular approach to a variety of fluid
  12 dynamics problems.  It provides a way to solve the incompressible,
  13 isothermal Navier-Stokes equations and has the attractive features of
  14 being both explicit in time and local in space. This makes the LB
  15 method well-suited to parallel computation. Many efficient parallel
  16 implementations of the LB method have been undertaken, typically using
  17 a combination of distributed domain decomposition and the Message
  18 Passing Interface (MPI). However, the potential
  19 performance benefits offered by GPUs has motivated a new `mixed-mode'
  20 approach to address very large problems. Here, fine-grained
  21 parallelism is implemented on the GPU, while MPI is reserved for
  22 larger-scale parallelism.  This mixed mode is of increasing interest
  23 to application programmers at a time when many supercomputing services
  24 are moving
  25 toward clusters of GPU accelerated nodes. The design questions which
  26 arise when developing a lattice Boltzmann code for this type of
  27 heterogeneous system are therefore worth studying. Furthermore, similar
  28 questions also recur in many other types of stencil-based algorithms.
  29
  30 The first applications of LB on GPUs were to achieve fluid-like
  31 effects in computer animation, rather than scientific applications per
  32 se.  These early works include simple fluids~\cite{wei2004}, miscible
  33 two-component flow~\cite{zhu2006}, and various image processing tasks
  34 based on the use of partial differential equations~\cite{zhao2007}.
  35 While these early works used relatively low level graphics APIs, the
  36 first CUDA runtime interface implementation was a two-dimensional
  37 simple fluid problem~\cite{toelke2010}.  Following pioneering work on
  38 clusters of GPUs coupled via MPI to study air
  39 pollution~\cite{fan2004}, more recent work has included mixed OpenMP
  40 and CUDA~\cite{myre2011}, Posix threads and CUDA~\cite{obrecht2011},
  41 and MPI and CUDA for increasingly large GPU clusters
  42 \cite{bernaschi2010,xian2011,feichtinger2011}. The heterogeneous
  43 nature of these systems has also spurred interest in approaches
  44 including automatic code generation \cite{walshsaar2012} and auto-tuning
  45 \cite{williams2011} to aid application performance.
  46
  47 Many of these authors make use of
  48 another attractive feature of LB: the ability to include fixed
  49 solid-fluid boundary conditions as a straightforward addition to
  50 the algorithm to study, for example, flow in
  51 porous media. This points to an important
  52 application area for LB methods: that of complex fluids.
  53 Complex fluids include mixtures, surfactants, liquid crystals,
  54 and particle suspensions, and typically require additional physics
  55 beyond the bare Navier-Stokes equations to provide a full
  56 description~\cite{aidun2010}. The representation of this extra
  57 physics raises additional design questions for the application
  58 programmer. Here, we consider the \textit{Ludwig} code \cite{desplat},
  59 an LB application developed specifically for complex fluids
  60 (\textit{Ludwig} was named for Boltzmann, 1844--1906).
  61 We will present the steps
  62 required to allow \textit{Ludwig} to exploit efficiently both a single
  63 GPU, and also many GPUs in parallel.  We show that
  64 \textit{Ludwig} scales excellently to at least the one thousand GPU level
  65 (the largest resource available at the time of writing) with
  66 indications that the code will scale to much larger systems as
  67 they become available. In addition, we consider the steps required
  68 to represent fully-resolved moving solid particles (usually referred
  69 to as colloids in the context of complex fluids). Such particles
  70 need to have their surface resolved on the lattice scale in order
  71 to include relevant surface physics, and must be able to move: e.g.,
  72 to execute Brownian motion in response to random forces from the
  73 fluid. Standard methods are available to represent such particles
  74 (e.g., \cite{ladd1994,nguyen2002}) which are amenable to effective
  75 domain decomposition and message passing \cite{stratford2008}.
  76 We describe below the additional considerations which arise
  77 for such moving particles when developing an
  78 implementation on a GPU.
  79
  80 In the following section we provide a brief overview of the lattice
  81 Boltzmann method, and mention some of the general issues which can
  82 influence performance in the CPU context.
  83 In Section \ref{ch14:sec:singlegpu}, we then describe the alterations
  84 which are required to exploit the GPU architecture effectively, and
  85 highlight how the resulting code differs from the CPU version. In
  86 Section \ref{ch14:sec:parallelgpu}, we extend this description to
  87 include the steps required to allow exploitation of many GPUs in
  88 parallel while retaining effective scaling. We also present results for
  89 a typical benchmark for a fluid-only problem in Section
  90 \ref{ch14:sec:parallelgpu} to demonstrate the success of the approach.
  91 In Section \ref{ch14:sec:particles}, we describe the design choices
  92 which are relevant to a GPU implementation of moving particles. Finally,
  93 we include a number of general observations on software
  94 engineering and maintenance which arise from our experience.
  95 A summary is provided in Section~\ref{ch14:sec:summary}.
  96
  97
  98 \section{Background}\label{ch14:sec:background}
  99
 100 %\begin{figure}[!t]
 101 %\centering
 102 %\includegraphics[width=12cm]{Chapters/chapter14/figures/basiclattice}
 103 %\caption{The lattice Boltzmann approach discritises space into a 3D lattice (left). The fluid is represented at each lattice site (right), by the {\it distribution} vector: each component corresponding to a specific velocity direction. Here we are interested in the {\it D3Q19} model, in which there are 3 spatial directions and 19 velocity components (including a zero component).}
 104 %\label{ch14:fig:basiclattice}
 105 %\end{figure}
 106
 107
 108 For a general complex fluid problem the starting point is the
 109 fluid velocity field $\mathbf{u}(\mathbf{r})$, whose evolution
 110 obeys the Navier-Stokes equations describing the conservation of mass
 111 (or density $\rho$), and momentum:
 112 \begin{equation}
 113 \rho  [ \partial_t \mathbf{u} + (\mathbf{u}.\mathbf{\nabla})\mathbf{u} ]
 114 = -\nabla p + \eta \nabla^2 \mathbf{u} + \mathbf{f}(\mathbf{r}),
 115 \end{equation}
 116 where $p$ is the isotropic pressure and $\eta$ is the viscosity.
 117 A local force $\mathbf{f}(\mathbf{r})$ provides a means for coupling
 118 to other complex fluid constituents, e.g., it might represent the force
 119 exerted on the fluid by a curved interface between different phases or
 120 components.
 121
 122 The LB approach makes use of a regular three-dimensional
 123 lattice (see Figure \ref{ch14:fig:decomphalo}) with discrete spacing
 124 $\Delta r$. It also makes use of a
 125 discrete velocity space $\mathbf{c}_i$, where the $\mathbf{c}_i$
 126 are chosen to capture the correct symmetries of the Navier-Stokes
 127 equations. A typical choice, used here, is the so-called D3Q19
 128 basis in three dimensions where there is one velocity such that
 129 $\mathbf{c} \Delta t$ is zero, along with six extending to the nearest
 130 neighbour
 131 lattice sites, and twelve extending to the next-nearest neighbour sites
 132 ($\Delta t$ being the discrete time step). The fundamental object
 133 in LB is then the distribution function $f_i (\mathbf{r};t)$ whose
 134 moments are related to the local hydrodynamic quantities: the fluid
 135 density, momentum, and stress. The time evolution of the distribution
 136 function is described by a discrete Boltzmann equation
 137 \begin{equation}
 138 f_i(\mathbf{r} + \mathbf{c}_i \Delta t; t) - f_i(\mathbf{r}; t)
 139 = - {\cal L}_{ij} f_j(\mathbf{r};t).
 140 \end{equation}
 141 It is convenient to think of this in two stages. First, the right hand
 142 side represents the action of a collision operator ${\cal L}_{ij}$,
 143 which is local to each lattice site and relaxes the distribution toward
 144 a local equilibrium at a rate ultimately related to the fluid viscosity.
 145 Second, the left hand side represents a propagation step (sometimes referred
 146 to as streaming step), in which each element $i$ of the distribution is
 147 displaced $\mathbf{c}_i \Delta t$, i.e., one lattice spacing in the
 148 appropriate direction per discrete time step.
 149
 150 More specifically, we store a vector of 19 double precision floating
 151 point values at each lattice site for the distribution function
 152 $f_i(\mathbf{r};t)$.
 153 The collision operation ${\cal L}_{ij}$, which is local at each lattice
 154 site, may be thought of as follows. A matrix-vector multiplication
 155 ${\cal M}_{ij}f_j$ is used to transform the distributions into the
 156 hydrodynamic quantities, where ${\cal M}_{ij}$ is a constant 19x19
 157 matrix related to the choice of
 158 $\mathbf{c}_i$. The non-conserved hydrodynamic quantities are then
 159 relaxed toward their (known) equilibrium values, and are transformed
 160 back to new post-collision distributions via the inverse transformation
 161 ${\cal M}^{-1}_{ij}$. This gives rise to the need for a minimum of 2x19$^2$
 162 floating point multiplications per lattice site. (Additional operations are
 163 required to implement, for example, the force $\mathbf{f}(\mathbf{r})$.)
 164 In contrast, the
 165 propagation stage consists solely of moving the distribution values
 166 between lattice sites, and involves no floating point operations.
 167
 168 In the CPU version, the \textit{Ludwig} implementation stores one time
 169 level of distribution values $f_i(\mathbf{r}; t)$. This distribution
 170 is stored as a flat array of C data type \texttt{double}, and laid out
 171 so that all elements of the velocity are contiguous at a given site
 172 (often referred to as ``array-of-structures'' order). This is
 173 appropriate for sums over the distribution required locally at the
 174 collision stage as illustrated schematically in
 175 Listing~\ref{ch14:listing1}: the fact that consecutive loads are from
 176 consecutive memory addresses allows the prefetcher to engage fully.
 177 (The temporary scalar \texttt{a\_tmp} allows caching of the
 178 intermediate accumulated value in the innermost loop.)  A variety of
 179 standard sequential optimisations are relevant for the collision stage
 180 (loop unrolling, inclusion of SIMD operations, and so on
 181 \cite{wellein2006}).  For example, the collision stage in
 182 \textit{Ludwig} includes explicit SIMD code, which is useful if the
 183 compiler on a given platform cannot identify it.  The propagation
 184 stage is separate, and is organised as a ``pull'' (again, various
 185 optimisation have been considered, e.g.,
 186 \cite{pohl2003,mattila2007,wittmann2012}).  No further optimisation is
 187 done here beyond ensuring that the ordering of the discrete velocities
 188 allows memory access to be as efficient as possible. While these
 189 optimisations are important, it should be remembered that for some
 190 complex fluid problems, the hydrodynamics embodied in the LB
 191 calculation is a relatively small part of the total computational cost
 192 (at the 10\%-level in some cases).  This means optimisation effort may
 193 be better concentrated elsewhere.
 194
 195 \begin{lstlisting}[float, label=ch14:listing1,
 196 caption = Collision schematic for CPU.]
 197 /* loop over lattice sites */
 198 for (is = 0; is < nsites; is++) {
 199   ...
 200   /* load distribution from ``array-of-structures'' */
 201   for (i = 0; i < 19; i++)
 202     f[i] = f_aos[19*is + i]
 203   ...
 204   /* perform matrix-vector multiplication */
 205   for (i = 0; i < 19; i++) {
 206     a_tmp = 0.0;
 207     for (j = 0; j < 19; j++) {
 208       a_tmp += f[j]*M[i][j];
 209     }
 210     a[i] = a_tmp;
 211   }
 212   ...
 213 }
 214 \end{lstlisting}
 215
 216
 217 \begin{figure}[!t]
 218 \centering
 219 \includegraphics[width=12cm]{Chapters/chapter14/figures/decomphalo}
 220 \caption[The lattice is decomposed between MPI tasks.]{Left: the lattice is decomposed between MPI tasks. For
 221   clarity we show a 2D decomposition of a 3D lattice, but in practice
 222   we decompose in all 3 dimensions. Halo cells are added to each
 223   sub-domain (as shown on the upper right for a single slice) which store
 224   data retrieved from remote neighbours in the halo exchange. Lower
 225   right: the D3Q19 velocity set resident on a lattice site;
 226   highlighted are the 5 ``outgoing'' elements to be transferred in a
 227   specific direction.}
 228 \label{ch14:fig:decomphalo}
 229 \end{figure}
 230
 231
 232 The regular 3D decomposition is illustrated in Fig.~\ref{ch14:fig:decomphalo}.
 233 Each local sub-domain is surrounded by a halo, or ghost, region of width one
 234 lattice site. While the collision is local, elements of the distribution
 235 must be exchanged at the edges of the domains to facilitate the propagation.
 236 To achieve the full 3D halo exchange, the standard approach of shifting the
 237 relevant data in each co-ordinate direction in turn is adopted. This
 238 requires appropriate synchronisation, i.e., a receive in the the first
 239 co-ordinate direction must be complete before a send in the second direction
 240 involving relevant data can take place, and so on. We note that only
 241 ``outgoing'' elements of the distribution need to be sent at each edge.
 242 For the D3Q19 model, this reduces the volume of data traffic from 19 to
 243 5 of the $f_i(\mathbf{r};t)$ per lattice site at each edge. In the CPU
 244 version, the necessary transfers are implemented in place using
 245 a vector of appropriately strided MPI datatypes for each direction.
 246
 247
 248 \section{Single GPU Implementation}\label{ch14:sec:singlegpu}
 249
 250
 251 In this section we describe the steps taken to enable \textit{Ludwig}
 252 for the GPU. There are a number of crucial issues: first, the
 253 minimisation of data traffic between host and device; second, the
 254 optimal mapping of available parallelism onto the architecture and
 255 third, the issue of memory coalescing. We discuss each of these in
 256 turn.
 257
 258 While the most important section of the LB in terms of
 259 floating-point performance is the collision stage, this cannot be the
 260 only consideration for a GPU implementation. It is essential to
 261 offload all computational activity which involves the main data
 262 structures (such as the distribution) to the GPU. This includes
 263 kernels with relatively low computational demand, such as the
 264 propagation stage. All relevant data then remain resident on the GPU,
 265 to avoid expensive host-device data transfer at each iteration of
 266 the algorithm. Such transfers would negate any benefit of GPU
 267 acceleration.  We note that for a complex fluid code, this requirement
 268 can extend to a considerable number of kernels, although we limit the
 269 discussion to collision and propagation for brevity.
 270
 271 To achieve optimal performance, it is vital to exploit fully the
 272 parallelism inherent in the GPU architecture, particularly for
 273 those matrix-vector operations within the collision kernel.
 274 The GPU architecture features a hierarchy of parallelism. At the
 275 lowest level, groups of 32 threads (warps) operate in
 276 lock-step on different data elements: this is SIMD style vector-level
 277 parallelism. Multiple warps are combined into a thread block (in which
 278 communication and synchronisation are possible), and multiple blocks can
 279 run concurrently across the streaming multiprocessors in the GPU
 280 (with no communication or
 281 synchronisation possible across blocks).  To decompose on the GPU, we
 282 must choose which part of the collision to assign to which level of
 283 parallelism.
 284
 285 While there exists parallelism within the matrix-vector operation, the
 286 length of each vector (here, 19) is smaller than the warp size and typical
 287 thread block sizes. So we simply decompose the loop over lattice sites
 288 to all levels of parallelism, i.e., we use a separate CUDA thread for
 289 every lattice site, and each thread performs the full matrix-vector
 290 operation. We find that a block size of 256 performs well on current
 291 devices: we therefore decompose
 292 the lattice into groups of 256 sites, and assign each group to a block
 293 of CUDA threads. As the matrix ${\cal M}_{ij}$ is constant, it is
 294 assigned to the fast {\it constant} on-chip device memory.
 295
 296 For the propagation stage, the GPU implementation adds a second time
 297 level of distribution values. The data-dependencies inherent in the
 298 propagation mean that the in-place propagation of the CPU version
 299 cannot be parallelised effectively without the additional time level.
 300 As both time levels may remain resident on the GPU, this is not a
 301 significant overhead.
 302
 303 An architectural
 304 constraint of GPUs means that optimal global memory bandwidth
 305 is only achieved when data are structured such that threads within a
 306 {\it half-warp} (a group of 16 threads) load data from the same memory
 307 segment in a single transaction: this is memory coalescing. It can be
 308 seen that the array-of-structures ordering used for the distribution
 309 in the CPU code would not be suitable for coalescing; in fact, it would
 310 result in serialised memory accesses and relative poor performance.
 311 To meet the coalescing criteria and allow consecutive threads to read
 312 consecutive memory addresses on the GPU, we transpose the layout of the
 313 distribution so that, for each velocity component, consecutive sites
 314 are contiguous in memory (``structure-of-arrays'' order). A schematic of
 315 the GPU collision code is given in Listing~\ref{ch14:listing2}.
 316
 317 \begin{lstlisting}[float, label=ch14:listing2,
 318 caption = Collision schematic for GPU.]
 319 /* compute current site index 'is' from CUDA thread and block */
 320 /* indices and load distribution from "structure-of-arrays" */
 321
 322 for (i = 0; i < 19; i++)
 323   f[i] = f_soa[nsites*i + is]
 324
 325 /* perform matrix-vector multiplication as before */
 326 ...
 327 \end{lstlisting}
 328
 329
 330 \textit{Ludwig} was modified to allow a choice of distribution data layout
 331 at compilation time depending on the target architecture: CPU or GPU. We
 332 defer some further comments on software engineering aspects of the code to
 333 the summary.
 334
 335
 336 \section{Multiple GPU Implementation}\label{ch14:sec:parallelgpu}
 337
 338 To execute on multiple GPUs, we use the same domain decomposition
 339 and message passing framework as the CPU version. Within each
 340 sub-domain (allocated to one MPI task) the GPU implementation
 341 proceeds as described in the previous section. The only additional
 342 complication is that halo transfers between GPUs must be staged
 343 through the host (in future, direct GPU to GPU data transfers via
 344 MPI may be possible, obviating the need for these steps). This
 345 means host MPI sends must be preceded by appropriate device to host
 346 transfers and host MPI receives must be followed by corresponding host
 347 to device transfers.
 348
 349 %To execute on multiple GPUs in parallel we decompose the problem using
 350 %the existing high-level parallelisaton framework: each MPI task acts
 351 %as a host to a separate GPU, with each GPU kernel operating on the
 352 %subset of the lattice volume corresponding to the partition assigned
 353 %to that task.  As described in Section \ref{ch14:sec:background}, halo
 354 %data exchanges are required at each timestep.  Significant
 355 %developments are made for the GPU adaptation of this communication
 356 %phase to achieve good inter-GPU communication performance and allow
 357 %scaling to many GPUs.
 358
 359 %The CPU code uses strided MPI datatypes to define the six
 360 %planes of lattice data to be sent or received, and performs the
 361 %communications {\it in-place}, i.e.  there is no explicit buffering of
 362 %data in the application.  For the GPU adaptation, the halo planes must
 363 %be exchanged between GPUs, on which MPI functionality is not
 364 %available, so transfers must be staged through the host CPUs. Buffer
 365 %packing/unpacking of the halo planes is explicitly performed using
 366 %CUDA kernels, and CUDA memory copy calls are used to transfer data
 367 %from a device to its host (using pinned host memory) through the PCI-e
 368 %bus (and vice versa). MPI calls are used to transfer the buffers
 369 %between hosts. The CPU version of Ludwig has a mechanism, using MPI built-in
 370 %data types, to reduce the amount of data communicated by around a
 371 %factor of four by only sending those elements of the distribution
 372 %propagating outward from a local domain. For the GPU adaptation, this
 373 %filtering of data is explicitly performed (again since MPI
 374 %functionality is not available on the GPU) in the buffer
 375 %packing/unpacking kernels.  Special care had to be taken for those
 376 %lattice sites on the corners of the halo planes, which had to be sent
 377 %in more than one direction (since the propagation includes accesses to
 378 %diagonal neighbours).
 379
 380 In practice, this data movement requires additional GPU kernels to
 381 pack and unpack the relevant data before and after corresponding MPI
 382 calls. However, the standard shift algorithm, in which each co-ordinate
 383 direction is treated in turn, does provide some scope for the
 384 overlapping of different operations.
 385 For example, after the data for the first co-ordinate direction have been
 386 retrieved by the host, these can be exchanged using MPI between
 387 hosts at the same time as kernels for packing and retrieving of
 388 data for the second co-ordinate direction are
 389 executed. This
 390 overlapping must respect the synchronisation required to ensure
 391 that data values at the corners of the sub-domain are transferred
 392 correctly.
 393 We use a separate CUDA stream for each co-ordinate direction:
 394 this allows some of the host-device communication time to be
 395 effectively ``hidden'' behind the host-host MPI communication,
 396 resulting in an overall speedup.
 397 The improvement is more pronounced for the smaller local lattice size,
 398 perhaps because of less CPU memory bandwidth contention. The
 399 overlapping is then a particularly valuable aid to strong scaling,
 400 where the local system size decreases as the number of GPUs increases.
 401
 402
 403 %\section{Performance}\label{ch14:sec:performance}
 404
 405 To demonstrate the effectiveness of our approach, we compare the
 406 performance of both CPU and GPU versions of \textit{Ludwig}. To
 407 test the complex fluid nature of the code, the problem is
 408 actually an immiscible fluid mixture which uses a second distribution
 409 function to introduce a composition variable. The interested reader
 410 is referred to \cite{ch14:stratford-jsp2005} for further details. The
 411 largest total problem size used is $2548\times 1764\times 1568$.
 412 The CPU system features a Cray XE6 architecture with 2 16-core AMD Opteron
 413 CPUs per node, and with nodes interconnected using Cray Gemini technology.
 414 For GPU results, a Cray XK6 system is used: this is very similar to the
 415 XE6, but has one CPU per node replaced with an NVIDIA X2090 GPU. Each
 416 node in the GPU system therefore features a single Opteron CPU acting
 417 as a host to a single GPU. The inter-node interconnect architecture is
 418 the same as for the Cray XE6. The GPU performance tests use a prototype
 419 Cray XK6 system with 936 nodes (the largest available at the time of writing).
 420 To provide a fair comparison, we compare scaling on a \textit{per node}
 421 basis. That is, we compare 1 fully occupied 32-core CPU node
 422 (running 32 MPI tasks) with 1 GPU node (host running 1 MPI task,
 423 and 1 device). We believe this is representative of the true ``cost'' of
 424 a simulation in terms of accounting budgets, or electricity.
 425
 426
 427 %\begin{figure}[!t]
 428 %\centering
 429 %\includegraphics[width=10cm]{Chapters/chapter14/figures/length_scale}
 430 %\caption{The dependence of the characteristic length scale of a binary
 431 %  fluid mixture on the simulation timestep number. Superimposed are
 432 %  three snapshots showing increased mixture separation with time:
 433 %  these are obtained through the order parameter and each show one
 434 %  eighth of the total simulation size, which is $1548\times 1764\times 1568$. }\label{ch14:fig:length_scale}
 435 %\end{figure}
 436
 437 \begin{figure}[!t]
 438 \centering
 439 \includegraphics[width=10cm]{Chapters/chapter14/figures/two_graphs_ud}
 440 \caption[The weak (top) and strong (bottom) scaling of \textit{Ludwig}.]{The weak (top) and strong (bottom) scaling of \textit{Ludwig}.  Closed
 441   shapes denote results using the CPU version run on the Cray
 442   XE6 (using two 16-core AMD Interlagos CPUs per node), while open
 443   shapes denote results using the GPU version on the Cray XK6 (using a
 444   single NVIDIA X2090 GPU per node). Top: the benchmark time is shown
 445   where the problem size per node is constant. Bottom: performance is
 446   shown where, for each of the four problem sizes presented, the
 447   results are scaled by the lattice volume, and all results are
 448   normalized by the same arbitrary constant.  }
 449 \label{ch14:fig:scaling}
 450 \end{figure}
 451
 452
 453 %The top of Figure \ref{ch14:fig:scaling} shows the benchmark time,
 454 %where the problem size is increased with the number of nodes. This
 455 %shows weak scaling: the time taken by a perfectly scaling code would
 456 %remain constant. The figure plots the time against the number of {\it
 457 %  nodes}: for the GPU version we utilize the one GPU per Cray XK6 node
 458 %whilst for the CPU version we fully utilize two 16-core CPUs
 459 %per Cray XE6 node, i.e. one GPU is compared with 32 CPU cores.  We use
 460 %a single MPI task per Cray XK6 node to host the GPU, and 32 MPI tasks
 461 %to fully utilize the CPUs in the Cray XE6 node.  Note that our CPU
 462 %version is highly optimised: in particular it has been tuned for
 463 %optimal utilisation of the SIMD vector units in each CPU core.
 464
 465 Figure \ref{ch14:fig:scaling} shows the results of both weak and
 466 strong scaling tests (top and bottom panels, respectively). For
 467 weak scaling, where the local sub-domain size is kept fixed
 468 (here a relatively large 196$^3$ lattice), the time taken by
 469 an ideal code would remain constant. It can be seen that while
 470 scaling of the CPU version shows a pronounced upward slope, the
 471 GPU version scales almost perfectly. The advantage of the GPU
 472 version over the CPU version is the fact that the GPU version
 473 has a factor of 32 fewer MPI tasks per node, so communications
 474 require a significantly smaller number of larger data transfers.
 475 The performance
 476 advantage of the GPU version ranges from a factor of around 1.5 to
 477 around 1.8. Careful study by other authors \cite{williams2011} have
 478 found the absolute performance (in terms of floating point
 479 operations per second, or floating point operations per Watt) to
 480 be remarkably similar between architectures.
 481
 482 Perhaps of more interest is the strong scaling picture (lower
 483 panel in Figure \ref{ch14:fig:scaling}) where the performance
 484 as a function of the number of nodes is measured for fixed problem
 485 size. We consider four different fixed problem sizes on both CPU
 486 (up to 512 nodes shown) and GPU (up to 768 nodes). To allow comparison,
 487 the results are scaled by total system size in each case. For strong
 488 scaling, the disparity in the number of MPI tasks is clearly revealed
 489 in the failure of the CPU version to provide any significant benefit
 490 beyond a modest number of nodes as communication overheads dominate.
 491 In contrast, the GPU version shows
 492 reasonably robust scaling even for the smaller system sizes and good
 493 scaling for the larger systems.
 494
 495 \section{Moving Solid Particles}\label{ch14:sec:particles}
 496
 497 \begin{figure}[t]
 498 \centering
 499 %\includegraphics[width=10cm]{Chapters/chapter14/figures/bbl}
 500 \includegraphics[width=10cm]{Chapters/chapter14/figures/colloid_new}
 501 \caption[A two-dimensional schematic picture of spherical particles on the lattice.]{
 502 A two-dimensional schematic picture of spherical particles on the lattice.
 503 Left: a particle is allowed
 504 to move continuously across the lattice, and the position of the
 505 surface defines fluid lattice sites (light blue) and solid lattice
 506 sites (dark red). The discrete surface is defined by links
 507 where propagation would intersect the surface (arrows). Note the
 508 discrete shape of the two particles is different. Right: post-collision
 509 distributions are reversed at the surface by the process of bounce-back
 510 on links, which replaces the propagation.
 511 %A 2D illustration of colloidal particles (grey circles)
 512 %  interacting with the lattice, where each lattice point is
 513 %  represented by a cross. Left: each particle spans multiple sites.
 514 %  Right: the particle-facing distribution components are moved inside
 515 %  the particle with the direction reversed, such that the ``bounce
 516 %  back'' will be completed when the fluid propagates.
 517 }
 518 \label{ch14:fig:bbl}
 519 \end{figure}
 520
 521
 522 %\begin{figure}[t]
 523 %\centering
 524 %\includegraphics[width=6cm]{Chapters/chapter14/figures/particlemove}
 525 %\caption{A 2D illustration of a colloidal particle moving across the
 526 %    lattice. The old and new positions are represented by the open and
 527 %    filled circles respectively.}
 528 %\label{ch14:fig:particlemove}
 529 %\end{figure}
 530
 531 % BEING NEW SOLID TEXT
 532
 533 The introduction of moving solid particles poses an additional hurdle
 534 to efficient GPU implementation of an LB code such as \textit{Ludwig}.
 535 In this section, we give a brief overview of the essential features
 536 of the CPU implementation, and how the considerations raised in the
 537 previous sections --- maximising parallelism and minimising both
 538 host to device and GPU to GPU data movement --- shape the design decisions
 539 for a GPU implementation. We restrict our discussion to a simple fluid
 540 in this section; additional solid-fluid boundary conditions (e.g.,
 541 wetting at a fluid-fluid-solid contact line) usually arise elsewhere
 542 in the calculation and broadly independent of the hydrodynamic boundary
 543 conditions which are described in what follows.
 544
 545 Moving solid particles (here, spheres) are defined by a centre position
 546 which is allowed to move continuously across the space of the lattice,
 547 and a fixed radius which is typically a few lattice spacings $\Delta r$.
 548 The surface of the particle is defined by a series of \textit{links}
 549 where a discrete velocity propagation $\mathbf{c}_i \Delta t$ would
 550 intercept or cut the spherical shell (see Fig.~\ref{ch14:fig:bbl}).
 551 Hydrodynamic boundary conditions are then implemented via the standard
 552 approach of bounce-back on links \cite{ladd1994, nguyen2002}, where the
 553 relevant post-collision distribution values are reversed at the
 554 propagation stage with an appropriate correction to allow for the
 555 solid body motion. The exchange of momentum at each link must then be
 556 accumulated around the entire particle surface to provide the net
 557 hydrodynamic force
 558 and torque on the sphere. The particle motion can then be updated in a
 559 molecular dynamics-like step.
 560
 561 For the CPU implementation, a number of additional MPI communications
 562 are required: (1) to exchange the centre position, radius, etc of each
 563 particle as it moves and (2), to allow the accumulation of the force
 564 and torque for each particle (the links of which may be distributed
 565 between up to 8 MPI tasks in three dimensions). Appropriate marshaling
 566 of these data can provide an effective parallelisation for relatively
 567 dense particle suspensions of up to  40\% solid volume fraction
 568 \cite{stratford2008}. As a final consideration, fluid distributions
 569 must be removed and replaced consistently as a given particle moves
 570 across the lattice and changes its discrete shape.
 571
 572 With these features in mind, we can identify a number of competing
 573 concerns which are relevant to a GPU implementation:
 574 \begin{enumerate}
 575 \item
 576 Minimisation of host-device data transfer would argue for moving the
 577 entire particle code to the GPU. However, the code in question involves
 578 largely conditional logic (e.g., identifying cut surface links) and
 579 irregular memory accesses (e.g., access to distribution elements around
 580 a spherical particle). These operations seem poorly suited to effective
 581 parallelisation on the GPU. As an additional complication, the sums
 582 required over the particle surface would involve potentially tricky
 583 and inefficient reductions in GPU memory.
 584 \item
 585 The alternative is to retain the relevant code on the CPU. While the
 586 transfer of the entire distribution $f_i(\mathbf{r};t)$ between host
 587 and device at each time step is unconscionable owing to PCIe bus bandwidth
 588 considerations, the transfer of only relevant distribution information
 589 to allow bounce-back on links is possible. At modest solid volumes
 590 only a very small fraction of the distribution data is
 591 involved in bounce-back (e.g., a 2\% solid volume fraction of particles
 592 radius 2.3$\Delta r$ would involve approximately 1\% of the distribution
 593 data). This option also has the advantage that no further host-device
 594 data transfers are necessary to allow the MPI exchanges required for
 595 particle information.
 596 \end{enumerate}
 597
 598
 599 We have implemented the second option as follows. For each sub-domain,
 600 a list of boundary-cutting links is assembled on the CPU which includes
 601 the identity of the relevant element of the distribution. This list,
 602 together with the particle information required to compute the correct
 603 bounce-back term, are transferred to the GPU. The updates to the relevant
 604 elements of the distribution can then take place on the GPU. The
 605 corresponding information to compute the update of the particle dynamics
 606 is returned
 607 to the CPU, where the reduction over the surface links is computed.
 608 The change of particle shape may be dealt with in a similar manner:
 609 the relatively small number of updates required at any one time step
 610 (or however frequently the particle position is updated) can be
 611 marshaled to the GPU as necessary. This preliminary implementation
 612 is found to be effective on problems involving up to 4\%
 613 solid volume fraction.
 614
 615 We note that the CPU version actually avoids the collision calculation
 616 at solid lattice points by consulting a look-up table of solid/fluid
 617 status. On the GPU, it is perhaps preferable to perform the collision
 618 stage everywhere instead of moving the look-up table to the GPU and
 619 introducing the associated logic.
 620 Ultimately, the GPU might favour other boundary methods which treat solid and
 621 fluid on a somewhat more equal basis, for example, the immersed boundary
 622 method \cite{ch14:immersed1,ch14:immersed2,ch14:immersed-lb} or smoothed profile method
 623 \cite{ch14:spm}.
 624 However, the approach adopted here  allows us to exploit
 625 the GPU for the intensive fluid simulation whilst maintaining the complex
 626 code required for particles on the CPU. Overheads of CPU-GPU transfer are
 627 minimised by transferring only those data relevant to the hydrodynamic
 628 interaction implemented via bounce-back on links.
 629
 630 % END NEW SOLID TEXT
 631
 632 %It is of interest, in many situations, to include colloidal particles
 633 %in the simulation **more from Kevin***. These are by represented by
 634 %moving spheres that interact with the fluid through transfer of
 635 %momenta: fluid coming into contact with a particle will bounce off it,
 636 %giving the particle a kick in the process. As illustrated on the left
 637 %of Figure \ref{ch14:fig:bbl} (which shows 2 dimensions only for
 638 %clarity - in practice, the particles will be represented by spheres on
 639 %a 3D lattice, and will be much larger relative to the lattice
 640 %spacing), typically each particle spans multiple lattice sites, and
 641 %therefore ``intercepts'' multiple inter-site links. Ludwig stores
 642 %particle information using nested linked lists: a list of particles,
 643 %each with a list of all the links intercepted by that particle (plus
 644 %other quantities). The fluid-particle interaction proceeds using the
 645 %so-called {\it bounce back on links} procedure during each timestep. As
 646 %illustrated in the right of Figure \ref{ch14:fig:bbl}, for each of the
 647 %intercepting links, those velocity components of the distribution that
 648 %face inwards to each particle are moved from the site
 649 %outside the particle to the site inside the particle, and
 650 %reversed in direction. This is done before the propagation stage,
 651 %which then propagates these same velocity components back out of the
 652 %particles: the velocity bounces back from the particles. The
 653 %momenta transfer from fluid to particles is also determined
 654 %from the distribution values.
 655
 656 %A simplified schematic of the implementation on the CPU is as follows,
 657 %noting that multiplicative coefficients are required:
 658 %{\footnotesize
 659 %\begin{verbatim}
 660 %For each particle:
 661 %  For each link:
 662 %    Calculate coefficient
 663 %    Bounce back fluid using coefficient
 664 %    Update particle momentum
 665 %\end{verbatim}
 666 %}
 667 %This process is relatively inexpensive on the CPU, since the total
 668 %number of particles (and intercepted links) is always relatively
 669 %small, but it does require access to distribution data which we want
 670 %to keep resident on the GPU throughout the simulation. The overhead of
 671 %transferring the full distribution to the CPU and back every timestep
 672 %would be too high. But on the other hand, the full codebase required
 673 %to manage particles is quite complex and comprehensive: completely
 674 %porting it to the GPU, such that the particles effectively also remain
 675 %resident on the GPU, would be a large effort and would cause a major
 676 %diversion of CPU and GPU codebases. Further complications would arise
 677 %when running on multiple GPUs due to the fact that particles can move
 678 %between parallel subdomains. Our solution, which minimises overhead,
 679 %is to keep the particles resident on CPU, but offload only their
 680 %interaction with the fluid to the GPU, as we describe below.
 681
 682 %On the CPU, the code in the above listing can be restructured into
 683 %three distinct stages as (introducing arrays to temporrily store
 684 %indices and data):
 685 %{\footnotesize
 686 %\begin{verbatim}
 687 %1) For each particle, for each link:
 688 %     Store site and velocity indices associated with link
 689 %     Calculate and store coefficient
 690 %2) For each particle, for each link:
 691 %     Update velocity data using stored values
 692 %     Calculate and store particle momenta updates
 693 %3) For each particle, for each link:
 694 %     Update particle momentum using stored values
 695 %\end{verbatim}
 696 %}
 697 %Stage 2 is only one that that accesses fluid data: this can therefore
 698 %be moved to the GPU, while stages 1 and 3 remain on the CPU. Stage 2
 699 %therefore becomes, for the GPU implementation
 700 %{\footnotesize
 701 %\begin{verbatim}
 702 %Transfer coefficient and link index arrays to GPU
 703 %Run CUDA kernel:
 704 %  Update fluid data using coefficient and link index arrays
 705 %  Calculate and store particle momenta updates from fluid data
 706 %Transfer particle momenta updates array back to CPU
 707 %\end{verbatim}
 708 %}
 709 %Since the total number of links involves is relatively small, these
 710 %partial data transfers have minimal overhead.  A complication arises
 711 %due to the fact that the particles move through the lattice, albeit
 712 %relatively slowly, and therefore the list of links that they intercept
 713 %occasionally changes.  When a
 714 %particle moves off a lattice site, the fluid on that site must be
 715 %reconstructed. Similarly, the fluid on the newly obstructed site must
 716 %be removed. For the same reasons given above, we want to avoid moving
 717 %all the code dealing with this complication to the GPU, whilst
 718 %minimising overheads. Since, on any one timestep, the number of
 719 %lattice sites affected is relatively low, we implement functionality
 720 %to perform partial data transfers of the distribution. Only those
 721 %sites affected are packed into a buffer on the GPU, transferred to the
 722 %CPU and unpacked into the CPU version of the distribution. Then, the
 723 %existing code can be used to update those affected sites, before the
 724 %reverse partial data transfer is performed to update the GPU version
 725 %of data.
 726
 727 It is perhaps interesting at this point to make some more general
 728 observations on the software engineering challenge presented when
 729 extending an existing CPU code to the GPU. The already complex
 730 task of maintaining the code in a portable fashion while also
 731 maintaining performance is currently formidable. To help this process,
 732 we have followed a number of basic principles. First,
 733 in order to port to the GPU in an incremental fashion,
 734 we have tried to maintain the modular structure of the CPU where
 735 possible. For each data structure, such as the distribution, a separate
 736 analogue is maintained in both the CPU and GPU memory spaces. However,
 737 the GPU copy does not include the complete CPU structure: in
 738 particular, non-intrinsic datatypes such as MPI datatypes are not
 739 required on the GPU. Functions to marshal data between CPU and GPU
 740 are provided for each data structure, abstracting the underlying
 741 CUDA implementation. (This reasonably lightweight abstraction layer
 742 could also allow an OpenCL implementation to be developed.)
 743 This makes it easy to switch between the CPU and GPU for different
 744 components in the code, which is useful in development
 745 and testing. GPU functionality can be added incrementally while
 746 retaining a code that runs correctly (albeit slowly due to data
 747 transfer overheads).  Once all relevant components are moved to
 748 the GPU, it becomes possible to remove such data transfers and
 749 keep the entire problem resident on the device.
 750
 751 %To ensure that \textit{Ludwig} remains suitable for CPU-based systems,
 752 %we augment rather than modify the original code. All new GPU-related
 753 %functionality are implemented in an additive manner, mirroring the
 754 %modular structure of the original code. The new GPU modules have
 755 %interfaces similar to the original CPU functions, and the ability to
 756 %switch between versions is achieved at compile-time via C
 757 %pre-processing directives.  All GPU compute kernels are implemented using
 758 %CUDA.  Wrapper routines are developed to specify the
 759 %decomposition, invoke the CUDA kernels and perform the necessary data
 760 %management operations: the latter are facilitated through newly
 761 %developed modules for creation and destruction of device memory
 762 %structures, the copying of data between host and device, and where
 763 %necessary the packing and buffering of data.
 764
 765
 766
 767
 768 \section{Summary}
 769 \label{ch14:sec:summary}
 770
 771 We have described the steps take to implement, on both single and
 772 multiple GPUs, the \textit{Ludwig} code which was originally
 773 designed for complex fluid problems on a conventional CPU
 774 architecture. We have added the necessary functionality using
 775 NVIDIA CUDA and discussed the important changes to the main
 776 data structures. By retaining domain decomposition and message
 777 passing via MPI, we have demonstrated it is possible to scale
 778 complex fluid problems to large numbers of GPUs in parallel.
 779 By following the key design criteria of maximising parallelism
 780 and minimising host-device data transfers, we have confirmed
 781 the mixed MPI-GPU approach is viable for scientific applications.
 782 For the more intricate problem of moving solid particles, we find
 783 it is possible to retain the more serial elements related to
 784 particle link operations on the CPU, while offloading only the
 785 parallel lattice-based operations involving the LB distribution
 786 to the GPU. Again, this minimises host-device movement of data.
 787
 788 From the software engineering viewpoint, some duplication of code
 789 to allow efficient implementation on both host and device is
 790 currently required. This issue might be addressed by approaches
 791 such as automatic kernel generation, but may also be addressed
 792 naturally in time as GPU and CPU hardware converge. Portable
 793 abstractions and APIs, perhaps based on approaches such as OpenCL,
 794 will also facilitate the development and maintenance of portable
 795 codes which also exhibit portable performance (perhaps in conjunction
 796 with automatic tuning approaches).
 797
 798 So, while the challenges in designing portable and efficient scientific
 799 applications remain very real, this work provides some hope the large
 800 clusters of GPU machines can be used effectively for a wide range of
 801 complex fluid problems.
 802
 803
 804 %The \textit{Ludwig} LB application, which specializes
 805 %in complex fluid problems, has undergone the developments required to
 806 %use large-scale parallel GPU accelerated supercomputers effectively.
 807 %The new functionality augments the original code, so that
 808 %\textit{Ludwig} can run on either CPU-based or GPU-accelerated systems.
 809 %Using the NVIDIA CUDA programming language, all the computational
 810 %kernels were offloaded to the GPU. This included not only those
 811 %responsible for the majority of the compute time, but any that
 812 %accessed the relevant data structures. This allows data to be kept
 813 %resident on the GPU and avoids expensive data transfers. We described
 814 %steps taken to optimise the performance on the GPU architecture by
 815 %reducing off-chip memory accesses and restructuring the data layout
 816 %in memory to ensure the available memory bandwidth was exploited fully.
 817
 818 %A new halo-exchange communication phase for the code was developed to
 819 %allow efficient parallel scaling to many GPUs. The CPU version
 820 %relies on MPI datatype functionality for CPU to CPU communication; we
 821 %have explicitly written buffer packing/unpacking and transfer
 822 %operations to for GPU to GPU data exchange, staged through the host
 823 %CPUs. This includes the filtering of velocity components to only
 824 %transfer those required in a specific direction, hence reducing the
 825 %overall volume of data transferred. We used CUDA stream functionality
 826 %to overlap separate stages within the communication phase, and reduce
 827 %the overall communication time. The effect was measured to be more
 828 %profound for smaller local system sizes, i.e. it is especially useful
 829 %in aiding strong scaling.
 830
 831 %We described work to enable colloidal particles in the simulation with
 832 %minimal overhead. We implemented these in such a way that we avoid a
 833 %major diversion of the CPU and GPU codebases, whilst minimising data
 834 %transfers between the CPU and GPU at each timestep. We keep the
 835 %majority of the (relatively inexpensive) particle related code on the
 836 %CPU, while offloading only those parts responsible for the interaction
 837 %with the fluid to the GPU (where the fluid data resides).
 838
 839 %Our resulting package is able to exploit the largest of GPU
 840 %accelerated architectures to address a wide range of complex problems.
 841
 842
 843
 844 % An example of a floating figure using the graphicx package.
 845 % Note that \label must occur AFTER (or within) \caption.
 846 % For figures, \caption should occur after the \includegraphics.
 847 % Note that IEEEtran v1.7 and later has special internal code that
 848 % is designed to preserve the operation of \label within \caption
 849 % even when the captionsoff option is in effect. However, because
 850 % of issues like this, it may be the safest practice to put all your
 851 % \label just after \caption rather than within \caption{}.
 852 %
 853 % Reminder: the "draftcls" or "draftclsnofoot", not "draft", class
 854 % option should be used if it is desired that the figures are to be
 855 % displayed while in draft mode.
 856 %
 857 %\begin{figure}[!t]
 858 %\centering
 859 %\includegraphics[width=2.5in]{myfigure}
 860 % where an .eps filename suffix will be assumed under latex,
 861 % and a .pdf suffix will be assumed for pdflatex; or what has been declared
 862 % via \DeclareGraphicsExtensions.
 863 %\caption{Simulation Results}
 864 %\label{fig_sim}
 865 %\end{figure}
 866
 867 % Note that IEEE typically puts floats only at the top, even when this
 868 % results in a large percentage of a column being occupied by floats.
 869
 870
 871 % An example of a double column floating figure using two subfigures.
 872 % (The subfig.sty package must be loaded for this to work.)
 873 % The subfigure \label commands are set within each subfloat command, the
 874 % \label for the overall figure must come after \caption.
 875 % \hfil must be used as a separator to get equal spacing.
 876 % The subfigure.sty package works much the same way, except \subfigure is
 877 % used instead of \subfloat.
 878 %
 879 %\begin{figure*}[!t]
 880 %\centerline{\subfloat[Case I]\includegraphics[width=2.5in]{subfigcase1}%
 881 %\label{fig_first_case}}
 882 %\hfil
 883 %\subfloat[Case II]{\includegraphics[width=2.5in]{subfigcase2}%
 884 %\label{fig_second_case}}}
 885 %\caption{Simulation results}
 886 %\label{fig_sim}
 887 %\end{figure*}
 888 %
 889 % Note that often IEEE papers with subfigures do not employ subfigure
 890 % captions (using the optional argument to \subfloat), but instead will
 891 % reference/describe all of them (a), (b), etc., within the main caption.
 892
 893
 894 % An example of a floating table. Note that, for IEEE style tables, the
 895 % \caption command should come BEFORE the table. Table text will default to
 896 % \footnotesize as IEEE normally uses this smaller font for tables.
 897 % The \label must come after \caption as always.
 898 %
 899 %\begin{table}[!t]
 900 %% increase table row spacing, adjust to taste
 901 %\renewcommand{\arraystretch}{1.3}
 902 % if using array.sty, it might be a good idea to tweak the value of
 903 % \extrarowheight as needed to properly center the text within the cells
 904 %\caption{An Example of a Table}
 905 %\label{table_example}
 906 %\centering
 907 %% Some packages, such as MDW tools, offer better commands for making tables
 908 %% than the plain LaTeX2e tabular which is used here.
 909 %\begin{tabular}{|c||c|}
 910 %\hline
 911 %One & Two\\
 912 %\hline
 913 %Three & Four\\
 914 %\hline
 915 %\end{tabular}
 916 %\end{table}
 917
 918
 919 % Note that IEEE does not put floats in the very first column - or typically
 920 % anywhere on the first page for that matter. Also, in-text middle ("here")
 921 % positioning is not used. Most IEEE journals/conferences use top floats
 922 % exclusively. Note that, LaTeX2e, unlike IEEE journals/conferences, places
 923 % footnotes above bottom floats. This can be corrected via the \fnbelowfloat
 924 % command of the stfloats package.
 925
 926
 927 % conference papers do not normally have an appendix
 928
 929
 930 % use section* for acknowledgement
 931 \section*{Acknowledgments}
 932 AG was supported by the CRESTA project, which has received
 933 funding from the European Community's Seventh Framework Programme
 934 (ICT-2011.9.13) under Grant Agreement no. 287703. KS was supported
 935 by UK EPSRC under grant EV/J007404/1.
 936
 937
 938 % trigger a \newpage just before the given reference
 939 % number - used to balance the columns on the last page
 940 % adjust value as needed - may need to be readjusted if
 941 % the document is modified later
 942 %\IEEEtriggeratref{8}
 943 % The "triggered" command can be changed if desired:
 944 %\IEEEtriggercmd{\enlargethispage{-5in}}
 945
 946 % references section
 947
 948 % can use a bibliography generated by BibTeX as a .bbl file
 949 % BibTeX documentation can be easily obtained at:
 950 % http://www.ctan.org/tex-archive/biblio/bibtex/contrib/doc/
 951 % The IEEEtran BibTeX style support page is at:
 952 % http://www.michaelshell.org/tex/ieeetran/bibtex/
 953 %\bibliographystyle{IEEEtran}
 954 % argument is your BibTeX string definitions and bibliography database(s)
 955 %\bibliography{IEEEabrv,../bib/paper}
 956 %
 957 % <OR> manually copy in the resultant .bbl file
 958 % set second argument of \begin to the number of references
 959 % (used to reserve space for the reference number labels box)
 960
 961 \putbib[Chapters/chapter14/biblio14]
 962 %\begin{thebibliography}{1}
 963
 964 %\bibitem{IEEEhowto:kopka}
 965 %H.~Kopka and P.~W. Daly, \emph{A Guide to \LaTeX}, 3rd~ed.\hskip 1em plus
 966 %  0.5em minus 0.4em\relax Harlow, England: Addison-Wesley, 1999.
 967
 968
 969
 970 %% \bibitem{succi-book}
 971 %% S. Succi, \textit{The lattice Boltzmann equation and beyond},
 972 %% Oxford University Press, Oxford, 2001.
 973
 974 %\bibitem{mpi-standard}
 975 %Message Passing Interface Forum, http://www.mpi-forum.org
 976
 977 %% \bibitem{desplat}
 978 %% Desplat, J.-C., I. Pagonabarraga, and P. Bladon,
 979 %% \textit{LUDWIG: A parallel lattice-Boltzmann code for complex fluids}.
 980 %% Comput. Phys. Comms., \textbf{134}, 273, 2001.
 981
 982 %% \bibitem{aidun2010}
 983 %% C.K. Aidun and J.R. Clausen,
 984 %% \textit{Lattice Boltzmann method for complex flows},
 985 %% Ann. Rev. Fluid Mech., \textbf{42} 439--472 (2010).
 986
 987 %\bibitem{bray1994}
 988 %A.J. Bray,
 989 %\textit{Theory of phase-ordering kinetics},
 990 %Adv. Phys., \textbf{43} 357--459 (1994).
 991
 992 %\bibitem{kendon2001}
 993 %V.M. Kendon, M.E. Cates, I. Pangonabarraga, J.-C. Desplat, and P. Bladon,
 994 %\textit{Inertial effects in the three-dimensional spinodal decomposition of a
 995 %symmetric binary fluid mixture; a lattice Boltzmann study},
 996 %J. Fluid Mech., \textbf{440}, 147--203 (2001).
 997
 998 %\bibitem{swift1996}
 999 %M.R. Swift, E. Orlandini, W.R. Osborn, and J.M. Yeomans,
1000 %\textit{Lattice Boltzmann simulations of liquid-gas and binary
1001 %fluid systems},
1002 %Phys. Rev. E, \textbf{54}, 5041--5052 (1996).
1003
1004 %% \bibitem{stratford2008}
1005 %% K. Stratford and I. Pagonabarraga,
1006 %% Parallel domain decomposition for lattice Boltzmann with moving particles,
1007 %% \textit{Comput. Math. with Applications} \textbf{55}, 1585 (2008).
1008
1009 %\bibitem{xe6}
1010 %Cray XE6 Product Brochure, available from
1011 %http://www.cray.com/Products/XE/Resources.aspx (2010)
1012
1013 %\bibitem{hector} HECToR home page, http://www.hector.ac.uk
1014
1015 %\bibitem{xk6}
1016 %Cray XK6 Product Brochure, available from
1017 %http://www.cray.com/Products/XK6/XK6.aspx (2011)
1018
1019
1020 %% \bibitem{wei2004}
1021 %% X. Wei, W. Li, K. M\"uller, and A.E. Kaufman,
1022 %% \textit{The lattice Boltzmann method for simulating gaseous phenomena},
1023 %% IEEE Transactions on Visualization and Computer Graphics,
1024 %% \textbf{10}, 164--176 (2004).
1025 %% Apparently first LB via GPU; a serious contribution using single fluid
1026 %%d3q19 single relaxation time.
1027
1028 %% \bibitem{zhu2006}
1029 %% H. Zhu, X. Liu, Y. Liu, and E. Wu,
1030 %% \textit{Simulation of miscible binary mixtures based on lattice Boltzmann method},
1031 %% Comp. Anim. Virtual Worlds, \textbf{17}, 403--410 (2006).
1032 %% Single relaxation (MRT mentioned) time d3q19 apparently sound although
1033 %%not a lot of detail. Pre-cuda so graphics code.
1034
1035 %% \bibitem{zhao2007}
1036 %% Y. Zhao,
1037 %% \textit{Lattice Boltzmann based PDE solver on the GPU},
1038 %% Visual Comput., doi 10.1007/s00371-0070191-y (2007).
1039
1040 %% \bibitem{toelke2010}
1041 %% J. T\"olke,
1042 %% Implementation of a lattice Boltzmann kernel using the compute unified
1043 %% device architecture developed by nVIDIA,
1044 %% Comput. Visual Sci. 13 29--39 (2010).
1045
1046 %% \bibitem{fan2004}
1047 %% Z. Fan, F. Qiu, A. Kaufman, and S. Yoakum-Stover,
1048 %% \textit{GPU cluster for high performance computing},
1049 %% Proceedings of ACM/IEEE Supercomputing Conference, pp. 47--59,
1050 %% IEEE Computer Society Press, Pittsburgh, PA (2004).
1051
1052 %% \bibitem{myre2011}
1053 %% J. Myre, S.D.C. Walsh, D. Lilja, and M.O. Saar,
1054 %% \textit{Performance analysis of single-phase, multiphase, and multicomponent
1055 %% lattice Boltzmann fluid flow simulations on GPU clusters},
1056 %% Concurrency Computat.: Pract. Exper., \textbf{23}, 332--350 (2011).
1057
1058 %% \bibitem{obrecht2011}
1059 %% C. Obrecht, F. Kuznik, B. Tourancheau, and J.-J. Roux,
1060 %% \textit{Multi-GPU implementation of the lattice Boltzmann method},
1061 %% Comput. Math. with Applications,
1062 %% doi:10.1016/j.camwa.2011.02.020 (2011).
1063
1064 %% \bibitem{bernaschi2010}
1065 %% M. Bernaschi, M. Fatica, S. Melchionna, S. Succi, and E. Kaxiras,
1066 %% \textit{A flexible high-performance lattice Boltzmann GPU code for the
1067 %% simulations of fluid flow in complex geometries},
1068 %% Concurrency Computat.: Pract. Exper., \textbf{22}, 1--14 (2010).
1069
1070 %% \bibitem{xian2011}
1071 %% W. Xian and A. Takayuki,
1072 %% \textit{Multi-GPU performance of incompressible flow computation by
1073 %% lattice Boltzmann method on GPU cluster},
1074 %% Parallel Comput., doi:10.1016/j.parco.2011.02.007 (2011).
1075
1076 %% \bibitem{feichtinger2011}
1077 %% C. Feichtinger, J. Habich, H. K\"ostler, G. Hager, U. R\"ude, and
1078 %% G. Wellein,
1079 %% A flexible patch-based lattice Boltzmann parallelization approach
1080 %% for heterogeneous GPU-CPU clusters,
1081 %% \textit{Parallel Computing} \textbf{37} 536--549 (2011).
1082
1083 %% \bibitem{wellein2006}
1084 %% G. Wellein, T. Zeiser, G Hager, and S. Donath,
1085 %% On the single processor performance of simple lattice Boltzmann kernels,
1086 %% \textit{Computers and Fluids}, \textbf{35}, 910--919 (2006).
1087
1088 %% \bibitem{pohl2003}
1089 %% T. Pohl, M. Kowarschik, J. Wilke, K. Igelberger, and U. R\"ude,
1090 %% Optimization and profiling of the cache performance of parallel
1091 %% lattice Boltzmann code,
1092 %% \textit{Parallel Process Lett.} \textit{13} 549--560 (2003).
1093
1094 %% \bibitem{mattila2007}
1095 %% K. Mattila, J. Hyv\"aluoma, T. Rossi, M. Aspn\"as and J. Westerholm,
1096 %% An efficient swap algorithm for the lattice Boltzmann method,
1097 %% \textit{Comput. Phys. Comms.} \textit{176} 200-210 (2007).
1098
1099 %% \bibitem{wittmann2012}
1100 %% M. Wittmann, T. Zeiser, G. Hager, and G. Wellein,
1101 %% Comparison of different propagation steps for lattice Boltzmann methods,
1102 %% \textit{Comput. Math with Appl.} doi:10.1016/j.camwa.2012.05.002 (2012).
1103
1104 %% \bibitem{walshsaar2012}
1105 %% S.D.C. Walsh and M.O. Saar,
1106 %% Developing extensible lattice Boltzmann simulators for general-purpose
1107 %% graphics-processing units,
1108 %% \textit{Comm. Comput. Phys.}, \textbf{13} 867--879 (2013).
1109
1110
1111 %% \bibitem{williams2011}
1112 %% S. Williams, L. Oliker, J. Carter, and J Shalf,
1113 %% Extracting ultra-scale lattice Boltzmann performance via
1114 %% hierarchical and distributed auto-tuning,
1115 %% \textit{Proc. SC2011}.
1116
1117
1118 %% \bibitem{ch14:stratford-jsp2005}
1119 %% K. Stratford, R. Adhikari, I. Pagonabarraga, and J.-C. Desplat,
1120 %% \textit{Lattice Boltzmann for Binary Fluids with Suspended Colloids},
1121 %% J. Stat. Phys. \textbf{121}, 163 (2005).
1122
1123 %% \bibitem{ladd1994}
1124 %% A.J.C. Ladd,
1125 %% Numerical simulations of particle suspensions via a discretized
1126 %% Boltzmann equation. Part 1. Theoretical foundation,
1127 %% \textit{J. Fluid Mech.} \textbf{271} 285--309 (1994);
1128 %% Part II. Numerical results,
1129 %% \textit{ibid.} \textbf{271} 311--339 (1994).
1130
1131
1132 %% \bibitem{nguyen2002}
1133 %% N.-Q. Nguyen and A.J.C. Ladd,
1134 %% Lubrication corrections for lattice Boltzmann simulations of particle
1135 %% suspensions,
1136 %% \textit{Phys. Rev. E} \textbf{66} 046708 (2002).
1137
1138 %% \bibitem{ch14:immersed}
1139 %% C.S. Peskin,
1140 %% Flow patterns around heart valves; a numerical method,
1141 %% \textit{J. Comp. Phys.}, \textbf{10}, 252--271 (1972);
1142 %% C.S. Peskin,
1143 %% The immersed boundary method,
1144 %% \textit{Acta Nummerica} \textbf{11} 479--517 (2002).
1145
1146 %% \bibitem{ch14:immersed-lb}
1147 %% Z.-G. Feng and E.E. Michaelides,
1148 %% The immersed boundary-lattice Boltzmann method for solving
1149 %% fluid-particles interaction problem,
1150 %% \textit{J. Comp. Phys.}, \textbf{195} 602--628 (2004).
1151
1152 %% \bibitem{ch14:spm}
1153 %% Y. Nakayama and R. Yammamoto,
1154 %% Simulation method to resolve hydrodynamic interactions in colloidal
1155 %% dispersions,
1156 %% \textit{Phys. Rev. E}, \textbf{71} 036707 (2005).
1157
1158
1159 %\end{thebibliography}
1160
1161
1162
1163
1164
1165
1166
1167 % LocalWords:  CRESTA EPSRC Succi Desplat Pagonabarraga