X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/ac699bb490942cea02be7b088a0cfa4e0dcdd105..cdcc8368664a64aed4169db27a06df9d0c9d6527:/BookGPU/Chapters/chapter14/ch14.tex diff --git a/BookGPU/Chapters/chapter14/ch14.tex b/BookGPU/Chapters/chapter14/ch14.tex index bf3cd66..2ce9331 100755 --- a/BookGPU/Chapters/chapter14/ch14.tex +++ b/BookGPU/Chapters/chapter14/ch14.tex @@ -1,4 +1,4 @@ -\chapterauthor{Alan Gray and Kevin Stratford}{EPCC, The University of Edinburgh} +\chapterauthor{Alan Gray and Kevin Stratford}{EPCC, The University of Edinburgh, United Kingdom} \chapter[Ludwig: multiple GPUs for a fluid lattice Boltzmann application]{Ludwig: multiple GPUs for a complex fluid lattice Boltzmann @@ -10,17 +10,17 @@ application} \section{Introduction} The lattice Boltzmann (LB) method \index{Lattice Boltzmann method} (for an overview see, e.g., \cite{succi-book}) has become a popular approach to a variety of fluid -dynamics problems. It provides a way to solve the incompressible, +dynamics problems. It provides a way to solve the incompressible isothermal Navier-Stokes equations and has the attractive features of being both explicit in time and local in space. This makes the LB -method well-suited to parallel computation. Many efficient parallel +method well suited to parallel computation. Many efficient parallel implementations of the LB method have been undertaken, typically using a combination of distributed domain decomposition and the Message Passing Interface (MPI). However, the potential -performance benefits offered by GPUs has motivated a new `mixed-mode' +performance benefits offered by GPUs has motivated a new ``mixed-mode'' approach to address very large problems. Here, fine-grained parallelism is implemented on the GPU, while MPI is reserved for -larger-scale parallelism. This mixed mode is of increasing interest +larger scale parallelism. This mixed mode is of increasing interest to application programmers at a time when many supercomputing services are moving toward clusters of GPU accelerated nodes. The design questions which @@ -61,7 +61,7 @@ an LB application developed specifically for complex fluids (\textit{Ludwig} was named for Boltzmann, 1844--1906). We will present the steps required to allow \textit{Ludwig} to exploit efficiently both a single -GPU, and also many GPUs in parallel. We show that +GPU and many GPUs in parallel. We show that \textit{Ludwig} scales excellently to at least the one thousand GPU level (the largest resource available at the time of writing) with indications that the code will scale to much larger systems as @@ -69,7 +69,7 @@ they become available. In addition, we consider the steps required to represent fully-resolved moving solid particles (usually referred to as colloids in the context of complex fluids). Such particles need to have their surface resolved on the lattice scale in order -to include relevant surface physics, and must be able to move: e.g., +to include relevant surface physics and must be able to move, e.g., to execute Brownian motion in response to random forces from the fluid. Standard methods are available to represent such particles (e.g., \cite{ladd1994,nguyen2002}) which are amenable to effective @@ -82,7 +82,7 @@ In the following section we provide a brief overview of the lattice Boltzmann method, and mention some of the general issues which can influence performance in the CPU context. In Section \ref{ch14:sec:singlegpu}, we then describe the alterations -which are required to exploit the GPU architecture effectively, and +which are required to exploit the GPU architecture effectively and highlight how the resulting code differs from the CPU version. In Section \ref{ch14:sec:parallelgpu}, we extend this description to include the steps required to allow exploitation of many GPUs in @@ -128,8 +128,8 @@ are chosen to capture the correct symmetries of the Navier-Stokes equations. A typical choice, used here, is the so-called D3Q19 basis in three dimensions where there is one velocity such that $\mathbf{c} \Delta t$ is zero, along with six extending to the nearest -neighbour -lattice sites, and twelve extending to the next-nearest neighbour sites +neighbor +lattice sites, and twelve extending to the next-nearest neighbor sites ($\Delta t$ being the discrete time step). The fundamental object in LB is then the distribution function $f_i (\mathbf{r};t)$ whose moments are related to the local hydrodynamic quantities: the fluid @@ -139,16 +139,16 @@ function is described by a discrete Boltzmann equation f_i(\mathbf{r} + \mathbf{c}_i \Delta t; t) - f_i(\mathbf{r}; t) = - {\cal L}_{ij} f_j(\mathbf{r};t). \end{equation} -It is convenient to think of this in two stages. First, the right hand +It is convenient to think of this in two stages. First, the right-hand side represents the action of a collision operator ${\cal L}_{ij}$, which is local to each lattice site and relaxes the distribution toward a local equilibrium at a rate ultimately related to the fluid viscosity. -Second, the left hand side represents a propagation step (sometimes referred +Second, the left-hand side represents a propagation step (sometimes referred to as streaming step), in which each element $i$ of the distribution is displaced $\mathbf{c}_i \Delta t$, i.e., one lattice spacing in the appropriate direction per discrete time step. -More specifically, we store a vector of 19 double precision floating +More specifically, we store a vector of 19 double-precision floating point values at each lattice site for the distribution function $f_i(\mathbf{r};t)$. The collision operation ${\cal L}_{ij}$, which is local at each lattice @@ -157,18 +157,34 @@ ${\cal M}_{ij}f_j$ is used to transform the distributions into the hydrodynamic quantities, where ${\cal M}_{ij}$ is a constant 19x19 matrix related to the choice of $\mathbf{c}_i$. The non-conserved hydrodynamic quantities are then -relaxed toward their (known) equilibrium values, and are transformed +relaxed toward their (known) equilibrium values and are transformed back to new post-collision distributions via the inverse transformation -${\cal M}^{-1}_{ij}$. This gives rise to the need for a minimum of 2x19$^2$ +${\cal M}^{-1}_{ij}$. This gives rise to the need for a minimum of $2\times 19^2$ floating point multiplications per lattice site. (Additional operations are required to implement, for example, the force $\mathbf{f}(\mathbf{r})$.) In contrast, the propagation stage consists solely of moving the distribution values -between lattice sites, and involves no floating point operations. +between lattice sites and involves no floating point operations. + + +\begin{figure}[tbp] +\centering +\includegraphics[width=12cm]{Chapters/chapter14/figures/decomphalo} +\caption[The lattice is decomposed between MPI tasks.]{Left: the lattice is decomposed between MPI tasks. For + clarity we show a 2D decomposition of a 3D lattice, but in practice + we decompose in all 3 dimensions. Halo cells are added to each + subdomain (as shown on the upper right for a single slice) which store + data retrieved from remote neighbors in the halo exchange. Lower + right: the D3Q19 velocity set resident on a lattice site; + highlighted are the 5 ``outgoing'' elements to be transferred in a + specific direction.} +\label{ch14:fig:decomphalo} +\end{figure} + In the CPU version, the \textit{Ludwig} implementation stores one time level of distribution values $f_i(\mathbf{r}; t)$. This distribution -is stored as a flat array of C data type \texttt{double}, and laid out +is stored as a flat array of C data type \texttt{double} and laid out so that all elements of the velocity are contiguous at a given site (often referred to as ``array-of-structures'' order). This is appropriate for sums over the distribution required locally at the @@ -177,24 +193,27 @@ Listing~\ref{ch14:listing1}: the fact that consecutive loads are from consecutive memory addresses allows the prefetcher to engage fully. (The temporary scalar \texttt{a\_tmp} allows caching of the intermediate accumulated value in the innermost loop.) A variety of -standard sequential optimisations are relevant for the collision stage +standard sequential optimizations are relevant for the collision stage (loop unrolling, inclusion of SIMD operations, and so on \cite{wellein2006}). For example, the collision stage in \textit{Ludwig} includes explicit SIMD code, which is useful if the -compiler on a given platform cannot identify it. The propagation -stage is separate, and is organised as a ``pull'' (again, various -optimisation have been considered, e.g., -\cite{pohl2003,mattila2007,wittmann2012}). No further optimisation is +compiler on a given platform cannot identify the appropriate vectorization. +The propagation +stage is separate and is organized as a ``pull'' (again, various +optimizations have been considered, e.g., +\cite{pohl2003,mattila2007,wittmann2012}). No further optimization is done here beyond ensuring that the ordering of the discrete velocities allows memory access to be as efficient as possible. While these -optimisations are important, it should be remembered that for some +optimizations are important, it should be remembered that for some complex fluid problems, the hydrodynamics embodied in the LB calculation is a relatively small part of the total computational cost -(at the 10\%-level in some cases). This means optimisation effort may +(at the 10\% level in some cases). This means optimization effort may be better concentrated elsewhere. + + \begin{lstlisting}[float, label=ch14:listing1, -caption = Collision schematic for CPU.] +caption = collision schematic for CPU.] /* loop over lattice sites */ for (is = 0; is < nsites; is++) { ... @@ -215,44 +234,32 @@ for (is = 0; is < nsites; is++) { \end{lstlisting} -\begin{figure}[!t] -\centering -\includegraphics[width=12cm]{Chapters/chapter14/figures/decomphalo} -\caption[The lattice is decomposed between MPI tasks.]{Left: the lattice is decomposed between MPI tasks. For - clarity we show a 2D decomposition of a 3D lattice, but in practice - we decompose in all 3 dimensions. Halo cells are added to each - sub-domain (as shown on the upper right for a single slice) which store - data retrieved from remote neighbours in the halo exchange. Lower - right: the D3Q19 velocity set resident on a lattice site; - highlighted are the 5 ``outgoing'' elements to be transferred in a - specific direction.} -\label{ch14:fig:decomphalo} -\end{figure} + The regular 3D decomposition is illustrated in Fig.~\ref{ch14:fig:decomphalo}. -Each local sub-domain is surrounded by a halo, or ghost, region of width one +Each local subdomain is surrounded by a halo, or ghost, region of width one lattice site. While the collision is local, elements of the distribution must be exchanged at the edges of the domains to facilitate the propagation. To achieve the full 3D halo exchange, the standard approach of shifting the -relevant data in each co-ordinate direction in turn is adopted. This -requires appropriate synchronisation, i.e., a receive in the the first -co-ordinate direction must be complete before a send in the second direction +relevant data in each coordinate direction in turn is adopted. This +requires appropriate synchronization, i.e., a receive in the the first +coordinate direction must be complete before a send in the second direction involving relevant data can take place, and so on. We note that only ``outgoing'' elements of the distribution need to be sent at each edge. For the D3Q19 model, this reduces the volume of data traffic from 19 to 5 of the $f_i(\mathbf{r};t)$ per lattice site at each edge. In the CPU version, the necessary transfers are implemented in place using -a vector of appropriately strided MPI datatypes for each direction. +a vector of MPI datatypes with appropriate stride for each direction. -\section{Single GPU Implementation}\label{ch14:sec:singlegpu} +\section{Single GPU implementation}\label{ch14:sec:singlegpu} In this section we describe the steps taken to enable \textit{Ludwig} for the GPU. There are a number of crucial issues: first, the -minimisation of data traffic between host and device; second, the -optimal mapping of available parallelism onto the architecture and +minimization of data traffic between host and device; second, the +optimal mapping of available parallelism onto the architecture; and third, the issue of memory coalescing. We discuss each of these in turn. @@ -274,12 +281,12 @@ parallelism inherent in the GPU architecture, particularly for those matrix-vector operations within the collision kernel. The GPU architecture features a hierarchy of parallelism. At the lowest level, groups of 32 threads (warps) operate in -lock-step on different data elements: this is SIMD style vector-level +lock-step on different data elements: this is SIMD-style vector-level parallelism. Multiple warps are combined into a thread block (in which -communication and synchronisation are possible), and multiple blocks can +communication and synchronization are possible), and multiple blocks can run concurrently across the streaming multiprocessors in the GPU (with no communication or -synchronisation possible across blocks). To decompose on the GPU, we +synchronization possible across blocks). To decompose on the GPU, we must choose which part of the collision to assign to which level of parallelism. @@ -290,14 +297,14 @@ to all levels of parallelism, i.e., we use a separate CUDA thread for every lattice site, and each thread performs the full matrix-vector operation. We find that a block size of 256 performs well on current devices: we therefore decompose -the lattice into groups of 256 sites, and assign each group to a block +the lattice into groups of 256 sites and assign each group to a block of CUDA threads. As the matrix ${\cal M}_{ij}$ is constant, it is assigned to the fast {\it constant} on-chip device memory. For the propagation stage, the GPU implementation adds a second time -level of distribution values. The data-dependencies inherent in the +level of distribution values. The data dependencies inherent in the propagation mean that the in-place propagation of the CPU version -cannot be parallelised effectively without the additional time level. +cannot be parallelized effectively without the additional time level. As both time levels may remain resident on the GPU, this is not a significant overhead. @@ -308,7 +315,7 @@ is only achieved when data are structured such that threads within a segment in a single transaction: this is memory coalescing. It can be seen that the array-of-structures ordering used for the distribution in the CPU code would not be suitable for coalescing; in fact, it would -result in serialised memory accesses and relative poor performance. +result in serialized memory accesses and relative poor performance. To meet the coalescing criteria and allow consecutive threads to read consecutive memory addresses on the GPU, we transpose the layout of the distribution so that, for each velocity component, consecutive sites @@ -316,7 +323,7 @@ are contiguous in memory (``structure-of-arrays'' order). A schematic of the GPU collision code is given in Listing~\ref{ch14:listing2}. \begin{lstlisting}[float, label=ch14:listing2, -caption = Collision schematic for GPU.] +caption = collision schematic for GPU.] /* compute current site index 'is' from CUDA thread and block */ /* indices and load distribution from "structure-of-arrays" */ @@ -334,18 +341,18 @@ defer some further comments on software engineering aspects of the code to the summary. -\section{Multiple GPU Implementation}\label{ch14:sec:parallelgpu} +\section{Multiple GPU implementation}\label{ch14:sec:parallelgpu} To execute on multiple GPUs, we use the same domain decomposition and message passing framework as the CPU version. Within each -sub-domain (allocated to one MPI task) the GPU implementation +subdomain (allocated to one MPI task) the GPU implementation proceeds as described in the previous section. The only additional complication is that halo transfers between GPUs must be staged -through the host (in future, direct GPU to GPU data transfers via +through the host (in the future, direct GPU-to-GPU data transfers via MPI may be possible, obviating the need for these steps). This -means host MPI sends must be preceded by appropriate device to host -transfers and host MPI receives must be followed by corresponding host -to device transfers. +means host MPI sends must be preceded by appropriate device-to-host +transfers and host MPI receives must be followed by corresponding +host-to-device transfers. %To execute on multiple GPUs in parallel we decompose the problem using %the existing high-level parallelisaton framework: each MPI task acts @@ -380,18 +387,18 @@ to device transfers. In practice, this data movement requires additional GPU kernels to pack and unpack the relevant data before and after corresponding MPI -calls. However, the standard shift algorithm, in which each co-ordinate +calls. However, the standard shift algorithm, in which each coordinate direction is treated in turn, does provide some scope for the overlapping of different operations. -For example, after the data for the first co-ordinate direction have been +For example, after the data for the first coordinate direction have been retrieved by the host, these can be exchanged using MPI between hosts at the same time as kernels for packing and retrieving of -data for the second co-ordinate direction are +data for the second coordinate direction are executed. This -overlapping must respect the synchronisation required to ensure -that data values at the corners of the sub-domain are transferred +overlapping must respect the synchronization required to ensure +that data values at the corners of the subdomain are transferred correctly. -We use a separate CUDA stream for each co-ordinate direction: +We use a separate CUDA stream for each coordinate direction: this allows some of the host-device communication time to be effectively ``hidden'' behind the host-host MPI communication, resulting in an overall speedup. @@ -415,7 +422,7 @@ CPUs per node, and with nodes interconnected using Cray Gemini technology. For GPU results, a Cray XK6 system is used: this is very similar to the XE6, but has one CPU per node replaced with an NVIDIA X2090 GPU. Each node in the GPU system therefore features a single Opteron CPU acting -as a host to a single GPU. The inter-node interconnect architecture is +as a host to a single GPU. The internode interconnect architecture is the same as for the Cray XE6. The GPU performance tests use a prototype Cray XK6 system with 936 nodes (the largest available at the time of writing). To provide a fair comparison, we compare scaling on a \textit{per node} @@ -465,7 +472,7 @@ a simulation in terms of accounting budgets, or electricity. Figure \ref{ch14:fig:scaling} shows the results of both weak and strong scaling tests (top and bottom panels, respectively). For -weak scaling, where the local sub-domain size is kept fixed +weak scaling, where the local subdomain size is kept fixed (here a relatively large 196$^3$ lattice), the time taken by an ideal code would remain constant. It can be seen that while scaling of the CPU version shows a pronounced upward slope, the @@ -475,9 +482,9 @@ has a factor of 32 fewer MPI tasks per node, so communications require a significantly smaller number of larger data transfers. The performance advantage of the GPU version ranges from a factor of around 1.5 to -around 1.8. Careful study by other authors \cite{williams2011} have -found the absolute performance (in terms of floating point -operations per second, or floating point operations per Watt) to +around 1.8. Careful studies by other authors \cite{williams2011} have +found the absolute performance (in terms of floating-point +operations per second, or floating-point operations per Watt) to be remarkably similar between architectures. Perhaps of more interest is the strong scaling picture (lower @@ -493,18 +500,18 @@ In contrast, the GPU version shows reasonably robust scaling even for the smaller system sizes and good scaling for the larger systems. -\section{Moving Solid Particles}\label{ch14:sec:particles} +\section{Moving solid particles}\label{ch14:sec:particles} \begin{figure}[t] \centering %\includegraphics[width=10cm]{Chapters/chapter14/figures/bbl} -\includegraphics[width=10cm]{Chapters/chapter14/figures/colloid_new} +\includegraphics[width=10cm]{Chapters/chapter14/figures/colloid_new_gray} \caption[A two-dimensional schematic picture of spherical particles on the lattice.]{ A two-dimensional schematic picture of spherical particles on the lattice. Left: a particle is allowed to move continuously across the lattice, and the position of the -surface defines fluid lattice sites (light blue) and solid lattice -sites (dark red). The discrete surface is defined by links +surface defines fluid lattice sites (gray) and solid lattice +sites (black). The discrete surface is defined by links where propagation would intersect the surface (arrows). Note the discrete shape of the two particles is different. Right: post-collision distributions are reversed at the surface by the process of bounce-back @@ -535,15 +542,15 @@ The introduction of moving solid particles poses an additional hurdle to efficient GPU implementation of an LB code such as \textit{Ludwig}. In this section, we give a brief overview of the essential features of the CPU implementation, and how the considerations raised in the -previous sections --- maximising parallelism and minimising both -host to device and GPU to GPU data movement --- shape the design decisions +previous sections---maximizing parallelism and minimising both +host-to-device and GPU-to-GPU data movement---shape the design decisions for a GPU implementation. We restrict our discussion to a simple fluid in this section; additional solid-fluid boundary conditions (e.g., wetting at a fluid-fluid-solid contact line) usually arise elsewhere -in the calculation and broadly independent of the hydrodynamic boundary +in the calculation and are broadly independent of the hydrodynamic boundary conditions which are described in what follows. -Moving solid particles (here, spheres) are defined by a centre position +Moving solid particles (here, spheres) are defined by a center position which is allowed to move continuously across the space of the lattice, and a fixed radius which is typically a few lattice spacings $\Delta r$. The surface of the particle is defined by a series of \textit{links} @@ -560,11 +567,11 @@ and torque on the sphere. The particle motion can then be updated in a molecular dynamics-like step. For the CPU implementation, a number of additional MPI communications -are required: (1) to exchange the centre position, radius, etc of each -particle as it moves and (2), to allow the accumulation of the force +are required: (1) to exchange the center position, radius, etc.\ of each +particle as it moves and (2) to allow the accumulation of the force and torque for each particle (the links of which may be distributed between up to 8 MPI tasks in three dimensions). Appropriate marshaling -of these data can provide an effective parallelisation for relatively +of these data can provide an effective parallelization for relatively dense particle suspensions of up to 40\% solid volume fraction \cite{stratford2008}. As a final consideration, fluid distributions must be removed and replaced consistently as a given particle moves @@ -574,12 +581,12 @@ With these features in mind, we can identify a number of competing concerns which are relevant to a GPU implementation: \begin{enumerate} \item -Minimisation of host-device data transfer would argue for moving the +Minimization of host-device data transfer would argue for moving the entire particle code to the GPU. However, the code in question involves largely conditional logic (e.g., identifying cut surface links) and irregular memory accesses (e.g., access to distribution elements around a spherical particle). These operations seem poorly suited to effective -parallelisation on the GPU. As an additional complication, the sums +parallelization on the GPU. As an additional complication, the sums required over the particle surface would involve potentially tricky and inefficient reductions in GPU memory. \item @@ -597,11 +604,12 @@ particle information. \end{enumerate} -We have implemented the second option as follows. For each sub-domain, -a list of boundary-cutting links is assembled on the CPU which includes -the identity of the relevant element of the distribution. This list, +We have implemented the second option as follows. For each subdomain, +a list of boundary-cutting links is assembled on the CPU; the list includes +the identity of the relevant element of the distribution in each case. +This list, together with the particle information required to compute the correct -bounce-back term, are transferred to the GPU. The updates to the relevant +bounce-back term, is transferred to the GPU. The updates to the relevant elements of the distribution can then take place on the GPU. The corresponding information to compute the update of the particle dynamics is returned @@ -618,14 +626,14 @@ at solid lattice points by consulting a look-up table of solid/fluid status. On the GPU, it is perhaps preferable to perform the collision stage everywhere instead of moving the look-up table to the GPU and introducing the associated logic. -Ultimately, the GPU might favour other boundary methods which treat solid and +Ultimately, the GPU might favor other boundary methods which treat solid and fluid on a somewhat more equal basis, for example, the immersed boundary method \cite{ch14:immersed1,ch14:immersed2,ch14:immersed-lb} or smoothed profile method \cite{ch14:spm}. However, the approach adopted here allows us to exploit -the GPU for the intensive fluid simulation whilst maintaining the complex +the GPU for the intensive fluid simulation while maintaining the complex code required for particles on the CPU. Overheads of CPU-GPU transfer are -minimised by transferring only those data relevant to the hydrodynamic +minimized by transferring only those data relevant to the hydrodynamic interaction implemented via bounce-back on links. % END NEW SOLID TEXT @@ -769,7 +777,7 @@ keep the entire problem resident on the device. \section{Summary} \label{ch14:sec:summary} -We have described the steps take to implement, on both single and +We have described the steps taken to implement, on both single and multiple GPUs, the \textit{Ludwig} code which was originally designed for complex fluid problems on a conventional CPU architecture. We have added the necessary functionality using @@ -777,14 +785,14 @@ NVIDIA CUDA and discussed the important changes to the main data structures. By retaining domain decomposition and message passing via MPI, we have demonstrated it is possible to scale complex fluid problems to large numbers of GPUs in parallel. -By following the key design criteria of maximising parallelism -and minimising host-device data transfers, we have confirmed +By following the key design criteria of maximizing parallelism +and minimizing host-device data transfers, we have confirmed that the mixed MPI-GPU approach is viable for scientific applications. For the more intricate problem of moving solid particles, we find it is possible to retain the more serial elements related to particle link operations on the CPU, while offloading only the parallel lattice-based operations involving the LB distribution -to the GPU. Again, this minimises host-device movement of data. +to the GPU. Again, this minimizes host-device movement of data. From the software engineering viewpoint, some duplication of code to allow efficient implementation on both host and device is @@ -797,7 +805,7 @@ codes which also exhibit portable performance (perhaps in conjunction with automatic tuning approaches). So, while the challenges in designing portable and efficient scientific -applications remain very real, this work provides some hope the large +applications remain very real, this work provides some hope that large clusters of GPU machines can be used effectively for a wide range of complex fluid problems. @@ -929,7 +937,7 @@ complex fluid problems. % use section* for acknowledgement -\section*{Acknowledgments} +\section{Acknowledgments} AG was supported by the CRESTA project, which has received funding from the European Community's Seventh Framework Programme (ICT-2011.9.13) under Grant Agreement no. 287703. KS was supported