1 \chapterauthor{Stefan L. Glimberg, Allan P. Engsig-Karup, Allan S. Nielsen, and Bernd Dammann}{Technical University of Denmark}
2 %\chapterauthor{Allan P. Engsig-Karup}{Technical University of Denmark}
3 %\chapterauthor{Allan S. Nielsen}{Technical University of Denmark}
4 %\chapterauthor{Bernd Dammann}{Technical University of Denmark}
6 \chapter[Software components for heterogeneous many-core architectures]{Development of software components for heterogeneous many-core architectures}\label{ch5}
10 %\item Performance portable tuning techniques via a modern parallel programming model (CUDA) on emerging GPU architectures.
11 %\item distributed heterogenous computing via MPI+CUDA/OpenCL with domain decomposition methods for PDE solvers.
12 %\item efficient and scalable iterative methods for solution of high-order numerical methods and strategies for efficient implementations on desktop architectures.
15 \section{Software development for heterogeneous architectures}
16 %Our library facilitates massively parallelization through GPU computing and contains components for various iterative strategies such as DC, CG, CGNR, BiCGSTAB and GMRES(m) for solution of large linear systems along with support for preconditioning strategies. The goal is to create a reusable library and framework which provide general components with performance similar to that of a dedicated solver. Preliminary results show that performance overhead can be kept minimal in this new software framework [8].
19 %Within the past decade we have therefore seen that development of applications based on high-performance parallel computing, is expanding from academic and industrial companies with financial support to invest in parallel computers, into any application developer with access to modern hardware. This shift has not only been motivated by the successful development of programmable GPUs, but by the architectural changes from single-core to multi-core architectures in general. Rapidly increasing power and heating problems have forced the continuous evolution of single core architectures to stop. The solution from the chip manufacturers has been to fit multiple processors into the same unit, in order to maintain the same overall peak performance rates. Though these new parallel architectures can be highly effective compute units, they often pose challenges for software developers to fully utilize their efficiency. Sequential legacy codes are not always easily parallelized and the time spent on conversion might not pay off in the end. This is particular true for heterogenous computers, where the architectural differences between the main- and co-processor can be so significant that they call for completely different optimization strategies. The cache hierarchy and management of CPUs and GPUs being an evident example hereof.
22 %using a library makes it easier for researchers to implement their problem without having to think about the complexity of the underlying computer hardware. (keyword: Opacity, [Berkeley dwarf paper])
24 % a library can support several architectures, and thus the high-level code does not have to be changed when moving to a new architecture, e.g. Thrust's support for CUDA, OpenMP and TBB.
26 % there is a price/penalty to pay when using libraries, i.e. one doesn't necessarily get the best performance, because the architectural details are not visible to the programmer (keyword: Visibility [Berkeley dwarf paper]). However, if the library permits to write own add-ons/kernels (flexibility), this is no longer an issue.
28 Massively parallel processors, such as graphical processing units (GPUs), have in recent years proven to be effective for a vast amount of scientific applications. Today, most desktop computers are equipped with one or more powerful GPUs, offering heterogeneous high-performance computing to a broad range of scientific researchers and software developers. Though GPUs are now programmable and can be highly effective computing units, they still pose challenges for software developers to fully utilize their efficiency. Sequential legacy codes are not always easily parallelized, and the time spent on conversion might not pay off in the end. This is particular true for heterogenous computers, where the architectural differences between the main and coprocessor can be so significant that they require completely different optimization strategies. The cache hierarchy management of CPUs and GPUs are an evident example hereof. In the past, industrial companies were able to boost application performance solely by upgrading their hardware systems, with an overt balance between investment and performance speedup. Today, the picture is different; not only do they have to invest in new hardware, but they also must account for the adaption and training of their software developers. What traditionally used to be a hardware problem, addressed by the chip manufacturers, has now become a software problem for application developers.
30 Software libraries\index{software library}\index{library|see{software library}} can be a tremendous help for developers as they make it easier to implement an application, without requiring special knowledge of the underlying computer architecture and hardware. A library may be referred to as \emph{opaque} when it automatically utilize the available resources, without requiring specific details from the developer\cite{ch5:Asanovic:EECS-2006-183}. The ultimate goal for a successful library is to simplify the process of writing new software and thus to increase developer productivity. Since programmable heterogeneous CPU/GPU systems are a rather new phenomenon, there is yet a limited number of established software libraries that take full advantage of such heterogeneous high performance systems, and there are no de facto design standards for such systems either. Some existing libraries for conventional homogeneous systems have already added support for offloading computationally intense operations onto coprocessing GPUs. However, this approach comes at the cost of frequent memory transfers across the low bandwidth PCIe bus.
32 In this chapter, we focus on the use of a software library to help application developers achieve their goals without spending an immense amount of time on optimization details, while still offering close-to-optimal performance. A good library provides performance-portable implementations with intuitive interfaces, that hide the complexity of underlaying hardware optimizations. Unfortunately, opaqueness sometimes comes at a price, as one does not necessarily get the best performance when the architectural details are not \emph{visible} to the programmer~\cite{ch5:Asanovic:EECS-2006-183}. If, however, the library is flexible enough and permits developers to supply their own low-level implementations as well, this does not need to be an issue. These are some of the considerations library developers should take into account, and what we will try to address in this chapter.
35 For demonstrative purposes we present details from an in-house generic CUDA-based C++ library for fast assembling of partial differential equation (PDE) solvers, utilizing the computational resources of GPUs. This library has been developed as part of research activities associated with the GPUlab, at the Technical University of Denmark and, therefore, is referred to as the {\em GPUlab library}. It falls into the category of computational libraries, as categorized by Hoefler and Snir~\cite{ch5:Hoefler2011}. Memory allocation and basic algebraic operations are supported via object-oriented components, without the user having to write CUDA specific kernels. As a back-end vector class, the parallel CUDA Thrust template-based library is used, enabling easy memory allocation and a high-level interface for vector manipulation~\cite{ch5:Bell2011}. Inspirations for \emph{good library design}, some of which we will present in this chapter, originate from guidelines proposed throughout the literature~\cite{ch5:Hoefler2011,ch5:Gamma1995,ch5:Skjellum1994}. An identification of desirable properties, which any library should strive to achieve, is pointed out by Korson and McGregor~\cite{ch5:Korson1992}. In particular we mention being easy-to-use, extensible, and intuitive.
38 %We rely on template-based implementations to allow full flexibility to change and assemble types~\cite{ch5:Vandevoorde2002}
40 %The library presented in this chapter falls into the category of computational libraries, as defined by Hoefler and Snir~\cite{ch5:Hoefler2011}, and seeks to fulfil the relevant objectives presented, such as designing components around data structures (classes), utilizing inheritance for component reuse, and hiding irrelevant information from the end users. An identification of desirable attributes for library development is also pointed out by Korson and McGregor~\cite{ch5:Korson1992}, in particular the library is designed to be easy-to-use, extensibility, and intuitive.
42 The library is designed to be effective and scalable for fast prototyping of PDE solvers, (primarily) based on matrix-free implementations of finite difference (stencil) approximations on logically structured grids. It offers functionalities that will help assemble PDE solvers that automatically exploit heterogeneous architectures much faster than manually having to manage GPU memory allocation, memory transfers, kernel launching, etc.
44 In the following sections we demonstrate how software components that play important roles in scientific applications can be designed to fit a simple framework that will run efficiently on heterogeneous systems. One example is finite difference approximations, commonly used to find numerical solutions to differential equations. Matrix-free implementations minimize both memory consumption and memory access, two important features for efficient GPU utilization and for enabling the solution of large-scale problems. The bottleneck problem for many PDE applications is to solve large sparse linear systems, arising from the discretization. In order to help solve these systems, the library includes a set of iterative solvers. All iterative solvers are template-based, such that vector and matrix classes, along with their underlying implementations, can be freely interchanged. New solvers can also be implemented without much coding effort. The generic nature of the library, along with a predefined set of interface rules, allows assembling components into PDE solvers. The use of parameterized-type binding allows the user to assemble PDE solvers at a high abstraction level, without having to change the remaining implementation.
46 Since this chapter is mostly dedicated to the discussion of software development for high performance heterogeneous systems, the focus will be more on the development and usage of an in-house GPU-based library, than on specific scientific applications. We demonstrate how to use the library on two elementary model problems and refer the reader to Chapter \ref{ch7} for a detailed description of an advanced application tool for free surface water wave simulations. These examples are assembled using the library components presented in this chapter.
48 % Make sure that we present a way to implement PDE solver components, and that our library is an example that demonstrate the use.
50 %Well designed standards for message passing across multiple platforms have been successfully developed, with MPI being the most well known \cite{ch5:Gropp1999,ch5:Gropp1999a}.
52 %\stefan{Add more motivation, power wall + memory wall + ILP (instruction-level parallelism) wall = brick wall.. }
54 %- Motivation for using libraries in general. Emphasize the computational power of GPUs and how they can be used for high-performance scientific computing.
56 %- Motivate our library, how we fit in.
58 %- Introduce the model problem that we use the demonstrate the use of our library.
60 %- There is a tendency to that the more performance is needed, the more low-level and difficult the programming gets. We strive to achieve such good performance, but without compromising too much easiness.
62 %- The following sections are ordered as follows...
64 \subsection{Test environments}\label{ch5:sec:testenvironments}
65 Throughout the chapter we use three different test environments: two high-end desktop computers located at the GPUlab--Technical University of Denmark, and a GPU cluster located at the Center for Computing and Visualization, Brown University, USA. Hardware details for the two systems are as follows:
67 \item[Test environment 1.] Desktop computer, Linux Ubuntu, Intel Xeon E5620 (2.4GHz) quad-core Westmere processor, 12GB of DDR-3 memory (1066 MHz), 2x NVIDIA GeForce GTX590 GPU with 3GB DDR5 memory, PCIe 2.0.
68 \item[Test environment 2.] Desktop computer, Linux Ubuntu, Intel Core i7-3820 (3.60GHz) Sandy Bridge processors, 32GB RAM, 2x NVIDIA Tesla K20 GPUs with 5GB DDR5 memory, PCIe 2.0.
69 \item[Test environment 3.] GPU cluster, Linux, up to 44 compute nodes based on dual Intel Xeon E5540 (2.53GHz) quad-core Nehalem processors, 24 GB of DDR-3 memory (1333MHz), 2x NVIDIA Tesla M2050 GPUs with 3GB GDDR5 memory, 40 Gb/s Quad-Data-Rate (QDR) InfiniBand interconnect.
73 \section{Heterogeneous library design for PDE solvers}
74 A generic CUDA-based C++ library has been developed to ease the assembling of PDE solvers. The template-based\index{templates} design allows users to assemble solver parts easily and to supply their own implementation of problem specific parts. In the following, we present an overview of the library and the supported features, introduce the concepts of the library components, and give short code examples to ease understanding. The library is a starting point for fast assembling of GPU-based PDE solvers, developed mainly to support finite difference operations on regular grids. However, this is not a limitation, since existing vector objects could be used as base classes for extending to other discretization methods or grid types as well.
76 \subsection{Component and concept design}
77 The library is grouped into component classes. Each component should fulfill a set of simple interface and template rules, called concepts\index{concept}, in order to guarantee compatibility with the rest of the library. In the context of PDE solving, we present five component classes: vectors, matrices, iterative solvers for linear system of equations, preconditioners for the iterative solvers, and time integrators. Figure \ref{ch5:fig:componentdesign} lists the five components along with a subset of the type definitions they should provide and the methods they should implement. It is possible to extend the implementation of these components with more functionality that relate to specific problems, but this is the minimum requirement for compatibility with the remaining library. With these concept rules fulfilled, components can rely on other components to have their respective functions implemented.
81 \input{Chapters/chapter5/figures/component_design.tikz}
82 \caption[Schematic representation of the five main components, their type definitions, and member functions.]{Schematic representation of the five main components, their type definitions, and member functions. Because components are template based, the argument types cannot be known beforehand. The concepts ensure compliance among components.}\label{ch5:fig:componentdesign}
85 A component is implemented as a generic C++ class, and normally takes as a template arguments the same types that it offers through type definitions: a matrix takes a vector as template argument, and a vector takes the working precision type. The matrix can then access the working precision through the vector class. Components that rely on multiple template arguments can combine these arguments via type binders to reduce the number of arguments and maintain code simplicity. We will demonstrate use of such type binders in the model problem examples. A thorough introduction to template-based programming in C++ can be found in \cite{ch5:Vandevoorde2002}.
87 The generic configuration allows the developer to define and assemble solver parts at the very beginning of the program using type definitions. Changing PDE parts at a later time is then only a matter of changing type definitions. We will give two model examples of how to assemble PDE solvers in Section \ref{ch5:sec:modelproblems}.
89 \subsection{A matrix-free finite difference component}\index{finite difference}\index{matrix-free}
90 Common vector operations, such as memory allocation, element-wise assignments, and basic algebraic transformations, require many lines of codes for a purely CUDA-based implementation. These CUDA-specific operations and kernels are hidden from the user behind library implementations, to ensure a high abstraction level. The vector class inherits from the CUDA-based Thrust library and therefore offer the same level of abstraction that enhances developer productivity and enables performance portability. Creating and allocating device (GPU) memory for two vectors can be done in a simple and intuitive way using the GPUlab library, as shown in Listing \ref{ch5:lst:vector} where two vectors are added together.
92 \lstinputlisting[label=ch5:lst:vector,caption={allocating, initializing, and adding together two vectors on the GPU: first example uses pure CUDA C; second example uses the built-in library template-based vector class}]{Chapters/chapter5/code/ex1.cu}
94 The vector class (and derived classes hereof) is compliant with the rest of the library components. Matrix-vector multiplications are usually what makes PDE-based applications different from each other, and the need to write a user specific implementation of the matrix-vector product is essential when solving specific PDE problems. The PDE and the choice of discretization method determine the structure and sparsity of the resulting matrix. Spatial discretization is supported by the library with finite difference approximations, and it offers an efficient low-storage (matrix-free) flexible order\index{flexible order} implementation to help developers tailor their custom codes. These matrix-free operators are feasible for problems where the matrix structure is known in advance and can be exploited, such that the matrix values can be either precomputed or computed on the fly. Furthermore, the low constant memory requirement makes them perfect in the context of solving large scale problems, whereas traditional sparse matrix formats require increasingly more memory, see e.g., ~\cite{ch5:Bell2009} for details on GPU sparse matrix formats.
96 Finite differences approximate the derivative of some function $u(x)$ as a weighted sum of neighboring elements. In compact notation we write
97 \begin{align}\label{ch5:eq:fdstencil}
98 \frac{\partial^q u(x_i)}{\partial x^q} \approx \sum_{n=-\alpha}^{\beta}c_n u(x_{i+n}),
100 where $q$ is the order of the derivative, $c_n$ is a set of finite difference coefficients, and $\alpha$ plus $\beta$ define the number of coefficients that are used for the approximation. The total set of contributing elements is called the stencil,\index{stencil} and the size of the stencil is called the rank, given as $\alpha + \beta + 1$. The stencil coefficients $c_n$ can be derived from a Taylor expansion based on the values of $\alpha$ and $\beta$, and $q$, using the method of undetermined coefficients~\cite{ch5:LeVeque2007}.
101 An example of a three-point finite difference matrix that approximates the first ($q=1$) or second ($q=2$) derivative of a one-dimensional uniformly distributed vector $u$ of length $8$ is given here:
102 \begin{eqnarray}\label{ch5:eq:stencilmatrix}
103 \left[\begin{array}{c c c c c c c c}
104 c_{00} & c_{01} & c_{02} & 0 & 0 & 0 & 0 & 0 \\
105 c_{10} & c_{11} & c_{12} & 0 & 0 & 0 & 0 & 0 \\
106 0 & c_{10} & c_{11} & c_{12} & 0 & 0 & 0 & 0 \\
107 0 & 0 & c_{10} & c_{11} & c_{12} & 0 & 0 & 0 \\
108 0 & 0 & 0 & c_{10} & c_{11} & c_{12} & 0 & 0 \\
109 0 & 0 & 0 & 0 & c_{10} & c_{11} & c_{12} & 0 \\
110 0 & 0 & 0 & 0 & 0 & c_{10} & c_{11} & c_{12} \\
111 0 & 0 & 0 & 0 & 0 & c_{20} & c_{21} & c_{22}
113 \left[\begin{array}{c}
124 \left[\begin{array}{c}
135 It is clear from this example that the matrix is sparse and that the same coefficients are repeated for all centered rows. The coefficients differ only near the boundaries, where off-centered stencils are used. It is natural to pack this information into a stencil operator that stores only the unique set of coefficients:
136 \begin{eqnarray}\label{ch5:eq:stencilcoeffs}
138 \left[\begin{array}{c c c}
139 c_{00} & c_{01} & c_{02} \\
140 c_{10} & c_{11} & c_{12} \\
141 c_{20} & c_{21} & c_{22} \\
144 Matrix components precompute these compact stencil coefficients and provides member functions that computes the finite difference approximation of input vectors. Unit scaled coefficients (assuming grid spacing is one) are computed and stored to be accessible via both CPU and GPU memory. On the GPU, the constant memory space is used for faster memory access~\cite{ch5:cudaguide}. In order to apply a stencil on a non unit-spaced grid, with grid space $\Delta x$, the scale factor $1/(\Delta x)^q$ will have to be multiplied by the finite difference sum, i.e., $(c_{00}u_0 + c_{01}u_1 + c_{02}u_2)/(\Delta x)^q \approx u^{(q)}_0$ as in the first row of \eqref{ch5:eq:stencilmatrix}.
146 Setting up a two-dimensional grid of size $N_x \times N_y$ in the unit square and computing the first derivative hereof is illustrated in Listing~\ref{ch5:lst:stencil}. The grid is a vector component, derived from the vector class. It is by default treated as a device object and memory is automatically allocated on the device to fit the grid size. The finite difference approximation as in \eqref{ch5:eq:fdstencil}, is performed via a CUDA kernel behind the scenes during the calls to \texttt{mult} and \texttt{diff\_x}, utilizing the memory hierarchy as the CUDA guidelines prescribe~\cite{ch5:cudaguide,ch5:cudapractice}. To increase developer productivity, kernel launch configurations have default settings, based on CUDA guidelines, principles, and experiences from performance testings, such that the user does not have to explicitly specify them. For problem-specific finite difference approximations, where the built-in stencil operators are insufficient, a pointer to the coefficient matrix \eqref{ch5:eq:stencilcoeffs} can be accessed as demonstrated in Listing \ref{ch5:lst:stencil} and passed to customized kernels.
148 \lstinputlisting[label=ch5:lst:stencil,caption={two-dimensional finite difference stencil example: computing the first derivative using five points ($\alpha=\beta=2$) per dimension, a total nine-point stencil}]{Chapters/chapter5/code/ex2.cu}
150 In the following sections we demonstrate how to go from an initial value problem (IVP) or a boundary value problem (BVP) to a working application solver by combining existing library components along with new custom-tailored components. We also demonstrate how to apply spatial and temporal domain decomposition strategies that can make existing solvers take advantage of systems equipped with multiple GPUs. Next section demonstrates how to rapidly assemble a PDE solver using library components.
152 \section{Model problems}\label{ch5:sec:modelproblems}
154 %Physical phenomenons are at all times present around us. Many of them are well defined via the laws of physics and can be described via differential equations. Though many differential equations are well defined, their solutions are not, and the need for computers to approximate and simulate solutions are inevitable. The accuracy of these simulations is critical for the quality of scientific applications within all fields of engineering. Improved accuracy often comes with the price of increased computational times, that can even be too overwhelming for a single computer to run. Fortunately, the fundamental change in chip-design led development of single core systems into multi- and many core high-performance architectures, that perfectly fits the engineering needs for increased computational resources. The massively parallel architecture of GPUs is a particular good match for accurate methods with high computational intensity\cite{ch5:Kloeckner2011}.
156 We present two elementary PDE model problems, to demonstrate how to assemble PDE solvers, using library components that follow the guidelines described above. The first model problem is the unsteady parabolic heat conduction equation; the second model problem is the elliptic Poisson equation. The two model problems consist of elements that play important roles in solving a broad range of more advanced PDE problems.
158 We refer the reader to Chapter \ref{ch7} for an example of a scientific application relevant for coastal and maritime engineering analysis that has been assembled using customized library components similar to those presented in the following.
160 \subsection{Heat conduction equation}\index{heat conduction}
161 First, we consider a two-dimensional heat conduction problem defined on a unit square. The heat conduction equation is a parabolic partial differential diffusion equation, including both spatial and temporal derivatives. It describes how the diffusion of heat in a medium changes with time. Diffusion equations are of great importance in many fields of sciences, e.g., fluid dynamics, where the fluid motion is uniquely described by the Navier-Stokes equations, which include a diffusive viscous term~\cite{ch5:chorin1993,ch5:Ferziger1996}.%, or in financial science where diffusive terms are present in the Black-Scholes equations for estimation of option price trends~\cite{}.
163 The heat problem is an IVP \index{initial value problem}, it describes how the heat distribution evolves from a specified initial state. Together with homogeneous Dirichlet boundary conditions\index{boundary condition}, the heat problem in the unit square is given as
164 \begin{subequations}\begin{align}
165 \frac{\partial u}{\partial t} - \kappa\nabla^2u = 0, & \qquad (x,y)\in \Omega([0,1]\times[0,1]),\quad t\geq 0, \label{ch5:eq:heateqdt}\\
166 u = 0, & \qquad (x,y) \in \partial\Omega,\label{ch5:eq:heateqbc}
167 \end{align}\label{ch5:eq:heateq}\end{subequations}
168 where $u(x,y,t)$ is the unknown heat distribution defined within the domain $\Omega$, $t$ is the time, $\kappa$ is a heat conductivity constant (let $\kappa=1$), and $\nabla^2$ is the two-dimensional Laplace\index{Laplace operator} differential operator $(\partial_{xx}+\partial_{yy})$. We use the following initial condition:
169 \begin{align}\label{ch5:eq:heatinit}
170 u(x,y,t_0) = \sin(\pi x)\,\sin(\pi y), & \qquad (x,y) \in \Omega,
172 because it has a known analytic solution over the entire time span, and it satisfies the homogeneous boundary condition given by \eqref{ch5:eq:heateqbc}. An illustrative example of the numerical solution to the heat problem, using \eqref{ch5:eq:heatinit} as the initial condition, is given in Figure \ref{ch5:fig:heatsolution}.
173 \begin{figure}[!htbp]
176 \setlength\figurewidth{0.26\textwidth} %
177 \setlength\figureheight{0.26\textwidth} %
178 \subfigure[$t=0.00s$]{\input{Chapters/chapter5/figures/HeatSolution0.tikz}}%
179 %{\includegraphics[width=0.48\textwidth]{Chapters/chapter5/figures/HeatSolution0_conv.pdf}}
180 \subfigure[$t=0.05s$]{\input{Chapters/chapter5/figures/HeatSolution0.049307.tikz}}%
181 %{\includegraphics[width=0.48\textwidth]{Chapters/chapter5/figures/HeatSolution0_049307_conv.pdf}}
182 \subfigure[$t=0.10s$]{\input{Chapters/chapter5/figures/HeatSolution0.099723.tikz}}%
183 %{\includegraphics[width=0.48\textwidth]{Chapters/chapter5/figures/HeatSolution0_099723_conv.pdf}}
184 \caption{Discrete solution, at times $t=0s$ and $t=0.05s$, using \eqref{ch5:eq:heatinit} as the initial condition and a small $20\times20$ numerical grid.}\label{ch5:fig:heatsolution}
187 We use a Method of Lines (MoL)\index{method of lines} approach to solve \eqref{ch5:eq:heateq}. Thus, the spatial derivatives are replaced with finite difference approximations, leaving only the temporal derivative as unknown. The spatial derivatives are approximated from $\myvec{u}^n$, where $\myvec{u}^n$ represents the approximate solution to $u(t_n)$ at a given time $t_n$ with time step size $\delta t$ such that $t_n=n\delta t$ for $n=0,1,\ldots$. The finite difference approximation\index{finite difference} can be interpreted as a matrix-vector product as sketched in \eqref{ch5:eq:stencilmatrix}, and so the semi-discrete heat conduction problem becomes
188 \begin{align}\label{ch5:eq:discreteheateq}
189 \frac{\partial u}{\partial t} = \mymat{A}\myvec{u}, \qquad \mymat{A} \in \mathbb{R}^{N\times N}, \quad \myvec{u} \in \mathbb{R}^{N},
191 where $\mymat{A}$ is the sparse finite difference matrix and $N$ is the number of unknowns in the discrete system. The temporal derivative is now free to be approximated by any suitable choice of a time-integration method\index{time integration}. The most simple integration scheme would be the first-order accurate explicit forward Euler method\index{forward Euler},
192 \begin{align}\label{ch5:eq:forwardeuler}
193 \myvec{u}^{n+1} = \myvec{u}^n + \delta t\,\mymat{A}\myvec{u}^n,
195 where $n+1$ refers to the solution at the next time step. The forward Euler method can be exchanged with alternative high-order accurate time integration methods, such as Runge-Kutta methods or linear multistep methods, if numerical instability becomes an issue, see, e.g., \cite{ch5:LeVeque2007} for details on numerical stability analysis. For demonstrative purpose, we simply use conservative time step sizes to avoid stability issues. However, the component-based library design provides exactly the flexibility for the application developer to select or change PDE solver parts, such as the time integrator, with little coding effort. A generic implementation of the forward Euler method that satisfies the library concept rules is illustrated in Listing~\ref{ch5:lst:euler}. According to the component guidelines in Figure \ref{ch5:fig:componentdesign}, a time integrator is basically a functor, which means that it implements the parenthesis operator, taking five template arguments: a right hand side operator, the state vector, integration start time, integration end time, and a time step size. The method takes as many time steps necessary to integrate from the start to the end, continuously updating the state vector according to \eqref{ch5:eq:forwardeuler}. Notice, that nothing in Listing~\ref{ch5:lst:euler} indicates wether GPUs are used or not. However, it is likely that the underlying implementation of the right hand side functor and the \texttt{axpy} vector function, do rely on fast GPU kernels. However, it is not something that the developer of the component has to account for. For this reason, the template-based approach, along with simple interface concepts, make it easy to create new components that will fit well into a generic library.
198 The basic numerical approach to solve the heat conduction problem has now been outlined, and we are ready to assemble the PDE solver.
201 \lstinputlisting[label=ch5:lst:euler,caption={generic implementation of explicit first-order forward Euler integration}]{Chapters/chapter5/code/ex3.cu}
204 %More accurate methods, like variations of Runge-Kutta methods or multi-step methods might be desirable, particulary if the accuracy of the spatial approximation is also high, see e.g. LeVeque for examples hereof~\cite{ch5:LeVeque2007}.
206 %The following section demonstrates how to rapidly assemble a solver for the heat problem just presented, with configurable order of accuracy for both the spatial and temporal approximations. Section \ref{ch5:sec:results} presents performance measures and Section \ref{ch5:sec:multigpu} provides details for further optimization strategies using multiple GPUs.
210 \subsubsection{Assembling the heat conduction solver}
211 Before we are able to numerically solve the discrete heat conduction problem \eqref{ch5:eq:heateq}, we need implementations to handle the the following items:
213 \item[Grid.] A discrete numerical grid to represent the two-dimensional heat distribution domain and the arithmetical working precision ($32$-bit single precision or $64$-bit double precision).
214 \item[RHS.] A right-hand side operator for \eqref{ch5:eq:discreteheateq} that approximates the second-order spatial derivatives (matrix-vector product).
215 \item[Boundary conditions.] A strategy that ensures that the Dirichlet conditions are satisfied on the boundary.
216 \item[Time integrator.] A time integration scheme, that approximates the time derivative from \eqref{ch5:eq:discreteheateq}.
218 All items are either directly available in the library or can be designed from components herein. The built-in stencil operator may assist in implementing the matrix-vector product, but we need to explicitly ensure that the Dirichlet boundary conditions are satisfied. We demonstrated in Listing \ref{ch5:lst:stencil} how to approximate the derivative using flexible order finite difference stencils. However, from \eqref{ch5:eq:heateqbc} we know that boundary values are zero. Therefore, we extend the stencil operator with a simple kernel call that assigns zero to the entire boundary. Listing \ref{ch5:lst:laplaceimpl} shows the code for the two-dimensional Laplace right-hand side operator. The constructor takes as an argument the stencil half size $\alpha$ and assumes $\alpha=\beta$. Thus, the total two-dimensional stencil rank will be $4\alpha+1$. For simplicity we also assume that the grid is uniformly distributed, $N_x=N_y$. Performance optimizations for the stencil kernel, such as shared memory utilization, are handled in the underlying implementation, accordingly to CUDA guidelines~\cite{ch5:cudaguide,ch5:cudapractice}. The macros, \texttt{BLOCK1D} and \texttt{GRID1D}, are used to help set up kernel configurations based on grid sizes and \texttt{RAW\_PTR} is used to cast the vector object to a valid device memory pointer.
220 \lstinputlisting[label=ch5:lst:laplaceimpl,caption={the right-hand side Laplace operator: the built-in stencil approximates the two dimensional spatial derivatives, while the custom set\_dirichlet\_bc kernel takes care of satisfying boundary conditions}]{Chapters/chapter5/code/laplacian.cu}
222 With the right-hand side operator in place, we are ready to implement the solver. For this simple PDE problem we compute all necessary initial data in the body of the main function and use the forward Euler time integrator to compute the solution until $t=t_{end}$. For more advanced solvers, a built-in \texttt{ode\_solver} class is defined that helps take care of initialization and storage of multiple state variables. Declaring type definitions for all components at the beginning of the main file gives a good overview of the solver composition. In this way, it will be easy to control or change solver components at later times. Listing~\ref{ch5:lst:heattypedefs} lists the type definitions\index{type definitions} that are used to assemble the heat conduction solver. %\stefan{elaborate on the choices}.
224 \lstset{label=ch5:lst:heattypedefs,caption={type definitions for all the heat conduction solver components used throughout the remaining code}}
226 typedef double value_type;
227 typedef laplacian<value_type> rhs_type;
228 typedef gpulab::grid<value_type> vector_type;
229 typedef vector_type::property_type property_type;
230 typedef gpulab::integration::forward_euler time_integrator_type;
232 The grid is by default treated as a device object, and memory is allocated on the GPU upon initialization of the grid. Setting up the grid can be done via the property type class. The property class holds information about the discrete and physical dimensions, along with fictitious ghost (halo) layers and periodicity conditions. For the heat conduction problem we use a non periodic domain of size $N\times N$ within the unit square with no ghost layers. Listing \ref{ch5:lst:gridsetup} illustrates the grid assembly.
234 \lstset{label=ch5:lst:gridsetup,caption={creating a two-dimensional grid of size \texttt{N} times \texttt{N} and physical dimension $0$ to $1$}}
236 // Setup discrete and physical dimensions
237 gpulab::grid_dim<int> dim(N,N,1);
238 gpulab::grid_dim<value_type> p0(0,0);
239 gpulab::grid_dim<value_type> p1(1,1);
240 property_type props(dim,p0,p1);
243 vector_type u(props);
245 Hereafter the vector \texttt{u} can be initialized accordingly to \eqref{ch5:eq:heatinit}. Finally we need to instantiate the right-hand side Laplacian operator from Listing \ref{ch5:lst:laplaceimpl} and the forward Euler time integrator in order to integrate from $t_0$ until $t_{end}$.
247 \lstset{label=ch5:lst:timeintegrator,caption=creating a time integrator and the right-hand side Laplacian operator.}
249 rhs_type rhs(alpha); // Create right-hand side operator
250 time_integrator_type solver; // Create time integrator
251 solver(&rhs,u,0.0f,tend,dt); // Integrate from 0 to tend using dt
253 The last line invokes the forward Euler time integration scheme defined in Listing \ref{ch5:lst:heattypedefs}. If the developer decides to change the integrator into another explicit scheme, only the time integrator type definition in Listing \ref{ch5:lst:heattypedefs} needs to be changed. The heat conduction solver is now complete.%The simple model problem illustrated, that a well designed library upholds developer productivity, by minimizing the
255 % how few CUDA implementations that were needed, and thereby upholding developer productivity.
257 %\todo{(How) do we give access to the entire file?}
259 \subsubsection{Numerical solutions to the heat conduction problem}
261 Solution time for the heat conduction problem is in itself not very interesting, as it is only a simple model problem. What is interesting for GPU kernels, such as the finite differences kernel, is that increased computational work often comes with a very small price, because the fast computations can be hidden by the relatively slower memory fetches. Therefore, we are able to improve the accuracy of the numerical solution via more accurate finite differences (larger stencil sizes), while improving the computational performance in terms of floating point operations per second (flops). Figure \ref{ch5:fig:stencilperformance} confirms, that larger stencils improve the kernel performance. Notice that even though these performance results are favorable compared to single core systems ($\sim 10$ GFlops double precision on a $2.5$-GHz processor), they are still far from their peak performance, e.g., $\sim 2.4$ TFlops single precision for the GeForce GTX590. The reason is that the kernel is bandwidth bound, i.e., performance is limited by the time it takes to move memory between the global GPU memory and the chip. The Tesla K20 performs better than the GeForce GTX590 because it obtains the highest bandwidth. Being bandwidth bound is a general limitation for matrix-vector-like operations that arise from the discretization of PDE problems. Only matrix-matrix multiplications, which have a high ratio of computations versus memory transactions, are able to reach near-optimal performance results~\cite{ch5:Kirk2010}. These kinds of operators are, however, rarely used to solve PDE problems.
264 \setlength\figureheight{0.35\textwidth}
265 \setlength\figurewidth{0.4\textwidth}
268 \subfigure[GeForce GTX590, test environment 1.]{
269 \input{Chapters/chapter5/figures/AlphaPerformanceGTX590_N16777216.tikz}%
271 \subfigure[Tesla K20c, test environment 2.]{
272 \input{Chapters/chapter5/figures/AlphaPerformanceTeslaK20c_N16777216.tikz}
274 %{\includegraphics[width=0.5\textwidth]{Chapters/chapter5/figures/AlphaPerformanceGTX590_N16777216_conv.pdf}}
276 \caption[Single- and double-precision floating point operations per second for a two-dimensional stencil operator on a numerical grid of size $4096^2$.]{Single and double precision floating point operations per second for a two dimensional stencil operator on a numerical grid of size $4096^2$. Various stencil sizes are used $\alpha=1,2,3,4$, equivalent to $5$pt, $9$pt, $13$pt, and $17$pt stencils.}\label{ch5:fig:stencilperformance}
282 \subsection{Poisson equation}\index{Poisson equation}
283 The Poisson equation is a second-order elliptic differential equation, often encountered in applications within scientific fields such as electrostatics and mechanics. We consider the two- dimensional BVP \index{boundary volume problem} defined in terms of Poisson's equation with homogeneous Dirichlet boundary conditions on the form
284 \begin{subequations}\begin{align}
285 \nabla^2 u = f(x,y),& \qquad (x,y) \in \Omega([0,1]\times[0,1]), \\
286 u = 0,& \qquad (x,y) \in \partial\Omega.
287 \end{align}\label{ch5:eq:poissoneq}\end{subequations}
288 Notice the similarities to the heat conduction equation \eqref{ch5:eq:heateq}. In fact, \eqref{ch5:eq:poissoneq} could be a steady-state solution to the heat equation, when there is no temporal change $\frac{\partial u}{\partial t}=0$, but a source term $f(x,y)$. Since the Laplace operator and the boundary conditions are the same for both problems, we are able to reuse the same implementation with few modifications.
290 Opposite to the heat equation, there are no initial conditions. Instead, we seek some $u(x,y)$ that satisfies \eqref{ch5:eq:poissoneq}, given a source term $f(x,y)$, on the right-hand side. For simplicity, assume that we know the exact solution, $u_{\textrm{true}}$, corresponding to \eqref{ch5:eq:heatinit}. Then we use the method of manufactured solutions to derive an expression for the corresponding right-hand side $f(x,y)$:
292 f(x,y) = \nabla^2 u_{\textrm{true}} = -2\pi^2\,\sin(\pi x)\,\sin(\pi y). \label{ch5:eq:poissonrhs}
294 The spatial derivative in \eqref{ch5:eq:poissoneq} is again approximated with finite differences\index{finite difference}, similar to the example in \eqref{ch5:eq:stencilmatrix}, except boundary values are explicitly set to zero. The discrete form of the system can now be written as a sparse linear system of equations:
296 \mymat{A}\myvec{u}=\myvec{f}, \qquad \myvec{u},\myvec{f} \in \mathbb{R}^{N}, \quad \mymat{A} \in \mathbb{R}^{N\times N}, \label{ch5:eq:poissonsystem}
298 where $\mymat{A}$ is the sparse matrix formed by finite difference coefficients, $N$ is the number of unknowns, and $\myvec{f}$ is given by \eqref{ch5:eq:poissonrhs}. Equation \eqref{ch5:eq:poissonsystem} can be solved in numerous ways, but a few observations may help do it more efficiently. Direct solvers based on Gaussian elimination are accurate and use a finite number of operations for a constant problem size. However, the arithmetic complexity grows with the problem size by as much as $\mathcal{O}(N^3)$ and does not exploit the sparsity of $\mymat{A}$. Direct solvers are therefore mostly feasible for dense systems of limited sizes. Sparse direct solvers exist, but they are often difficult to parallelize, or applicable for only certain types of matrices. Regardless of the discretization technique, the discretization of an elliptic PDE into a linear system as in \eqref{ch5:eq:poissonsystem} yields a very sparse matrix $\mymat{A}$ when $N$ is large. Iterative methods\index{iterative methods} for solving large sparse linear systems find broad use in scientific applications, because they require only an implementation of the matrix-vector product, and they often use a limited amount of additional memory. Comprehensive introductions to iterative methods may be found in any of~\cite{ch5:Saad2003,ch5:Kelley1995,ch5:Barrett1994}.
300 One benefit of the high abstraction level and the template-based library design is to allow developers to implement their own components, such as iterative methods for solving sparse linear systems. The library includes three popular iterative methods: conjugate gradient, defect correction\index{defect correction}, and geometric multigrid. The conjugate gradient method is applicable only to systems with symmetric positive definite matrices. This is true for the two-dimensional Poisson problem, when it is discretized with a five-point finite difference stencil, because then there will be no off-centered approximations near the boundary. For high-order approximations ($\alpha>1$), we use the defect correction method with multigrid preconditioning. See e.g., \cite{ch5:Trottenberg2001} for details on multigrid methods.
302 We will not present the implementation details for all three methods but briefly demonstrate the simplicity of implementing the body of such an iterative solver, given a textbook recipe or mathematical formulation. The defect correction method iteratively improves the solution to $\mymat{A}\myvec{x}=\myvec{b}$, given an initial start guess $\myvec{x}^0$, by continuously solving a preconditioned error equation. The defect correction\index{defect correction!iteration} iteration can be written as
304 \myvec{x}^{k+1} = \myvec{x}^{k} + \mymat{M}^{-1}(\myvec{b}-\mymat{A}\myvec{x}^{k}), \quad \mymat{A},\mathcal{M}\in\mathbb{R}^{N\times N}, \quad {\bf x},{\bf b}\in\mathbb{R}^N, \label{ch5:eq:dc}
306 where $k$ is the iteration number and $\mymat{M}$ is the preconditioner\index{preconditioning} which should be an approximation to the original coefficient matrix $\mymat{A}$. To achieve fast numerical convergence\index{convergence}, applying the preconditioner should be a computationally inexpensive operation compared to solving the original system. How to implement \eqref{ch5:eq:dc} within the library context is illustrated in Listing \ref{ch5:lst:dc}. The host CPU traverses each line in Listing \ref{ch5:lst:dc} and tests for convergence, while the computationally expensive matrix-vector operation and preconditioning, can be executed on the GPU, if GPU-based components are used. The defect correction method has two attractive properties. First, global reduction is required to monitor convergence only once per iteration during convergence evaluation, which reduces communication requirements and provides a basis for efficient and scalable parallelization. Second, it has a minimal constant memory footprint, making it a suitable method for solving very large systems.
308 \lstset{label=ch5:lst:dc,caption={main loop for the iterative defect correction solver: the solver is instantiated with template argument types for the matrix and vector classes, allowing underlying implementations to be based on GPU kernels}}
310 while(r.nrm2() > tol)
312 // Calculate residual
316 // Reset initial guess
322 // Defect correction update
327 In the following section we demonstrate how to assemble a solver for the discrete Poisson problem, using one of the three iterative methods to efficiently solve \eqref{ch5:eq:poissonsystem}.
329 \subsubsection{Assembling the Poisson solver}
331 Assembling the Poisson solver follows almost the same procedure as the heat conduction solver, except the time integration part is exchanged with an iterative method to solve the system of linear equations \eqref{ch5:eq:poissonsystem}. For the discrete matrix-vector product we reuse the Laplace operator from the heat conduction problem in Listing \ref{ch5:lst:laplaceimpl} with few modifications. The Laplace operator is now a matrix component, so to be compatible with the component interface rules in Figure \ref{ch5:fig:componentdesign}, a \texttt{mult} function taking two vector arguments is implemented instead of the parentheses operator. We leave out this code example as it almost identical to the one in Listing \ref{ch5:lst:laplaceimpl}.
333 At the beginning of the solver implementation we list the type definitions for the Poisson solver that will be used throughout the implementation. Here we use a geometric multigrid\index{multigrid} method as a preconditioner for the defect correction method. Therefore the multigrid solver is assembled first, so that it can be used in the assembling of the defect correction solver. Listing \ref{ch5:lst:poissontypedefs} defines the types for the vector, the matrix, the multigrid preconditioner, and the defect correction solver. The geometric multigrid method needs two additional template arguments that are specific for multigrid, namely, a smoother and a grid restriction/interpolation operator. These arguments are free to be implemented and supplied by the developer if special care is required, e.g., for a custom grid structure. For the Poisson problem on a regular grid, the library contains built-in restriction and interpolation operators, and a red-black Gauss-Seidel smoother. We refer the reader to \cite{ch5:Trottenberg2001} for extensive details on multigrid methods. The monitor and config types that appear in Listing \ref{ch5:lst:poissontypedefs} are used for convergence monitoring within the iterative solver and to control runtime parameters, such as tolerances and iteration limits.
335 \lstset{label=ch5:lst:poissontypedefs,caption={type definitions for the Laplacian matrix component and the multigrid preconditioned iterative defect correction\index{defect correction} solver}}
337 typedef double value_type;
338 typedef gpulab::grid<value_type> vector_type;
339 typedef laplacian<vector_type> matrix_type;
341 // MULTIGRID solver types
342 typedef gpulab::solvers::multigrid_types<
345 , gpulab::solvers::gauss_seidel_rb_2d
346 , gpulab::solvers::grid_handler_2d> mg_types;
347 typedef gpulab::solvers::multigrid<mg_types> mg_solver_type;
348 typedef mg_solver_type::monitor_type monitor_type;
349 typedef monitor_type::config_type config_type;
352 typedef gpulab::solvers::defect_correction_types<
356 , mg_solver_type> dc_types;
357 typedef gpulab::solvers::defect_correction<dc_types> dc_solver_type;
360 With the type definitions set up, the implementation for the Poisson solver follows in Listing \ref{ch5:lst:poissonsolver}. Some of the initializations are left out, as they follow the same procedure as for the heat conduction example. The defect correction and geometric multigrid solvers are initialized and then multigrid is set as a preconditioner to the defect correction method. Finally the system is solved via a call to \texttt{solve()}.
363 \lstset{label=ch5:lst:poissonsolver,caption={initializing the preconditioned defect correction solver to approximate the solution to $\mymat{A}\myvec{u}=\myvec{f}$}}
365 matrix_type A(alpha); // High-order matrix
366 matrix_type M(1); // Low-order matrix
368 /* Omitted: create and init vectors u, f */
370 config_type config; // Create configuration
371 config.set("iter",30); // Set max iteration count
372 config.set("rtol",1e-10); // Set relative tolerance
373 monitor_type monitor(config); // Create monitor
374 dc_solver_type solver(A,monitor); // Create DC solver
375 mg_solver_type precond(M,monitor); // Create MG preconditioner
376 solver.set_preconditioner(precond); // Set preconditioner
377 solver.solve(u,f); // Solve M^-1(Au = f)
378 if(monitor.converged())
383 \subsubsection{Numerical solutions to the Poisson problem}\label{ch5:sec:poissonresults}
386 The discrete Poisson problem \eqref{ch5:eq:poissonsystem} has been solved using the three iterative methods presented above. Convergence histories for the conjugate gradient method and geometric multigrid method, using two different resolutions, are illustrated in Figure \ref{ch5:fig:poissonconvergence:a}. Multigrid methods are very robust and algorithmic efficient, independent of the problem size. Figure \ref{ch5:fig:poissonconvergence:a} confirms that the rate of convergence for the multigrid method is unchanged for both problem sizes. Only the attainable accuracy is slightly worsened, as a consequence of a more ill-conditioned system for large problem sizes.
388 Defect correction in combination with multigrid preconditioning enables efficient solution of high-order approximations of the Poisson problem, illustrated in Figure \ref{ch5:fig:poissonconvergence:b}. The multigrid preconditioning matrix $\mymat{M}$ is based on a low-order approximation to \eqref{ch5:eq:poissonsystem}, whereas matrix $\mymat{A}$ is a high-order approximation. When $\mymat{M}$ is a close approximation to $\mymat{A}$, defect correction converges most rapidly. This is the effect that can be seen between the three convergence lines in Figure \ref{ch5:fig:poissonconvergence:b}.
393 \subfigure[Convergence histories for the conjugate gradient (CG) and multigrid (MG) methods, for two different problem sizes.]{\label{ch5:fig:poissonconvergence:a}
394 %{\scriptsize \input{Chapters/chapter5/figures/ConvergenceMGvsCG.tikz}}
395 {\includegraphics[width=0.45\textwidth]{Chapters/chapter5/figures/ConvergenceMGvsCG_conv.pdf}}
398 \subfigure[Defect correction convergence history for three different stencil sizes.]{\label{ch5:fig:poissonconvergence:b}
399 %{\scriptsize \input{Chapters/chapter5/figures/ConvergenceDC.tikz}}
400 {\includegraphics[width=0.45\textwidth]{Chapters/chapter5/figures/ConvergenceDC_conv.pdf}}
402 \caption{Algorithmic performance for the conjugate gradient, multigrid, and defect correction methods, measured in terms of the relative residual per iteration.}\label{ch5:fig:poissonconvergence}
405 %Numerical and algorithmic efficiency...
409 \section{Optimization strategies for multi-GPU systems}\label{ch5:sec:multigpu}\index{multi-GPU}
411 CUDA enabled GPUs are optimized for high memory bandwidth and fast on-chip performance. However, the role as a separate coprocessor to the CPU can be a limiting factor for large scale scientific applications, because the GPU memory capacity is fixed and is only in the range of a few gigabytes. In comparison, it is not unusual for a high-end workstation to be equipped with $\sim32$GB of main memory, plus a terabyte hard disk capacity for secondary storage. Therefore, large scale scientific applications that process gigabytes of data, require distributed computations on multiple GPU devices. Multi-GPU desktop computers and clusters can have a very attractive peak performance, but the addition of multiple devices introduces the potential performance bottleneck of slow data transfers across PCIe busses and network interconnections, as illustrated in Figure \ref{ch5:fig:gpu2gputransfer}. The ratio between data transfers and computational work has a significant impact on the possibility for latency hiding and thereby overall application performance.
415 %\includegraphics[width=0.6\textwidth]{Chapters/chapter5/figures/GPU2GPU.png}
416 \input{Chapters/chapter5/figures/GPU2GPU.tikz}
418 \caption[Message passing between two GPUs involves several memory transfers across lower bandwidth connections.]{Message passing between two GPUs involves several memory transfers across lower bandwidth connections. The kernel call is required if the data is not already sequentially stored in device memory. Recent generations of NVIDIA GPUs, CUDA, and MPI support direct transfers without explicitely transfering data to the host first.}\label{ch5:fig:gpu2gputransfer}
422 Developing applications that exploit the full computational capabilities of modern clusters--GPU-based or not--is no trivial matter. Developers are faced with the complexity of distributing and coordinating computations on nodes consisting of many-core CPUs, GPUs and potentially other types of accelerators as well. These complexities give rise to challenges in finding numerical algorithms, that are well suited for such systems, forcing developers to search for novel methods that utilize concurrency.
424 To ease software development, we use MPI-2 for message passing and ensure a safe and private communication space by creation of a communicator private to the library during initialization, as recommended by Hoefler and Snir~\cite{ch5:Hoefler2011}. With the addition of remote direct memory access (RDMA) for GPUDirect it is possible to make direct memory transfers between recent generation of GPUs (Kepler), eliminating CPU overhead. Unfortunately there are some strict system and driver requirements to enable these features. Therefore, in the following examples, device memory is first transferred to the CPU main memory before invoking any MPI calls. The library provides device-to-device transfers via template-based routines that work directly with GPU vector objects. This hides the complexity of message passing from the developer and helps developers design new components for multi-GPU execution.
426 %(GPUDirect v2 and Unified Virtual Addressing can help). Effort is being made to create GPU aware MPI implementations, that will allow direct memory transfers(cite: openMPI, MVAPICH2). However, only 64-bit systems with Unified Virtual Addressing (UVA) from CUDA v4.0 and Tesla 20-serie (Kepler) GPUs are capable of direct transfers. Therefore only a limited number of today's GPU clusters will be able to utilize these features.
427 %For now, the simple solution is to first transfer data to the CPU main memory before invoking the appropriate MPI calls.
429 In the following sections we present two very different methods for distributed computing based on spatial and temporal decomposition. Each method has its own characteristic, which makes the method attractive for various types of PDE problems and for different problem sizes.
433 \subsection{Spatial domain decomposition}\label{ch5:sec:dd}\index{domain decomposition}
435 Domain decomposition methods can be used for distributing computational work in the numerical solution of boundary value problems~\cite{ch5:Smith1996}. These methods add parallelism by splitting the spatial dimensions, on which the boundary values are defined, into a number of smaller boundary value problems and then coordinating the solution between adjacent subdomains. Domain decomposition techniques, such as the classical overlapping Schwarz methods, may be considered as preconditioners \index{preconditioning} to the system of equations that arise from the discretization of PDEs, e.g., as for the Poisson problem in \eqref{ch5:eq:poissonsystem}. The algorithmic efficiency of such methods depends on the size of the domain overlaps, while an additional coarse grid correction method is sometimes necessary to maintain fast global convergence.
436 %A detailed study and analysis of convergence criterions, overlapping zones, local domain solvers etc. can lead to parallel efficiencies that scale linearly with the number of processors for large problem sizes, see e.g. \cite{ch5:Cai2005}.
438 An alternative to the preconditioning strategy is to have each subdomain query information from adjacent subdomains whenever needed. For PDEs that are discretized onto regular shaped grids, this can be an attractive strategy, as the decomposition of subdomains, and thereby the communication topology\index{topology}, is straightforward. The library supports decomposition of regular grids by either pre- or user-defined topologies. A topology in this context, is a description of connectivity between processors that share the same grid along with information about the local and global discretization. Layers of ghost points are inserted at the artificially introduced boundaries to account for grid points that reside in adjacent subdomains. The ghost point values can be updated upon request from the user. An illustrative example is given in Figure \ref{ch5:fig:dd2d}; the arrows indicate message passing between adjacent subdomains, when updating grid points within the ghost layers.
441 \input{Chapters/chapter5/figures/dd2d.tikz}
443 \caption[Domain distribution of a two-dimensional grid into three subdomains.]{Domain distribution of a two-dimensional grid into three subdomains. {\large$\bullet$} and {\scriptsize$\textcolor[rgb]{0.5,0.5,0.5}{\blacksquare}$} represent internal grid points and ghost points, respectively.}\label{ch5:fig:dd2d}
446 Topologies are introduced via an extra template argument to the grid class. A grid is by default not decomposed, because the default template argument is based on a non distribution topology implementation. The grid class is extended with a new member function \texttt{update()}, which makes sure that all ghost points are updated according to the grid topology. The library contains topologies based on one-dimensional and two-dimensional distributions of the grid. The number of grid subdomains will be equal to the number of MPI processes executing the program.
448 If grid ghost layers are updated whenever information from adjacent subdomains is needed, e.g., before a stencil operation, all interior points will be exactly the same as they would be for the non distributed setup. Therefore, one advantage of this approach is that the algorithmic efficiency of an application can be preserved, if grid updates are consistently invoked at the proper times.
450 Distributed performance for the finite difference stencil operation is illustrated in Figure \ref{ch5:fig:multigpu}. The timings include the compute time for the finite difference approximation and the time for updating ghost layers via message passing. It is obvious from Figure \ref{ch5:fig:multigpu:a} that communication overhead dominates for the smallest problem sizes, where the non distributed grid (1 GPU) is fastest. However, communication overhead does not grow as rapidly as computation times, due to the surface-to-volume ratio. Therefore message passing becomes less influential for large problems, where reasonable performance speedups are obtained. Figure \ref{ch5:fig:multigpu:b} demonstrates how the computational performance on multi-GPU systems can be significantly improved for various stencil sizes. With this simple domain decomposition technique, developers are able to implement applications based on heterogeneous distributed computing, without explicitly dealing with message passing and it is still possible to provide user specific implementations of the topology class for customized grid updates.
452 % TODO: Should we put in the DD algebra?
455 \setlength\figureheight{0.39\textwidth}
456 \setlength\figurewidth{0.39\textwidth}
458 \subfigure[Absolute timings, $\alpha=3$.]{
459 {\scriptsize\input{Chapters/chapter5/figures/MultiGPUAlpha3TimingsTeslaM2050.tikz}}
460 %{\includegraphics[width=0.5\textwidth]{Chapters/chapter5/figures/MultiGPUAlpha3TimingsTeslaM2050_conv.pdf}}
461 \label{ch5:fig:multigpu:a}
463 \subfigure[Performance at $N=4069^2$, single precision.]{
464 {\scriptsize\input{Chapters/chapter5/figures/MultiGPUAlphaPerformanceTeslaM2050_N16777216.tikz}}
465 %{\includegraphics[width=0.5\textwidth]{Chapters/chapter5/figures/MultiGPUAlphaPerformanceTeslaM2050_N16777216_conv.pdf}}
466 \label{ch5:fig:multigpu:b}
468 \caption{Performance timings for distributed stencil operations, including communication and computation times. Executed on test environment 3.}\label{ch5:fig:multigpu}
471 \subsection{Parareal--parallel time integration}\label{ch5:sec:parareal}\index{parareal}
473 The use of spatial domain decomposition methods is widespread, as they have proven to be efficient on a wide range of problems. Unfortunately, applications that are concerned with limited numerical problem sizes can rapidly reach a speedup limit for a low number of processors due to scalability degradation when the number of processors becomes large, as this leads to an increasingly unfavourable communication-to-compute ratio. This issue is continuously worsened by the fact that communication speed has been increasing at a far slower pace than compute speed for the past several years, and this trend is expected to continue for years to come. It is often referred to as \emph{the memory wall}~\cite{ch5:Asanovic:EECS-2006-183}, one of the grand challenges facing development and architectural design of future high-performance systems~\cite{ch5:Keyes2011,ch5:ScientificGrandChallenges2010}. Also, there are applications based on ordinary differential equations, where classical domain decomposition methods are not even applicable~\cite{ch5:YMTR08}. For these type of applications, a method of adding parallelism in the temporal integration is of great interest. Contrary to space however, time is--by its very nature--sequential, which precludes a straightforward implementation of a parallel approach.
475 One method that introduces concurrency to the solution of evolution problems is the parareal algorithm. Parareal is an iterative method imposed on a time decomposition. Gander and Vandewalle showed in \cite{ch5:MS07} that the algorithm can be written both as a multiple shooting method and as a two-level multigrid-in-time approach, even though the leading idea came from spatial domain decomposition. The method has many exciting features: it is fault tolerant and has different communication characteristics than those of the classical domain decomposition methods. It has also been demonstrated to work effectively on a wide range of problems, and most importantly, once the proper distribution infrastructure is in place, it can easily be wrapped around any type of numerical integrator, for any type of initial value problem.
478 \subsubsection{The parareal algorithm}
480 The parareal algorithm was first presented in 2001, in a paper by Lions et al.~\cite{ch5:LMT01}, and later introduced in a slightly revised predictor-corrector form in 2002 by Baffico et al.~\cite{ch5:LSY02}. The parareal-in-time approach proposes to break the global problem of time evolution into a series of independent evolution problems on smaller intervals, see Figure \ref{ch5:fig:ParallelInTime}. %
484 \input{Chapters/chapter5/figures/ParallelInTime.tikz}
486 \caption[Time domain decomposition.]{Time domain decomposition. A compute node is assigned to each individual time subdomain to compute the initial value problem. Consistency at the time subdomain boundaries is obtained with the application of a computationally cheap integrator in conjunction with the parareal iterative predictor-corrector algorithm}\label{ch5:fig:ParallelInTime}
488 Initial states for these problems are needed and supplied by a simple, less accurate, but computationally cheap sequential integrator. The smaller independent evolution problems can then be solved in parallel. The information, generated during the concurrent solution of the independent evolution problems with accurate propagators and inaccurate initial states, is used in a predictor-corrector fashion in conjunction with the coarse integrator to propagate the solution faster, now using the information generated in parallel. We define the decomposition into $N$ intervals, that is,
490 & T_{0}<T_{1}<\cdots<T_{n}=n\Delta T<T_{n+1}<T_{N},
492 where $\Delta T$ is the size of the time intervals and $n=0,1,\ldots,N$. The general initial value problem on the decomposed time domain is defined as
493 \begin{align}\label{ch5:eq:difproblem}
494 \frac{\partial u}{dt}+\mathcal{A}u=0, \qquad u(T_{0})=u^{0}, \quad t\in\left[T_{0},T_{N}\right],
496 where $\mathcal{A}$ is an operator from one Hilbert space to another. To solve the differential problem \eqref{ch5:eq:difproblem} we define an operator $\mathcal{F}_{\Delta T}$ that operates on some initial state $U_{n} \approx u(T_{n})$ and returns an approximate solution to \eqref{ch5:eq:difproblem}, at time $T_{n}+\Delta T$. Such an operator is achieved by the implementation of a numerical time integrator, using some small time-step $\delta t\ll\Delta T$ in the integration. The numerical solution to \eqref{ch5:eq:difproblem} can then be obtained by applying the fine propagator sequentially for $n=1,2,\ldots,N$.
497 \begin{align}\label{ch5:eq:fineOp}
498 \hat{U}_{n}=\mathcal{F}_{\Delta T}\left(T_{n-1},\hat{U}_{n-1}\right), \qquad\hat{U}_{0}=u^{0}.
501 For the purpose of parallel acceleration of the otherwise purely sequential process of obtaining $\mathcal{F}_{\Delta T}^{N} u^{0} \approx u(T_{N})$, we define the coarse propagator $\mathcal{G}_{\Delta T}$. $\mathcal{G}_{\Delta T}$ also operates on some initial state $U_{n}$, propagating the solution over the time interval $\Delta T$, but now using a time step $\delta T$. Typically $\delta t<\delta T<\Delta T$. For the parareal algorithm to be effective, the coarse propagator $\mathcal{G}_{\Delta T}$ has to be substantially faster to evaluate than the fine propagator $\mathcal{F}_{\Delta T}$. There are many ways of constructing the coarse propagator, the simplest one being to apply the same numerical integrator as for the fine propagator, but using a coarser time discretization. We refer the reader to \cite{ch5:ASNP12} for an introduction to other methods. The coarse operator reads
502 \begin{align}\label{ch5:eq:coarseOp}
503 \tilde{U}_{n}=\mathcal{\mathcal{G}}_{\Delta T}\left(T_{n-1},\tilde{U}_{n-1}\right),\qquad\tilde{U_{0}}=u^{0}.
505 Using the defined $\mathcal{F}_{\Delta T}$ and $\mathcal{G}_{\Delta T}$ operators, the predictor-corrector form of the parareal algorithm can be written in a single line as
506 \begin{equation}\label{ch5:eq:PARAREAL}
507 U_{n}^{k+1}=\mathcal{\mathcal{G}}_{\Delta T}\left(U_{n-1}^{k+1}\right)+\mathcal{\mathcal{F}}_{\Delta T}\left(U_{n-1}^{k}\right)-\mathcal{\mathcal{G}}_{\Delta T}\left(U_{n-1}^{k}\right),\quad U_{0}^{k}=u^{0},
509 with the initial prediction $U^{0}_{n} = \mathcal{G}_{\Delta T}^{n} u^{0}$ for $n=1\ldots N$ and $k=1\ldots K$. $N$ being the number of time subdomains, while $K\geq1$ is the number of predictor-corrector iterations applied. The parareal algorithm is implemented in the library as a separate time-integration component, using a fully distributed work scheduling model, as proposed by Aubanel~\cite{ch5:EA10}. The model is schematically presented in Figure \ref{ch5:fig:FullyDistributedCores}. The parareal component hides all communication and work distribution from the application developer. It is defined such that a user only has to decide what coarse and fine propagators to use. Setting up the type definitions for parareal time-integration using forward Euler for coarse propagation and fourth order Runge-Kutta for fine propagation could then be defined as in Listings \ref{ch5:lst:parareal}. The number of GPUs used for parallelization depends on the number of MPI processes executing the application.
510 \lstset{label=ch5:lst:parareal,caption={assembling a parareal time integrator using forward Euler for coarse propagation and a Runge-Kutta method for fine propagation}}
512 typedef gpulab::integration::forward_euler coarse;
513 typedef gpulab::integration::ERK4 fine;
514 typedef gpulab::integration::parareal<coarse,fine> integrator;
518 %Each application of $\mathcal{\mathcal{G}}_{\Delta T}$ or $\mathcal{\mathcal{F}}_{\Delta T}$ to a state $\tilde{U}_{n}^{k}$ is performed on a GPU while the corrections themselves are performed on CPUs. In addition to an efficient distribution model, a stopping strategy is needed as parareal is an iterative algorithm. Various choices has been implemented in the library, these are presented in section \ref{ch5:subsec:Lib_impl}.
523 \input{Chapters/chapter5/figures/FullyDistributed.tikz}
525 \caption[Schematic visualization of a fully distributed work scheduling model for the parareal algorithm as proposed by Aubanel.]{\label{ch5:fig:FullyDistributedCores} Schematic visualization of a fully distributed work scheduling model for the parareal algorithm as proposed by Aubanel~\cite{ch5:EA10}. Each GPU is responsible for computing the solution on a single time subdomain. The computation is initiated at rank $0$ and cascades through to rank $N$ where the final solution can be fetched.}
528 \subsubsection{Computational complexity}
530 In the analysis of the computational complexity, we first recognize that both the coarse and the fine propagators, regardless of the type of discretization scheme, involve a complexity that is proportional to the number of time steps being used. Let us define two scalar values $\mathcal{C}_\mathcal{F}$ and $\mathcal{C}_\mathcal{G}$ as the computational cost of performing a single step with the fine and coarse propagators. The computational complexity of a propagator integrating over an interval $\Delta T$ is then given by $\mathcal{C}_\mathcal{F}\frac{\Delta T}{\delta t}$ and $\mathcal{C}_\mathcal{G}\frac{\Delta T}{\delta T}$, respectively. $R$ is introduced as the relation between the two; that is, $R$ is a measure of how much faster the coarse propagator is compared to the fine propagator in integrating the time interval $\Delta T$. The total computational cost for parareal over $N$ intervals is then proportional to
532 \left(k+1\right)N\mathcal{C_{G}}\frac{\Delta T}{\delta T}+kN\mathcal{C_{F}}\frac{\Delta T}{\delta t}.
534 Recognizing that the second term can be distributed over $N$ processors, we are left with
535 \begin{align}\label{ch5:eq:ComComplexPara}
536 \left(k+1\right)N\mathcal{C_{G}}\frac{\Delta T}{\delta T}+k\mathcal{C_{F}}\frac{\Delta T}{\delta t}.
538 The above should be compared to the computational complexity of a purely sequential propagation, using only the fine operator,
540 \frac{T_{N}-T_{0}}{\delta t}\mathcal{C}_\mathcal{F}=N\frac{\Delta T}{\delta t}\mathcal{C}_\mathcal{F}.
542 We can now estimate the speedup, here denoted $\psi$, as the ratio between the computational complexity of the purely sequential solution, and the complexity of the solution obtained by the parareal algorithm \eqref{ch5:eq:ComComplexPara}. Neglecting the influence of communication speed and correction time, we are left with the estimate
543 \begin{align}\label{ch5:eq:EstiSpeedBasicPara}
544 \psi=\frac{N\frac{\Delta T}{\delta t}\mathcal{C}_\mathcal{F}}{\left(k+1\right)N\mathcal{C}_\mathcal{G}\frac{\Delta T}{\delta T}+k\mathcal{C}_\mathcal{F}\frac{\Delta T}{\delta t}}=\frac{N}{\left(k+1\right)N\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}+k}.
546 If we additionally assume that the time spent on coarse propagation is negligible compared to the time spent on the fine propagation, i.e., the limit $\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}\rightarrow0$, the estimate reduces to $\psi=\frac{N}{k}$. It is thus clear that the number of iterations $k$ for the algorithm to converge poses an upper bound on obtainable parallel efficiency. The number of iterations needed for convergence is intimately coupled with the ratio $R$ between the speed of the fine and the coarse integrators $\frac{\mathcal{C}_\mathcal{F}}{\mathcal{C}_\mathcal{G}}\frac{\delta T}{\delta t}$. Using a slow, but more accurate coarse integrator will lead to convergence in fewer iterations $k$, but at the same time it also makes $R$ smaller. Ultimately, this will degrade the obtained speedup as can be deduced from \eqref{ch5:eq:EstiSpeedBasicPara}, and by Amdahl's law it will also lower the upper bound on possible attainable speedup. Thus, $R$ \emph{cannot} be made arbitrarily large since the ratio is inversely proportional to the number of iterations $k$ needed for convergence. This poses a challenge in obtaining speedup and is a trade-off between time spent on the fundamentally sequential part of the algorithm and the number of iterations needed for convergence. It is particularly important to consider this trade-off in the choice of stopping strategy; a more thorough discussion on this topic is available in \cite{ch5:ASNP12} for the interested reader. Measurements on parallel efficiency are typically observed in the literature to be in the range of 20--50\%, depending on the problem and the number of time subdomains, which is also confirmed by our measurements using GPUs. Here we include a demonstration of the obtained speedup of parareal applied to the two-dimensional heat problem \eqref{ch5:eq:heateq}. In Figure \ref{ch5:fig:pararealRvsGPUs} the iterations needed for convergence using the forward Euler method for both fine and coarse integration are presented. $R$ is regulated by changing the time step size for the coarse integrator. In Figure \ref{ch5:fig:pararealGPUs} speedup and parallel efficiency measurements are presented. Notice, when using many GPUs it is advantageous to use a faster, less accurate coarse propagator, despite it requires an extra parareal iteration that increases the total computational complexity.
549 \setlength\figureheight{0.32\textwidth}
550 \setlength\figurewidth{0.35\textwidth}
552 \subfigure[The relative error after one parareal iteration ($k=1$).]{
553 {\small\input{Chapters/chapter5/figures/pararealK1ErrvsRvsGPUs.tikz}}
554 \label{ch5:fig:pararealRvsGPUs:a}
556 \subfigure[Iterations $K$ needed to obtain a relative error less than $10^{-5}$.]{
557 {\small\input{Chapters/chapter5/figures/pararealKvsRvsGPUs.tikz}}
558 \label{ch5:fig:pararealRvsGPUs:b}
561 \caption[Parareal convergence properties as a function of $R$ and number of GPUs used.]{Parareal convergence properties as a function of $R$ and number GPUs used. The error is measured as the relative difference between the purely sequential solution and the parareal solution.}\label{ch5:fig:pararealRvsGPUs}
565 \setlength\figureheight{0.32\textwidth}
566 \setlength\figurewidth{0.34\textwidth}
568 \subfigure[Measured parallel speedup]{
569 {\small\input{Chapters/chapter5/figures/pararealSpeedupvsRvsGPUs.tikz}}
570 \label{ch5:fig:pararealGPUs:a}
572 \subfigure[Measured parallel efficiency]{
573 {\small\input{Chapters/chapter5/figures/pararealEfficiencyvsRvsGPUs.tikz}}
574 \label{ch5:fig:pararealGPUs:b}
577 \caption[Parareal performance properties as a function of $R$ and number of GPUs used.]{Parareal performance properties as a function of $R$ and number GPUs used. Notice how the obtained performance depends greatly on the choice of $R$ as a function of the number of GPUs. Executed on test environment 3.}\label{ch5:fig:pararealGPUs}
580 %TODO: Do we make this into a subsubsection:
581 %\subsubsection{Library Implementation}\label{ch5:subsec:libimpl}
582 %Describe library features. Stopping criteria. Usage of different numerical integrators. Examples including c++ code. Speed-up measurements on test cases.
584 %\begin{figure}[!htb]
585 % \setlength\figureheight{0.32\textwidth}
586 % \setlength\figurewidth{0.35\textwidth}
588 % \subfigure[Parareal iterations $k$ needed for converge to a relative error of $10^{-6}$]{
589 % {\small\input{Chapters/chapter5/figures/pararealKvsRvsGPUs.tikz}}
590 % \label{ch5:fig:pararealGPUs:a}
592 % \subfigure[Measured speed-up]{
593 % {\small\input{Chapters/chapter5/figures/pararealSpeedupvsRvsGPUs.tikz}}
594 % \label{ch5:fig:pararealGPUs:b}
597 % \caption{Parareal performance and convergence properties as a function of $R$ and number GPUs used. $R$ is regulated by the time discretization in the coarse propagator. }\label{ch5:fig:pararealGPUs}
600 \section{Conclusion and outlook}
602 Massively parallel heterogeneous systems continue to enter the consumer market, and there has been no identification that this trend will stop for years to come. However, these parallel architectures require software vendors to adjust to new programming models and optimization strategies. Good software libraries are important tools for reducing the time and complexity of adjusting to new architectures, and they provide the user with an intuitive programming interface.
604 We have presented ideas for a generic GPU-based library for fast assembling of PDE solvers. A high-level interface and simple implementation concepts were created with the objective of enhancing developer productivity and to ensure performance portability. Two elementary PDE model problems were used to demonstrate assembling of both spatial and temporal approximation techniques. Results confirmed that numerical methods for solution of PDE systems are bandwidth limited on GPU systems. Therefore higher-order methods can be advantageous, when extra computational work can be performed during memory transfer idle times, leading to increased flop performance.
606 Decomposition strategies for spatial and temporal parallelization on multi-GPU systems have been presented, with reasonable performance speedups. Novel results for the parareal algorithm on multi-GPU systems have been presented, and an example of parallel efficiency for the heat conduction problem has been demonstrated. Furthermore, a domain decomposition technique that preserves algorithmic efficiency has been presented for the Poisson problem.
608 The library has already successfully been used for development of a fast tool intended for scientific applications within maritime engineering; see Chapter \ref{ch7} for details. We intend to further extend the library, as we explore new techniques, suitable for parallelization on heterogeneous systems, that fit the scope of our applications.
610 For more information about the library and our work within the GPUlab group at the Technical University of Denmark, we encourage you to visit \texttt{http://gpulab.imm.dtu.dk}
612 \section*{Acknowledgments}
613 This work have been supported by grant no. 09-070032 %on "Desktop Scientific Computing on Consumer Graphics Cards"
614 from the Danish Research Council for Technology and Production Sciences. A special thanks goes to Professor Jan S. Hesthaven for supporting parts of this work. Scalability and performance tests were done in the GPUlab at DTU Informatics, Technical University of Denmark and using the GPU-cluster at the Center for Computing and Visualization, Brown University, USA. The NVIDIA Corporation is acknowledged for generous hardware donations to facilities of the GPUlab.
616 %-------------------------------------------------------------------------
617 %\backmatter % bibliography, index and similar things, which are not numbered.
621 %References to include
622 %\cite{ch5:Gropp1999}
623 %\cite{ch5:Gropp1999a}
624 %\cite{ch5:LeVeque2007}
625 %\cite{ch5:Gamma1995}
626 %\cite{ch5:Hoefler2011}
627 %\cite{ch5:Bajrovic2011}
628 %\cite{ch5:Skjellum1994}
629 %\cite{ch5:Glimberg2011}
630 %\cite{ch5:Engsig-Karup2011}
631 %\cite{ch5:Smith1996}
632 %\cite{ch5:Vandevoorde2002}
634 %\cite{ch5:mooreslaw1965}
636 \putbib[Chapters/chapter5/biblio5]
638 % Reset lst label and caption
639 %\lstset{label=,caption=}