1 \chapterauthor{Stefan L. Glimberg}{Technical University of Denmark}
2 \chapterauthor{Allan P. Engsig-Karup}{Technical University of Denmark}
3 \chapterauthor{Allan S. Nielsen}{Technical University of Denmark}
4 \chapterauthor{Bernd Dammann}{Technical University of Denmark}
8 \chapter{Development of software components for heterogeneous many-core architectures}\label{ch5}
13 %\item Performance portable tuning techniques via a modern parallel programming model (CUDA) on emerging GPU architectures.
14 %\item distributed heterogenous computing via MPI+CUDA/OpenCL with domain decomposition methods for PDE solvers.
15 %\item efficient and scalable iterative methods for solution of high-order numerical methods and strategies for efficient implementations on desktop architectures.
19 \section{Software development for heterogeneous architectures}
20 %Our library facilitates massively parallelization through GPU computing and contains components for various iterative strategies such as DC, CG, CGNR, BiCGSTAB and GMRES(m) for solution of large linear systems along with support for preconditioning strategies. The goal is to create a reusable library and framework which provide general components with performance similar to that of a dedicated solver. Preliminary results show that performance overhead can be kept minimal in this new software framework [8].
23 %Within the past decade we have therefore seen that development of applications based on high-performance parallel computing, is expanding from academic and industrial companies with financial support to invest in parallel computers, into any application developer with access to modern hardware. This shift has not only been motivated by the successful development of programmable GPUs, but by the architectural changes from single-core to multi-core architectures in general. Rapidly increasing power and heating problems have forced the continuous evolution of single core architectures to stop. The solution from the chip manufacturers has been to fit multiple processors into the same unit, in order to maintain the same overall peak performance rates. Though these new parallel architectures can be highly effective compute units, they often pose challenges for software developers to fully utilize their efficiency. Sequential legacy codes are not always easily parallelized and the time spent on conversion might not pay off in the end. This is particular true for heterogenous computers, where the architectural differences between the main- and co-processor can be so significant that they call for completely different optimization strategies. The cache hierarchy and management of CPUs and GPUs being an evident example hereof.
26 %using a library makes it easier for researchers to implement their problem without having to think about the complexity of the underlying computer hardware. (keyword: Opacity, [Berkeley dwarf paper])
28 % a library can support several architectures, and thus the high-level code does not have to be changed when moving to a new architecture, e.g. Thrust's support for CUDA, OpenMP and TBB.
30 % there is a price/penalty to pay when using libraries, i.e. one doesn't necessarily get the best performance, because the architectural details are not visible to the programmer (keyword: Visibility [Berkeley dwarf paper]). However, if the library permits to write own add-ons/kernels (flexibility), this is no longer an issue.
32 Massively parallel processors, such as graphical processing units (GPUs), have in recent years proven to be effective for a vast amount of scientific applications. Today, most desktop computers are equipped with one or more powerful GPUs, offering heterogeneous high-performance computing to a broad range of scientific researchers and software developers. Though GPUs are now programmable and can be highly effective compute units, they still pose challenges for software developers to fully utilize their efficiency. Sequential legacy codes are not always easily parallelized, and the time spent on conversion might not pay off in the end. This is particular true for heterogenous computers, where the architectural differences between the main- and co-processor can be so significant, that they call for completely different optimization strategies. The cache hierarchy management of CPUs and GPUs being an evident example hereof. In the past, industrial companies were able to boost application performance solely by upgrading their hardware systems, with an overt balance between investment and performance speedup. Today, the picture is different, not only do they have to invest in new hardware, but also account for the adaption and training of their software developers. What traditionally used to be a hardware problem, addressed by the chip manufacturers, has now become a software problem for application developers.
34 Software libraries \index{software library} \index{library|see{software library}} can be a tremendous help for developers as they make it easier to implement an application, without having to know about the complexity of the underlying computer hardware, known as \emph{opacity}~\cite{ch5:Asanovic:EECS-2006-183}. The ultimate goal for a successful library is to simplify the process of writing new software and thus to increase developer productivity. Since programmable heterogeneous CPU/GPU systems are a rather new phenomenon, there is yet a limited number of established software libraries that take full advantage of such heterogeneous high performance systems, and there are no de-facto design standards for such systems either. Some existing libraries for conventional homogeneous systems have already added support for offloading computational intense operations onto co-processing GPUs. However, this approach comes with the cost of frequent memory transfers across the low bandwidth PCIe bus.
36 In this chapter, we focus on the use of a software library to help application developers achieve their goals without spending immense time on optimization details, while still offering close-to-optimal performance. A good library provides performance portable implementations with intuitive interfaces, that hide the complexity of underlaying hardware optimizations. Unfortunately, this opaqueness sometimes comes with a price, as one does not necessarily get the best performance when the architectural details are not \emph{visible} to the programmer~\cite{ch5:Asanovic:EECS-2006-183}. If, however, the library is flexible enough and permits developers to supply their own low-level implementations as well, this does not need to be an issue. These are some of the considerations library developers should make, and what we will try to address in this chapter.
39 For demonstrative purposes we present details from an in-house generic CUDA-based C++ library for fast assembling of partial differential equation (PDE) solvers, utilizing the computational resources of GPUs. It has been developed as part of research activities associated with GPUlab, at the Technical University of Denmark, and therefore referred to as the GPUlab library. It falls into the category of computational libraries, as categorized by Hoefler and Snir~\cite{ch5:Hoefler2011}. Memory allocation and basic algebraic operations are supported via object oriented components, without the user having to write CUDA specific kernels. As back-end vector class, the CUDA Thrust library is used, enabling easy memory allocation and a high-level interface for vector manipulation~\cite{ch5:Bell2011}. Inspirations for \emph{good library design}, some of which we will present in this chapter, originates from guidelines proposed throughout literature~\cite{ch5:Hoefler2011,ch5:Gamma1995,ch5:Skjellum1994}. An identification of desirable properties, that any library should strive to achieve, is pointed out by Korson and McGregor~\cite{ch5:Korson1992}, in particular we mention easy-to-use, extensible, and intuitive.
42 %We rely on template-based implementations to allow full flexibility to change and assemble types~\cite{ch5:Vandevoorde2002}
44 %The library presented in this chapter falls into the category of computational libraries, as defined by Hoefler and Snir~\cite{ch5:Hoefler2011}, and seeks to fulfil the relevant objectives presented, such as designing components around data structures (classes), utilizing inheritance for component reuse, and hiding irrelevant information from the end users. An identification of desirable attributes for library development is also pointed out by Korson and McGregor~\cite{ch5:Korson1992}, in particular the library is designed to be easy-to-use, extensibility, and intuitive.
46 The library is designed to be effective and scalable for fast prototyping of PDE solvers, (primarily) based on matrix-free implementations of finite difference (stencil) approximations on logically structured grids. It offers functionalities that will help assemble PDE solvers that automatically exploit heterogeneous architectures, much faster than manually having to manage GPU memory allocation, memory transfers, kernel launching, etc.
48 In the following sections we demonstrate how software components that play important roles in many scientific applications can be designed to fit a simple framework that will run efficiently on heterogeneous systems. One example is finite difference approximations, commonly used to find numerical solutions to differential equations. Matrix-free implementations minimize both memory consumption and main memory access, two important features for efficient GPU utilization and for enabling solution of large-scale problems. The bottleneck problem for many PDE applications, is to find the solution of a large sparse linear systems, arising from the discretization. In order to help solve these systems, the library includes a set of iterative solvers. All iterative solvers are template-based, such that vector and matrix classes, along with their underlying implementations, can be freely interchanged or whole new solvers can even be implemented without much coding effort. The generic nature of the library, along with a predefined set of interface rules, allow assembling of components into PDE solvers. The use of parameterized type binding, allows the user to control the assembling of PDE solvers at a high abstraction level, without having to change the remaining implementation.
50 Since this chapter is mostly dedicated to the discussion of software development for high performance heterogeneous systems, the focus will be merely on the development and usage of an in-house GPU-based library, more than on specific scientific applications. We demonstrate how to use the library on two elementary model problems, and refer to Chapter \ref{ch7} for a detailed description of an advanced application tool for free surface water wave simulations. These examples are assembled using the library components presented in this chapter.
52 % Make sure that we present a way to implement PDE solver components, and that our library is an example that demonstrate the use.
54 %Well designed standards for message passing across multiple platforms have been successfully developed, with MPI being the most well known \cite{ch5:Gropp1999,ch5:Gropp1999a}.
56 %\stefan{Add more motivation, power wall + memory wall + ILP (instruction-level parallelism) wall = brick wall.. }
58 %- Motivation for using libraries in general. Emphasize the computational power of GPUs and how they can be used for high-performance scientific computing.
60 %- Motivate our library, how we fit in.
62 %- Introduce the model problem that we use the demonstrate the use of our library.
64 %- There is a tendency to that the more performance is needed, the more low-level and difficult the programming gets. We strive to achieve such good performance, but without compromising too much easiness.
66 %- The following sections are ordered as follows...
68 \subsection{Test environments}\label{ch5:sec:testenvironments}
69 Throughout the chapter we have used two different test environments; a high-end desktop computer located at the GPUlab -- Technical University of Denmark, and a GPU cluster located at the Center for Computing and Visualization, Brown University, USA. Hardware details for the two systems are as follows:
71 \item[Test environment 1.] Desktop computer, Linux Ubuntu v10.04, dual Intel Xeon E5620 (2.4 GHz) quad-core Westmere processor, 12 GB of DDR-3 memory (1066 MHz), 2x Nvidia GeForce 590 GPU with 3GB DDR5 memory, PCIe 2.0.
72 \item[Test environment 2.] GPU cluster, Linux, up to 44 compute nodes based on dual Intel Xeon E5540 (2.53 GHz) quad-core Nehalem processors, 24 GB of DDR-3 memory (1333 MHz), 2x Nvidia Tesla M2050 GPUs with 3GB GDDR5 memory, 40 Gb/s Quad-Data-Rate (QDR) InfiniBand interconnect.
76 \section{Heterogenous library design for PDE solvers}
77 A generic CUDA-based C++ library has been developed to ease the assembling of PDE solvers. The template \index{templates} based design allows users to assemble solver parts easily, and to supply their own implementation of problem specific parts. In the following, we present an overview of the library and the supported features, we introduce the concepts of the library components and give short code examples to ease understanding. The library is a starting point for fast assembling of GPU-based PDE solvers, developed mainly to support finite difference operations on regular grids. However, this is not a limitation, since existing vector objects could be used as base classes for extending to other discretization methods or grid types as well.
79 \subsection{Component and concept design}
80 The library is grouped into component classes. Each component should fulfill a set of simple interface and template rules, called concepts \index{concept}, in order to guarantee compatibility with the rest of the library. In the context of PDE solving, we present five component types; vectors, matrices, iterative solvers for linear system of equations, preconditioners for the iterative solvers, and time integrators. Figure \ref{ch5:fig:componentdesign} lists the five components along with the type definitions they must provide, and the methods they should implement. It is possible to extend the implementation of these components with more functionalities, that relate to specific problems, but this is the minimum requirement for compatibility with the remaining library. With these rather concept rules fulfilled, components can rely on other components to have implementations of their member functions.
84 \input{Chapters/chapter5/figures/component_design.tikz}
86 \caption{Schematic representation of the five main components, their type definitions, and member functions. Because components are template based, the argument types cannot be known in beforehand. This concept rule set ensures compliance between components.}\label{ch5:fig:componentdesign}
89 A component is implemented as a generic C++ class, that normally takes as template arguments the same types that it offers through type definitions: a matrix takes a vector class as template argument, and a vector class takes a working precision value type. The matrix can then access the working precision via the vector class. Components that rely on multiple template arguments can combine these arguments via type binders to maintain code simplicity. We will demonstrate use of such type binders in the model problem examples. A thorough introduction to template-based programming in C++ can be found in \cite{ch5:Vandevoorde2002}.
91 The generic configuration allows the developer to define and assemble solver parts at the very beginning of the program, using type definitions. Changing PDE solver parts at a later time is then only a matter of changing the proper definitions. We will give examples of how to assemble PDE solvers in section \ref{ch5:sec:modelproblems}.
93 \subsection{A matrix-free finite difference component}\index{finite difference}\index{matrix-free}
94 Common vector operations, such as memory allocation, element-wise assignments, and basic algebraic transformations, require many lines of codes for a purely CUDA-based implementation. These CUDA specific operations and kernels are hidden from the user behind library implementations, to ensure a high abstraction level. The vector classes inherit from the CUDA-based Thrust library and therefore offer the same level of abstraction, that enhance developer productivity and enables performance portability. Creating and allocating device (GPU) memory for two vectors can be done in a simple and intuitive way using the GPUlab library, as shown in the following example where two vectors are added together:
96 \lstinputlisting[label=ch5:lst:vector,caption={Allocating, initializing, and adding together two vectors on the GPU. First example are using pure CUDA C, second example use the built-in library template-based vector class.}]{Chapters/chapter5/code/ex1.cu}
98 The vector class (and derived classes hereof) works with the rest of the library components. Therefore we encourage to use this class as the base vector object. Matrix-vector multiplications are usually what makes PDE-based applications differ, and the need to write a user specific implementation of the matrix-vector product is essential to solve specific PDE problems. The PDE and the choice of discretization method determine the structure and sparsity of the resulting matrix. Spatial discretization is supported by the library with finite difference approximations, and it offers an efficient low-storage (matrix-free) flexible order\index{flexible order} implementation, to help developers tailor their custom codes. These matrix-free operators are feasible for problems where the matrix structure is known in advance and can be exploited, such that the matrix values can either be precomputed or computed on-the-fly. Furthermore, the low constant memory requirement makes them perfect in the context of solving large scale problems, whereas traditional sparse matrix formats require increasingly more memory~\cite{ch5:Bell2009}.
100 Finite differences approximate the derivative of some function $u(x)$ as a weighted sum of neighboring elements. In compact notation we write
101 \begin{align}\label{ch5:eq:fdstencil}
102 \frac{\partial^q u(x_i)}{\partial x^q} \approx \sum_{n=-\alpha}^{\beta}c_n u(x_{i+n}),
104 where $q$ is the order of the derivative, $c_n$ is a set of finite difference coefficients, where $\alpha$ plus $\beta$ define the number of coefficients that are used for the approximation. The total set of contributing elements is called the stencil,\index{stencil} and the size of the stencil is called the rank, given as $\alpha + \beta + 1$. The stencil coefficients $c_n$ can be derived from a Taylor expansion based on the values of $\alpha$, $\beta$, and for a specific order $q$, using the method of undetermined coefficients~\cite{ch5:LeVeque2007}.
105 An example of a three-point finite difference matrix that approximates the first ($q=1$) or second ($q=2$) derivative of a one dimensional uniformly distributed vector $u$ of length $8$ is given here
106 \begin{eqnarray}\label{ch5:eq:stencilmatrix}
107 \left[\begin{array}{c c c c c c c c}
108 c_{00} & c_{01} & c_{02} & 0 & 0 & 0 & 0 & 0 \\
109 c_{10} & c_{11} & c_{12} & 0 & 0 & 0 & 0 & 0 \\
110 0 & c_{10} & c_{11} & c_{12} & 0 & 0 & 0 & 0 \\
111 0 & 0 & c_{10} & c_{11} & c_{12} & 0 & 0 & 0 \\
112 0 & 0 & 0 & c_{10} & c_{11} & c_{12} & 0 & 0 \\
113 0 & 0 & 0 & 0 & c_{10} & c_{11} & c_{12} & 0 \\
114 0 & 0 & 0 & 0 & 0 & c_{10} & c_{11} & c_{12} \\
115 0 & 0 & 0 & 0 & 0 & c_{20} & c_{21} & c_{22}
117 \left[\begin{array}{c}
128 \left[\begin{array}{c}
139 It is clear from this example, that the matrix is sparse and that there are many repetitions of the same coefficients. Notice, that the coefficients only differ near the boundaries, where off-centered stencils are used. It is natural to pack this information into a stencil operator, that stores only the unique set of coefficients:
140 \begin{eqnarray}\label{ch5:eq:stencilcoeffs}
142 \left[\begin{array}{c c c}
143 c_{00} & c_{01} & c_{02} \\
144 c_{10} & c_{11} & c_{12} \\
145 c_{20} & c_{21} & c_{22} \\
148 Library matrix components calculate these compact stencil coefficient matrices, and implements member functions that compute the finite difference approximation to vector objects. Unit scaled coefficients (assuming grid spacing is one) are computed and stored to be accessible both via CPU and GPU memory. On the GPU, the constant memory space is used for faster memory access~\cite{ch5:cudaguide}. In order to apply a stencil on a non-unit spaced grid, with grid space $\Delta x$, a scale factor of $1/(\Delta x)^q$ will have to be multiplied to the finite difference sum, i.e. $(c_{00}u_0 + c_{01}u_1 + c_{02}u_2)/(\Delta x)^q \approx u^{(q)}_0$.
150 Setting up a two dimensional grid of size $N_x \times N_y$ in the unit square and computing the first derivative hereof, is illustrated in Listing~\ref{ch5:lst:stencil}. The grid is a vector component, derived from the vector class. It is by default treated as a device object and memory is automatically allocated on the device to fit the grid size. The finite difference approximation as in \eqref{ch5:eq:fdstencil}, are performed via a CUDA kernel behind the scenes during the calls to mult and diff\_x, utilizing the memory hierarchy as the CUDA guidelines prescribe~\cite{ch5:cudaguide,ch5:cudapractice}. To increase developer productivity, kernel launch configurations have default settings, based on CUDA guidelines principles and experiences from performance testings, such that the user does not have to explicitly specify them. For problem specific finite difference approximations, where the built-in stencil operators are insufficient, a pointer to the coefficient matrix \eqref{ch5:eq:stencilcoeffs} can be passed to customized kernels.
152 \lstinputlisting[label=ch5:lst:stencil,caption={Two dimensional finite difference stencil example. Computing the first derivative using five points ($\alpha=\beta=2$) per dimension, a total nine point stencil.}]{Chapters/chapter5/code/ex2.cu}
154 In the following sections we demonstrate how to go from a PDE posed as an initial value problem (IVP) or a boundary value problem (BVP) to a working application solver, by combining existing library components along with new custom tailored components. We also demonstrate how to apply spatial and temporal domain decomposition strategies, that can make existing solvers take advantage of systems equipped with multiple GPUs. In the following we demonstrate how to rapidly assemble a PDE solver using library components.
156 \section{Model problems}\label{ch5:sec:modelproblems}
158 %Physical phenomenons are at all times present around us. Many of them are well defined via the laws of physics and can be described via differential equations. Though many differential equations are well defined, their solutions are not, and the need for computers to approximate and simulate solutions are inevitable. The accuracy of these simulations is critical for the quality of scientific applications within all fields of engineering. Improved accuracy often comes with the price of increased computational times, that can even be too overwhelming for a single computer to run. Fortunately, the fundamental change in chip-design led development of single core systems into multi- and many core high-performance architectures, that perfectly fits the engineering needs for increased computational resources. The massively parallel architecture of GPUs is a particular good match for accurate methods with high computational intensity\cite{ch5:Kloeckner2011}.
160 We present two elementary PDE model problems, to demonstrate how to assemble PDE solvers, using library components that follow the guidelines described above. The first model problem is the unsteady parabolic heat conduction equation, the second model problem is the elliptic Poisson equation. The two model problems consist of elements that play important roles in solving a broad range of more advanced PDE problems.
162 We refer to Chapter \ref{ch7} for an example of a scientific application relevant for coastal and maritime engineering analysis, that has been assembled using customized library components similar to those presented in the following.
164 \subsection{Heat conduction equation}\index{heat conduction}
165 First, we consider a two dimensional heat conduction problem defined on a unit square. The heat conduction equation is a parabolic partial differential diffusion equation, including both spatial and temporal derivatives. It describes how the diffusion of heat in a medium changes with time. Diffusion equations find great importance in many fields of sciences, e.g. fluid dynamics, where the fluid motion is uniquely described by the Navier-Stokes equations, which include a diffusive viscous term~\cite{ch5:chorin1993,ch5:Ferziger1996}.%, or in financial science where diffusive terms are present in the Black-Scholes equations for estimation of option price trends~\cite{}.
167 The heat problem is an IVP \index{initial value problem} that describes how the heat distribution evolves from a specified initial state. Together with homogeneous Dirichlet boundary conditions\index{boundary conditions}, the heat problem is given as
168 \begin{subequations}\begin{align}
169 \frac{\partial u}{\partial t} - \kappa\nabla^2u = 0, & \qquad (x,y)\in \Omega([0,1]\times[0,1]),\label{ch5:eq:heateqdt}\\
170 u = 0, & \qquad (x,y) \in \partial\Omega,\label{ch5:eq:heateqbc}
171 \end{align}\label{ch5:eq:heateq}\end{subequations}
172 where $u(x,y,t)$ is the unknown heat distribution, $\kappa$ is a heat conductivity constant (assume $\kappa=1$ from now) and $\nabla^2$ is the two dimensional Laplace\index{Laplace operator} differential operator $(\partial_{xx}+\partial_{yy})$. We use the following initial condition, because it has a known analytical solution over the entire time span, and it satisfies the homogeneous boundary condition given by \eqref{ch5:eq:heateqbc},
173 \begin{align}\label{ch5:eq:heatinit}
174 u(x,y,t_0) = \sin(\pi x)\,\sin(\pi y), & \qquad (x,y) \in \Omega.
176 An illustrative example of the numerical solution to the heat problem, using \eqref{ch5:eq:heatinit} as the initial condition is given in Figure \ref{ch5:fig:heatsolution}.
177 \begin{figure}[!htbp]
179 \setlength\figurewidth{0.3\textwidth} %
180 \setlength\figureheight{0.3\textwidth} %
181 \subfigure[$t=0.00s$]%{\input{Chapters/chapter5/figures/HeatSolution0.tikz}}
182 {\includegraphics[width=0.48\textwidth]{Chapters/chapter5/figures/HeatSolution0_conv.pdf}}
183 \subfigure[$t=0.05s$]%{\input{Chapters/chapter5/figures/HeatSolution0.049307.tikz}}
184 {\includegraphics[width=0.48\textwidth]{Chapters/chapter5/figures/HeatSolution0_049307_conv.pdf}}
185 %\subfigure[$t=0.10s$]{\input{Chapters/chapter5/figures/HeatSolution0.099723.tikz}}
186 {\includegraphics[width=0.48\textwidth]{Chapters/chapter5/figures/HeatSolution0_099723_conv.pdf}}
188 \caption{Discrete solution at times $t=0s$ and $t=0.05s$, using \eqref{ch5:eq:heatinit} as initial condition and a small $20\times20$ numerical grid.}\label{ch5:fig:heatsolution}
191 We use a Method of Lines (MoL)\index{method of lines} approach to solve \eqref{ch5:eq:heateq}. Thus, the spatial derivatives are replaced with finite difference approximations, leaving only the temporal derivative as unknown. The spatial derivatives are approximated from $\myvec{u}^n$, where $\myvec{u}^n$ represents the approximate solution to $u(t_n)$ at a given time $t_n$ with time step size $\delta t$ such that $t_n=n\delta t$ for $n=0,1,\ldots$. The finite difference approximation\index{finite difference} can be interpreted as a matrix-vector product as sketched in \eqref{ch5:eq:stencilmatrix}, and so the semi-discrete heat conduction problem becomes
192 \begin{align}\label{ch5:eq:discreteheateq}
193 \frac{\partial u}{\partial t} = \mymat{A}\myvec{u}, \qquad \mymat{A} \in \mathbb{R}^{N\times N}, \quad \myvec{u} \in \mathbb{R}^{N},
195 where $\mymat{A}$ is the sparse finite difference matrix and $N$ is the number of unknowns in the discrete system. The temporal derivative is now free to be approximated by any suitable choice of a time integration method\index{time integration}. The most simple integration scheme would be the first order explicit forward Euler's method\index{forward Euler}, resulting in an update on the form
196 \begin{align}\label{ch5:eq:forwardeuler}
197 \myvec{u}^{n+1} = \myvec{u}^n + \delta t\,\mymat{A}\myvec{u}^n,
199 where $n+1$ refers to the solution at the next time step. The forward Euler method can be exchanged with alternative high-order accurate time integration methods, such as Runge-Kutta methods or linear multistep methods, if numerical instability becomes an issue, see e.g. \cite{ch5:LeVeque2007} for details on numerical stability analysis. For demonstrative purpose, we simply use conservative time step sizes to avoid stability issues. However, the component based library design provides exactly the flexibility for the application developer to select or change PDE solver parts, such as the time integrator, with little coding effort. A generic implementation of the forward Euler method, that satisfy the library concept rules is illustrated in Listing~\ref{ch5:lst:euler}. According to the component guidelines in Figure \ref{ch5:fig:componentdesign}, a time integrator is basically a functor, which means that it implements the parenthesis operator, taking five template arguments: a right hand side operator, the state vector, integration start time, integration end time, and a time step size. The method takes as many time steps necessary to integrate from the start till end, continuously updating the state vector according to \eqref{ch5:eq:forwardeuler}. Notice, that nothing in Listing~\ref{ch5:lst:euler} indicates wether GPUs are used or not. However, it is likely that the underlying implementation of the right hand side functor and the \texttt{axpy} vector function, that can be used for scaling and summation of vectors, do rely on fast GPU kernels. However, it is not something that the developer of the component has to account for. For this reason, the template-based approach, along with simple interface concepts, make it easy to create new components that will fit well into the library.
201 \lstinputlisting[label=ch5:lst:euler,caption={Generic implementation of explicit first order forward Euler integration.}]{Chapters/chapter5/code/ex3.cu}
203 A basic approach to solve the heat conduction problem has now been outlined, and we are ready to assemble the PDE solver.
205 %More accurate methods, like variations of Runge-Kutta methods or multi-step methods might be desirable, particulary if the accuracy of the spatial approximation is also high, see e.g. LeVeque for examples hereof~\cite{ch5:LeVeque2007}.
207 %The following section demonstrates how to rapidly assemble a solver for the heat problem just presented, with configurable order of accuracy for both the spatial and temporal approximations. Section \ref{ch5:sec:results} presents performance measures and Section \ref{ch5:sec:multigpu} provides details for further optimization strategies using multiple GPUs.
211 \subsubsection{Assembling the heat conduction solver}
212 Before we are able to numerically solve the discrete heat conduction problem \eqref{ch5:eq:heateq}, we need implementations to handle the the following items
214 \item[Grid] - A discrete numerical grid, to represent the two dimensional heat distribution domain and the arithmetical working precision ($32$-bit single precision or $64$-bit double precision).
215 \item[RHS] - A right hand side operator for \eqref{ch5:eq:discreteheateq}, that approximates the second order spatial derivatives (matrix-vector product).
216 \item[Boundary conditions] - A strategy, that ensures that the Dirichlet conditions are satisfied on the boundary.
217 \item[Time integrator] - A time integration scheme, that approximates the time derivative from equation \eqref{ch5:eq:discreteheateq}.
219 All items are either directly available in the library or can be designed from components herein. The built-in stencil operator may assist in implementing the matrix-vector product, but we need to explicitly ensure that the Dirichlet boundary conditions are satisfied. We demonstrated in Listing \ref{ch5:lst:stencil} how to approximate the derivative using flexible order finite difference stencils. However, from \eqref{ch5:eq:heateqbc} we know, that boundary values are zero. Therefore we extend the stencil operator with a simple kernel call that assigns zero to the entire boundary. Listing \ref{ch5:lst:laplaceimpl} shows the code for the two dimensional Laplace right hand side operator. The constructor takes as argument the stencil half size $\alpha$ and assumes $\alpha=\beta$. Thus the total two dimensional stencil rank will be $4\alpha+1$. For simplicity we also assume that the grid is uniformly distributed, $N_x=N_y$. Performance optimizations for the stencil kernel, such as shared memory utilization, is taken care of in the underlying implementation, accordingly to CUDA guidelines~\cite{ch5:cudaguide,ch5:cudapractice}. The two macros, BLOCK1D and GRID1D, are used to help set up kernel configurations based on grid sizes.
221 \lstinputlisting[label=ch5:lst:laplaceimpl,caption={The right hand side Laplace operator. The built-in stencil approximates the two dimensional spatial derivatives, while the custom set\_dirichlet\_bc kernel takes care of satisfying boundary conditions.}]{Chapters/chapter5/code/laplacian.cu}
223 With the right hand side operator in place, we are ready to implement the solver. For this simple PDE problem we compute all necessary initial data in the body of the main function and use the forward Euler time integrator to compute the solution till $t=t_{end}$. For more advanced PDE solvers, a built-in \texttt{ode\_solver} class is defined, that helps taking care of initialization and storage of multiple state variables. Declaring type definitions for all components at the beginning of the main file gives a good overview of the solver composition. In this way, it will be easy to control or change solver components at later times. Listing~\ref{ch5:lst:heattypedefs} lists the type definitions\index{type definitions} that are used to assemble the heat conduction solver. %\stefan{elaborate on the choices}.
225 \lstset{label=ch5:lst:heattypedefs,caption={Type definitions for all the heat conduction solver components used throughout the remaining code.}}
227 typedef double value_type;
228 typedef laplacian<value_type> rhs_type;
229 typedef gpulab::grid<value_type> vector_type;
230 typedef vector_type::property_type property_type;
231 typedef gpulab::integration::forward_euler time_integrator_type;
233 The grid is by default treated as a device object and memory is allocated on the GPU upon initialization of the grid. Setting up the grid can be done via the property type class. The property class holds information about the discrete and physical dimensions, along with fictitious ghost (halo) layers and periodicity conditions. For the heat conduction problem we use a non-periodic domain within the unit square and with no ghost layers. Listing \ref{ch5:lst:gridsetup} illustrates the grid assembly.
235 \lstset{label=ch5:lst:gridsetup,caption={Create 2D grid of discrete dimension \texttt{N} times \texttt{N} and physical dimension $0$ to $1$.}}
237 // Setup discrete and physical dimensions
238 gpulab::grid_dim<int> dim(N,N,1);
239 gpulab::grid_dim<value_type> p0(0,0);
240 gpulab::grid_dim<value_type> p1(1,1);
241 property_type props(dim,p0,p1);
244 vector_type u(props);
246 Hereafter the vector \texttt{u} can be initialized accordingly to \eqref{ch5:eq:heatinit}. Finally we need to instantiate the right hand side Laplacian operator from Listing \ref{ch5:lst:laplaceimpl} and the forward Euler time integrator in order to integrate from $t_0$ till $t_{end}$.
248 \lstset{label=ch5:lst:timeintegrator,caption=Create time integrator and the right hand side Laplacian operator.}
250 rhs_type rhs(alpha); // Create right hand side operator
251 time_integrator_type solver; // Create time integrator
252 solver(&rhs,u,0.0f,tend,dt); // Integrate from to tend using dt
254 The last line invokes the forward Euler time integration scheme, defined in Listing \ref{ch5:lst:heattypedefs}. If the developer decides to change the integrator into another explicit scheme, only the time integrator type definition in Listing \ref{ch5:lst:heattypedefs} needs to be changed. The heat conduction solver is now complete.%The simple model problem illustrated, that a well designed library upholds developer productivity, by minimizing the
256 % how few CUDA implementations that were needed, and thereby upholding developer productivity.
258 %\todo{(How) do we give access to the entire file?}
260 \subsubsection{Numerical solutions to the heat conduction problem}
262 Solution times for the heat conduction problem is in itself not very interesting, as it is only a simple model problem. What is interesting for GPU kernels, like the finite differences kernel, is that increased computational work often comes with a very little price. This is because the relative fast computations can be hidden by the slower memory fetches. Therefore we are able to improve the accuracy of the numerical solution via more accurate finite differences (larger stencil sizes), while improving the computational performance in terms of floating point operations per second (flops). Figure \ref{ch5:fig:stencilperformance} confirms, that larger stencils improve the kernel performance. Notice that even though these performance results are favorable compared to single core systems ($\sim 10$ GFlops double precision on a $2.5$-GHz processor), it is still far from the peak performance of the GeForce GTX590 ($\sim 2.4$ TFlops single precision). The reason is that the kernel is bandwidth bound, i.e. performance is limited by the time it takes to move memory between the global GPU memory and the chip. However, this is a general limitation for matrix-vector-like operations that approximate solutions to discretized PDE problems. Only matrix-matrix operations, that have a high ratio of computations versus memory transactions, are able to reach near-optimal performance results~\cite{ch5:Kirk2010}. These kind of operators are, however, rarely used to solve PDE problems.
265 \setlength\figureheight{0.3\textwidth}
266 \setlength\figurewidth{0.4\textwidth}
269 %\input{Chapters/chapter5/figures/AlphaPerformanceGTX590_N16777216.tikz}
270 {\includegraphics[width=0.5\textwidth]{Chapters/chapter5/figures/AlphaPerformanceGTX590_N16777216_conv.pdf}}
273 \caption{Single and double precision floating point operations per second for a two dimensional stencil operator on a numerical grid of size $4096^2$. Various stencil sizes are used $\alpha=1,2,3,4$, equivalent to $5$pt, $9$pt, $13$pt, and $17$pt stencils. Test environment 1.}\label{ch5:fig:stencilperformance}
279 \subsection{Poisson equation}\index{Poisson equation}
280 The Poisson equation is a second-order elliptic differential equation, often encountered in applications within scientific fields like electrostatics and mechanics. We consider the two dimensional BVP \index{boundary volume problem} defined in terms of Poisson's equation with homogeneous Dirichlet boundary conditions on the form
281 \begin{subequations}\begin{align}
282 \nabla^2 u = f(x,y),& \qquad (x,y) \in \Omega([0,1]\times[0,1]), \\
283 u = 0,& \qquad (x,y) \in \partial\Omega.
284 \end{align}\label{ch5:eq:poissoneq}\end{subequations}
285 Notice the similarities to the heat conduction equation \eqref{ch5:eq:heateq}. In fact, this could be a steady state solution to the heat equation, when there is no temporal change $\frac{\partial u}{\partial t}=0$, but a source term $f(x,y)$. Since the Laplace operator and the boundary conditions are the same for both problems, we are able to reuse the same implementation with few modifications.
287 Opposite to the heat equation, there are no initial conditions. Instead, we seek some $u(x,y)$ that satisfies \eqref{ch5:eq:poissoneq}, given a source term $f(x,y)$, on the right hand side. For simplicity, assume that we know the exact solution, $u_{\textrm{true}}$, given as \eqref{ch5:eq:heatinit}. Then we use the method of manufactured solutions to derive an expression for the corresponding right hand side $f(x,y)$:
289 f(x,y) = \nabla^2 u_{\textrm{true}} = -2\pi^2\,\sin(\pi x)\,\sin(\pi y). \label{ch5:eq:poissonrhs}
291 The spatial derivative in \eqref{ch5:eq:poissoneq} is again approximated with finite differences\index{finite difference}, similar to the example in \eqref{ch5:eq:stencilmatrix}, except boundary values are explicitly set to zero. The discrete form of the system can now be written as a sparse linear system of equations
293 \mymat{A}\myvec{u}=\myvec{f}, \quad \mymat{A} \in \mathbb{R}^{N\times N}, \label{ch5:eq:poissonsystem}
295 where $\mymat{A}$ is the sparse matrix formed by finite difference coefficients, $N$ is the number of unknowns and $\myvec{f}$ is given by \eqref{ch5:eq:poissonrhs}. Equation \eqref{ch5:eq:poissonsystem} can be solved in numerous ways, but a few observations may help us do it efficiently. Direct solvers based on Gaussian elimination are accurate and use a finite number of operations for a constant problem size. However, the arithmetic complexity grows with the problem size with as much as $\mathcal{O}(N^3)$ and do not exploit the sparsity of $\mymat{A}$. Direct solvers are therefore mostly feasible for dense systems of limited sizes. Sparse direct solvers exist, but they are often difficult to parallelize, or only applicable for certain types of matrices. Regardless of the discretization technique, the discretization of an elliptic PDE into a linear system like \eqref{ch5:eq:poissonsystem}, yields a very sparse matrix $\mymat{A}$ when $N$ is very large. Iterative methods\index{iterative methods} for solving large sparse linear systems find broad use in scientific applications, because they only require an implementation of the matrix-vector product, and often use a limited amount of additional memory. Comprehensive introductions to iterative methods may be found in any of~\cite{ch5:Saad2003,ch5:Kelley1995,ch5:Barrett1994}.
297 One benefit of the high abstraction level and the template-based library design, is to allow developers to implement their own components, such as iterative methods for solving sparse linear systems. The library includes three popular iterative methods: conjugate gradient, defect correction\index{defect correction}, and geometric multigrid. The conjugate gradient method is applicable only to systems with symmetric positive definite matrices. This is true for the two dimensional Poisson problem, when it is discretized with a five point finite difference stencil, because then there will be no off-centered approximations near the boundary. For high-order approximations ($\alpha>1$), we use the defect correction method with multigrid preconditioning. See \cite{ch5:Trottenberg2001} for details on high-order multigrid methods.
299 We will not present the implementation details for all three methods, but briefly demonstrate the simplicity of implementing the body of such an iterative solver, given a textbook recipe or mathematical formulation. The defect correction method iteratively improves the solution to $\mymat{A}\myvec{x}=\myvec{b}$, given an initial start guess $\myvec{x}^0$, by continuously solving a preconditioned error equation. One defect correction \index{defect correction!iteration} iteration can be written via mathematical notation as
301 \myvec{x}^{k+1} = \myvec{x}^{k} + \mymat{M}^{-1}(\myvec{b}-\mymat{A}\myvec{x}^{k}), \quad \mathcal{M}\in\mathbb{R}^{N\times N}, \quad {\bf x},{\bf b}\in\mathbb{R}^N \label{ch5:eq:dc}
303 where $k$ is the iteration number and $\mymat{M}$ is the preconditioner\index{preconditioning} which is based on approximation to the original coefficient matrix. To achieve efficient algebraic convergence\index{convergence}, a solve step with the preconditioner should be computational inexpensive compared to using $\mymat{A}$. How to implement \eqref{ch5:eq:dc} within the library context is illustrated in Listing \ref{ch5:lst:dc}. The host CPU is responsible for traversing through each line in Listing \ref{ch5:lst:dc} and testing for convergence, while the computationally expensive matrix-vector operation and preconditioning, can be executed on the GPU, if GPU-based components are used. The defect correction method has two attractive properties. First, global reduction is only required to monitor convergence once per iteration during convergence evaluation, which reduces communication requirements and provides a basis for efficient and scalable parallelization. Second, it has minimal memory footprint, which is good for the solution of very large systems.
305 In the following section we demonstrate how to assemble a solver for the discrete Poisson problem, using one of the tree iterative methods to efficiently solve \eqref{ch5:eq:poissonsystem}.
307 \lstset{label=ch5:lst:dc,caption={Main loop for the iterative defect correction solver. The solver is instantiated with template argument types for the matrix and vector classes, allowing underlying implementations to be based on GPU kernels.}}
309 while(r.nrm2() > tol)
311 // Calculate residual
315 // Reset initial guess
321 // Defect correction update
327 \subsubsection{Assembling the Poisson Solver}
329 Assembling the Poisson solver follows almost the same procedure as the heat conduction solver, except the time integration part is exchanged with an iterative method to to solve the system of linear equations \eqref{ch5:eq:poissonsystem}. For the discrete matrix-vector product we reuse the Laplace operator from the heat conduction problem in Listing \ref{ch5:lst:laplaceimpl} with a few modifications. The Laplace operator now appears as a matrix component, so to be compatible with the component interface rules in Figure \ref{ch5:fig:componentdesign} the parenthesis operator is renamed to \texttt{mult}, taking two vector components as argument.
331 At the beginning of the solver implementation we list the type definitions for the Poisson solver that will be used throughout the implementation. Here we use a geometric multigrid\index{multigrid} method as a preconditioner for the defect correction method. Therefore the multigrid solver is assembled first, so that it can be used in the assembling of the defect correction method. Listing \ref{ch5:lst:poissontypedefs} defines the types for the vector, the matrix, the multigrid preconditioner and the defect correction solver. The geometric multigrid method needs two additional template arguments, that are specific for multigrid, namely a smoother and a grid restriction/interpolation operator. These arguments are free to be implemented and supplied by the developer if special care is required for their specific problems, e.g. for a custom grid structure. For the Poisson problem on a regular grid, the library contains built-in restriction and interpolation operators, and a red-black Gauss-Seidel smoother. We refer to \cite{ch5:Trottenberg2001} for extensive details on multigrid methods. The monitor and config types that appear in Listing \ref{ch5:lst:poissontypedefs} are used for convergence monitoring within the iterative solver and to control run time parameters, such as tolerances, iteration limits, etc.
333 \lstset{label=ch5:lst:poissontypedefs,caption={Type definitions for the Laplacian matrix component and the multigrid preconditioned iterative defect correction\index{defect correction} solver.}}
335 typedef double value_type;
336 typedef gpulab::grid<value_type> vector_type;
337 typedef laplacian<vector_type> matrix_type;
339 // MULTIGRID solver types
340 typedef gpulab::solvers::multigrid_types<
343 , gpulab::solvers::gauss_seidel_rb_2d
344 , gpulab::solvers::grid_handler_2d_boundary> mg_types;
345 typedef gpulab::solvers::multigrid<mg_types> mg_solver_type;
346 typedef mg_solver_type::monitor_type monitor_type;
347 typedef monitor_type::config_type config_type;
350 typedef gpulab::solvers::defect_correction_types<
354 , mg_solver_type> dc_types;
355 typedef gpulab::solvers::defect_correction<dc_types> dc_solver_type;
358 With the type definitions set up, the implementation for the Poisson solver follows in Listing \ref{ch5:lst:poissonsolver}. Some of the initializations are left out, as they follow the same procedure as for the Heat conduction example. The defect correction and geometric multigrid solvers are initiated and then the multigrid is set as a preconditioner to the defect correction method. Finally the system is solved via a call to \texttt{solve()}.
360 \lstset{label=ch5:lst:poissonsolver,caption={Initializing the preconditioned defect correction solver to approximate the solution to $\mymat{A}\myvec{u}=\myvec{f}$.}}
362 matrix_type A(alpha); // High-order matrix
363 matrix_type M(1); // Low-order matrix
365 /* Omitted: create and init vectors u, f */
367 config_type config; // Create configuration
368 config.set("iter",30); // Set max iteration count
369 config.set("rtol",1e-10); // Set relative tolerance
370 monitor_type monitor(config); // Create monitor
371 dc_solver_type solver(A,monitor); // Create DC solver
372 mg_solver_type precond(M,monitor); // Create MG preconditioner
373 solver.set_preconditioner(precond); // Set preconditioner
374 solver.solve(u,f); // Solve M^-1(Au = f)
375 if(monitor.converged())
380 \subsubsection{Numerical solutions to the Poisson problem}\label{ch5:sec:poissonresults}
383 The discrete Poisson problem \eqref{ch5:eq:poissonsystem}, has been solved using the three iterative methods presented above. Convergence histories for the conjugate gradient method and geometric multigrid method, using two different resolutions are illustrated in Figure \ref{ch5:fig:poissonconvergence:a}. Multigrid methods have the attractive properties, that they are very robust and algorithmic efficient, independently of the problem size. Figure \ref{ch5:fig:poissonconvergence:a} confirms that the rate of convergence for the multigrid method is unchanged for both problem sizes. Only the attainable accuracy is slightly worsened, as a consequence of a more ill-conditioned system for large problem sizes.
385 Defect correction in combination with multigrid preconditioning, enables efficient solution of high-order approximations of the Poisson problem, illustrated in Figure \ref{ch5:fig:poissonconvergence:b}. The multigrid preconditioning matrix $\mymat{M}$ is based on a low order approximation to \eqref{ch5:eq:poissonsystem}, whereas matrix $\mymat{A}$ is a high-order approximation. When $\mymat{M}$ is a close approximation to $\mymat{A}$, defect correction converges most rapidly. This is the effect that can be seen between the three convergence lines in Figure \ref{ch5:fig:poissonconvergence:b}.
389 \setlength\figureheight{0.33\textwidth}
390 \setlength\figurewidth{0.33\textwidth}
392 \subfigure[Convergence history for the conjugate gradient and multigrid methods, for two different problem sizes.]{\label{ch5:fig:poissonconvergence:a}
393 %{\scriptsize \input{Chapters/chapter5/figures/ConvergenceMGvsCG.tikz}}
394 {\includegraphics[width=0.5\textwidth]{Chapters/chapter5/figures/ConvergenceMGvsCG_conv.pdf}}
398 \subfigure[Defect correction convergence history for three different stencil sizes.]{\label{ch5:fig:poissonconvergence:b}
399 %{\scriptsize \input{Chapters/chapter5/figures/ConvergenceDC.tikz}}
400 {\includegraphics[width=0.5\textwidth]{Chapters/chapter5/figures/ConvergenceDC_conv.pdf}}
403 \caption{Algorithmic performance for the conjugate gradient, multigrid, and defect correction methods, measured in terms of the relative residual per iteration.}\label{ch5:fig:poissonconvergence}
406 %Numerical and algorithmic efficiency...
410 \section{Optimization strategies for multi-GPU systems}\label{ch5:sec:multigpu}\index{multi-GPU}
412 CUDA enabled GPUs are optimized for high memory bandwidth and fast on-chip performance. However, the role as a separate co-processor to the CPU can be a limiting factor for large scale scientific applications, because the per GPU memory capacity is fixed and is only in the range of a few Gigabytes. In comparison, it is not unusual for a high-end workstation to be equipped with $\sim32$GB of main memory, plus a terabyte capacity hard disk for secondary storage. Therefore, large scale scientific applications that process Gigabytes of data, require distributed computations on multiple GPU devices. Multi-GPU desktop computers and clusters can have a very attractive peak performance, but the addition of multiple devices introduces the potential performance bottleneck of slow data transfers across PCIe busses and network interconnections, as illustrated in Figure \ref{ch5:fig:gpu2gputransfer}. The ratio between data transfers and computational work has a significant impact on the possibility for latency hiding and thereby overall application performance.
416 %\includegraphics[width=0.6\textwidth]{Chapters/chapter5/figures/GPU2GPU.png}
417 \input{Chapters/chapter5/figures/GPU2GPU.tikz}
419 \caption{Message passing between two GPUs involves several memory transfers across lower bandwidth connections. A kernel that aligns the data to be transferred is required, if the data is not already sequentially stored in device memory.}\label{ch5:fig:gpu2gputransfer}
423 Developing applications that exploit the full computational capabilities of modern clusters -- GPU-based or not -- is no trivial matter. Developers are faced with the complexity of distributing and coordinating computations on nodes consisting of many-core CPUs, GPUs and potentially other types of accelerators as well. These complexities give rise to challenges in finding numerical algorithms, that are well suited for such systems, forcing developers to search for novel methods for introducing concurrency.
425 To ease software development, we use MPI-2 for message passing and ensure a safe and private communication space by creation of a communicator private to the library during initialization, as recommended by Hoefler and Snir~\cite{ch5:Hoefler2011}. With the addition of GPUDirect to CUDA, it is possible to make direct transfers between GPUs, eliminating CPU overhead. Unfortunately there are rather strict system requirements to enable these direct transfers, and there is no stable MPI implementation that supports message passing via pointers to device memory. Therefore device memory is first transferred to the CPU main memory before invoking any MPI calls. We do provide device to device transfers via template-based library routines, that work directly with GPU vector objects. It hides the complexity of message passing from the developer and help developers design new components for multi-GPU execution.
427 %(GPUDirect v2 and Unified Virtual Addressing can help). Effort is being made to create GPU aware MPI implementations, that will allow direct memory transfers(cite: openMPI, MVAPICH2). However, only 64-bit systems with Unified Virtual Addressing (UVA) from CUDA v4.0 and Tesla 20-serie (Kepler) GPUs are capable of direct transfers. Therefore only a limited number of today's GPU clusters will be able to utilize these features.
428 %For now, the simple solution is to first transfer data to the CPU main memory before invoking the appropriate MPI calls.
430 In the following sections we present two very different methods for distributed computing based on spatial and temporal decomposition. Each method has its own characteristic, which make them attractive for various types of PDE problems, and for different problem sizes.
434 \subsection{Spatial domain decomposition}\label{ch5:sec:dd}\index{domain decomposition}
436 Domain decomposition methods can be used for distributing computational work in the numerical solution of boundary value problems~\cite{ch5:Smith1996}. These methods add parallelism, by splitting the spatial dimensions, on which the boundary values are defined, into a number of smaller boundary value problems, and coordinating the solution between adjacent subdomains. Domain decomposition techniques, such as the classical overlapping Schwarz methods, may be considered as preconditioners \index{preconditioning} to the system of equations, that arise from discretization of the PDE, e.g. as for the Poisson problem in \eqref{ch5:eq:poissonsystem}. The algorithmic efficiency of such methods depends on the size of the domain overlaps, while an additional coarse grid correction method is sometimes necessary to maintain fast global convergence.
437 %A detailed study and analysis of convergence criterions, overlapping zones, local domain solvers etc. can lead to parallel efficiencies that scale linearly with the number of processors for large problem sizes, see e.g. \cite{ch5:Cai2005}.
439 An alternative to the preconditioning strategy is to have each subdomain query information from adjacent subdomains whenever needed. For PDEs that are discretized onto regular shaped grids, this can be an attractive strategy, as the decomposition of subdomains, and thereby the communication topology\index{topology}, is straightforward. The library supports decomposition of regular grids by either pre- or user-defined topologies. A topology in this context, is a description of connectivity between processors, that share the same grid along with information about the local and global discretization. Layers of ghost points are inserted at the artificially introduced boundaries to account for grid points that reside in adjacent subdomains. The ghost point values can be updated upon request from the user. An illustrative example is given in Figure \ref{ch5:fig:dd2d}; the arrows indicate message passing between adjacent subdomains, when updating grid points within the ghost layers.
442 \input{Chapters/chapter5/figures/dd2d.tikz}
444 \caption{Domain distribution along the horizontal direction of a two dimensional grid into three subdomains. {\large$\bullet$} and {\scriptsize$\textcolor[rgb]{0.5,0.5,0.5}{\blacksquare}$} represent internal grid points and ghost points respectively.}\label{ch5:fig:dd2d}
447 Topologies are introduced via an extra template argument to the grid component. A grid is by default not distributed, because the default template argument is a non-distribution topology implementation. The grid class is extended with a new member function \texttt{update()}, that makes sure that the topology instance updates all ghost points. The library contains two topology strategies, based on one dimensional and two dimensional distributions of the grid. The number of grid subdomains will be equal to the number of MPI processes executing the program.
449 If these updates are performed whenever information from adjacent subdomains are needed, i.e. before a stencil operation, all distributed interior points will be exactly the same as they would be for the non-distributed setup. Therefore, one advantage of this method is, that the algorithmic efficiency of an application can be preserved, if grid updates are invoked at the proper times.
451 Distributed performance for the finite difference stencil operation is illustrated in Figure \ref{ch5:fig:multigpu}. The timings include the compute time for the finite difference approximation and the time for updating ghost layers via message passing, cf. Figure \ref{ch5:fig:gpu2gputransfer}. It is obvious from Figure \ref{ch5:fig:multigpu:a}, that communication overhead dominates for smaller problem sizes, where the non-distributed grid (1 GPU) is fastest. However, communication times do not grow as rapidly for larger problems as computation times, due to the surface to volume ratio. Therefore message passing becomes less influential for large problems, where reasonable performance speedups are obtained. Figure \ref{ch5:fig:multigpu:b} demonstrates how the computational performance on multi-GPU systems can be significantly improved for various stencil sizes. With this simple domain decomposition technique, developers are able to implement applications based on heterogeneous distributed computing, without explicitly dealing with message passing, while it is still possible to provide user specific implementations of the topology class for customized grid updates.
453 % TODO: Should we put in the DD algebra?
456 \setlength\figureheight{0.27\textwidth}
457 \setlength\figurewidth{0.55\textwidth}
459 \subfigure[Absolute timings, $\alpha=3$.]{
460 %{\small\input{Chapters/chapter5/figures/MultiGPUAlpha3TimingsTeslaM2050.tikz}}
461 {\includegraphics[width=0.6\textwidth]{Chapters/chapter5/figures/MultiGPUAlpha3TimingsTeslaM2050_conv.pdf}}
462 \label{ch5:fig:multigpu:a}
464 \subfigure[Performance at $N=4069^2$, single precision.]{
465 % {\small\input{Chapters/chapter5/figures/MultiGPUAlphaPerformanceTeslaM2050_N16777216.tikz}}
466 {\includegraphics[width=0.6\textwidth]{Chapters/chapter5/figures/MultiGPUAlphaPerformanceTeslaM2050_N16777216_conv.pdf}}
467 \label{ch5:fig:multigpu:b}
470 \caption{Performance timings for distributed stencil operations, including communication and computation times. Test environment 2.}\label{ch5:fig:multigpu}
473 \subsection{Parareal -- Parallel time integration}\label{ch5:sec:parareal}\index{parareal}
475 The use of spatial domain decomposition methods is widespread, as they have proven to be efficient on a wide range of problems. Unfortunately, applications that are concerned with limited numerical problem sizes, rapidly reach their speedup limit for a low number of processors, due to a degradation of the scalability when the number of processors become large as this leads to an increasingly unfavourable communication to compute ratio of the individual subdomains. This issue is continuously worsened by the fact that communication speed has been increasing at a far slower pace than compute speed for the past many years, and is expected to continue to do so for years to come. It is often referred to as "the memory wall"~\cite{ch5:Asanovic:EECS-2006-183}, one of the grand challenges facing development and architectural design of future high-performance systems~\cite{ch5:Keyes2011,ch5:ScientificGrandChallenges2010}. Finally, there are applications based on ordinary differential equations, where classical domain decomposition methods are not even applicable~\cite{ch5:YMTR08}. For these type of applications, a method of adding parallelism in the temporal integration is of great interest. Contrary to space however, time is -- by its very nature -- sequential, which precludes a straightforward implementation of a parallel approach.
477 One method that introduces concurrency to the solution of evolution problems is the parareal algorithm. Parareal is an iterative method imposed on a time decomposition. Gander and Vandewalle showed in \cite{ch5:MS07}, that the algorithm can be written as both a multiple shooting method and as a two level multigrid in time approach, even though the leading idea came from spatial domain decomposition. The method has many exciting features: it is fault tolerant and has different communication characteristics than those of the classical domain decomposition methods. It has also been demonstrated to work effectively on a wide range of problems, and most importantly, once the proper distribution infrastructure is in place, it can easily be wrapped around any type of numerical integrator, for any type of initial value problem.
482 \input{Chapters/chapter5/figures/ParallelInTime.tikz}
484 \caption{Time domain decomposition. A compute node is assigned to each individual time sub-main to compute the initial value problem. Consistency at the time sub-domain boundaries is obtained with the application of a computationally cheap integrator in conjunction with the parareal iterative predictor-corrector algorithm, eq. \eqref{ch5:eq:PARAREAL}. }\label{ch5:fig:ParallelInTime}
488 \subsubsection{The Parareal algorithm}
490 The parareal algorithm was first presented in 2001, in a paper by Lions et al.~\cite{ch5:LMT01}, and later introduced in a slightly revised predictor-corrector form in 2002 by Baffico et al.~\cite{ch5:LSY02}. The parareal in time approach proposes to break the global problem of time evolution into a series of independent evolution problems on smaller intervals, see Figure \ref{ch5:fig:ParallelInTime}. Initial states for these problems are needed, and supplied by a simple, less accurate, but computationally cheap sequential integrator. The smaller independent evolution problems can then be solved in parallel. The information, generated during the concurrent solution of the independent evolution problems with accurate propagators and inaccurate initial states, is used in a predictor-corrector fashion in conjunction with the coarse integrator to propagate the solution faster, now using the information generated in parallel. We define the decomposition into $N$ intervals, that is
492 & T_{0}<T_{1}<\cdots<T_{n}=n\Delta T<T_{n+1}<T_{N},
494 where $\Delta T$ is the size of the time intervals and $n=0,1,\ldots,N$. The general initial value problem on the decomposed time domain is defined as
495 \begin{align}\label{ch5:eq:difproblem}
496 \frac{\partial u}{dt}+\mathcal{A}u=0, \qquad u(T_{0})=u^{0}, \quad t\in\left[T_{0},T_{N}\right],
498 where $\mathcal{A}$ is an operator from one Hilbert space to another. To solve the differential problem \eqref{ch5:eq:difproblem} we define an operator $\mathcal{F}_{\Delta T}$ that operates on some initial state $U_{n} \approx u(T_{n})$, and return an approximate solution to \eqref{ch5:eq:difproblem}, at time $T_{n}+\Delta T$. Such an operator is achieved by the implementation of a numerical time integrator, using some small time-step $\delta t\ll\Delta T$ in the integration. The numerical solution to \eqref{ch5:eq:difproblem} is then obtained by applying the fine propagator sequentially for $n=1,2,\ldots,N$.
499 \begin{align}\label{ch5:eq:fineOp}
500 \hat{U}_{n}=\mathcal{F}_{\Delta T}\left(T_{n-1},\hat{U}_{n-1}\right), \qquad\hat{U}_{0}=u^{0}.
502 For the purpose of parallel acceleration of the otherwise purely sequential process of obtaining $\mathcal{F}_{\Delta T}^{N} u^{0} \approx u(T_{N})$, we define the coarse propagator $\mathcal{G}_{\Delta T}$. $\mathcal{G}_{\Delta T}$ also operates on some initial state $U_{n}$, propagating the solution over the time interval $\Delta T$, but now using a time step $\delta T$. Typically $\delta t<\delta T<\Delta T$. For the parareal algorithm to be effective, the coarse propagator $\mathcal{G}_{\Delta T}$ has to be substantially faster to evaluate than the fine propagator $\mathcal{F}_{\Delta T}$. There are many ways of constructing the coarse propagator, the simplest one being to apply the same numerical integrator as for the fine propagator, but using a coarser time discretization. We refer to \cite{ch5:ASNP12} for an introduction to other methods. The coarse operator reads
503 \begin{align}\label{ch5:eq:coarseOp}
504 \tilde{U}_{n}=\mathcal{\mathcal{G}}_{\Delta T}\left(T_{n-1},\tilde{U}_{n-1}\right),\qquad\tilde{U_{0}}=u^{0}.
506 Using the defined $\mathcal{F}_{\Delta T}$ and $\mathcal{G}_{\Delta T}$ operators, the predictor-corrector form of the parareal algorithm can be written in a single line
507 \begin{equation}\label{ch5:eq:PARAREAL}
508 U_{n}^{k+1}=\mathcal{\mathcal{G}}_{\Delta T}\left(U_{n-1}^{k+1}\right)+\mathcal{\mathcal{F}}_{\Delta T}\left(U_{n-1}^{k}\right)-\mathcal{\mathcal{G}}_{\Delta T}\left(U_{n-1}^{k}\right),\quad U_{0}^{k}=u^{0},
510 with the initial prediction $U^{0}_{n} = \mathcal{G}_{\Delta T}^{n} u^{0}$ for $n=1\ldots N$ and $k=1\ldots K$. $N$ being the number of time subdomains, while $K\geq1$ is the number of predictor-corrector iterations applied. The parareal algorithm is implemented in the library as a separate time integration component, using a fully distributed work scheduling model, as proposed by Aubanel (2010)~\cite{ch5:EA10}. The model is schematically presented in Figure \ref{ch5:fig:FullyDistributedCores}. The parareal component hides all communication and work distribution from the application developer. It is defined such that a user only has to decide what coarse and fine propagators to use. Setting up the type definitions for parareal time integration using forward Euler for coarse propagation and fourth order Runge-Kutta for fine propagation, could then be defined as follows:
511 \lstset{label=ch5:lst:parareal,caption={Assembling a parareal time integrator using forward Euler for coarse propagation and a Runge-Kutta method for fine propagation.}}
513 typedef gpulab::integration::forward_euler coarse;
514 typedef gpulab::integration::ERK4 fine;
515 typedef gpulab::integration::parareal<coarse,fine> integrator;
517 The number of GPUs used for parallelization depends on the number of MPI processes that the application is initiated with.
519 %Each application of $\mathcal{\mathcal{G}}_{\Delta T}$ or $\mathcal{\mathcal{F}}_{\Delta T}$ to a state $\tilde{U}_{n}^{k}$ is performed on a GPU while the corrections themselves are performed on CPUs. In addition to an efficient distribution model, a stopping strategy is needed as parareal is an iterative algorithm. Various choices has been implemented in the library, these are presented in section \ref{ch5:subsec:Lib_impl}.
524 \input{Chapters/chapter5/figures/FullyDistributed.tikz}
526 \caption{\label{ch5:fig:FullyDistributedCores} Schematic visualization of a fully distributed work scheduling model for the parareal algorithm as proposed by Aubanel. Each GPU is responsible for computing the solution on a single time sub-domain. The computation is initiated at rank $0$ and cascades trough to rank $N$ where the final solution can be fetched. }
529 \subsubsection{Computational complexity}
531 In the analysis of the computational complexity, we first recognize that both the coarse and the fine propagator, regardless of the type of discretization scheme, involves a complexity that is proportional to the number of time steps being used. Let us define two scalar values $\mathcal{C}_\mathcal{F}$ and $\mathcal{C}_\mathcal{G}$ as the computational cost of performing a single step with the fine and coarse propagators. The computational complexity of a propagator integrating over an interval $\Delta T$ is then given by $\mathcal{C}_\mathcal{F}\frac{\Delta T}{\delta t}$ and $\mathcal{C}_\mathcal{G}\frac{\Delta T}{\delta T}$ respectively. $R$ is introduced as the relation between the two, that is, $R$ is a measure of how much faster the coarse propagator is compared to the fine propagator over the time interval $\Delta T$. The total computational cost for parareal over $N$ intervals is proportional to
533 \left(k+1\right)N\mathcal{C_{G}}\frac{\Delta T}{\delta T}+kN\mathcal{C_{F}}\frac{\Delta T}{\delta t},
535 recognising that the second term can be distributed over $N$ processors, we are left with
536 \begin{align}\label{ch5:eq:ComComplexPara}
537 \left(k+1\right)N\mathcal{C_{G}}\frac{\Delta T}{\delta T}+k\mathcal{C_{F}}\frac{\Delta T}{\delta t}.
539 The above should be compared to the computational complexity of a purely sequential propagation, using only the fine operator, $\frac{T_{N}-T_{0}}{\delta t}\mathcal{C}_\mathcal{F}=N\frac{\Delta T}{\delta t}\mathcal{C}_\mathcal{F}$. From the latter we can estimate the speed-up, here denoted $\psi$, as the ratio between the computational complexity of the purely sequential solution, and the complexity of the solution obtained by the parareal algorithm \eqref{ch5:eq:ComComplexPara}. Neglecting the influence of communication speed and correction time, we are left with the estimate
540 \begin{align}\label{ch5:eq:EstiSpeedBasicPara}
541 \psi=\frac{N\frac{\Delta T}{\delta t}\mathcal{C}_\mathcal{F}}{\left(k+1\right)N\mathcal{C}_\mathcal{G}\frac{\Delta T}{\delta T}+k\mathcal{C}_\mathcal{F}\frac{\Delta T}{\delta t}}=\frac{N}{\left(k+1\right)N\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}+k}.
543 If we in addition assume that the time spend on coarse propagation is negligible compared to time spend on the fine propagation, i.e. the limit $\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}\rightarrow0$, the estimate reduces to $\psi=\frac{N}{k}$. It is thus clear that the number of iterations $k$ for the algorithm to converge pose an upper bound on obtainable parallel efficiency. The number of iterations needed for convergence is intimately coupled with the ratio between the speed of the coarse and the fine integrator $\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}$, denoted $R$. Using a slow, but more accurate coarse integrator, will lead to convergence in fewer iterations $k$, but at the same time it also makes $\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}$ larger. Ultimately, this will degrade the obtained speedup as can be deduced from \eqref{ch5:eq:EstiSpeedBasicPara}, and by Amdahl's law also lower the upper bound on possible attainable speedup. Thus, the ratio $\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}$ \emph{can not} be made arbitrarily small since the relation is inversely proportional to the iterations $k$, needed for convergence. This poses a challenge in obtaining speedup and is a trade-off between time spend on the fundamentally sequential part of the algorithm and the number of iterations needed for convergence. It is particularly important to consider this trade-off in the choice of stopping strategy; a more thorough discussion on this topic is available in \cite{ch5:ASNP12} for the interested reader. Measurements on parallel efficiency is typically observed to be in the range of 20--50\% in the literature, depending on the problem and the number of time subdomains, which is also confirmed by our measurements using GPUs. Here we include a demonstration of the obtained speedup of parareal applied to the two dimensional heat problem \ref{ch5:eq:heateqbc}. In Figure \ref{ch5:fig:pararealRvsGPUs} the iterations needed for convergence using the forward Euler method as both fine and coarse propagator is presented. R is regulated by changing the coarse time discretization. In Figure \ref{ch5:fig:pararealGPUs} speed-up and parallel efficiency measurements are presented. Notice how when using many GPUs it is advantageous to use a faster, less accurate coarse propagator, despite this requiring an extra parareal iteration that increases the total computational complexity.
546 \setlength\figureheight{0.32\textwidth}
547 \setlength\figurewidth{0.35\textwidth}
549 \subfigure[The relative error after one parareal iteration ($k=1$).]{
550 {\small\input{Chapters/chapter5/figures/pararealK1ErrvsRvsGPUs.tikz}}
551 \label{ch5:fig:pararealRvsGPUs:a}
553 \subfigure[Iterations $K$ needed to obtain a relative error less than $10^{-5}$.]{
554 {\small\input{Chapters/chapter5/figures/pararealKvsRvsGPUs.tikz}}
555 \label{ch5:fig:pararealRvsGPUs:b}
558 \caption{Parareal convergence properties as a function of $R$ and number GPUs used. The error is measured as the relative difference between the purely sequential solution and the parareal solution.}\label{ch5:fig:pararealRvsGPUs}
562 \setlength\figureheight{0.32\textwidth}
563 \setlength\figurewidth{0.35\textwidth}
565 \subfigure[Measured parallel speed-up]{
566 {\small\input{Chapters/chapter5/figures/pararealSpeedupvsRvsGPUs.tikz}}
567 \label{ch5:fig:pararealGPUs:a}
569 \subfigure[Measured parallel efficiency]{
570 {\small\input{Chapters/chapter5/figures/pararealEfficiencyvsRvsGPUs.tikz}}
571 \label{ch5:fig:pararealGPUs:b}
574 \caption{Parareal performance properties as a function of $R$ and number GPUs used. Notice how the obtained performance depends greatly on the choice of $R$ as a function of the number of GPUs. Test environment 2. }\label{ch5:fig:pararealGPUs}
577 %TODO: Do we make this into a subsubsection:
578 %\subsubsection{Library Implementation}\label{ch5:subsec:libimpl}
579 %Describe library features. Stopping criteria. Usage of different numerical integrators. Examples including c++ code. Speed-up measurements on test cases.
581 %\begin{figure}[!htb]
582 % \setlength\figureheight{0.32\textwidth}
583 % \setlength\figurewidth{0.35\textwidth}
585 % \subfigure[Parareal iterations $k$ needed for converge to a relative error of $10^{-6}$]{
586 % {\small\input{Chapters/chapter5/figures/pararealKvsRvsGPUs.tikz}}
587 % \label{ch5:fig:pararealGPUs:a}
589 % \subfigure[Measured speed-up]{
590 % {\small\input{Chapters/chapter5/figures/pararealSpeedupvsRvsGPUs.tikz}}
591 % \label{ch5:fig:pararealGPUs:b}
594 % \caption{Parareal performance and convergence properties as a function of $R$ and number GPUs used. $R$ is regulated by the time discretization in the coarse propagator. }\label{ch5:fig:pararealGPUs}
597 \section{Conclusion and outlook}
599 Massively parallel heterogeneous systems continue to enter the consumer market, and there has been no identification that this trend will stop for the next years to come. However, these parallel architectures require software vendors to adjust to new programming models and optimization strategies. Good software libraries are important tools for reducing time and complexity of adjusting to new architectures and they provide the user with an intuitive programming interface.
601 We presented ideas for a generic GPU-based library for fast assembling of PDE solvers. A high-level interface and some simple implementation concepts were created with the objective of enhancing developer productivity and to ensure performance portability. Two elementary PDE model problems were used to demonstrate assembling of both spatial and temporal approximation techniques. Results confirmed that numerical methods for solution of PDE systems are bandwidth limited on GPU systems. Therefore higher-order methods can be advantageous, because extra computational work can be performed during memory transfer idle times, leading to increased flop performance.
603 Decomposition strategies for spatial and temporal parallelization on multi-GPU systems has been presented, with reasonable performance speedups. Novel results for the parareal algorithm on multi-GPU systems was presented, and an example of parallel efficiency for the heat conduction problem was demonstrated. Furthermore, a domain decomposition technique that preserves algorithmic efficiency was presented for the Poisson problem, with good parallel scalability.
605 The library has already successfully been used for development of a fast tool intended for scientific applications within maritime engineering, see Chapter \ref{ch7} for details. We intend to further extend the library, as we explore new techniques, suitable for parallelization on heterogeneous systems, that fit the scope of our applications.
607 For more information about the library and our work within the GPUlab group at the Technical University of Denmark, we encourage you to visit \texttt{http://gpulab.imm.dtu.dk}.
609 \section*{Acknowledgements}
610 This work was supported by grant no. 09-070032 %on "Desktop Scientific Computing on Consumer Graphics Cards"
611 from the Danish Research Council for Technology and Production Sciences. A special thank goes to Professor Jan S. Hesthaven for supporting parts of this work. Scalability and performance tests was done in GPUlab at DTU Informatics, Technical University of Denmark and using the GPU-cluster at Center for Computing and Visualization, Brown University, USA. Nvidia Corporation is acknowledged for generous hardware donations to facilities of GPUlab.
613 %-------------------------------------------------------------------------
614 %\backmatter % bibliography, index and similar things, which are not numbered.
618 %References to include
619 %\cite{ch5:Gropp1999}
620 %\cite{ch5:Gropp1999a}
621 %\cite{ch5:LeVeque2007}
622 %\cite{ch5:Gamma1995}
623 %\cite{ch5:Hoefler2011}
624 %\cite{ch5:Bajrovic2011}
625 %\cite{ch5:Skjellum1994}
626 %\cite{ch5:Glimberg2011}
627 %\cite{ch5:Engsig-Karup2011}
628 %\cite{ch5:Smith1996}
629 %\cite{ch5:Vandevoorde2002}
631 %\cite{ch5:mooreslaw1965}
633 \putbib[Chapters/chapter5/biblio5]
635 % Reset lst label and caption
636 \lstset{label=,caption=}