BookGPU/Chapters/chapter5/ch5.tex

   1 \chapterauthor{Stefan L. Glimberg, Allan P. Engsig-Karup, Allan S. Nielsen, and Bernd Dammann}{Technical University of Denmark}
   2 %\chapterauthor{Allan P. Engsig-Karup}{Technical University of Denmark}
   3 %\chapterauthor{Allan S. Nielsen}{Technical University of Denmark}
   4 %\chapterauthor{Bernd Dammann}{Technical University of Denmark}
   5
   6 \chapter[Software components for heterogeneous many-core architectures]{Development of software components for heterogeneous many-core architectures}\label{ch5}
   7
   8 %Subjects:
   9 %\begin{itemize}
  10 %\item Performance portable tuning techniques via a modern parallel programming model (CUDA) on emerging GPU architectures.
  11 %\item distributed heterogenous computing via MPI+CUDA/OpenCL with domain decomposition methods for PDE solvers.
  12 %\item efficient and scalable iterative methods for solution of high-order numerical methods and strategies for efficient implementations on desktop architectures.
  13 %\end{itemize}
  14
  15 \clearpage
  16 \section{Software development for heterogeneous architectures}
  17 %Our library facilitates massively parallelization through GPU computing and contains components for various iterative strategies such as DC, CG, CGNR, BiCGSTAB and GMRES(m) for solution of large linear systems along with support for preconditioning strategies. The goal is to create a reusable library and framework which provide general components with performance similar to that of a dedicated solver. Preliminary results show that performance overhead can be kept minimal in this new software framework [8].
  18
  19
  20 %Within the past decade we have therefore seen that development of applications based on high-performance parallel computing, is expanding from academic and industrial companies with financial support to invest in parallel computers, into any application developer with access to modern hardware. This shift has not only been motivated by the successful development of programmable GPUs, but by the architectural changes from single-core to multi-core architectures in general. Rapidly increasing power and heating problems have forced the continuous evolution of single core architectures to stop. The solution from the chip manufacturers has been to fit multiple processors into the same unit, in order to maintain the same overall peak performance rates. Though these new parallel architectures can be highly effective compute units, they often pose challenges for software developers to fully utilize their efficiency. Sequential legacy codes are not always easily parallelized and the time spent on conversion might not pay off in the end. This is particular true for heterogenous computers, where the architectural differences between the main- and co-processor can be so significant that they call for completely different optimization strategies. The cache hierarchy and management of CPUs and GPUs being an evident example hereof.
  21
  22
  23 %using a library makes it easier for researchers to implement their problem without having to think about the complexity of the underlying computer hardware.  (keyword: Opacity, [Berkeley dwarf paper])
  24
  25 % a library can support several architectures, and thus the high-level code does not have to be changed when moving to a new architecture, e.g. Thrust's support for CUDA, OpenMP and TBB.
  26
  27 % there is a price/penalty to pay when using libraries, i.e. one doesn't necessarily get the best performance, because the architectural details are not visible to the programmer (keyword: Visibility  [Berkeley dwarf paper]).  However, if the library permits to write own add-ons/kernels (flexibility), this is no longer an issue.
  28
  29 Massively parallel processors, such as graphical processing units (GPUs), have in recent years proven to be effective for a vast amount of scientific applications. Today, most desktop computers are equipped with one or more powerful GPUs, offering heterogeneous high-performance computing to a broad range of scientific researchers and software developers. Though GPUs are now programmable and can be highly effective computing units, they still pose challenges for software developers to fully utilize their efficiency. Sequential legacy codes are not always easily parallelized, and the time spent on conversion might not pay off in the end. This is particular true for heterogenous computers, where the architectural differences between the main and coprocessor can be so significant that they require completely different optimization strategies. The cache hierarchy management of CPUs and GPUs are an evident example hereof. In the past, industrial companies were able to boost application performance solely by upgrading their hardware systems, with an overt balance between investment and performance speedup. Today, the picture is different; not only do they have to invest in new hardware, but they also must account for the adaption and training of their software developers. What traditionally used to be a hardware problem, addressed by the chip manufacturers, has now become a software problem for application developers.
  30
  31 Software libraries\index{software library}\index{library|see{software library}} can be a tremendous help for developers as they make it easier to implement an application, without requiring special knowledge of the underlying computer architecture and hardware. A library may be referred to as \emph{opaque} when it automatically utilizes the available resources, without requiring specific details from the developer\cite{ch5:Asanovic:EECS-2006-183}. The ultimate goal for a successful library is to simplify the process of writing new software and thus to increase developer productivity. Since programmable heterogeneous CPU/GPU systems are a rather new phenomenon, there is a limited number of established software libraries that take full advantage of such heterogeneous high performance systems, and there are no de facto design standards for such systems either. Some existing libraries for conventional homogeneous systems have already added support for offloading computationally intense operations onto coprocessing GPUs. However, this approach comes at the cost of frequent memory transfers across the low bandwidth PCIe bus.
  32
  33 In this chapter, we focus on the use of a software library to help application developers achieve their goals without spending an immense amount of time on optimization details, while still offering close-to-optimal performance. A good library provides performance-portable implementations with intuitive interfaces, that hide the complexity of underlaying hardware optimizations. Unfortunately, opaqueness sometimes comes at a price, as one does not necessarily get the best performance when the architectural details are not \emph{visible} to the programmer~\cite{ch5:Asanovic:EECS-2006-183}. If, however, the library is flexible enough and permits developers to supply their own low-level implementations as well, this does not need to be an issue. These are some of the considerations library developers should take into account, and what we will try to address in this chapter.
  34
  35
  36 For demonstrative purposes we present details from an in-house generic CUDA-based C++ library for fast assembling of partial differential equation (PDE) solvers, utilizing the computational resources of GPUs. This library has been developed as part of research activities associated with the GPUlab, at the Technical University of Denmark and, therefore, is referred to as the {\em GPUlab library}. It falls into the category of computational libraries, as categorized by Hoefler and Snir~\cite{ch5:Hoefler2011}. Memory allocation and basic algebraic operations are supported via object-oriented components, without the user having to write CUDA specific kernels. As a back-end vector class, the parallel CUDA Thrust template-based library is used, enabling easy memory allocation and a high-level interface for vector manipulation~\cite{ch5:Bell2011}. Inspirations for \emph{good library design}, some of which we will present in this chapter, originate from guidelines proposed throughout the literature~\cite{ch5:Hoefler2011,ch5:Gamma1995,ch5:Skjellum1994}. An identification of desirable properties, which any library should strive to achieve, is pointed out by Korson and McGregor~\cite{ch5:Korson1992}. In particular we mention being easy-to-use, extensible, and intuitive.
  37
  38
  39 %We rely on template-based implementations to allow full flexibility to change and assemble types~\cite{ch5:Vandevoorde2002}
  40
  41 %The library presented in this chapter falls into the category of computational libraries, as defined by Hoefler and Snir~\cite{ch5:Hoefler2011}, and seeks to fulfil the relevant objectives presented, such as designing components around data structures (classes), utilizing inheritance for component reuse, and hiding irrelevant information from the end users. An identification of desirable attributes for library development is also pointed out by Korson and McGregor~\cite{ch5:Korson1992}, in particular the library is designed to be easy-to-use, extensibility, and intuitive.
  42
  43 The library is designed to be effective and scalable for fast prototyping of PDE solvers, (primarily) based on matrix-free implementations of finite difference (stencil) approximations on logically structured grids. It offers functionalities that will help assemble PDE solvers that automatically exploit heterogeneous architectures much faster than manually having to manage GPU memory allocation, memory transfers, kernel launching, etc.
  44
  45 In the following sections we demonstrate how software components that play important roles in scientific applications can be designed to fit a simple framework that will run efficiently on heterogeneous systems. One example is finite difference approximations, commonly used to find numerical solutions to differential equations. Matrix-free implementations minimize both memory consumption and memory access, two important features for efficient GPU utilization and for enabling the solution of large-scale problems. The bottleneck problem for many PDE applications is to solve large sparse linear systems, arising from the discretization. In order to help solve these systems, the library includes a set of iterative solvers. All iterative solvers are template-based, such that vector and matrix classes, along with their underlying implementations, can be freely interchanged. New solvers can also be implemented without much coding effort. The generic nature of the library, along with a predefined set of interface rules, allows assembling components into PDE solvers. The use of parameterized-type binding allows the user to assemble PDE solvers at a high abstraction level, without having to change the remaining implementation.
  46
  47 Since this chapter is mostly dedicated to the discussion of software development for high performance heterogeneous systems, the focus will be more on the development and usage of an in-house GPU-based library, than on specific scientific applications. We demonstrate how to use the library on two elementary model problems and refer the reader to Chapter \ref{ch7} for a detailed description of an advanced application tool for free surface water wave simulations. These examples are assembled using the library components presented in this chapter.
  48
  49 % Make sure that we present a way to implement PDE solver components, and that our library is an example that demonstrate the use.
  50
  51 %Well designed standards for message passing across multiple platforms have been successfully developed, with MPI being the most well known \cite{ch5:Gropp1999,ch5:Gropp1999a}.
  52
  53 %\stefan{Add more motivation, power wall + memory wall + ILP (instruction-level parallelism) wall = brick wall.. }
  54
  55 %- Motivation for using libraries in general. Emphasize the computational power of GPUs and how they can be used for high-performance scientific computing.
  56
  57 %- Motivate our library, how we fit in.
  58
  59 %- Introduce the model problem that we use the demonstrate the use of our library.
  60
  61 %- There is a tendency to that the more performance is needed, the more low-level and difficult the programming gets. We strive to achieve such good performance, but without compromising too much easiness.
  62
  63 %- The following sections are ordered as follows...
  64
  65 \subsection{Test environments}\label{ch5:sec:testenvironments}
  66 Throughout the chapter we use three different test environments: two high-end desktop computers located at the GPUlab--Technical University of Denmark, and a GPU cluster located at the Center for Computing and Visualization, Brown University, USA. Hardware details for the two systems are as follows:
  67 \begin{description}
  68 \item[Test environment 1.] Desktop computer, Linux Ubuntu, Intel Xeon E5620 (2.4GHz) quad-core Westmere processor, 12GB of DDR-3 memory (1066 MHz), 2x NVIDIA GeForce GTX590 GPU with 3GB DDR5 memory, PCIe 2.0.
  69 \item[Test environment 2.] Desktop computer, Linux Ubuntu, Intel Core i7-3820 (3.60GHz) Sandy Bridge processors, 32GB RAM, 2x NVIDIA Tesla K20 GPUs with 5GB DDR5 memory, PCIe 2.0.
  70 \item[Test environment 3.] GPU cluster, Linux, up to 44 compute nodes based on dual Intel Xeon E5540 (2.53GHz) quad-core Nehalem processors, 24 GB of DDR-3 memory (1333MHz), 2x NVIDIA Tesla M2050 GPUs with 3GB GDDR5 memory, 40 Gb/s Quad-Data-Rate (QDR) InfiniBand interconnect.
  71 \end{description}
  72
  73
  74 \section{Heterogeneous library design for PDE solvers}
  75 A generic CUDA-based C++ library has been developed to ease the assembling of PDE solvers. The template-based\index{templates} design allows users to assemble solver parts easily and to supply their own implementation of problem specific parts. In the following, we present an overview of the library and the supported features, introduce the concepts of the library components, and give short code examples to ease understanding. The library is a starting point for fast assembling of GPU-based PDE solvers, developed mainly to support finite difference operations on regular grids. However, this is not a limitation, since existing vector objects could be used as base classes for extending to other discretization methods or grid types as well.
  76
  77 \subsection{Component and concept design}
  78 The library is grouped into component classes. Each component should fulfill a set of simple interface and template rules, called concepts\index{concept}, in order to guarantee compatibility with the rest of the library. In the context of PDE solving, we present five component classes: vectors, matrices, iterative solvers for linear system of equations, preconditioners for the iterative solvers, and time integrators. Figure \ref{ch5:fig:componentdesign} lists the five components along with a subset of the type definitions they should provide and the methods they should implement. It is possible to extend the implementation of these components with more functionality that relate to specific problems, but this is the minimum requirement for compatibility with the remaining library. With these concept rules fulfilled, components can rely on other components to have their respective functions implemented.
  79
  80 \begin{figure}[!htb]
  81 \centering
  82 \input{Chapters/chapter5/figures/component_design.tikz}
  83 \caption[Schematic representation of the five main components, their type definitions, and member functions.]{Schematic representation of the five main components, their type definitions, and member functions. Because components are template based, the argument types cannot be known beforehand. The concepts ensure compliance among components.}\label{ch5:fig:componentdesign}
  84 \end{figure}
  85
  86 A component is implemented as a generic C++ class, and normally takes as a template arguments the same types that it offers through type definitions: a matrix takes a vector as template argument, and a vector takes the working precision type. The matrix can then access the working precision through the vector class. Components that rely on multiple template arguments can combine these arguments via type binders to reduce the number of arguments and maintain code simplicity. We will demonstrate use of such type binders in the model problem examples. A thorough introduction to template-based programming in C++ can be found in \cite{ch5:Vandevoorde2002}.
  87
  88 The generic configuration allows the developer to define and assemble solver parts at the very beginning of the program using type definitions. Changing PDE parts at a later time is then only a matter of changing type definitions. We will give two model examples of how to assemble PDE solvers in Section \ref{ch5:sec:modelproblems}.
  89
  90 \subsection{A matrix-free finite difference component}\index{finite difference}\index{matrix-free}
  91 Common vector operations, such as memory allocation, element-wise assignments, and basic algebraic transformations, require many lines of codes for a purely CUDA-based implementation. These CUDA-specific operations and kernels are hidden from the user behind library implementations, to ensure a high abstraction level. The vector class inherits from the CUDA-based Thrust library and therefore offer the same level of abstraction that enhances developer productivity and enables performance portability. Creating and allocating device (GPU) memory for two vectors can be done in a simple and intuitive way using the GPUlab library, as shown in Listing \ref{ch5:lst:vector} where two vectors are added together.
  92
  93 \lstinputlisting[label=ch5:lst:vector,caption={allocating, initializing, and adding together two vectors on the GPU: first example uses pure CUDA C; second example uses the built-in library template-based vector class}]{Chapters/chapter5/code/ex1.cu}
  94
  95 The vector class (and derived classes hereof) is compliant with the rest of the library components. Matrix-vector multiplications are usually what makes PDE-based applications different from each other, and the need to write a user specific implementation of the matrix-vector product is essential when solving specific PDE problems. The PDE and the choice of discretization method determine the structure and sparsity of the resulting matrix. Spatial discretization is supported by the library with finite difference approximations, and it offers an efficient low-storage (matrix-free) flexible order\index{flexible order} implementation to help developers tailor their custom codes. These matrix-free operators are feasible for problems where the matrix structure is known in advance and can be exploited, such that the matrix values can be either precomputed or computed on the fly. Furthermore, the low constant memory requirement makes them perfect in the context of solving large scale problems, whereas traditional sparse matrix formats require increasingly more memory, see e.g., ~\cite{ch5:Bell2009} for details on GPU sparse matrix formats.
  96
  97 Finite differences approximate the derivative of some function $u(x)$ as a weighted sum of neighboring elements. In compact notation we write
  98 \begin{align}\label{ch5:eq:fdstencil}
  99 \frac{\partial^q u(x_i)}{\partial x^q} \approx \sum_{n=-\alpha}^{\beta}c_n u(x_{i+n}),
 100 \end{align}
 101 where $q$ is the order of the derivative, $c_n$ is a set of finite difference coefficients, and $\alpha$ plus $\beta$ define the number of coefficients that are used for the approximation. The total set of contributing elements is called the stencil,\index{stencil} and the size of the stencil is called the rank, given as $\alpha + \beta + 1$. The stencil coefficients $c_n$ can be derived from a Taylor expansion based on the values of $\alpha$ and $\beta$, and $q$, using the method of undetermined coefficients~\cite{ch5:LeVeque2007}.
 102 An example of a three-point finite difference matrix that approximates the first ($q=1$) or second ($q=2$) derivative of a one-dimensional uniformly distributed vector $u$ of length $8$ is given here:
 103 \begin{flalign}\label{ch5:eq:stencilmatrix}
 104 \left[\begin{array}{c c c c c c c c}
 105 c_{00} & c_{01} & c_{02} &      0 &      0 &      0 &      0 &      0 \\
 106 c_{10} & c_{11} & c_{12} &      0 &      0 &      0 &      0 &      0 \\
 107      0 & c_{10} & c_{11} & c_{12} &      0 &      0 &      0 &      0 \\
 108      0 &      0 & c_{10} & c_{11} & c_{12} &      0 &      0 &      0 \\
 109      0 &      0 &      0 & c_{10} & c_{11} & c_{12} &      0 &      0 \\
 110      0 &      0 &      0 &      0 & c_{10} & c_{11} & c_{12} &      0 \\
 111      0 &      0 &      0 &      0 &      0 & c_{10} & c_{11} & c_{12} \\
 112      0 &      0 &      0 &      0 &      0 & c_{20} & c_{21} & c_{22}
 113 \end{array}\right]
 114 \left[\begin{array}{c}
 115 u_0 \\
 116 u_1 \\
 117 u_2 \\
 118 u_3 \\
 119 u_4 \\
 120 u_5 \\
 121 u_6 \\
 122 u_7
 123 \end{array}\right]
 124  \approx
 125 \left[\begin{array}{c}
 126 u^{(q)}_0 \\
 127 u^{(q)}_1 \\
 128 u^{(q)}_2 \\
 129 u^{(q)}_3 \\
 130 u^{(q)}_4 \\
 131 u^{(q)}_5 \\
 132 u^{(q)}_6 \\
 133 u^{(q)}_7
 134 \end{array}\right].
 135 \end{flalign}
 136 It is clear from this example that the matrix is sparse and that the same coefficients are repeated for all centered rows. The coefficients differ only near the boundaries, where off-centered stencils are used. It is natural to pack this information into a stencil operator that stores only the unique set of coefficients:
 137 \begin{eqnarray}\label{ch5:eq:stencilcoeffs}
 138 \myvec{c} & = &
 139 \left[\begin{array}{c c c}
 140 c_{00} & c_{01} & c_{02} \\
 141 c_{10} & c_{11} & c_{12} \\
 142 c_{20} & c_{21} & c_{22} \\
 143 \end{array}\right].
 144 \end{eqnarray}
 145 Matrix components precompute these compact stencil coefficients and provides member functions that computes the finite difference approximation of input vectors. Unit scaled coefficients (assuming grid spacing is one) are computed and stored to be accessible via both CPU and GPU memory. On the GPU, the constant memory space is used for faster memory access~\cite{ch5:cudaguide}. In order to apply a stencil on a non unit-spaced grid, with grid space $\Delta x$, the scale factor $1/(\Delta x)^q$ will have to be multiplied by the finite difference sum, i.e., $(c_{00}u_0 + c_{01}u_1 + c_{02}u_2)/(\Delta x)^q \approx u^{(q)}_0$ as in the first row of \eqref{ch5:eq:stencilmatrix}.
 146
 147 Setting up a two-dimensional grid of size $N_x \times N_y$ in the unit square and computing the first derivative hereof is illustrated in Listing~\ref{ch5:lst:stencil}. The grid is a vector component, derived from the vector class. It is by default treated as a device object and memory is automatically allocated on the device to fit the grid size. The finite difference approximation as in \eqref{ch5:eq:fdstencil}, is performed via a CUDA kernel behind the scenes during the calls to \texttt{mult} and \texttt{diff\_x}, utilizing the memory hierarchy as the CUDA guidelines prescribe~\cite{ch5:cudaguide,ch5:cudapractice}. To increase developer productivity, kernel launch configurations have default settings, based on CUDA guidelines, principles, and experiences from performance testings, such that the user does not have to explicitly specify them. For problem-specific finite difference approximations, where the built-in stencil operators are insufficient, a pointer to the coefficient matrix \eqref{ch5:eq:stencilcoeffs} can be accessed as demonstrated in Listing \ref{ch5:lst:stencil} and passed to customized kernels.
 148 \pagebreak
 149 \lstinputlisting[label=ch5:lst:stencil,caption={two-dimensional finite difference stencil example: computing the first derivative using five points ($\alpha=\beta=2$) per dimension, a total nine-point stencil}]{Chapters/chapter5/code/ex2.cu}
 150
 151 In the following sections we demonstrate how to go from an initial value problem (IVP) or a boundary value problem (BVP) to a working application solver by combining existing library components along with new custom-tailored components. We also demonstrate how to apply spatial and temporal domain decomposition strategies that can make existing solvers take advantage of systems equipped with multiple GPUs. Next section demonstrates how to rapidly assemble a PDE solver using library components.
 152
 153 \section{Model problems}\label{ch5:sec:modelproblems}
 154
 155 %Physical phenomenons are at all times present around us. Many of them are well defined via the laws of physics and can be described via differential equations. Though many differential equations are well defined, their solutions are not, and the need for computers to approximate and simulate solutions are inevitable. The accuracy of these simulations is critical for the quality of scientific applications within all fields of engineering. Improved accuracy often comes with the price of increased computational times, that can even be too overwhelming for a single computer to run. Fortunately, the fundamental change in chip-design led development of single core systems into multi- and many core high-performance architectures, that perfectly fits the engineering needs for increased computational resources. The massively parallel architecture of GPUs is a particular good match for accurate methods with high computational intensity\cite{ch5:Kloeckner2011}.
 156
 157 We present two elementary PDE model problems, to demonstrate how to assemble PDE solvers, using library components that follow the guidelines described above. The first model problem is the unsteady parabolic heat conduction equation; the second model problem is the elliptic Poisson equation. The two model problems consist of elements that play important roles in solving a broad range of more advanced PDE problems.
 158
 159 We refer the reader to Chapter \ref{ch7} for an example of a scientific application relevant for coastal and maritime engineering analysis that has been assembled using customized library components similar to those presented in the following.
 160
 161 \subsection{Heat conduction equation}\index{heat conduction}
 162 First, we consider a two-dimensional heat conduction problem defined on a unit square. The heat conduction equation is a parabolic partial differential diffusion equation, including both spatial and temporal derivatives. It describes how the diffusion of heat in a medium changes with time. Diffusion equations are of great importance in many fields of sciences, e.g., fluid dynamics, where the fluid motion is uniquely described by the Navier-Stokes equations, which include a diffusive viscous term~\cite{ch5:chorin1993,ch5:Ferziger1996}.%, or in financial science where diffusive terms are present in the Black-Scholes equations for estimation of option price trends~\cite{}.
 163
 164 The heat problem is an IVP \index{initial value problem}, it describes how the heat distribution evolves from a specified initial state. Together with homogeneous Dirichlet boundary conditions\index{boundary condition}, the heat problem in the unit square is given as
 165 \begin{subequations}\begin{align}
 166 \frac{\partial u}{\partial t} - \kappa\nabla^2u =  0, & \qquad (x,y)\in \Omega([0,1]\times[0,1]),\quad t\geq 0, \label{ch5:eq:heateqdt}\\
 167 u =  0, & \qquad (x,y) \in \partial\Omega,\label{ch5:eq:heateqbc}
 168 \end{align}\label{ch5:eq:heateq}\end{subequations}
 169 where $u(x,y,t)$ is the unknown heat distribution defined within the domain $\Omega$, $t$ is the time, $\kappa$ is a heat conductivity constant (let $\kappa=1$), and $\nabla^2$ is the two-dimensional Laplace\index{Laplace operator} differential operator $(\partial_{xx}+\partial_{yy})$. We use the following initial condition:
 170 \begin{align}\label{ch5:eq:heatinit}
 171 u(x,y,t_0) = \sin(\pi x)\,\sin(\pi y), & \qquad (x,y) \in \Omega,
 172 \end{align}
 173 because it has a known analytic solution over the entire time span, and it satisfies the homogeneous boundary condition given by \eqref{ch5:eq:heateqbc}. An illustrative example of the numerical solution to the heat problem, using \eqref{ch5:eq:heatinit} as the initial condition, is given in Figure \ref{ch5:fig:heatsolution}.
 174 \begin{figure}[!htbp]
 175         \scriptsize
 176     \centering
 177     \setlength\figurewidth{0.26\textwidth}        %
 178     \setlength\figureheight{0.26\textwidth}       %
 179     \subfigure[$t=0.00s$]{\input{Chapters/chapter5/figures/HeatSolution0.tikz}}%
 180 %{\includegraphics[width=0.48\textwidth]{Chapters/chapter5/figures/HeatSolution0_conv.pdf}}
 181     \subfigure[$t=0.05s$]{\input{Chapters/chapter5/figures/HeatSolution0.049307.tikz}}%
 182 %{\includegraphics[width=0.48\textwidth]{Chapters/chapter5/figures/HeatSolution0_049307_conv.pdf}}
 183     \subfigure[$t=0.10s$]{\input{Chapters/chapter5/figures/HeatSolution0.099723.tikz}}%
 184 %{\includegraphics[width=0.48\textwidth]{Chapters/chapter5/figures/HeatSolution0_099723_conv.pdf}}
 185     \caption{Discrete solution, at times $t=0s$ and $t=0.05s$, using \eqref{ch5:eq:heatinit} as the initial condition and a small $20\times20$ numerical grid.}\label{ch5:fig:heatsolution}
 186 \end{figure}
 187
 188 We use a Method of Lines (MoL)\index{method of lines} approach to solve \eqref{ch5:eq:heateq}. Thus,  the spatial derivatives are replaced with finite difference approximations, leaving only the temporal derivative as unknown. The spatial derivatives are approximated from $\myvec{u}^n$, where $\myvec{u}^n$ represents the approximate solution to $u(t_n)$ at a given time $t_n$ with time step size $\delta t$ such that $t_n=n\delta t$ for $n=0,1,\ldots$ The finite difference approximation\index{finite difference} can be interpreted as a matrix-vector product as sketched in \eqref{ch5:eq:stencilmatrix}, and so the semi-discrete heat conduction problem becomes
 189 \begin{align}\label{ch5:eq:discreteheateq}
 190 \frac{\partial u}{\partial t} = \mymat{A}\myvec{u}, \qquad \mymat{A} \in \mathbb{R}^{N\times N}, \quad \myvec{u} \in \mathbb{R}^{N},
 191 \end{align}
 192 where $\mymat{A}$ is the sparse finite difference matrix and $N$ is the number of unknowns in the discrete system. The temporal derivative is now free to be approximated by any suitable choice of a time-integration method\index{time integration}. The most simple integration scheme would be the first-order accurate explicit forward Euler method\index{Euler!forward Euler},
 193 \begin{align}\label{ch5:eq:forwardeuler}
 194 \myvec{u}^{n+1} = \myvec{u}^n + \delta t\,\mymat{A}\myvec{u}^n,
 195 \end{align}
 196 where $n+1$ refers to the solution at the next time step. The forward Euler method can be exchanged with alternative high-order accurate time integration methods, such as Runge-Kutta methods or linear multistep methods, if numerical instability becomes an issue, see, e.g., \cite{ch5:LeVeque2007} for details on numerical stability analysis. For demonstrative purpose, we simply use conservative time step sizes to avoid stability issues. However, the component-based library design provides exactly the flexibility for the application developer to select or change PDE solver parts, such as the time integrator, with little coding effort. A generic implementation of the forward Euler method that satisfies the library concept rules is illustrated in Listing~\ref{ch5:lst:euler}. According to the component guidelines in Figure \ref{ch5:fig:componentdesign}, a time integrator is basically a functor, which means that it implements the parenthesis operator, taking five template arguments: a right hand side operator, the state vector, integration start time, integration end time, and a time step size. The method takes as many time steps necessary to integrate from the start to the end, continuously updating the state vector according to \eqref{ch5:eq:forwardeuler}. Notice, that nothing in Listing~\ref{ch5:lst:euler} indicates wether GPUs are used or not. However, it is likely that the underlying implementation of the right hand side functor and the \texttt{axpy} vector function, do rely on fast GPU kernels. However, it is not something that the developer of the component has to account for. For this reason, the template-based approach, along with simple interface concepts, make it easy to create new components that will fit well into a generic library.
 197
 198
 199 The basic numerical approach to solve the heat conduction problem has now been outlined, and we are ready to assemble the PDE solver.
 200
 201 \pagebreak
 202 \lstinputlisting[label=ch5:lst:euler,caption={generic implementation of explicit first-order forward Euler integration}]{Chapters/chapter5/code/ex3.cu}
 203
 204
 205 %More accurate methods, like variations of Runge-Kutta methods or multi-step methods might be desirable, particulary if the accuracy of the spatial approximation is also high, see e.g. LeVeque for examples hereof~\cite{ch5:LeVeque2007}.
 206
 207 %The following section demonstrates how to rapidly assemble a solver for the heat problem just presented, with configurable order of accuracy for both the spatial and temporal approximations. Section \ref{ch5:sec:results} presents performance measures and Section \ref{ch5:sec:multigpu} provides details for further optimization strategies using multiple GPUs.
 208
 209
 210
 211 \subsubsection{Assembling the heat conduction solver}
 212 Before we are able to numerically solve the discrete heat conduction problem \eqref{ch5:eq:heateq}, we need implementations to handle the the following items:
 213 \begin{description}
 214 \item[Grid]-- A discrete numerical grid to represent the two-dimensional heat distribution domain and the arithmetical working precision ($32$-bit single-precision or $64$-bit double-precision).
 215 \item[RHS]-- A right-hand side operator for \eqref{ch5:eq:discreteheateq} that approximates the second-order spatial derivatives (matrix-vector product).
 216 \item[Boundary conditions]-- A strategy that ensures that the Dirichlet conditions are satisfied on the boundary.
 217 \item[Time integrator]-- A time integration scheme, that approximates the time derivative from \eqref{ch5:eq:discreteheateq}.
 218 \end{description}
 219 All items are either directly available in the library or can be designed from components herein. The built-in stencil operator may assist in implementing the matrix-vector product, but we need to explicitly ensure that the Dirichlet boundary conditions are satisfied. We demonstrated in Listing \ref{ch5:lst:stencil} how to approximate the derivative using flexible order finite difference stencils. However, from \eqref{ch5:eq:heateqbc} we know that boundary values are zero. Therefore, we extend the stencil operator with a simple kernel call that assigns zero to the entire boundary. Listing \ref{ch5:lst:laplaceimpl} shows the code for the two-dimensional Laplace right-hand side operator. The constructor takes as an argument the stencil half size $\alpha$ and assumes $\alpha=\beta$. Thus, the total two-dimensional stencil rank will be $4\alpha+1$. For simplicity we also assume that the grid is uniformly distributed, $N_x=N_y$. Performance optimizations for the stencil kernel, such as shared memory utilization, are handled in the underlying implementation, accordingly to CUDA guidelines~\cite{ch5:cudaguide,ch5:cudapractice}. The macros, \texttt{BLOCK1D} and \texttt{GRID1D}, are used to help set up kernel configurations based on grid sizes and \texttt{RAW\_PTR} is used to cast the vector object to a valid device memory pointer.
 220
 221 \lstinputlisting[label=ch5:lst:laplaceimpl,caption={the right-hand side Laplace operator: the built-in stencil approximates the two-dimensional spatial derivatives, while the custom set\_dirichlet\_bc kernel takes care of satisfying boundary conditions}]{Chapters/chapter5/code/laplacian.cu}
 222
 223 With the right-hand side operator in place, we are ready to implement the solver. For this simple PDE problem we compute all necessary initial data in the body of the main function and use the forward Euler time integrator to compute the solution until $t=t_{end}$. For more advanced solvers, a built-in \texttt{ode\_solver} class is defined that helps take care of initialization and storage of multiple state variables. Declaring type definitions for all components at the beginning of the main file gives a good overview of the solver composition. In this way, it will be easy to control or change solver components at later times. Listing~\ref{ch5:lst:heattypedefs} lists the type definitions\index{type definitions} that are used to assemble the heat conduction solver. %\stefan{elaborate on the choices}.
 224
 225 \lstset{label=ch5:lst:heattypedefs,caption={type definitions for all the heat conduction solver components used throughout the remaining code}}
 226 \begin{lstlisting}
 227 typedef double                              value_type;
 228 typedef laplacian<value_type>               rhs_type;
 229 typedef gpulab::grid<value_type>            vector_type;
 230 typedef vector_type::property_type          property_type;
 231 typedef gpulab::integration::forward_euler  time_integrator_type;
 232 \end{lstlisting}
 233 The grid is by default treated as a device object, and memory is allocated on the GPU upon initialization of the grid. Setting up the grid can be done via the property type class. The property class holds information about the discrete and physical dimensions, along with fictitious ghost (halo) layers and periodicity conditions. For the heat conduction problem we use a non periodic domain of size $N\times N$ within the unit square with no ghost layers. Listing \ref{ch5:lst:gridsetup} illustrates the grid assembly.
 234
 235 \lstset{label=ch5:lst:gridsetup,caption={creating a two-dimensional grid of size \texttt{N} times \texttt{N} and physical dimension $0$ to $1$}}
 236 \begin{lstlisting}
 237 // Setup discrete and physical dimensions
 238 gpulab::grid_dim<int>          dim(N,N,1);
 239 gpulab::grid_dim<value_type>   p0(0,0);
 240 gpulab::grid_dim<value_type>   p1(1,1);
 241 property_type                  props(dim,p0,p1);
 242
 243 // Initialize vector
 244 vector_type                    u(props);
 245 \end{lstlisting}
 246 Hereafter the vector \texttt{u} can be initialized accordingly to \eqref{ch5:eq:heatinit}. Finally we need to instantiate the right-hand side Laplacian operator from Listing \ref{ch5:lst:laplaceimpl} and the forward Euler time integrator in order to integrate from $t_0$ until $t_{end}$.
 247
 248 \lstset{label=ch5:lst:timeintegrator,caption=creating a time integrator and the right-hand side Laplacian operator}
 249 \begin{lstlisting}
 250 rhs_type rhs(alpha);           // Create right-hand side operator
 251 time_integrator_type solver;   // Create time integrator
 252 solver(&rhs,u,0.0f,tend,dt);   // Integrate from 0 to tend using dt
 253 \end{lstlisting}
 254 The last line invokes the forward Euler time integration scheme defined in Listing \ref{ch5:lst:heattypedefs}. If the developer decides to change the integrator into another explicit scheme, only the time integrator type definition in Listing \ref{ch5:lst:heattypedefs} needs to be changed. The heat conduction solver is now complete.%The simple model problem illustrated, that a well designed library upholds developer productivity, by minimizing the
 255
 256 % how few CUDA implementations that were needed, and thereby upholding developer productivity.
 257
 258 %\todo{(How) do we give access to the entire file?}
 259
 260 \subsubsection{Numerical solutions to the heat conduction problem}
 261
 262 Solution time for the heat conduction problem is in itself not very interesting, as it is only a simple model problem. What is interesting for GPU kernels, such as the finite differences kernel, is that increased computational work often comes with a very small price, because the fast computations can be hidden by the relatively slower memory fetches. Therefore, we are able to improve the accuracy of the numerical solution via more accurate finite differences (larger stencil sizes), while improving the computational performance in terms of floating point operations per second (flops). Figure \ref{ch5:fig:stencilperformance} confirms, that larger stencils improve the kernel performance. Notice that even though these performance results are favorable compared to single core systems ($\sim 10$ GFlops double-precision on a $2.5$-GHz processor), they are still far from their peak performance, e.g., $\sim 2.4$ TFlops single-precision for the GeForce GTX590. The reason is that the kernel is bandwidth bound, i.e., performance is limited by the time it takes to move memory between the global GPU memory and the chip. The Tesla K20 performs better than the GeForce GTX590 because it obtains the highest bandwidth. Being bandwidth bound is a general limitation for matrix-vector-like operations that arise from the discretization of PDE problems. Only matrix-matrix multiplications, which have a high ratio of computations versus memory transactions, are able to reach near-optimal performance results~\cite{ch5:Kirk2010}. These kinds of operators are, however, rarely used to solve PDE problems.
 263
 264 \begin{figure}[!htb]
 265 \setlength\figureheight{0.35\textwidth}
 266 \setlength\figurewidth{0.4\textwidth}
 267 \centering
 268 {\scriptsize
 269 \subfigure[GeForce GTX590, test environment 1.]{
 270         \input{Chapters/chapter5/figures/AlphaPerformanceGTX590_N16777216.tikz}%
 271 }%
 272 \subfigure[Tesla K20c, test environment 2.]{
 273         \input{Chapters/chapter5/figures/AlphaPerformanceTeslaK20c_N16777216.tikz}
 274 }%
 275 %{\includegraphics[width=0.5\textwidth]{Chapters/chapter5/figures/AlphaPerformanceGTX590_N16777216_conv.pdf}}
 276 }
 277 \caption[Single- and double-precision floating point operations per second for a two-dimensional stencil operator on a numerical grid of size $4096^2$.]{Single- and double-precision floating point operations per second for a two-dimensional stencil operator on a numerical grid of size $4096^2$. Various stencil sizes are used $\alpha=1,2,3,4$, equivalent to $5$pt, $9$pt, $13$pt, and $17$pt stencils.}\label{ch5:fig:stencilperformance}
 278 \end{figure}
 279
 280
 281
 282
 283 \subsection{Poisson equation}\index{Poisson equation}
 284 The Poisson equation is a second-order elliptic differential equation, often encountered in applications within scientific fields such as electrostatics and mechanics. We consider the two- dimensional BVP \index{boundary volume problem} defined in terms of Poisson's equation with homogeneous Dirichlet boundary conditions on the form
 285 \begin{subequations}\begin{align}
 286 \nabla^2 u = f(x,y),& \qquad (x,y) \in \Omega([0,1]\times[0,1]), \\
 287 u = 0,& \qquad (x,y) \in \partial\Omega.
 288 \end{align}\label{ch5:eq:poissoneq}\end{subequations}
 289 Notice the similarities to the heat conduction equation \eqref{ch5:eq:heateq}. In fact, \eqref{ch5:eq:poissoneq} could be a steady-state solution to the heat equation, when there is no temporal change $\frac{\partial u}{\partial t}=0$, but a source term $f(x,y)$. Since the Laplace operator and the boundary conditions are the same for both problems, we are able to reuse the same implementation with few modifications.
 290
 291 Opposite to the heat equation, there are no initial conditions. Instead, we seek some $u(x,y)$ that satisfies \eqref{ch5:eq:poissoneq}, given a source term $f(x,y)$, on the right-hand side. For simplicity, assume that we know the exact solution, $u_{\textrm{true}}$, corresponding to \eqref{ch5:eq:heatinit}. Then we use the method of manufactured solutions to derive an expression for the corresponding right-hand side $f(x,y)$:
 292 \begin{align}
 293 f(x,y) = \nabla^2 u_{\textrm{true}} = -2\pi^2\,\sin(\pi x)\,\sin(\pi y). \label{ch5:eq:poissonrhs}
 294 \end{align}
 295 The spatial derivative in \eqref{ch5:eq:poissoneq} is again approximated with finite differences\index{finite difference}, similar to the example in \eqref{ch5:eq:stencilmatrix}, except boundary values are explicitly set to zero. The discrete form of the system can now be written as a sparse linear system of equations:
 296 \begin{align}
 297 \mymat{A}\myvec{u}=\myvec{f}, \qquad \myvec{u},\myvec{f} \in \mathbb{R}^{N}, \quad \mymat{A} \in \mathbb{R}^{N\times N}, \label{ch5:eq:poissonsystem}
 298 \end{align}
 299 where $\mymat{A}$ is the sparse matrix formed by finite difference coefficients, $N$ is the number of unknowns, and $\myvec{f}$ is given by \eqref{ch5:eq:poissonrhs}. Equation \eqref{ch5:eq:poissonsystem} can be solved in numerous ways, but a few observations may help do it more efficiently. Direct solvers based on Gaussian elimination are accurate and use a finite number of operations for a constant problem size. However, the arithmetic complexity grows with the problem size by as much as $\mathcal{O}(N^3)$ and does not exploit the sparsity of $\mymat{A}$. Direct solvers are therefore mostly feasible for dense systems of limited sizes. Sparse direct solvers exist, but they are often difficult to parallelize, or applicable for only certain types of matrices. Regardless of the discretization technique, the discretization of an elliptic PDE into a linear system as in \eqref{ch5:eq:poissonsystem} yields a very sparse matrix $\mymat{A}$ when $N$ is large. Iterative methods\index{iterative method} for solving large sparse linear systems find broad use in scientific applications, because they require only an implementation of the matrix-vector product, and they often use a limited amount of additional memory. Comprehensive introductions to iterative methods may be found in any of~\cite{ch5:Saad2003,ch5:Kelley1995,ch5:Barrett1994}.
 300
 301 One benefit of the high abstraction level and the template-based library design is to allow developers to implement their own components, such as iterative methods for solving sparse linear systems. The library includes three popular iterative methods: conjugate gradient, defect correction\index{defect correction}, and geometric multigrid. The conjugate gradient method is applicable only to systems with symmetric positive definite matrices. This is true for the two-dimensional Poisson problem, when it is discretized with a five-point finite difference stencil, because then there will be no off-centered approximations near the boundary. For high-order approximations ($\alpha>1$), we use the defect correction method with multigrid preconditioning. See e.g., \cite{ch5:Trottenberg2001} for details on multigrid methods.
 302
 303 We will not present the implementation details for all three methods but briefly demonstrate the simplicity of implementing the body of such an iterative solver, given a textbook recipe or mathematical formulation. The defect correction method iteratively improves the solution to $\mymat{A}\myvec{x}=\myvec{b}$, given an initial start guess $\myvec{x}^0$, by continuously solving a preconditioned error equation. The defect correction\index{defect correction!iteration} iteration can be written as
 304 \begin{align}
 305 \myvec{x}^{k+1} = \myvec{x}^{k} + \mymat{M}^{-1}(\myvec{b}-\mymat{A}\myvec{x}^{k}), \quad \mymat{A},\mathcal{M}\in\mathbb{R}^{N\times N}, \quad {\bf x},{\bf b}\in\mathbb{R}^N, \label{ch5:eq:dc}
 306 \end{align}
 307 where $k$ is the iteration number and $\mymat{M}$ is the preconditioner\index{preconditioning} which should be an approximation to the original coefficient matrix $\mymat{A}$. To achieve fast numerical convergence\index{convergence}, applying the preconditioner should be a computationally inexpensive operation compared to solving the original system. How to implement \eqref{ch5:eq:dc} within the library context is illustrated in Listing \ref{ch5:lst:dc}. The host CPU traverses each line in Listing \ref{ch5:lst:dc} and tests for convergence, while the computationally expensive matrix-vector operation and preconditioning, can be executed on the GPU, if GPU-based components are used. The defect correction method has two attractive properties. First, global reduction is required to monitor convergence only once per iteration during convergence evaluation, which reduces communication requirements and provides a basis for efficient and scalable parallelization. Second, it has a minimal constant memory footprint, making it a suitable method for solving very large systems.
 308
 309 \lstset{label=ch5:lst:dc,caption={main loop for the iterative defect correction solver: the solver is instantiated with template argument types for the matrix and vector classes, allowing underlying implementations to be based on GPU kernels}}
 310 \begin{lstlisting}
 311 while(r.nrm2() > tol)
 312 {
 313     // Calculate residual
 314         A.mult(x,r);
 315         r.axpby(1, -1, b);
 316
 317     // Reset initial guess
 318         d.fill(0);
 319
 320     // Solve M*d=r
 321         M(d,r);
 322
 323     // Defect correction update
 324         x.axpy(1, d);
 325 }
 326 \end{lstlisting}
 327
 328 In the following section we demonstrate how to assemble a solver for the discrete Poisson problem, using one of the three iterative methods to efficiently solve \eqref{ch5:eq:poissonsystem}.
 329
 330 \subsubsection{Assembling the Poisson solver}
 331
 332 Assembling the Poisson solver follows almost the same procedure as the heat conduction solver, except the time integration part is exchanged with an iterative method to solve the system of linear equations \eqref{ch5:eq:poissonsystem}. For the discrete matrix-vector product we reuse the Laplace operator from the heat conduction problem in Listing \ref{ch5:lst:laplaceimpl} with few modifications. The Laplace operator is now a matrix component, so to be compatible with the component interface rules in Figure \ref{ch5:fig:componentdesign}, a \texttt{mult} function taking two vector arguments is implemented instead of the parentheses operator. We leave out this code example as it almost identical to the one in Listing \ref{ch5:lst:laplaceimpl}.
 333
 334 At the beginning of the solver implementation we list the type definitions for the Poisson solver that will be used throughout the implementation. Here we use a geometric multigrid\index{multigrid} method as a preconditioner for the defect correction method. Therefore the multigrid solver is assembled first, so that it can be used in the assembling of the defect correction solver. Listing \ref{ch5:lst:poissontypedefs} defines the types for the vector, the matrix, the multigrid preconditioner, and the defect correction solver. The geometric multigrid method needs two additional template arguments that are specific for multigrid, namely, a smoother and a grid restriction/interpolation operator. These arguments are free to be implemented and supplied by the developer if special care is required, e.g., for a custom grid structure. For the Poisson problem on a regular grid, the library contains built-in restriction and interpolation operators, and a red-black Gauss-Seidel smoother. We refer the reader to \cite{ch5:Trottenberg2001} for extensive details on multigrid methods. The monitor and config types that appear in Listing \ref{ch5:lst:poissontypedefs} are used for convergence monitoring within the iterative solver and to control runtime parameters, such as tolerances and iteration limits.
 335
 336 \lstset{label=ch5:lst:poissontypedefs,caption={type definitions for the Laplacian matrix component and the multigrid preconditioned iterative defect correction\index{defect correction} solver}}
 337 \begin{lstlisting}
 338 typedef double                                   value_type;
 339 typedef gpulab::grid<value_type>                 vector_type;
 340 typedef laplacian<vector_type>                   matrix_type;
 341
 342 // MULTIGRID solver types
 343 typedef gpulab::solvers::multigrid_types<
 344           vector_type
 345         , matrix_type
 346         , gpulab::solvers::gauss_seidel_rb_2d
 347         , gpulab::solvers::grid_handler_2d>           mg_types;
 348 typedef gpulab::solvers::multigrid<mg_types>     mg_solver_type;
 349 typedef mg_solver_type::monitor_type             monitor_type;
 350 typedef monitor_type::config_type                config_type;
 351
 352 // DC solver types
 353 typedef gpulab::solvers::defect_correction_types<
 354           vector_type
 355         , matrix_type
 356         , monitor_type
 357         , mg_solver_type>                                 dc_types;
 358 typedef gpulab::solvers::defect_correction<dc_types> dc_solver_type;
 359 \end{lstlisting}
 360
 361 With the type definitions set up, the implementation for the Poisson solver follows in Listing \ref{ch5:lst:poissonsolver}. Some of the initializations are left out, as they follow the same procedure as for the heat conduction example. The defect correction and geometric multigrid solvers are initialized and then multigrid is set as a preconditioner to the defect correction method. Finally the system is solved via a call to \texttt{solve()}.
 362
 363 \pagebreak
 364 \lstset{label=ch5:lst:poissonsolver,caption={initializing the preconditioned defect correction solver to approximate the solution to $\mymat{A}\myvec{u}=\myvec{f}$}}
 365 \begin{lstlisting}
 366 matrix_type A(alpha);               // High-order matrix
 367 matrix_type M(1);                   // Low-order matrix
 368
 369 /* Omitted: create and init vectors u, f */
 370
 371 config_type config;                 // Create configuration
 372 config.set("iter",30);              // Set max iteration count
 373 config.set("rtol",1e-10);           // Set relative tolerance
 374 monitor_type monitor(config);       // Create monitor
 375 dc_solver_type solver(A,monitor);   // Create DC solver
 376 mg_solver_type precond(M,monitor);  // Create MG preconditioner
 377 solver.set_preconditioner(precond); // Set preconditioner
 378 solver.solve(u,f);                  // Solve M^-1(Au = f)
 379 if(monitor.converged())
 380   printf("SUCCESS\n");
 381 \end{lstlisting}
 382
 383
 384 \subsubsection{Numerical solutions to the Poisson problem}\label{ch5:sec:poissonresults}
 385
 386
 387 The discrete Poisson problem \eqref{ch5:eq:poissonsystem} has been solved using the three iterative methods presented above. Convergence histories for the conjugate gradient method and geometric multigrid method, using two different resolutions, are illustrated in Figure \ref{ch5:fig:poissonconvergence:a}. Multigrid methods are very robust and algorithmic efficient, independent of the problem size. Figure \ref{ch5:fig:poissonconvergence:a} confirms that the rate of convergence for the multigrid method is unchanged for both problem sizes. Only the attainable accuracy is slightly worsened, as a consequence of a more ill-conditioned system for large problem sizes.
 388
 389 Defect correction in combination with multigrid preconditioning enables efficient solution of high-order approximations of the Poisson problem, illustrated in Figure \ref{ch5:fig:poissonconvergence:b}. The multigrid preconditioning matrix $\mymat{M}$ is based on a low-order approximation to \eqref{ch5:eq:poissonsystem}, whereas matrix $\mymat{A}$ is a high-order approximation. When $\mymat{M}$ is a close approximation to $\mymat{A}$, defect correction converges most rapidly. This is the effect that can be seen between the three convergence lines in Figure \ref{ch5:fig:poissonconvergence:b}.
 390
 391
 392 \begin{figure}[!htb]
 393 \centering
 394 \subfigure[Convergence histories for the conjugate gradient (CG) and multigrid (MG) methods, for two different problem sizes.]{\label{ch5:fig:poissonconvergence:a}
 395     %{\scriptsize \input{Chapters/chapter5/figures/ConvergenceMGvsCG.tikz}}
 396     {\includegraphics[width=0.45\textwidth]{Chapters/chapter5/figures/ConvergenceMGvsCG_conv.pdf}}
 397 }%
 398 \hspace{0.5cm}%
 399 \subfigure[Defect correction convergence history for three different stencil sizes.]{\label{ch5:fig:poissonconvergence:b}
 400     %{\scriptsize \input{Chapters/chapter5/figures/ConvergenceDC.tikz}}
 401  {\includegraphics[width=0.45\textwidth]{Chapters/chapter5/figures/ConvergenceDC_conv.pdf}}
 402 }
 403 \caption{Algorithmic performance for the conjugate gradient, multigrid, and defect correction methods, measured in terms of the relative residual per iteration.}\label{ch5:fig:poissonconvergence}
 404 \end{figure}
 405
 406 %Numerical and algorithmic efficiency...
 407
 408 %\todo{Make this}
 409
 410 \section{Optimization strategies for multi-GPU systems}\label{ch5:sec:multigpu}\index{multi-GPU}
 411
 412 CUDA enabled GPUs are optimized for high memory bandwidth and fast on-chip performance. However, the role as a separate coprocessor to the CPU can be a limiting factor for large scale scientific applications, because the GPU memory capacity is fixed and is only in the range of a few gigabytes. In comparison, it is not unusual for a high-end workstation to be equipped with $\sim32$GB of main memory, plus a terabyte hard disk capacity for secondary storage. Therefore, large scale scientific applications that process gigabytes of data, require distributed computations on multiple GPU devices. Multi-GPU desktop computers and clusters can have a very attractive peak performance, but the addition of multiple devices introduces the potential performance bottleneck of slow data transfers across PCIe busses and network interconnections, as illustrated in Figure \ref{ch5:fig:gpu2gputransfer}. The ratio between data transfers and computational work has a significant impact on the possibility for latency hiding and thereby overall application performance.
 413
 414 \begin{figure}[!htb]
 415 \begin{center}
 416 %\includegraphics[width=0.6\textwidth]{Chapters/chapter5/figures/GPU2GPU.png}
 417 \input{Chapters/chapter5/figures/GPU2GPU.tikz}
 418 \end{center}
 419 \caption[Message passing between two GPUs involves several memory transfers across lower bandwidth connections.]{Message passing between two GPUs involves several memory transfers across lower bandwidth connections. The kernel call is required if the data is not already sequentially stored in device memory. Recent generations of NVIDIA GPUs, CUDA, and MPI support direct transfers without explicitely transfering data to the host first.}\label{ch5:fig:gpu2gputransfer}
 420 \end{figure}
 421
 422
 423 Developing applications that exploit the full computational capabilities of modern clusters--GPU-based or not--is no trivial matter. Developers are faced with the complexity of distributing and coordinating computations on nodes consisting of many-core CPUs, GPUs and potentially other types of accelerators as well. These complexities give rise to challenges in finding numerical algorithms, that are well suited for such systems, forcing developers to search for novel methods that utilize concurrency.
 424
 425 To ease software development, we use MPI-2 for message passing and ensure a safe and private communication space by creation of a communicator private to the library during initialization, as recommended by Hoefler and Snir~\cite{ch5:Hoefler2011}. With the addition of remote direct memory access (RDMA) for GPUDirect it is possible to make direct memory transfers between recent generation of GPUs (Kepler), eliminating CPU overhead. Unfortunately there are some strict system and driver requirements to enable these features. Therefore, in the following examples, device memory is first transferred to the CPU main memory before invoking any MPI calls. The library provides device-to-device transfers via template-based routines that work directly with GPU vector objects. This hides the complexity of message passing from the developer and helps developers design new components for multi-GPU execution.
 426
 427 %(GPUDirect v2 and Unified Virtual Addressing can help). Effort is being made to create GPU aware MPI implementations, that will allow direct memory transfers(cite: openMPI, MVAPICH2). However, only 64-bit systems with Unified Virtual Addressing (UVA) from CUDA v4.0 and Tesla 20-serie (Kepler) GPUs are capable of direct transfers. Therefore only a limited number of today's GPU clusters will be able to utilize these features.
 428 %For now, the simple solution is to first transfer data to the CPU main memory before invoking the appropriate MPI calls.
 429
 430 In the following sections we present two very different methods for distributed computing based on spatial and temporal decomposition. Each method has its own characteristic, which makes the method attractive for various types of PDE problems and for different problem sizes.
 431
 432
 433
 434 \subsection{Spatial domain decomposition}\label{ch5:sec:dd}\index{domain decomposition}
 435
 436 Domain decomposition methods can be used for distributing computational work in the numerical solution of boundary value problems~\cite{ch5:Smith1996}. These methods add parallelism by splitting the spatial dimensions, on which the boundary values are defined, into a number of smaller boundary value problems and then coordinating the solution between adjacent subdomains. Domain decomposition techniques, such as the classical overlapping Schwarz methods, may be considered as preconditioners \index{preconditioning} to the system of equations that arise from the discretization of PDEs, e.g., as for the Poisson problem in \eqref{ch5:eq:poissonsystem}. The algorithmic efficiency of such methods depends on the size of the domain overlaps, while an additional coarse grid correction method is sometimes necessary to maintain fast global convergence.
 437 %A detailed study and analysis of convergence criterions, overlapping zones, local domain solvers etc. can lead to parallel efficiencies that scale linearly with the number of processors for large problem sizes, see e.g. \cite{ch5:Cai2005}.
 438
 439 An alternative to the preconditioning strategy is to have each subdomain query information from adjacent subdomains whenever needed. For PDEs that are discretized onto regular shaped grids, this can be an attractive strategy, as the decomposition of subdomains, and thereby the communication topology\index{topology}, is straightforward. The library supports decomposition of regular grids by either pre- or user-defined topologies. A topology in this context, is a description of connectivity between processors that share the same grid along with information about the local and global discretization. Layers of ghost points are inserted at the artificially introduced boundaries to account for grid points that reside in adjacent subdomains. The ghost point values can be updated upon request from the user. An illustrative example is given in Figure \ref{ch5:fig:dd2d}; the arrows indicate message passing between adjacent subdomains, when updating grid points within the ghost layers.
 440 \begin{figure}[!htb]
 441 \begin{center}
 442 \input{Chapters/chapter5/figures/dd2d.tikz}
 443 \end{center}
 444 \caption[Domain distribution of a two-dimensional grid into three subdomains.]{Domain distribution of a two-dimensional grid into three subdomains. {\large$\bullet$} and {\scriptsize$\textcolor[rgb]{0.5,0.5,0.5}{\blacksquare}$} represent internal grid points and ghost points, respectively.}\label{ch5:fig:dd2d}
 445 \end{figure}
 446
 447 Topologies are introduced via an extra template argument to the grid class. A grid is by default not decomposed, because the default template argument is based on a non distribution topology implementation. The grid class is extended with a new member function \texttt{update()}, which makes sure that all ghost points are updated according to the grid topology. The library contains topologies based on one-dimensional and two-dimensional distributions of the grid. The number of grid subdomains will be equal to the number of MPI processes executing the program.
 448
 449 If grid ghost layers are updated whenever information from adjacent subdomains is needed, e.g., before a stencil operation, all interior points will be exactly the same as they would be for the non distributed setup. Therefore, one advantage of this approach is that the algorithmic efficiency of an application can be preserved, if grid updates are consistently invoked at the proper times.
 450
 451 Distributed performance for the finite difference stencil operation is illustrated in Figure \ref{ch5:fig:multigpu}. The timings include the compute time for the finite difference approximation and the time for updating ghost layers via message passing. It is obvious from Figure \ref{ch5:fig:multigpu:a} that communication overhead dominates for the smallest problem sizes, where the non distributed grid (1 GPU) is fastest. However, communication overhead does not grow as rapidly as computation times, due to the surface-to-volume ratio. Therefore message passing becomes less influential for large problems, where reasonable performance speedups are obtained. Figure \ref{ch5:fig:multigpu:b} demonstrates how the computational performance on multi-GPU systems can be significantly improved for various stencil sizes. With this simple domain decomposition technique, developers are able to implement applications based on heterogeneous distributed computing, without explicitly dealing with message passing and it is still possible to provide user specific implementations of the topology class for customized grid updates.
 452
 453 % TODO: Should we put in the DD algebra?
 454
 455 \begin{figure}[!htb]
 456     \setlength\figureheight{0.39\textwidth}
 457     \setlength\figurewidth{0.39\textwidth}
 458 %       \centering
 459     \subfigure[Absolute timings, $\alpha=3$.]{
 460     {\scriptsize\input{Chapters/chapter5/figures/MultiGPUAlpha3TimingsTeslaM2050.tikz}}
 461     %{\includegraphics[width=0.5\textwidth]{Chapters/chapter5/figures/MultiGPUAlpha3TimingsTeslaM2050_conv.pdf}}
 462     \label{ch5:fig:multigpu:a}
 463     }%
 464     \subfigure[Performance at $N=4069^2$, single-precision.]{
 465     {\scriptsize\input{Chapters/chapter5/figures/MultiGPUAlphaPerformanceTeslaM2050_N16777216.tikz}}
 466         %{\includegraphics[width=0.5\textwidth]{Chapters/chapter5/figures/MultiGPUAlphaPerformanceTeslaM2050_N16777216_conv.pdf}}
 467     \label{ch5:fig:multigpu:b}
 468     }
 469     \caption{Performance timings for distributed stencil operations, including communication and computation times. Executed on test environment 3.}\label{ch5:fig:multigpu}
 470 \end{figure}
 471
 472 \subsection{Parareal--parallel time integration}\label{ch5:sec:parareal}\index{parareal}
 473
 474 The use of spatial domain decomposition methods is widespread, as they have proven to be efficient on a wide range of problems. Unfortunately, applications that are concerned with limited numerical problem sizes can rapidly reach a speedup limit for a low number of processors due to scalability degradation when the number of processors becomes large, as this leads to an increasingly unfavourable communication-to-compute ratio. This issue is continuously worsened by the fact that communication speed has been increasing at a far slower pace than compute speed for the past several years, and this trend is expected to continue for years to come. It is often referred to as \emph{the memory wall}~\cite{ch5:Asanovic:EECS-2006-183}, one of the grand challenges facing development and architectural design of future high-performance systems~\cite{ch5:Keyes2011,ch5:ScientificGrandChallenges2010}. Also, there are applications based on ordinary differential equations, where classical domain decomposition methods are not even applicable~\cite{ch5:YMTR08}. For these type of applications, a method of adding parallelism in the temporal integration is of great interest. Contrary to space however, time is--by its very nature--sequential, which precludes a straightforward implementation of a parallel approach.
 475
 476 One method that introduces concurrency to the solution of evolution problems is the parareal algorithm. Parareal is an iterative method imposed on a time decomposition. Gander and Vandewalle showed in \cite{ch5:MS07} that the algorithm can be written both as a multiple shooting method and as a two-level multigrid-in-time approach, even though the leading idea came from spatial domain decomposition. The method has many exciting features: it is fault tolerant and has different communication characteristics than those of the classical domain decomposition methods. It has also been demonstrated to work effectively on a wide range of problems, and most importantly, once the proper distribution infrastructure is in place, it can easily be wrapped around any type of numerical integrator, for any type of initial value problem.
 477 \label{ch5:parareal}
 478
 479 \subsubsection{The parareal algorithm}
 480
 481 The parareal algorithm was first presented in 2001, in a paper by Lions et al.~\cite{ch5:LMT01}, and later introduced in a slightly revised predictor-corrector form in 2002 by Baffico et al.~\cite{ch5:LSY02}. The parareal-in-time approach proposes to break the global problem of time evolution into a series of independent evolution problems on smaller intervals, see Figure \ref{ch5:fig:ParallelInTime}. %
 482 \begin{figure}[!htb]
 483 \def\FigureSize{0.6}
 484 \begin{center}
 485 \input{Chapters/chapter5/figures/ParallelInTime.tikz}
 486 \end{center}
 487 \caption[Time domain decomposition.]{Time domain decomposition. A compute node is assigned to each individual time subdomain to compute the initial value problem. Consistency at the time subdomain boundaries is obtained with the application of a computationally cheap integrator in conjunction with the parareal iterative predictor-corrector algorithm}\label{ch5:fig:ParallelInTime}
 488 \end{figure}%
 489 Initial states for these problems are needed and supplied by a simple, less accurate, but computationally cheap sequential integrator. The smaller independent evolution problems can then be solved in parallel. The information, generated during the concurrent solution of the independent evolution problems with accurate propagators and inaccurate initial states, is used in a predictor-corrector fashion in conjunction with the coarse integrator to propagate the solution faster, now using the information generated in parallel. We define the decomposition into $N$ intervals, that is,
 490 \begin{align}
 491  & T_{0}<T_{1}<\cdots<T_{n}=n\Delta T<T_{n+1}<T_{N},
 492 \end{align}
 493 where $\Delta T$ is the size of the time intervals and $n=0,1,\ldots,N$. The general initial value problem on the decomposed time domain is defined as
 494 \begin{align}\label{ch5:eq:difproblem}
 495  \frac{\partial u}{dt}+\mathcal{A}u=0, \qquad u(T_{0})=u^{0}, \quad t\in\left[T_{0},T_{N}\right],
 496 \end{align}
 497 where $\mathcal{A}$ is an operator from one Hilbert space to another. To solve the differential problem \eqref{ch5:eq:difproblem} we define an operator $\mathcal{F}_{\Delta T}$ that operates on some initial state $U_{n} \approx u(T_{n})$ and returns an approximate solution to \eqref{ch5:eq:difproblem}, at time $T_{n}+\Delta T$. Such an operator is achieved by the implementation of a numerical time integrator, using some small time-step $\delta t\ll\Delta T$ in the integration. The numerical solution to \eqref{ch5:eq:difproblem} can then be obtained by applying the fine propagator sequentially for $n=1,2,\ldots,N$.
 498 \begin{align}\label{ch5:eq:fineOp}
 499 \hat{U}_{n}=\mathcal{F}_{\Delta T}\left(T_{n-1},\hat{U}_{n-1}\right), \qquad\hat{U}_{0}=u^{0}.
 500 \end{align}
 501
 502 For the purpose of parallel acceleration of the otherwise purely sequential process of obtaining $\mathcal{F}_{\Delta T}^{N} u^{0} \approx u(T_{N})$, we define the coarse propagator $\mathcal{G}_{\Delta T}$. $\mathcal{G}_{\Delta T}$ also operates on some initial state $U_{n}$, propagating the solution over the time interval $\Delta T$, but now using a time step $\delta T$. Typically $\delta t<\delta T<\Delta T$. For the parareal algorithm to be effective, the coarse propagator $\mathcal{G}_{\Delta T}$ has to be substantially faster to evaluate than the fine propagator $\mathcal{F}_{\Delta T}$. There are many ways of constructing the coarse propagator, the simplest one being to apply the same numerical integrator as for the fine propagator, but using a coarser time discretization. We refer the reader to \cite{ch5:ASNP12} for an introduction to other methods. The coarse operator reads
 503 \begin{align}\label{ch5:eq:coarseOp}
 504 \tilde{U}_{n}=\mathcal{\mathcal{G}}_{\Delta T}\left(T_{n-1},\tilde{U}_{n-1}\right),\qquad\tilde{U_{0}}=u^{0}.
 505 \end{align}
 506 Using the defined $\mathcal{F}_{\Delta T}$ and $\mathcal{G}_{\Delta T}$ operators, the predictor-corrector form of the parareal algorithm can be written in a single line as
 507 \begin{equation}\label{ch5:eq:PARAREAL}
 508 U_{n}^{k+1}=\mathcal{\mathcal{G}}_{\Delta T}\left(U_{n-1}^{k+1}\right)+\mathcal{\mathcal{F}}_{\Delta T}\left(U_{n-1}^{k}\right)-\mathcal{\mathcal{G}}_{\Delta T}\left(U_{n-1}^{k}\right),\quad U_{0}^{k}=u^{0},
 509 \end{equation}
 510 with the initial prediction $U^{0}_{n} = \mathcal{G}_{\Delta T}^{n} u^{0}$ for $n=1\ldots N$ and $k=1\ldots K$. $N$ being the number of time subdomains, while $K\geq1$ is the number of predictor-corrector iterations applied. The parareal algorithm is implemented in the library as a separate time-integration component, using a fully distributed work scheduling model, as proposed by Aubanel~\cite{ch5:EA10}. The model is schematically presented in Figure \ref{ch5:fig:FullyDistributedCores}. The parareal component hides all communication and work distribution from the application developer.  It is defined such that a user only has to decide what coarse and fine propagators to use. Setting up the type definitions for parareal time-integration using forward Euler for coarse propagation and fourth order Runge-Kutta for fine propagation could then be defined as in Listings \ref{ch5:lst:parareal}. The number of GPUs used for parallelization depends on the number of MPI processes executing the application.
 511 \lstset{label=ch5:lst:parareal,caption={assembling a parareal time integrator using forward Euler for coarse propagation and a Runge-Kutta method for fine propagation}}
 512 \begin{lstlisting}
 513 typedef gpulab::integration::forward_euler          coarse;
 514 typedef gpulab::integration::ERK4                   fine;
 515 typedef gpulab::integration::parareal<coarse,fine>  integrator;
 516 \end{lstlisting}
 517
 518
 519 %Each application of $\mathcal{\mathcal{G}}_{\Delta T}$ or $\mathcal{\mathcal{F}}_{\Delta T}$ to a state $\tilde{U}_{n}^{k}$ is performed on a GPU while the corrections themselves are performed on CPUs. In addition to an efficient distribution model, a stopping strategy is needed as parareal is an iterative algorithm. Various choices has been implemented in the library, these are presented in section \ref{ch5:subsec:Lib_impl}.
 520
 521 \begin{figure}[!htb]
 522 \def\FigureSize{0.6}
 523 \begin{center}
 524 \input{Chapters/chapter5/figures/FullyDistributed.tikz}
 525 \end{center}
 526 \caption[Schematic visualization of a fully distributed work scheduling model for the parareal algorithm as proposed by Aubanel.]{\label{ch5:fig:FullyDistributedCores} Schematic visualization of a fully distributed work scheduling model for the parareal algorithm as proposed by Aubanel~\cite{ch5:EA10}. Each GPU is responsible for computing the solution on a single time subdomain. The computation is initiated at rank $0$ and cascades through to rank $N$ where the final solution can be fetched.}
 527 \end{figure}
 528
 529 \subsubsection{Computational complexity}
 530
 531 In the analysis of the computational complexity, we first recognize that both the coarse and the fine propagators, regardless of the type of discretization scheme, involve a complexity that is proportional to the number of time steps being used. Let us define two scalar values $\mathcal{C}_\mathcal{F}$ and $\mathcal{C}_\mathcal{G}$ as the computational cost of performing a single step with the fine and coarse propagators. The computational complexity of a propagator integrating over an interval $\Delta T$ is then given by $\mathcal{C}_\mathcal{F}\frac{\Delta T}{\delta t}$ and $\mathcal{C}_\mathcal{G}\frac{\Delta T}{\delta T}$, respectively. $R$ is introduced as the relation between the two; that is, $R$ is a measure of how much faster the coarse propagator is compared to the fine propagator in integrating the time interval $\Delta T$. The total computational cost for parareal over $N$ intervals is then proportional to
 532 \begin{align}
 533 \left(k+1\right)N\mathcal{C_{G}}\frac{\Delta T}{\delta T}+kN\mathcal{C_{F}}\frac{\Delta T}{\delta t}.
 534 \end{align}
 535 Recognizing that the second term can be distributed over $N$ processors, we are left with
 536 \begin{align}\label{ch5:eq:ComComplexPara}
 537 \left(k+1\right)N\mathcal{C_{G}}\frac{\Delta T}{\delta T}+k\mathcal{C_{F}}\frac{\Delta T}{\delta t}.
 538 \end{align}
 539 The above should be compared to the computational complexity of a purely sequential propagation, using only the fine operator,
 540 \begin{align}
 541 \frac{T_{N}-T_{0}}{\delta t}\mathcal{C}_\mathcal{F}=N\frac{\Delta T}{\delta t}\mathcal{C}_\mathcal{F}.
 542 \end{align}
 543 We can now estimate the speedup, here denoted $\psi$, as the ratio between the computational complexity of the purely sequential solution, and the complexity of the solution obtained by the parareal algorithm \eqref{ch5:eq:ComComplexPara}. Neglecting the influence of communication speed and correction time, we are left with the estimate
 544 \begin{align}\label{ch5:eq:EstiSpeedBasicPara}
 545 \psi=\frac{N\frac{\Delta T}{\delta t}\mathcal{C}_\mathcal{F}}{\left(k+1\right)N\mathcal{C}_\mathcal{G}\frac{\Delta T}{\delta T}+k\mathcal{C}_\mathcal{F}\frac{\Delta T}{\delta t}}=\frac{N}{\left(k+1\right)N\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}+k}.
 546 \end{align}
 547 If we additionally assume that the time spent on coarse propagation is negligible compared to the time spent on the fine propagation, i.e., the limit $\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}\rightarrow0$, the estimate reduces to $\psi=\frac{N}{k}$. It is thus clear that the number of iterations $k$ for the algorithm to converge poses an upper bound on obtainable parallel efficiency. The number of iterations needed for convergence is intimately coupled with the ratio $R$ between the speed of the fine and the coarse integrators $\frac{\mathcal{C}_\mathcal{F}}{\mathcal{C}_\mathcal{G}}\frac{\delta T}{\delta t}$. Using a slow, but more accurate coarse integrator will lead to convergence in fewer iterations $k$, but at the same time it also makes $R$ smaller. Ultimately, this will degrade the obtained speedup as can be deduced from \eqref{ch5:eq:EstiSpeedBasicPara}, and by Amdahl's law it will also lower the upper bound on possible attainable speedup. Thus, $R$ \emph{cannot} be made arbitrarily large since the ratio is inversely proportional to the number of iterations $k$ needed for convergence. This poses a challenge in obtaining speedup and is a trade-off between time spent on the fundamentally sequential part of the algorithm and the number of iterations needed for convergence. It is particularly important to consider this trade-off in the choice of stopping strategy; a more thorough discussion on this topic is available in \cite{ch5:ASNP12} for the interested reader. Measurements on parallel efficiency are typically observed in the literature to be in the range of 20--50\%, depending on the problem and the number of time subdomains, which is also confirmed by our measurements using GPUs. Here we include a demonstration of the obtained speedup of parareal applied to the two-dimensional heat problem \eqref{ch5:eq:heateq}. In Figure \ref{ch5:fig:pararealRvsGPUs} the iterations needed for convergence using the forward Euler method for both fine and coarse integration are presented. $R$ is regulated by changing the time step size for the coarse integrator. In Figure \ref{ch5:fig:pararealGPUs} speedup and parallel efficiency measurements are presented. Notice, when using many GPUs it is advantageous to use a faster, less accurate coarse propagator, despite it requires an extra parareal iteration that increases the total computational complexity.
 548
 549 \begin{figure}[!htb]
 550     \setlength\figureheight{0.32\textwidth}
 551     \setlength\figurewidth{0.35\textwidth}
 552     \begin{center}
 553     \subfigure[The relative error after one parareal iteration ($k=1$).]{
 554     {\small\input{Chapters/chapter5/figures/pararealK1ErrvsRvsGPUs.tikz}}
 555     \label{ch5:fig:pararealRvsGPUs:a}
 556     }
 557     \subfigure[Iterations $K$ needed to obtain a relative error less than $10^{-5}$.]{
 558     {\small\input{Chapters/chapter5/figures/pararealKvsRvsGPUs.tikz}}
 559     \label{ch5:fig:pararealRvsGPUs:b}
 560     }
 561     \end{center}
 562     \caption[Parareal convergence properties as a function of $R$ and number of GPUs used.]{Parareal convergence properties as a function of $R$ and number GPUs used. The error is measured as the relative difference between the purely sequential solution and the parareal solution.}\label{ch5:fig:pararealRvsGPUs}
 563 \end{figure}
 564
 565 \begin{figure}[!htb]
 566     \setlength\figureheight{0.32\textwidth}
 567     \setlength\figurewidth{0.34\textwidth}
 568     \begin{center}
 569     \subfigure[Measured parallel speedup]{
 570     {\small\input{Chapters/chapter5/figures/pararealSpeedupvsRvsGPUs.tikz}}
 571     \label{ch5:fig:pararealGPUs:a}
 572     }
 573     \subfigure[Measured parallel efficiency]{
 574     {\small\input{Chapters/chapter5/figures/pararealEfficiencyvsRvsGPUs.tikz}}
 575     \label{ch5:fig:pararealGPUs:b}
 576     }
 577     \end{center}
 578     \caption[Parareal performance properties as a function of $R$ and number of GPUs used.]{Parareal performance properties as a function of $R$ and number GPUs used. Notice how the obtained performance depends greatly on the choice of $R$ as a function of the number of GPUs. Executed on test environment 3.}\label{ch5:fig:pararealGPUs}
 579 \end{figure}
 580
 581 %TODO: Do we make this into a subsubsection:
 582 %\subsubsection{Library Implementation}\label{ch5:subsec:libimpl}
 583 %Describe library features. Stopping criteria. Usage of different numerical integrators. Examples including c++ code. Speed-up measurements on test cases.
 584
 585 %\begin{figure}[!htb]
 586 %    \setlength\figureheight{0.32\textwidth}
 587 %    \setlength\figurewidth{0.35\textwidth}
 588 %    \begin{center}
 589 %    \subfigure[Parareal iterations $k$ needed for converge to a relative error of $10^{-6}$]{
 590 %    {\small\input{Chapters/chapter5/figures/pararealKvsRvsGPUs.tikz}}
 591 %    \label{ch5:fig:pararealGPUs:a}
 592 %    }
 593 %    \subfigure[Measured speed-up]{
 594 %    {\small\input{Chapters/chapter5/figures/pararealSpeedupvsRvsGPUs.tikz}}
 595 %    \label{ch5:fig:pararealGPUs:b}
 596 %    }
 597 %    \end{center}
 598 %    \caption{Parareal performance and convergence properties as a function of $R$ and number GPUs used. $R$ is regulated by the time discretization in the coarse propagator. }\label{ch5:fig:pararealGPUs}
 599 %\end{figure}
 600
 601 \section{Conclusion and outlook}
 602
 603 Massively parallel heterogeneous systems continue to enter the consumer market, and there has been no identification that this trend will stop for years to come. However, these parallel architectures require software vendors to adjust to new programming models and optimization strategies. Good software libraries are important tools for reducing the time and complexity of adjusting to new architectures, and they provide the user with an intuitive programming interface.
 604
 605 We have presented ideas for a generic GPU-based library for fast assembling of PDE solvers. A high-level interface and simple implementation concepts were created with the objective of enhancing developer productivity and to ensure performance portability. Two elementary PDE model problems were used to demonstrate assembling of both spatial and temporal approximation techniques. Results confirmed that numerical methods for solution of PDE systems are bandwidth limited on GPU systems. Therefore higher-order methods can be advantageous, when extra computational work can be performed during memory transfer idle times, leading to increased flop performance.
 606
 607 Decomposition strategies for spatial and temporal parallelization on multi-GPU systems have been presented, with reasonable performance speedups. Novel results for the parareal algorithm on multi-GPU systems have been presented, and an example of parallel efficiency for the heat conduction problem has been demonstrated. Furthermore, a domain decomposition technique that preserves algorithmic efficiency has been presented for the Poisson problem.
 608
 609 The library has already successfully been used for development of a fast tool intended for scientific applications within maritime engineering; see Chapter \ref{ch7} for details. We intend to further extend the library, as we explore new techniques, suitable for parallelization on heterogeneous systems, that fit the scope of our applications.
 610
 611 For more information about the library and our work within the GPUlab group at the Technical University of Denmark, we encourage you to visit \texttt{http://gpulab.imm.dtu.dk}
 612
 613 \section*{Acknowledgments}
 614 This work have been supported by grant no. 09-070032 %on "Desktop Scientific Computing on Consumer Graphics Cards"
 615 from the Danish Research Council for Technology and Production Sciences. A special thanks goes to Professor Jan S. Hesthaven for supporting parts of this work. Scalability and performance tests were done in the GPUlab at DTU Informatics, Technical University of Denmark and using the GPU-cluster at the Center for Computing and Visualization, Brown University, USA. The NVIDIA Corporation is acknowledged for generous hardware donations to facilities of the GPUlab.
 616
 617 %-------------------------------------------------------------------------
 618 %\backmatter % bibliography, index and similar things, which are not numbered.
 619 %\include{glossary}
 620 %\include{notat}
 621 % TODO:
 622 %References to include
 623 %\cite{ch5:Gropp1999}
 624 %\cite{ch5:Gropp1999a}
 625 %\cite{ch5:LeVeque2007}
 626 %\cite{ch5:Gamma1995}
 627 %\cite{ch5:Hoefler2011}
 628 %\cite{ch5:Bajrovic2011}
 629 %\cite{ch5:Skjellum1994}
 630 %\cite{ch5:Glimberg2011}
 631 %\cite{ch5:Engsig-Karup2011}
 632 %\cite{ch5:Smith1996}
 633 %\cite{ch5:Vandevoorde2002}
 634 %\cite{ch5:Bell2011}
 635 %\cite{ch5:mooreslaw1965}
 636
 637 \putbib[Chapters/chapter5/biblio5]
 638
 639 % Reset lst label and caption
 640 %\lstset{label=,caption=}