new ch5 reread

[book_gpu.git] / BookGPU / Chapters / chapter5 / ch5.tex
diff --git a/BookGPU/Chapters/chapter5/ch5.tex b/BookGPU/Chapters/chapter5/ch5.tex

index ff24f8ce9254b73b2b1811f89278f33344431b6b..f7eabb40b73e179a4441664b461099c2f8e06610 100644 (file)
--- a/BookGPU/Chapters/chapter5/ch5.tex
+++ b/BookGPU/Chapters/chapter5/ch5.tex
@@ -1,12 +1,9 @@
-\chapterauthor{Stefan L. Glimberg}{Technical University of Denmark}
-\chapterauthor{Allan P. Engsig-Karup}{Technical University of Denmark}
-\chapterauthor{Allan S. Nielsen}{Technical University of Denmark}
-\chapterauthor{Bernd Dammann}{Technical University of Denmark}
-
-
-
-\chapter{Development of software components for heterogeneous many-core architectures}\label{ch5}
+\chapterauthor{Stefan L. Glimberg, Allan P. Engsig-Karup, Allan S. Nielsen, and Bernd Dammann}{Technical University of Denmark}
+%\chapterauthor{Allan P. Engsig-Karup}{Technical University of Denmark}
+%\chapterauthor{Allan S. Nielsen}{Technical University of Denmark}
+%\chapterauthor{Bernd Dammann}{Technical University of Denmark}
  
+\chapter[Software components for heterogeneous manycore architectures]{Development of software components for heterogeneous manycore architectures}\label{ch5}
  
  %Subjects:
  %\begin{itemize}
@@ -15,7 +12,6 @@
  %\item efficient and scalable iterative methods for solution of high-order numerical methods and strategies for efficient implementations on desktop architectures.
  %\end{itemize}
  
-
  \section{Software development for heterogeneous architectures}
  %Our library facilitates massively parallelization through GPU computing and contains components for various iterative strategies such as DC, CG, CGNR, BiCGSTAB and GMRES(m) for solution of large linear systems along with support for preconditioning strategies. The goal is to create a reusable library and framework which provide general components with performance similar to that of a dedicated solver. Preliminary results show that performance overhead can be kept minimal in this new software framework [8].
  
@@ -29,25 +25,25 @@
  
  % there is a price/penalty to pay when using libraries, i.e. one doesn't necessarily get the best performance, because the architectural details are not visible to the programmer (keyword: Visibility  [Berkeley dwarf paper]).  However, if the library permits to write own add-ons/kernels (flexibility), this is no longer an issue.
  
-Massively parallel processors, such as graphical processing units (GPUs), have in recent years proven to be effective for a vast amount of scientific applications. Today, most desktop computers are equipped with one or more powerful GPUs, offering heterogeneous high-performance computing to a broad range of scientific researchers and software developers. Though GPUs are now programmable and can be highly effective compute units, they still pose challenges for software developers to fully utilize their efficiency. Sequential legacy codes are not always easily parallelized, and the time spent on conversion might not pay off in the end. This is particular true for heterogenous computers, where the architectural differences between the main- and co-processor can be so significant, that they call for completely different optimization strategies. The cache hierarchy management of CPUs and GPUs being an evident example hereof. In the past, industrial companies were able to boost application performance solely by upgrading their hardware systems, with an overt balance between investment and performance speedup. Today, the picture is different, not only do they have to invest in new hardware, but also account for the adaption and training of their software developers. What traditionally used to be a hardware problem, addressed by the chip manufacturers, has now become a software problem for application developers.
+Massively parallel processors, such as graphical processing units (GPUs), have in recent years proven to be effective for a vast amount of scientific applications. Today, most desktop computers are equipped with one or more powerful GPUs, offering heterogeneous high-performance computing to a broad range of scientific researchers and software developers. Though GPUs are now programmable and can be highly effective computing units, they still pose challenges for software developers to fully utilize their efficiency. Sequential legacy codes are not always easily parallelized, and the time spent on conversion might not pay off in the end. This is particular true for heterogenous computers, where the architectural differences between the main and coprocessor can be so significant that they require completely different optimization strategies. The cache hierarchy management of CPUs and GPUs are an evident example hereof. In the past, industrial companies were able to boost application performance solely by upgrading their hardware systems, with an overt balance between investment and performance speedup. Today, the picture is different; not only do they have to invest in new hardware, but they also must account for the adaption and training of their software developers. What traditionally used to be a hardware problem, addressed by the chip manufacturers, has now become a software problem for application developers.
  
-Software libraries \index{software library} \index{library|see{software library}} can be a tremendous help for developers as they make it easier to implement an application, without having to know about the complexity of the underlying computer hardware, known as \emph{opacity}~\cite{ch5:Asanovic:EECS-2006-183}. The ultimate goal for a successful library is to simplify the process of writing new software and thus to increase developer productivity. Since programmable heterogeneous CPU/GPU systems are a rather new phenomenon, there is yet a limited number of established software libraries that take full advantage of such heterogeneous high performance systems, and there are no de-facto design standards for such systems either. Some existing libraries for conventional homogeneous systems have already added support for offloading computational intense operations onto co-processing GPUs. However, this approach comes with the cost of frequent memory transfers across the low bandwidth PCIe bus.
+Software libraries\index{software library}\index{library|see{software library}} can be a tremendous help for developers as they make it easier to implement an application, without requiring special knowledge of the underlying computer architecture and hardware. A library may be referred to as \emph{opaque} when it automatically utilize the available resources, without requiring specific details from the developer\cite{ch5:Asanovic:EECS-2006-183}. The ultimate goal for a successful library is to simplify the process of writing new software and thus to increase developer productivity. Since programmable heterogeneous CPU/GPU systems are a rather new phenomenon, there is yet a limited number of established software libraries that take full advantage of such heterogeneous high performance systems, and there are no de facto design standards for such systems either. Some existing libraries for conventional homogeneous systems have already added support for offloading computationally intense operations onto coprocessing GPUs. However, this approach comes at the cost of frequent memory transfers across the low bandwidth PCIe bus.
  
-In this chapter, we focus on the use of a software library to help application developers achieve their goals without spending immense time on optimization details, while still offering close-to-optimal performance. A good library provides performance portable implementations with intuitive interfaces, that hide the complexity of underlaying hardware optimizations. Unfortunately, this opaqueness sometimes comes with a price, as one does not necessarily get the best performance when the architectural details are not \emph{visible} to the programmer~\cite{ch5:Asanovic:EECS-2006-183}. If, however, the library is flexible enough and permits developers to supply their own low-level implementations as well, this does not need to be an issue. These are some of the considerations library developers should make, and what we will try to address in this chapter.
+In this chapter, we focus on the use of a software library to help application developers achieve their goals without spending an immense amount of time on optimization details, while still offering close-to-optimal performance. A good library provides performance-portable implementations with intuitive interfaces, that hide the complexity of underlaying hardware optimizations. Unfortunately, opaqueness sometimes comes at a price, as one does not necessarily get the best performance when the architectural details are not \emph{visible} to the programmer~\cite{ch5:Asanovic:EECS-2006-183}. If, however, the library is flexible enough and permits developers to supply their own low-level implementations as well, this does not need to be an issue. These are some of the considerations library developers should take into account, and what we will try to address in this chapter.
  
  
-For demonstrative purposes we present details from an in-house generic CUDA-based C++ library for fast assembling of partial differential equation (PDE) solvers, utilizing the computational resources of GPUs. It has been developed as part of research activities associated with GPUlab, at the Technical University of Denmark, and therefore referred to as the GPUlab library. It falls into the category of computational libraries, as categorized by Hoefler and Snir~\cite{ch5:Hoefler2011}. Memory allocation and basic algebraic operations are supported via object oriented components, without the user having to write CUDA specific kernels. As back-end vector class, the CUDA Thrust library is used, enabling easy memory allocation and a high-level interface for vector manipulation~\cite{ch5:Bell2011}. Inspirations for \emph{good library design}, some of which we will present in this chapter, originates from guidelines proposed throughout literature~\cite{ch5:Hoefler2011,ch5:Gamma1995,ch5:Skjellum1994}. An identification of desirable properties, that any library should strive to achieve, is pointed out by Korson and McGregor~\cite{ch5:Korson1992}, in particular we mention easy-to-use, extensible, and intuitive.
+For demonstrative purposes we present details from an in-house generic CUDA-based C++ library for fast assembling of partial differential equation (PDE) solvers, utilizing the computational resources of GPUs. This library has been developed as part of research activities associated with the GPUlab, at the Technical University of Denmark and, therefore, is referred to as the {\em GPUlab library}. It falls into the category of computational libraries, as categorized by Hoefler and Snir~\cite{ch5:Hoefler2011}. Memory allocation and basic algebraic operations are supported via object-oriented components, without the user having to write CUDA specific kernels. As a back-end vector class, the parallel CUDA Thrust template-based library is used, enabling easy memory allocation and a high-level interface for vector manipulation~\cite{ch5:Bell2011}. Inspirations for \emph{good library design}, some of which we will present in this chapter, originate from guidelines proposed throughout the literature~\cite{ch5:Hoefler2011,ch5:Gamma1995,ch5:Skjellum1994}. An identification of desirable properties, which any library should strive to achieve, is pointed out by Korson and McGregor~\cite{ch5:Korson1992}. In particular we mention being easy-to-use, extensible, and intuitive.
  
  
  %We rely on template-based implementations to allow full flexibility to change and assemble types~\cite{ch5:Vandevoorde2002}
  
  %The library presented in this chapter falls into the category of computational libraries, as defined by Hoefler and Snir~\cite{ch5:Hoefler2011}, and seeks to fulfil the relevant objectives presented, such as designing components around data structures (classes), utilizing inheritance for component reuse, and hiding irrelevant information from the end users. An identification of desirable attributes for library development is also pointed out by Korson and McGregor~\cite{ch5:Korson1992}, in particular the library is designed to be easy-to-use, extensibility, and intuitive.
  
-The library is designed to be effective and scalable for fast prototyping of PDE solvers, (primarily) based on matrix-free implementations of finite difference (stencil) approximations on logically structured grids. It offers functionalities that will help assemble PDE solvers that automatically exploit heterogeneous architectures, much faster than manually having to manage GPU memory allocation, memory transfers, kernel launching, etc.
+The library is designed to be effective and scalable for fast prototyping of PDE solvers, (primarily) based on matrix-free implementations of finite difference (stencil) approximations on logically structured grids. It offers functionalities that will help assemble PDE solvers that automatically exploit heterogeneous architectures much faster than manually having to manage GPU memory allocation, memory transfers, kernel launching, etc.
  
-In the following sections we demonstrate how software components that play important roles in many scientific applications can be designed to fit a simple framework that will run efficiently on heterogeneous systems. One example is finite difference approximations, commonly used to find numerical solutions to differential equations. Matrix-free implementations minimize both memory consumption and main memory access, two important features for efficient GPU utilization and for enabling solution of large-scale problems. The bottleneck problem for many PDE applications, is to find the solution of a large sparse linear systems, arising from the discretization. In order to help solve these systems, the library includes a set of iterative solvers. All iterative solvers are template-based, such that vector and matrix classes, along with their underlying implementations, can be freely interchanged or whole new solvers can even be implemented without much coding effort. The generic nature of the library, along with a predefined set of interface rules, allow assembling of components into PDE solvers. The use of parameterized type binding, allows the user to control the assembling of PDE solvers at a high abstraction level, without having to change the remaining implementation.
+In the following sections we demonstrate how software components that play important roles in scientific applications can be designed to fit a simple framework that will run efficiently on heterogeneous systems. One example is finite difference approximations, commonly used to find numerical solutions to differential equations. Matrix-free implementations minimize both memory consumption and memory access, two important features for efficient GPU utilization and for enabling the solution of large-scale problems. The bottleneck problem for many PDE applications is to solve large sparse linear systems, arising from the discretization. In order to help solve these systems, the library includes a set of iterative solvers. All iterative solvers are template-based, such that vector and matrix classes, along with their underlying implementations, can be freely interchanged. New solvers can also be implemented without much coding effort. The generic nature of the library, along with a predefined set of interface rules, allows assembling components into PDE solvers. The use of parameterized-type binding allows the user to assemble PDE solvers at a high abstraction level, without having to change the remaining implementation.
  
-Since this chapter is mostly dedicated to the discussion of software development for high performance heterogeneous systems, the focus will be merely on the development and usage of an in-house GPU-based library, more than on specific scientific applications. We demonstrate how to use the library on two elementary model problems, and refer to Chapter \ref{ch7} for a detailed description of an advanced application tool for free surface water wave simulations. These examples are assembled using the library components presented in this chapter.
+Since this chapter is mostly dedicated to the discussion of software development for high performance heterogeneous systems, the focus will be more on the development and usage of an in-house GPU-based library, than on specific scientific applications. We demonstrate how to use the library on two elementary model problems and refer the reader to Chapter \ref{ch7} for a detailed description of an advanced application tool for free surface water wave simulations. These examples are assembled using the library components presented in this chapter.
  
  % Make sure that we present a way to implement PDE solver components, and that our library is an example that demonstrate the use.
  
@@ -66,43 +62,43 @@ Since this chapter is mostly dedicated to the discussion of software development
  %- The following sections are ordered as follows...
  
  \subsection{Test environments}\label{ch5:sec:testenvironments}
-Throughout the chapter we have used two different test environments; a high-end desktop computer located at the GPUlab -- Technical University of Denmark, and a GPU cluster located at the Center for Computing and Visualization, Brown University, USA. Hardware details for the two systems are as follows:
+Throughout the chapter we use three different test environments: two high-end desktop computers located at the GPUlab--Technical University of Denmark, and a GPU cluster located at the Center for Computing and Visualization, Brown University, USA. Hardware details for the two systems are as follows:
  \begin{description}
-\item[Test environment 1.] Desktop computer, Linux Ubuntu v10.04, dual Intel Xeon E5620 (2.4 GHz) quad-core Westmere processor, 12 GB of DDR-3 memory (1066 MHz), 2x Nvidia GeForce 590 GPU with 3GB DDR5 memory, PCIe 2.0.
-\item[Test environment 2.] GPU cluster, Linux, up to 44 compute nodes based on dual Intel Xeon E5540 (2.53 GHz) quad-core Nehalem processors, 24 GB of DDR-3 memory (1333 MHz), 2x Nvidia Tesla M2050 GPUs with 3GB GDDR5 memory, 40 Gb/s Quad-Data-Rate (QDR) InfiniBand interconnect.
+\item[Test environment 1.] Desktop computer, Linux Ubuntu, Intel Xeon E5620 (2.4GHz) quad-core Westmere processor, 12GB of DDR-3 memory (1066 MHz), 2x NVIDIA GeForce GTX590 GPU with 3GB DDR5 memory, PCIe 2.0.
+\item[Test environment 2.] Desktop computer, Linux Ubuntu, Intel Core i7-3820 (3.60GHz) Sandy Bridge processors, 32GB RAM, 2x NVIDIA Tesla K20 GPUs with 5GB DDR5 memory, PCIe 2.0.
+\item[Test environment 3.] GPU cluster, Linux, up to 44 compute nodes based on dual Intel Xeon E5540 (2.53GHz) quad-core Nehalem processors, 24 GB of DDR-3 memory (1333MHz), 2x NVIDIA Tesla M2050 GPUs with 3GB GDDR5 memory, 40 Gb/s Quad-Data-Rate (QDR) InfiniBand interconnect.
  \end{description}
  
  
-\section{Heterogenous library design for PDE solvers}
-A generic CUDA-based C++ library has been developed to ease the assembling of PDE solvers. The template \index{templates} based design allows users to assemble solver parts easily, and to supply their own implementation of problem specific parts. In the following, we present an overview of the library and the supported features, we introduce the concepts of the library components and give short code examples to ease understanding. The library is a starting point for fast assembling of GPU-based PDE solvers, developed mainly to support finite difference operations on regular grids. However, this is not a limitation, since existing vector objects could be used as base classes for extending to other discretization methods or grid types as well.
+\section{Heterogeneous library design for PDE solvers}
+A generic CUDA-based C++ library has been developed to ease the assembling of PDE solvers. The template-based\index{templates} design allows users to assemble solver parts easily and to supply their own implementation of problem specific parts. In the following, we present an overview of the library and the supported features, introduce the concepts of the library components, and give short code examples to ease understanding. The library is a starting point for fast assembling of GPU-based PDE solvers, developed mainly to support finite difference operations on regular grids. However, this is not a limitation, since existing vector objects could be used as base classes for extending to other discretization methods or grid types as well.
  
  \subsection{Component and concept design}
-The library is grouped into component classes. Each component should fulfill a set of simple interface and template rules, called concepts \index{concept}, in order to guarantee compatibility with the rest of the library. In the context of PDE solving, we present five component types; vectors, matrices, iterative solvers for linear system of equations, preconditioners for the iterative solvers, and time integrators. Figure \ref{ch5:fig:componentdesign} lists the five components along with the type definitions they must provide, and the methods they should implement. It is possible to extend the implementation of these components with more functionalities, that relate to specific problems, but this is the minimum requirement for compatibility with the remaining library. With these rather concept rules fulfilled, components can rely on other components to have implementations of their member functions.
+The library is grouped into component classes. Each component should fulfill a set of simple interface and template rules, called concepts\index{concept}, in order to guarantee compatibility with the rest of the library. In the context of PDE solving, we present five component classes: vectors, matrices, iterative solvers for linear system of equations, preconditioners for the iterative solvers, and time integrators. Figure \ref{ch5:fig:componentdesign} lists the five components along with a subset of the type definitions they should provide and the methods they should implement. It is possible to extend the implementation of these components with more functionality that relate to specific problems, but this is the minimum requirement for compatibility with the remaining library. With these concept rules fulfilled, components can rely on other components to have their respective functions implemented.
  
  \begin{figure}[!htb]
-\begin{center}
+\centering
  \input{Chapters/chapter5/figures/component_design.tikz}
-\end{center}
-\caption[Schematic representation of the five main components, their type definitions, and member functions.]{Schematic representation of the five main components, their type definitions, and member functions. Because components are template based, the argument types cannot be known in beforehand. This concept rule set ensures compliance between components.}\label{ch5:fig:componentdesign}
+\caption[Schematic representation of the five main components, their type definitions, and member functions.]{Schematic representation of the five main components, their type definitions, and member functions. Because components are template based, the argument types cannot be known beforehand. The concepts ensure compliance among components.}\label{ch5:fig:componentdesign}
  \end{figure}
  
-A component is implemented as a generic C++ class, that normally takes as template arguments the same types that it offers through type definitions: a matrix takes a vector class as template argument, and a vector class takes a working precision value type. The matrix can then access the working precision via the vector class. Components that rely on multiple template arguments can combine these arguments via type binders to maintain code simplicity. We will demonstrate use of such type binders in the model problem examples. A thorough introduction to template-based programming in C++ can be found in \cite{ch5:Vandevoorde2002}.
+A component is implemented as a generic C++ class, and normally takes as a template arguments the same types that it offers through type definitions: a matrix takes a vector as template argument, and a vector takes the working precision type. The matrix can then access the working precision through the vector class. Components that rely on multiple template arguments can combine these arguments via type binders to reduce the number of arguments and maintain code simplicity. We will demonstrate use of such type binders in the model problem examples. A thorough introduction to template-based programming in C++ can be found in \cite{ch5:Vandevoorde2002}.
  
-The generic configuration allows the developer to define and assemble solver parts at the very beginning of the program, using type definitions. Changing PDE solver parts at a later time is then only a matter of changing the proper definitions. We will give examples of how to assemble PDE solvers in section \ref{ch5:sec:modelproblems}.
+The generic configuration allows the developer to define and assemble solver parts at the very beginning of the program using type definitions. Changing PDE parts at a later time is then only a matter of changing type definitions. We will give two model examples of how to assemble PDE solvers in Section \ref{ch5:sec:modelproblems}.
  
  \subsection{A matrix-free finite difference component}\index{finite difference}\index{matrix-free}
-Common vector operations, such as memory allocation, element-wise assignments, and basic algebraic transformations, require many lines of codes for a purely CUDA-based implementation. These CUDA specific operations and kernels are hidden from the user behind library implementations, to ensure a high abstraction level. The vector classes inherit from the CUDA-based Thrust library and therefore offer the same level of abstraction, that enhance developer productivity and enables performance portability. Creating and allocating device (GPU) memory for two vectors can be done in a simple and intuitive way using the GPUlab library, as shown in the following example where two vectors are added together:
+Common vector operations, such as memory allocation, element-wise assignments, and basic algebraic transformations, require many lines of codes for a purely CUDA-based implementation. These CUDA-specific operations and kernels are hidden from the user behind library implementations, to ensure a high abstraction level. The vector class inherits from the CUDA-based Thrust library and therefore offer the same level of abstraction that enhances developer productivity and enables performance portability. Creating and allocating device (GPU) memory for two vectors can be done in a simple and intuitive way using the GPUlab library, as shown in Listing \ref{ch5:lst:vector} where two vectors are added together.
  
-\lstinputlisting[label=ch5:lst:vector,caption={Allocating, initializing, and adding together two vectors on the GPU. First example are using pure CUDA C, second example use the built-in library template-based vector class.}]{Chapters/chapter5/code/ex1.cu}
+\lstinputlisting[label=ch5:lst:vector,caption={allocating, initializing, and adding together two vectors on the GPU: first example uses pure CUDA C; second example uses the built-in library template-based vector class}]{Chapters/chapter5/code/ex1.cu}
  
-The vector class (and derived classes hereof) works with the rest of the library components. Therefore we encourage to use this class as the base vector object. Matrix-vector multiplications are usually what makes PDE-based applications differ, and the need to write a user specific implementation of the matrix-vector product is essential to solve specific PDE problems. The PDE and the choice of discretization method determine the structure and sparsity of the resulting matrix. Spatial discretization is supported by the library with finite difference approximations, and it offers an efficient low-storage (matrix-free) flexible order\index{flexible order} implementation, to help developers tailor their custom codes. These matrix-free operators are feasible for problems where the matrix structure is known in advance and can be exploited, such that the matrix values can either be precomputed or computed on-the-fly. Furthermore, the low constant memory requirement makes them perfect in the context of solving large scale problems, whereas traditional sparse matrix formats require increasingly more memory~\cite{ch5:Bell2009}.
+The vector class (and derived classes hereof) is compliant with the rest of the library components. Matrix-vector multiplications are usually what makes PDE-based applications different from each other, and the need to write a user specific implementation of the matrix-vector product is essential when solving specific PDE problems. The PDE and the choice of discretization method determine the structure and sparsity of the resulting matrix. Spatial discretization is supported by the library with finite difference approximations, and it offers an efficient low-storage (matrix-free) flexible order\index{flexible order} implementation to help developers tailor their custom codes. These matrix-free operators are feasible for problems where the matrix structure is known in advance and can be exploited, such that the matrix values can be either precomputed or computed on the fly. Furthermore, the low constant memory requirement makes them perfect in the context of solving large scale problems, whereas traditional sparse matrix formats require increasingly more memory, see e.g., ~\cite{ch5:Bell2009} for details on GPU sparse matrix formats.
  
  Finite differences approximate the derivative of some function $u(x)$ as a weighted sum of neighboring elements. In compact notation we write
  \begin{align}\label{ch5:eq:fdstencil}
  \frac{\partial^q u(x_i)}{\partial x^q} \approx \sum_{n=-\alpha}^{\beta}c_n u(x_{i+n}),
  \end{align}
-where $q$ is the order of the derivative, $c_n$ is a set of finite difference coefficients, where $\alpha$ plus $\beta$ define the number of coefficients that are used for the approximation. The total set of contributing elements is called the stencil,\index{stencil} and the size of the stencil is called the rank, given as $\alpha + \beta + 1$. The stencil coefficients $c_n$ can be derived from a Taylor expansion based on the values of $\alpha$, $\beta$, and for a specific order $q$, using the method of undetermined coefficients~\cite{ch5:LeVeque2007}.
-An example of a three-point finite difference matrix that approximates the first ($q=1$) or second ($q=2$) derivative of a one dimensional uniformly distributed vector $u$ of length $8$ is given here
+where $q$ is the order of the derivative, $c_n$ is a set of finite difference coefficients, and $\alpha$ plus $\beta$ define the number of coefficients that are used for the approximation. The total set of contributing elements is called the stencil,\index{stencil} and the size of the stencil is called the rank, given as $\alpha + \beta + 1$. The stencil coefficients $c_n$ can be derived from a Taylor expansion based on the values of $\alpha$ and $\beta$, and $q$, using the method of undetermined coefficients~\cite{ch5:LeVeque2007}.
+An example of a three-point finite difference matrix that approximates the first ($q=1$) or second ($q=2$) derivative of a one-dimensional uniformly distributed vector $u$ of length $8$ is given here:
  \begin{eqnarray}\label{ch5:eq:stencilmatrix}
  \left[\begin{array}{c c c c c c c c}
  c_{00} & c_{01} & c_{02} &      0 &      0 &      0 &      0 &      0 \\
@@ -136,7 +132,7 @@ u^{(q)}_6 \\
  u^{(q)}_7
  \end{array}\right].
  \end{eqnarray}
-It is clear from this example, that the matrix is sparse and that there are many repetitions of the same coefficients. Notice, that the coefficients only differ near the boundaries, where off-centered stencils are used. It is natural to pack this information into a stencil operator, that stores only the unique set of coefficients:
+It is clear from this example that the matrix is sparse and that the same coefficients are repeated for all centered rows. The coefficients differ only near the boundaries, where off-centered stencils are used. It is natural to pack this information into a stencil operator that stores only the unique set of coefficients:
  \begin{eqnarray}\label{ch5:eq:stencilcoeffs}
  \myvec{c} & = &
  \left[\begin{array}{c c c}
@@ -145,62 +141,65 @@ c_{10} & c_{11} & c_{12} \\
  c_{20} & c_{21} & c_{22} \\
  \end{array}\right].
  \end{eqnarray}
-Library matrix components calculate these compact stencil coefficient matrices, and implements member functions that compute the finite difference approximation to vector objects. Unit scaled coefficients (assuming grid spacing is one) are computed and stored to be accessible both via CPU and GPU memory. On the GPU, the constant memory space is used for faster memory access~\cite{ch5:cudaguide}. In order to apply a stencil on a non-unit spaced grid, with grid space $\Delta x$, a scale factor of $1/(\Delta x)^q$ will have to be multiplied to the finite difference sum, i.e. $(c_{00}u_0 + c_{01}u_1 + c_{02}u_2)/(\Delta x)^q \approx u^{(q)}_0$.
+Matrix components precompute these compact stencil coefficients and provides member functions that computes the finite difference approximation of input vectors. Unit scaled coefficients (assuming grid spacing is one) are computed and stored to be accessible via both CPU and GPU memory. On the GPU, the constant memory space is used for faster memory access~\cite{ch5:cudaguide}. In order to apply a stencil on a non unit-spaced grid, with grid space $\Delta x$, the scale factor $1/(\Delta x)^q$ will have to be multiplied by the finite difference sum, i.e., $(c_{00}u_0 + c_{01}u_1 + c_{02}u_2)/(\Delta x)^q \approx u^{(q)}_0$ as in the first row of \eqref{ch5:eq:stencilmatrix}.
  
-Setting up a two dimensional grid of size $N_x \times N_y$ in the unit square and computing the first derivative hereof, is illustrated in Listing~\ref{ch5:lst:stencil}. The grid is a vector component, derived from the vector class. It is by default treated as a device object and memory is automatically allocated on the device to fit the grid size. The finite difference approximation as in \eqref{ch5:eq:fdstencil}, are performed via a CUDA kernel behind the scenes during the calls to mult and diff\_x, utilizing the memory hierarchy as the CUDA guidelines prescribe~\cite{ch5:cudaguide,ch5:cudapractice}. To increase developer productivity, kernel launch configurations have default settings, based on CUDA guidelines principles and experiences from performance testings, such that the user does not have to explicitly specify them. For problem specific finite difference approximations, where the built-in stencil operators are insufficient, a pointer to the coefficient matrix \eqref{ch5:eq:stencilcoeffs} can be passed to customized kernels.
+Setting up a two-dimensional grid of size $N_x \times N_y$ in the unit square and computing the first derivative hereof is illustrated in Listing~\ref{ch5:lst:stencil}. The grid is a vector component, derived from the vector class. It is by default treated as a device object and memory is automatically allocated on the device to fit the grid size. The finite difference approximation as in \eqref{ch5:eq:fdstencil}, is performed via a CUDA kernel behind the scenes during the calls to \texttt{mult} and \texttt{diff\_x}, utilizing the memory hierarchy as the CUDA guidelines prescribe~\cite{ch5:cudaguide,ch5:cudapractice}. To increase developer productivity, kernel launch configurations have default settings, based on CUDA guidelines, principles, and experiences from performance testings, such that the user does not have to explicitly specify them. For problem-specific finite difference approximations, where the built-in stencil operators are insufficient, a pointer to the coefficient matrix \eqref{ch5:eq:stencilcoeffs} can be accessed as demonstrated in Listing \ref{ch5:lst:stencil} and passed to customized kernels.
  
-\lstinputlisting[label=ch5:lst:stencil,caption={Two dimensional finite difference stencil example. Computing the first derivative using five points ($\alpha=\beta=2$) per dimension, a total nine point stencil.}]{Chapters/chapter5/code/ex2.cu}
+\lstinputlisting[label=ch5:lst:stencil,caption={two-dimensional finite difference stencil example: computing the first derivative using five points ($\alpha=\beta=2$) per dimension, a total nine-point stencil}]{Chapters/chapter5/code/ex2.cu}
  
-In the following sections we demonstrate how to go from a PDE posed as an initial value problem (IVP) or a boundary value problem (BVP) to a working application solver, by combining existing library components along with new custom tailored components. We also demonstrate how to apply spatial and temporal domain decomposition strategies, that can make existing solvers take advantage of systems equipped with multiple GPUs. In the following we demonstrate how to rapidly assemble a PDE solver using library components.
+In the following sections we demonstrate how to go from an initial value problem (IVP) or a boundary value problem (BVP) to a working application solver by combining existing library components along with new custom-tailored components. We also demonstrate how to apply spatial and temporal domain decomposition strategies that can make existing solvers take advantage of systems equipped with multiple GPUs. Next section demonstrates how to rapidly assemble a PDE solver using library components.
  
  \section{Model problems}\label{ch5:sec:modelproblems}
  
  %Physical phenomenons are at all times present around us. Many of them are well defined via the laws of physics and can be described via differential equations. Though many differential equations are well defined, their solutions are not, and the need for computers to approximate and simulate solutions are inevitable. The accuracy of these simulations is critical for the quality of scientific applications within all fields of engineering. Improved accuracy often comes with the price of increased computational times, that can even be too overwhelming for a single computer to run. Fortunately, the fundamental change in chip-design led development of single core systems into multi- and many core high-performance architectures, that perfectly fits the engineering needs for increased computational resources. The massively parallel architecture of GPUs is a particular good match for accurate methods with high computational intensity\cite{ch5:Kloeckner2011}.
  
-We present two elementary PDE model problems, to demonstrate how to assemble PDE solvers, using library components that follow the guidelines described above. The first model problem is the unsteady parabolic heat conduction equation, the second model problem is the elliptic Poisson equation. The two model problems consist of elements that play important roles in solving a broad range of more advanced PDE problems.
+We present two elementary PDE model problems, to demonstrate how to assemble PDE solvers, using library components that follow the guidelines described above. The first model problem is the unsteady parabolic heat conduction equation; the second model problem is the elliptic Poisson equation. The two model problems consist of elements that play important roles in solving a broad range of more advanced PDE problems.
  
-We refer to Chapter \ref{ch7} for an example of a scientific application relevant for coastal and maritime engineering analysis, that has been assembled using customized library components similar to those presented in the following.
+We refer the reader to Chapter \ref{ch7} for an example of a scientific application relevant for coastal and maritime engineering analysis that has been assembled using customized library components similar to those presented in the following.
  
  \subsection{Heat conduction equation}\index{heat conduction}
-First, we consider a two dimensional heat conduction problem defined on a unit square. The heat conduction equation is a parabolic partial differential diffusion equation, including both spatial and temporal derivatives. It describes how the diffusion of heat in a medium changes with time. Diffusion equations find great importance in many fields of sciences, e.g. fluid dynamics, where the fluid motion is uniquely described by the Navier-Stokes equations, which include a diffusive viscous term~\cite{ch5:chorin1993,ch5:Ferziger1996}.%, or in financial science where diffusive terms are present in the Black-Scholes equations for estimation of option price trends~\cite{}.
+First, we consider a two-dimensional heat conduction problem defined on a unit square. The heat conduction equation is a parabolic partial differential diffusion equation, including both spatial and temporal derivatives. It describes how the diffusion of heat in a medium changes with time. Diffusion equations are of great importance in many fields of sciences, e.g., fluid dynamics, where the fluid motion is uniquely described by the Navier-Stokes equations, which include a diffusive viscous term~\cite{ch5:chorin1993,ch5:Ferziger1996}.%, or in financial science where diffusive terms are present in the Black-Scholes equations for estimation of option price trends~\cite{}.
  
-The heat problem is an IVP \index{initial value problem} that describes how the heat distribution evolves from a specified initial state. Together with homogeneous Dirichlet boundary conditions\index{boundary conditions}, the heat problem is given as
+The heat problem is an IVP \index{initial value problem}, it describes how the heat distribution evolves from a specified initial state. Together with homogeneous Dirichlet boundary conditions\index{boundary conditions}, the heat problem in the unit square is given as
  \begin{subequations}\begin{align}
-\frac{\partial u}{\partial t} - \kappa\nabla^2u =  0, & \qquad (x,y)\in \Omega([0,1]\times[0,1]),\label{ch5:eq:heateqdt}\\
+\frac{\partial u}{\partial t} - \kappa\nabla^2u =  0, & \qquad (x,y)\in \Omega([0,1]\times[0,1]),\quad t\geq 0, \label{ch5:eq:heateqdt}\\
  u =  0, & \qquad (x,y) \in \partial\Omega,\label{ch5:eq:heateqbc}
  \end{align}\label{ch5:eq:heateq}\end{subequations}
-where $u(x,y,t)$ is the unknown heat distribution, $\kappa$ is a heat conductivity constant (assume $\kappa=1$ from now) and $\nabla^2$ is the two dimensional Laplace\index{Laplace operator} differential operator $(\partial_{xx}+\partial_{yy})$. We use the following initial condition, because it has a known analytical solution over the entire time span, and it satisfies the homogeneous boundary condition given by \eqref{ch5:eq:heateqbc},
+where $u(x,y,t)$ is the unknown heat distribution defined within the domain $\Omega$, $t$ is the time, $\kappa$ is a heat conductivity constant (let $\kappa=1$), and $\nabla^2$ is the two-dimensional Laplace\index{Laplace operator} differential operator $(\partial_{xx}+\partial_{yy})$. We use the following initial condition:
  \begin{align}\label{ch5:eq:heatinit}
-u(x,y,t_0) = \sin(\pi x)\,\sin(\pi y), & \qquad (x,y) \in \Omega.
+u(x,y,t_0) = \sin(\pi x)\,\sin(\pi y), & \qquad (x,y) \in \Omega,
  \end{align}
-An illustrative example of the numerical solution to the heat problem, using \eqref{ch5:eq:heatinit} as the initial condition is given in Figure \ref{ch5:fig:heatsolution}.
+because it has a known analytic solution over the entire time span, and it satisfies the homogeneous boundary condition given by \eqref{ch5:eq:heateqbc}. An illustrative example of the numerical solution to the heat problem, using \eqref{ch5:eq:heatinit} as the initial condition, is given in Figure \ref{ch5:fig:heatsolution}.
  \begin{figure}[!htbp]
-    \begin{center}
-    \setlength\figurewidth{0.3\textwidth}        %
-    \setlength\figureheight{0.3\textwidth}       %
-    \subfigure[$t=0.00s$]%{\input{Chapters/chapter5/figures/HeatSolution0.tikz}}
-{\includegraphics[width=0.48\textwidth]{Chapters/chapter5/figures/HeatSolution0_conv.pdf}}
-    \subfigure[$t=0.05s$]%{\input{Chapters/chapter5/figures/HeatSolution0.049307.tikz}}
-{\includegraphics[width=0.48\textwidth]{Chapters/chapter5/figures/HeatSolution0_049307_conv.pdf}}
-    %\subfigure[$t=0.10s$]{\input{Chapters/chapter5/figures/HeatSolution0.099723.tikz}}
-{\includegraphics[width=0.48\textwidth]{Chapters/chapter5/figures/HeatSolution0_099723_conv.pdf}}
-    \end{center}
-    \caption{Discrete solution at times $t=0s$ and $t=0.05s$, using \eqref{ch5:eq:heatinit} as initial condition and a small $20\times20$ numerical grid.}\label{ch5:fig:heatsolution}
+       \scriptsize
+    \centering
+    \setlength\figurewidth{0.26\textwidth}        %
+    \setlength\figureheight{0.26\textwidth}       %
+    \subfigure[$t=0.00s$]{\input{Chapters/chapter5/figures/HeatSolution0.tikz}}%
+%{\includegraphics[width=0.48\textwidth]{Chapters/chapter5/figures/HeatSolution0_conv.pdf}}
+    \subfigure[$t=0.05s$]{\input{Chapters/chapter5/figures/HeatSolution0.049307.tikz}}%
+%{\includegraphics[width=0.48\textwidth]{Chapters/chapter5/figures/HeatSolution0_049307_conv.pdf}}
+    \subfigure[$t=0.10s$]{\input{Chapters/chapter5/figures/HeatSolution0.099723.tikz}}%
+%{\includegraphics[width=0.48\textwidth]{Chapters/chapter5/figures/HeatSolution0_099723_conv.pdf}}
+    \caption{Discrete solution, at times $t=0s$ and $t=0.05s$, using \eqref{ch5:eq:heatinit} as the initial condition and a small $20\times20$ numerical grid.}\label{ch5:fig:heatsolution}
  \end{figure}
  
  We use a Method of Lines (MoL)\index{method of lines} approach to solve \eqref{ch5:eq:heateq}. Thus,  the spatial derivatives are replaced with finite difference approximations, leaving only the temporal derivative as unknown. The spatial derivatives are approximated from $\myvec{u}^n$, where $\myvec{u}^n$ represents the approximate solution to $u(t_n)$ at a given time $t_n$ with time step size $\delta t$ such that $t_n=n\delta t$ for $n=0,1,\ldots$. The finite difference approximation\index{finite difference} can be interpreted as a matrix-vector product as sketched in \eqref{ch5:eq:stencilmatrix}, and so the semi-discrete heat conduction problem becomes
  \begin{align}\label{ch5:eq:discreteheateq}
  \frac{\partial u}{\partial t} = \mymat{A}\myvec{u}, \qquad \mymat{A} \in \mathbb{R}^{N\times N}, \quad \myvec{u} \in \mathbb{R}^{N},
  \end{align}
-where $\mymat{A}$ is the sparse finite difference matrix and $N$ is the number of unknowns in the discrete system. The temporal derivative is now free to be approximated by any suitable choice of a time integration method\index{time integration}. The most simple integration scheme would be the first order explicit forward Euler's method\index{forward Euler}, resulting in an update on the form
+where $\mymat{A}$ is the sparse finite difference matrix and $N$ is the number of unknowns in the discrete system. The temporal derivative is now free to be approximated by any suitable choice of a time-integration method\index{time integration}. The most simple integration scheme would be the first-order accurate explicit forward Euler method\index{forward Euler},
  \begin{align}\label{ch5:eq:forwardeuler}
  \myvec{u}^{n+1} = \myvec{u}^n + \delta t\,\mymat{A}\myvec{u}^n,
  \end{align}
-where $n+1$ refers to the solution at the next time step. The forward Euler method can be exchanged with alternative high-order accurate time integration methods, such as Runge-Kutta methods or linear multistep methods, if numerical instability becomes an issue, see e.g. \cite{ch5:LeVeque2007} for details on numerical stability analysis. For demonstrative purpose, we simply use conservative time step sizes to avoid stability issues. However, the component based library design provides exactly the flexibility for the application developer to select or change PDE solver parts, such as the time integrator, with little coding effort. A generic implementation of the forward Euler method, that satisfy the library concept rules is illustrated in Listing~\ref{ch5:lst:euler}. According to the component guidelines in Figure \ref{ch5:fig:componentdesign}, a time integrator is basically a functor, which means that it implements the parenthesis operator, taking five template arguments: a right hand side operator, the state vector, integration start time, integration end time, and a time step size. The method takes as many time steps necessary to integrate from the start till end, continuously updating the state vector according to \eqref{ch5:eq:forwardeuler}. Notice, that nothing in Listing~\ref{ch5:lst:euler} indicates wether GPUs are used or not. However, it is likely that the underlying implementation of the right hand side functor and the \texttt{axpy} vector function, that can be used for scaling and summation of vectors, do rely on fast GPU kernels. However, it is not something that the developer of the component has to account for. For this reason, the template-based approach, along with simple interface concepts, make it easy to create new components that will fit well into the library.
+where $n+1$ refers to the solution at the next time step. The forward Euler method can be exchanged with alternative high-order accurate time integration methods, such as Runge-Kutta methods or linear multistep methods, if numerical instability becomes an issue, see, e.g., \cite{ch5:LeVeque2007} for details on numerical stability analysis. For demonstrative purpose, we simply use conservative time step sizes to avoid stability issues. However, the component-based library design provides exactly the flexibility for the application developer to select or change PDE solver parts, such as the time integrator, with little coding effort. A generic implementation of the forward Euler method that satisfies the library concept rules is illustrated in Listing~\ref{ch5:lst:euler}. According to the component guidelines in Figure \ref{ch5:fig:componentdesign}, a time integrator is basically a functor, which means that it implements the parenthesis operator, taking five template arguments: a right hand side operator, the state vector, integration start time, integration end time, and a time step size. The method takes as many time steps necessary to integrate from the start to the end, continuously updating the state vector according to \eqref{ch5:eq:forwardeuler}. Notice, that nothing in Listing~\ref{ch5:lst:euler} indicates wether GPUs are used or not. However, it is likely that the underlying implementation of the right hand side functor and the \texttt{axpy} vector function, do rely on fast GPU kernels. However, it is not something that the developer of the component has to account for. For this reason, the template-based approach, along with simple interface concepts, make it easy to create new components that will fit well into a generic library.
  
-\lstinputlisting[label=ch5:lst:euler,caption={Generic implementation of explicit first order forward Euler integration.}]{Chapters/chapter5/code/ex3.cu}
  
-A basic approach to solve the heat conduction problem has now been outlined, and we are ready to assemble the PDE solver.
+The basic numerical approach to solve the heat conduction problem has now been outlined, and we are ready to assemble the PDE solver.
+
+\pagebreak
+\lstinputlisting[label=ch5:lst:euler,caption={generic implementation of explicit first-order forward Euler integration}]{Chapters/chapter5/code/ex3.cu}
+
  
  %More accurate methods, like variations of Runge-Kutta methods or multi-step methods might be desirable, particulary if the accuracy of the spatial approximation is also high, see e.g. LeVeque for examples hereof~\cite{ch5:LeVeque2007}.
  
@@ -209,20 +208,20 @@ A basic approach to solve the heat conduction problem has now been outlined, and
  
  
  \subsubsection{Assembling the heat conduction solver}
-Before we are able to numerically solve the discrete heat conduction problem \eqref{ch5:eq:heateq}, we need implementations to handle the the following items
+Before we are able to numerically solve the discrete heat conduction problem \eqref{ch5:eq:heateq}, we need implementations to handle the the following items:
  \begin{description}
-\item[Grid] - A discrete numerical grid, to represent the two dimensional heat distribution domain and the arithmetical working precision ($32$-bit single precision or $64$-bit double precision).
-\item[RHS] - A right hand side operator for \eqref{ch5:eq:discreteheateq}, that approximates the second order spatial derivatives (matrix-vector product).
-\item[Boundary conditions] - A strategy, that ensures that the Dirichlet conditions are satisfied on the boundary.
-\item[Time integrator] - A time integration scheme, that approximates the time derivative from equation \eqref{ch5:eq:discreteheateq}.
+\item[Grid.] A discrete numerical grid to represent the two-dimensional heat distribution domain and the arithmetical working precision ($32$-bit single precision or $64$-bit double precision).
+\item[RHS.] A right-hand side operator for \eqref{ch5:eq:discreteheateq} that approximates the second-order spatial derivatives (matrix-vector product).
+\item[Boundary conditions.] A strategy that ensures that the Dirichlet conditions are satisfied on the boundary.
+\item[Time integrator.] A time integration scheme, that approximates the time derivative from \eqref{ch5:eq:discreteheateq}.
  \end{description}
-All items are either directly available in the library or can be designed from components herein. The built-in stencil operator may assist in implementing the matrix-vector product, but we need to explicitly ensure that the Dirichlet boundary conditions are satisfied. We demonstrated in Listing \ref{ch5:lst:stencil} how to approximate the derivative using flexible order finite difference stencils. However, from \eqref{ch5:eq:heateqbc} we know, that boundary values are zero. Therefore we extend the stencil operator with a simple kernel call that assigns zero to the entire boundary. Listing \ref{ch5:lst:laplaceimpl} shows the code for the two dimensional Laplace right hand side operator. The constructor takes as argument the stencil half size $\alpha$ and assumes $\alpha=\beta$. Thus the total two dimensional stencil rank will be $4\alpha+1$. For simplicity we also assume that the grid is uniformly distributed, $N_x=N_y$. Performance optimizations for the stencil kernel, such as shared memory utilization, is taken care of in the underlying implementation, accordingly to CUDA guidelines~\cite{ch5:cudaguide,ch5:cudapractice}. The two macros, BLOCK1D and GRID1D, are used to help set up kernel configurations based on grid sizes.
+All items are either directly available in the library or can be designed from components herein. The built-in stencil operator may assist in implementing the matrix-vector product, but we need to explicitly ensure that the Dirichlet boundary conditions are satisfied. We demonstrated in Listing \ref{ch5:lst:stencil} how to approximate the derivative using flexible order finite difference stencils. However, from \eqref{ch5:eq:heateqbc} we know that boundary values are zero. Therefore, we extend the stencil operator with a simple kernel call that assigns zero to the entire boundary. Listing \ref{ch5:lst:laplaceimpl} shows the code for the two-dimensional Laplace right-hand side operator. The constructor takes as an argument the stencil half size $\alpha$ and assumes $\alpha=\beta$. Thus, the total two-dimensional stencil rank will be $4\alpha+1$. For simplicity we also assume that the grid is uniformly distributed, $N_x=N_y$. Performance optimizations for the stencil kernel, such as shared memory utilization, are handled in the underlying implementation, accordingly to CUDA guidelines~\cite{ch5:cudaguide,ch5:cudapractice}. The macros, \texttt{BLOCK1D} and \texttt{GRID1D}, are used to help set up kernel configurations based on grid sizes and \texttt{RAW\_PTR} is used to cast the vector object to a valid device memory pointer.
  
-\lstinputlisting[label=ch5:lst:laplaceimpl,caption={The right hand side Laplace operator. The built-in stencil approximates the two dimensional spatial derivatives, while the custom set\_dirichlet\_bc kernel takes care of satisfying boundary conditions.}]{Chapters/chapter5/code/laplacian.cu}
+\lstinputlisting[label=ch5:lst:laplaceimpl,caption={the right-hand side Laplace operator: the built-in stencil approximates the two dimensional spatial derivatives, while the custom set\_dirichlet\_bc kernel takes care of satisfying boundary conditions}]{Chapters/chapter5/code/laplacian.cu}
  
-With the right hand side operator in place, we are ready to implement the solver. For this simple PDE problem we compute all necessary initial data in the body of the main function and use the forward Euler time integrator to compute the solution till $t=t_{end}$. For more advanced PDE solvers, a built-in \texttt{ode\_solver} class is defined, that helps taking care of initialization and storage of multiple state variables. Declaring type definitions for all components at the beginning of the main file gives a good overview of the solver composition. In this way, it will be easy to control or change solver components at later times. Listing~\ref{ch5:lst:heattypedefs} lists the type definitions\index{type definitions} that are used to assemble the heat conduction solver. %\stefan{elaborate on the choices}.
+With the right-hand side operator in place, we are ready to implement the solver. For this simple PDE problem we compute all necessary initial data in the body of the main function and use the forward Euler time integrator to compute the solution until $t=t_{end}$. For more advanced solvers, a built-in \texttt{ode\_solver} class is defined that helps take care of initialization and storage of multiple state variables. Declaring type definitions for all components at the beginning of the main file gives a good overview of the solver composition. In this way, it will be easy to control or change solver components at later times. Listing~\ref{ch5:lst:heattypedefs} lists the type definitions\index{type definitions} that are used to assemble the heat conduction solver. %\stefan{elaborate on the choices}.
  
-\lstset{label=ch5:lst:heattypedefs,caption={Type definitions for all the heat conduction solver components used throughout the remaining code.}}
+\lstset{label=ch5:lst:heattypedefs,caption={type definitions for all the heat conduction solver components used throughout the remaining code}}
  \begin{lstlisting}
  typedef double                              value_type;
  typedef laplacian<value_type>               rhs_type;
@@ -230,9 +229,9 @@ typedef gpulab::grid<value_type>            vector_type;
  typedef vector_type::property_type          property_type;
  typedef gpulab::integration::forward_euler  time_integrator_type;
  \end{lstlisting}
-The grid is by default treated as a device object and memory is allocated on the GPU upon initialization of the grid. Setting up the grid can be done via the property type class. The property class holds information about the discrete and physical dimensions, along with fictitious ghost (halo) layers and periodicity conditions. For the heat conduction problem we use a non-periodic domain within the unit square and with no ghost layers. Listing \ref{ch5:lst:gridsetup} illustrates the grid assembly.
+The grid is by default treated as a device object, and memory is allocated on the GPU upon initialization of the grid. Setting up the grid can be done via the property type class. The property class holds information about the discrete and physical dimensions, along with fictitious ghost (halo) layers and periodicity conditions. For the heat conduction problem we use a non periodic domain of size $N\times N$ within the unit square with no ghost layers. Listing \ref{ch5:lst:gridsetup} illustrates the grid assembly.
  
-\lstset{label=ch5:lst:gridsetup,caption={Create 2D grid of discrete dimension \texttt{N} times \texttt{N} and physical dimension $0$ to $1$.}}
+\lstset{label=ch5:lst:gridsetup,caption={creating a two-dimensional grid of size \texttt{N} times \texttt{N} and physical dimension $0$ to $1$}}
  \begin{lstlisting}
  // Setup discrete and physical dimensions
  gpulab::grid_dim<int>          dim(N,N,1);
@@ -243,15 +242,15 @@ property_type                  props(dim,p0,p1);
  // Initialize vector
  vector_type                    u(props);
  \end{lstlisting}
-Hereafter the vector \texttt{u} can be initialized accordingly to \eqref{ch5:eq:heatinit}. Finally we need to instantiate the right hand side Laplacian operator from Listing \ref{ch5:lst:laplaceimpl} and the forward Euler time integrator in order to integrate from $t_0$ till $t_{end}$.
+Hereafter the vector \texttt{u} can be initialized accordingly to \eqref{ch5:eq:heatinit}. Finally we need to instantiate the right-hand side Laplacian operator from Listing \ref{ch5:lst:laplaceimpl} and the forward Euler time integrator in order to integrate from $t_0$ until $t_{end}$.
  
-\lstset{label=ch5:lst:timeintegrator,caption=Create time integrator and the right hand side Laplacian operator.}
+\lstset{label=ch5:lst:timeintegrator,caption=creating a time integrator and the right-hand side Laplacian operator.}
  \begin{lstlisting}
-rhs_type rhs(alpha);           // Create right hand side operator
+rhs_type rhs(alpha);           // Create right-hand side operator
  time_integrator_type solver;   // Create time integrator
-solver(&rhs,u,0.0f,tend,dt);   // Integrate from to tend using dt
+solver(&rhs,u,0.0f,tend,dt);   // Integrate from 0 to tend using dt
  \end{lstlisting}
-The last line invokes the forward Euler time integration scheme, defined in Listing \ref{ch5:lst:heattypedefs}. If the developer decides to change the integrator into another explicit scheme, only the time integrator type definition in Listing \ref{ch5:lst:heattypedefs} needs to be changed. The heat conduction solver is now complete.%The simple model problem illustrated, that a well designed library upholds developer productivity, by minimizing the
+The last line invokes the forward Euler time integration scheme defined in Listing \ref{ch5:lst:heattypedefs}. If the developer decides to change the integrator into another explicit scheme, only the time integrator type definition in Listing \ref{ch5:lst:heattypedefs} needs to be changed. The heat conduction solver is now complete.%The simple model problem illustrated, that a well designed library upholds developer productivity, by minimizing the
  
  % how few CUDA implementations that were needed, and thereby upholding developer productivity.
  
@@ -259,52 +258,54 @@ The last line invokes the forward Euler time integration scheme, defined in List
  
  \subsubsection{Numerical solutions to the heat conduction problem}
  
-Solution times for the heat conduction problem is in itself not very interesting, as it is only a simple model problem. What is interesting for GPU kernels, like the finite differences kernel, is that increased computational work often comes with a very little price. This is because the relative fast computations can be hidden by the slower memory fetches. Therefore we are able to improve the accuracy of the numerical solution via more accurate finite differences (larger stencil sizes), while improving the computational performance in terms of floating point operations per second (flops). Figure \ref{ch5:fig:stencilperformance} confirms, that larger stencils improve the kernel performance. Notice that even though these performance results are favorable compared to single core systems ($\sim 10$ GFlops double precision on a $2.5$-GHz processor), it is still far from the peak performance of the GeForce GTX590 ($\sim 2.4$ TFlops single precision). The reason is that the kernel is bandwidth bound, i.e. performance is limited by the time it takes to move memory between the global GPU memory and the chip. However, this is a general limitation for matrix-vector-like operations that approximate solutions to discretized PDE problems. Only matrix-matrix operations, that have a high ratio of computations versus memory transactions, are able to reach near-optimal performance results~\cite{ch5:Kirk2010}. These kind of operators are, however, rarely used to solve PDE problems.
+Solution time for the heat conduction problem is in itself not very interesting, as it is only a simple model problem. What is interesting for GPU kernels, such as the finite differences kernel, is that increased computational work often comes with a very small price, because the fast computations can be hidden by the relatively slower memory fetches. Therefore, we are able to improve the accuracy of the numerical solution via more accurate finite differences (larger stencil sizes), while improving the computational performance in terms of floating point operations per second (flops). Figure \ref{ch5:fig:stencilperformance} confirms, that larger stencils improve the kernel performance. Notice that even though these performance results are favorable compared to single core systems ($\sim 10$ GFlops double precision on a $2.5$-GHz processor), they are still far from their peak performance, e.g., $\sim 2.4$ TFlops single precision for the GeForce GTX590. The reason is that the kernel is bandwidth bound, i.e., performance is limited by the time it takes to move memory between the global GPU memory and the chip. The Tesla K20 performs better than the GeForce GTX590 because it obtains the highest bandwidth. Being bandwidth bound is a general limitation for matrix-vector-like operations that arise from the discretization of PDE problems. Only matrix-matrix multiplications, which have a high ratio of computations versus memory transactions, are able to reach near-optimal performance results~\cite{ch5:Kirk2010}. These kinds of operators are, however, rarely used to solve PDE problems.
  
  \begin{figure}[!htb]
-\setlength\figureheight{0.3\textwidth}
+\setlength\figureheight{0.35\textwidth}
  \setlength\figurewidth{0.4\textwidth}
-\begin{center}
-{\small
-%\input{Chapters/chapter5/figures/AlphaPerformanceGTX590_N16777216.tikz}
-{\includegraphics[width=0.5\textwidth]{Chapters/chapter5/figures/AlphaPerformanceGTX590_N16777216_conv.pdf}}
+\centering
+{\scriptsize
+\subfigure[GeForce GTX590, test environment 1.]{
+       \input{Chapters/chapter5/figures/AlphaPerformanceGTX590_N16777216.tikz}%
+}%
+\subfigure[Tesla K20c, test environment 2.]{
+       \input{Chapters/chapter5/figures/AlphaPerformanceTeslaK20c_N16777216.tikz}
+}%
+%{\includegraphics[width=0.5\textwidth]{Chapters/chapter5/figures/AlphaPerformanceGTX590_N16777216_conv.pdf}}
  }
-\end{center}
-\caption[Single and double precision floating point operations per second for a two dimensional stencil operator on a numerical grid of size $4096^2$.]{Single and double precision floating point operations per second for a two dimensional stencil operator on a numerical grid of size $4096^2$. Various stencil sizes are used $\alpha=1,2,3,4$, equivalent to $5$pt, $9$pt, $13$pt, and $17$pt stencils. Test environment 1.}\label{ch5:fig:stencilperformance}
+\caption[Single- and double-precision floating point operations per second for a two-dimensional stencil operator on a numerical grid of size $4096^2$.]{Single and double precision floating point operations per second for a two dimensional stencil operator on a numerical grid of size $4096^2$. Various stencil sizes are used $\alpha=1,2,3,4$, equivalent to $5$pt, $9$pt, $13$pt, and $17$pt stencils.}\label{ch5:fig:stencilperformance}
  \end{figure}
  
  
  
  
  \subsection{Poisson equation}\index{Poisson equation}
-The Poisson equation is a second-order elliptic differential equation, often encountered in applications within scientific fields like electrostatics and mechanics. We consider the two dimensional BVP \index{boundary volume problem} defined in terms of Poisson's equation with homogeneous Dirichlet boundary conditions on the form
+The Poisson equation is a second-order elliptic differential equation, often encountered in applications within scientific fields such as electrostatics and mechanics. We consider the two- dimensional BVP \index{boundary volume problem} defined in terms of Poisson's equation with homogeneous Dirichlet boundary conditions on the form
  \begin{subequations}\begin{align}
  \nabla^2 u = f(x,y),& \qquad (x,y) \in \Omega([0,1]\times[0,1]), \\
  u = 0,& \qquad (x,y) \in \partial\Omega.
  \end{align}\label{ch5:eq:poissoneq}\end{subequations}
-Notice the similarities to the heat conduction equation \eqref{ch5:eq:heateq}. In fact, this could be a steady state solution to the heat equation, when there is no temporal change $\frac{\partial u}{\partial t}=0$, but a source term $f(x,y)$. Since the Laplace operator and the boundary conditions are the same for both problems, we are able to reuse the same implementation with few modifications.
+Notice the similarities to the heat conduction equation \eqref{ch5:eq:heateq}. In fact, \eqref{ch5:eq:poissoneq} could be a steady-state solution to the heat equation, when there is no temporal change $\frac{\partial u}{\partial t}=0$, but a source term $f(x,y)$. Since the Laplace operator and the boundary conditions are the same for both problems, we are able to reuse the same implementation with few modifications.
  
-Opposite to the heat equation, there are no initial conditions. Instead, we seek some $u(x,y)$ that satisfies \eqref{ch5:eq:poissoneq}, given a source term $f(x,y)$, on the right hand side. For simplicity, assume that we know the exact solution, $u_{\textrm{true}}$, given as \eqref{ch5:eq:heatinit}. Then we use the method of manufactured solutions to derive an expression for the corresponding right hand side $f(x,y)$:
+Opposite to the heat equation, there are no initial conditions. Instead, we seek some $u(x,y)$ that satisfies \eqref{ch5:eq:poissoneq}, given a source term $f(x,y)$, on the right-hand side. For simplicity, assume that we know the exact solution, $u_{\textrm{true}}$, corresponding to \eqref{ch5:eq:heatinit}. Then we use the method of manufactured solutions to derive an expression for the corresponding right-hand side $f(x,y)$:
  \begin{align}
  f(x,y) = \nabla^2 u_{\textrm{true}} = -2\pi^2\,\sin(\pi x)\,\sin(\pi y). \label{ch5:eq:poissonrhs}
  \end{align}
-The spatial derivative in \eqref{ch5:eq:poissoneq} is again approximated with finite differences\index{finite difference}, similar to the example in \eqref{ch5:eq:stencilmatrix}, except boundary values are explicitly set to zero. The discrete form of the system can now be written as a sparse linear system of equations
+The spatial derivative in \eqref{ch5:eq:poissoneq} is again approximated with finite differences\index{finite difference}, similar to the example in \eqref{ch5:eq:stencilmatrix}, except boundary values are explicitly set to zero. The discrete form of the system can now be written as a sparse linear system of equations:
  \begin{align}
-\mymat{A}\myvec{u}=\myvec{f}, \quad \mymat{A} \in \mathbb{R}^{N\times N}, \label{ch5:eq:poissonsystem}
+\mymat{A}\myvec{u}=\myvec{f}, \qquad \myvec{u},\myvec{f} \in \mathbb{R}^{N}, \quad \mymat{A} \in \mathbb{R}^{N\times N}, \label{ch5:eq:poissonsystem}
  \end{align}
-where $\mymat{A}$ is the sparse matrix formed by finite difference coefficients, $N$ is the number of unknowns and $\myvec{f}$ is given by \eqref{ch5:eq:poissonrhs}. Equation \eqref{ch5:eq:poissonsystem} can be solved in numerous ways, but a few observations may help us do it efficiently. Direct solvers based on Gaussian elimination are accurate and use a finite number of operations for a constant problem size. However, the arithmetic complexity grows with the problem size with as much as $\mathcal{O}(N^3)$ and do not exploit the sparsity of $\mymat{A}$. Direct solvers are therefore mostly feasible for dense systems of limited sizes. Sparse direct solvers exist, but they are often difficult to parallelize, or only applicable for certain types of matrices. Regardless of the discretization technique, the discretization of an elliptic PDE into a linear system like \eqref{ch5:eq:poissonsystem}, yields a very sparse matrix $\mymat{A}$ when $N$ is very large. Iterative methods\index{iterative methods} for solving large sparse linear systems find broad use in scientific applications, because they only require an implementation of the matrix-vector product, and often use a limited amount of additional memory. Comprehensive introductions to iterative methods may be found in any of~\cite{ch5:Saad2003,ch5:Kelley1995,ch5:Barrett1994}.
+where $\mymat{A}$ is the sparse matrix formed by finite difference coefficients, $N$ is the number of unknowns, and $\myvec{f}$ is given by \eqref{ch5:eq:poissonrhs}. Equation \eqref{ch5:eq:poissonsystem} can be solved in numerous ways, but a few observations may help do it more efficiently. Direct solvers based on Gaussian elimination are accurate and use a finite number of operations for a constant problem size. However, the arithmetic complexity grows with the problem size by as much as $\mathcal{O}(N^3)$ and does not exploit the sparsity of $\mymat{A}$. Direct solvers are therefore mostly feasible for dense systems of limited sizes. Sparse direct solvers exist, but they are often difficult to parallelize, or applicable for only certain types of matrices. Regardless of the discretization technique, the discretization of an elliptic PDE into a linear system as in \eqref{ch5:eq:poissonsystem} yields a very sparse matrix $\mymat{A}$ when $N$ is large. Iterative methods\index{iterative methods} for solving large sparse linear systems find broad use in scientific applications, because they require only an implementation of the matrix-vector product, and they often use a limited amount of additional memory. Comprehensive introductions to iterative methods may be found in any of~\cite{ch5:Saad2003,ch5:Kelley1995,ch5:Barrett1994}.
  
-One benefit of the high abstraction level and the template-based library design, is to allow developers to implement their own components, such as iterative methods for solving sparse linear systems. The library includes three popular iterative methods: conjugate gradient, defect correction\index{defect correction}, and geometric multigrid. The conjugate gradient method is applicable only to systems with symmetric positive definite matrices. This is true for the two dimensional Poisson problem, when it is discretized with a five point finite difference stencil, because then there will be no off-centered approximations near the boundary. For high-order approximations ($\alpha>1$), we use the defect correction method with multigrid preconditioning. See \cite{ch5:Trottenberg2001} for details on high-order multigrid methods.
+One benefit of the high abstraction level and the template-based library design is to allow developers to implement their own components, such as iterative methods for solving sparse linear systems. The library includes three popular iterative methods: conjugate gradient, defect correction\index{defect correction}, and geometric multigrid. The conjugate gradient method is applicable only to systems with symmetric positive definite matrices. This is true for the two-dimensional Poisson problem, when it is discretized with a five-point finite difference stencil, because then there will be no off-centered approximations near the boundary. For high-order approximations ($\alpha>1$), we use the defect correction method with multigrid preconditioning. See e.g., \cite{ch5:Trottenberg2001} for details on multigrid methods.
  
-We will not present the implementation details for all three methods, but briefly demonstrate the simplicity of implementing the body of such an iterative solver, given a textbook recipe or mathematical formulation. The defect correction method iteratively improves the solution to $\mymat{A}\myvec{x}=\myvec{b}$, given an initial start guess $\myvec{x}^0$, by continuously solving a preconditioned error equation. One defect correction \index{defect correction!iteration} iteration can be written via mathematical notation as
+We will not present the implementation details for all three methods but briefly demonstrate the simplicity of implementing the body of such an iterative solver, given a textbook recipe or mathematical formulation. The defect correction method iteratively improves the solution to $\mymat{A}\myvec{x}=\myvec{b}$, given an initial start guess $\myvec{x}^0$, by continuously solving a preconditioned error equation. The defect correction\index{defect correction!iteration} iteration can be written as
  \begin{align}
-\myvec{x}^{k+1} = \myvec{x}^{k} + \mymat{M}^{-1}(\myvec{b}-\mymat{A}\myvec{x}^{k}), \quad \mathcal{M}\in\mathbb{R}^{N\times N}, \quad {\bf x},{\bf b}\in\mathbb{R}^N \label{ch5:eq:dc}
+\myvec{x}^{k+1} = \myvec{x}^{k} + \mymat{M}^{-1}(\myvec{b}-\mymat{A}\myvec{x}^{k}), \quad \mymat{A},\mathcal{M}\in\mathbb{R}^{N\times N}, \quad {\bf x},{\bf b}\in\mathbb{R}^N, \label{ch5:eq:dc}
  \end{align}
-where $k$ is the iteration number and $\mymat{M}$ is the preconditioner\index{preconditioning} which is based on approximation to the original coefficient matrix. To achieve efficient algebraic convergence\index{convergence}, a solve step with the preconditioner should be computational inexpensive compared to using $\mymat{A}$. How to implement \eqref{ch5:eq:dc} within the library context is illustrated in Listing \ref{ch5:lst:dc}. The host CPU is responsible for traversing through each line in Listing \ref{ch5:lst:dc} and testing for convergence, while the computationally expensive matrix-vector operation and preconditioning, can be executed on the GPU, if GPU-based components are used. The defect correction method has two attractive properties. First, global reduction is only required to monitor convergence once per iteration during convergence evaluation, which reduces communication requirements and provides a basis for efficient and scalable parallelization. Second, it has minimal memory footprint, which is good for the solution of very large systems.
+where $k$ is the iteration number and $\mymat{M}$ is the preconditioner\index{preconditioning} which should be an approximation to the original coefficient matrix $\mymat{A}$. To achieve fast numerical convergence\index{convergence}, applying the preconditioner should be a computationally inexpensive operation compared to solving the original system. How to implement \eqref{ch5:eq:dc} within the library context is illustrated in Listing \ref{ch5:lst:dc}. The host CPU traverses each line in Listing \ref{ch5:lst:dc} and tests for convergence, while the computationally expensive matrix-vector operation and preconditioning, can be executed on the GPU, if GPU-based components are used. The defect correction method has two attractive properties. First, global reduction is required to monitor convergence only once per iteration during convergence evaluation, which reduces communication requirements and provides a basis for efficient and scalable parallelization. Second, it has a minimal constant memory footprint, making it a suitable method for solving very large systems.
  
-In the following section we demonstrate how to assemble a solver for the discrete Poisson problem, using one of the tree iterative methods to efficiently solve \eqref{ch5:eq:poissonsystem}.
-
-\lstset{label=ch5:lst:dc,caption={Main loop for the iterative defect correction solver. The solver is instantiated with template argument types for the matrix and vector classes, allowing underlying implementations to be based on GPU kernels.}}
+\lstset{label=ch5:lst:dc,caption={main loop for the iterative defect correction solver: the solver is instantiated with template argument types for the matrix and vector classes, allowing underlying implementations to be based on GPU kernels}}
  \begin{lstlisting}
  while(r.nrm2() > tol)
  {
@@ -323,14 +324,15 @@ while(r.nrm2() > tol)
  }
  \end{lstlisting}
  
+In the following section we demonstrate how to assemble a solver for the discrete Poisson problem, using one of the three iterative methods to efficiently solve \eqref{ch5:eq:poissonsystem}.
  
-\subsubsection{Assembling the Poisson Solver}
+\subsubsection{Assembling the Poisson solver}
  
-Assembling the Poisson solver follows almost the same procedure as the heat conduction solver, except the time integration part is exchanged with an iterative method to to solve the system of linear equations \eqref{ch5:eq:poissonsystem}. For the discrete matrix-vector product we reuse the Laplace operator from the heat conduction problem in Listing \ref{ch5:lst:laplaceimpl} with a few modifications. The Laplace operator now appears as a matrix component, so to be compatible with the component interface rules in Figure \ref{ch5:fig:componentdesign} the parenthesis operator is renamed to \texttt{mult}, taking two vector components as argument.
+Assembling the Poisson solver follows almost the same procedure as the heat conduction solver, except the time integration part is exchanged with an iterative method to solve the system of linear equations \eqref{ch5:eq:poissonsystem}. For the discrete matrix-vector product we reuse the Laplace operator from the heat conduction problem in Listing \ref{ch5:lst:laplaceimpl} with few modifications. The Laplace operator is now a matrix component, so to be compatible with the component interface rules in Figure \ref{ch5:fig:componentdesign}, a \texttt{mult} function taking two vector arguments is implemented instead of the parentheses operator. We leave out this code example as it almost identical to the one in Listing \ref{ch5:lst:laplaceimpl}.
  
-At the beginning of the solver implementation we list the type definitions for the Poisson solver that will be used throughout the implementation. Here we use a geometric multigrid\index{multigrid} method as a preconditioner for the defect correction method. Therefore the multigrid solver is assembled first, so that it can be used in the assembling of the defect correction method. Listing \ref{ch5:lst:poissontypedefs} defines the types for the vector, the matrix, the multigrid preconditioner and the defect correction solver. The geometric multigrid method needs two additional template arguments, that are specific for multigrid, namely a smoother and a grid restriction/interpolation operator. These arguments are free to be implemented and supplied by the developer if special care is required for their specific problems, e.g. for a custom grid structure. For the Poisson problem on a regular grid, the library contains built-in restriction and interpolation operators, and a red-black Gauss-Seidel smoother. We refer to \cite{ch5:Trottenberg2001} for extensive details on multigrid methods. The monitor and config types that appear in Listing \ref{ch5:lst:poissontypedefs} are used for convergence monitoring within the iterative solver and to control run time parameters, such as tolerances, iteration limits, etc.
+At the beginning of the solver implementation we list the type definitions for the Poisson solver that will be used throughout the implementation. Here we use a geometric multigrid\index{multigrid} method as a preconditioner for the defect correction method. Therefore the multigrid solver is assembled first, so that it can be used in the assembling of the defect correction solver. Listing \ref{ch5:lst:poissontypedefs} defines the types for the vector, the matrix, the multigrid preconditioner, and the defect correction solver. The geometric multigrid method needs two additional template arguments that are specific for multigrid, namely, a smoother and a grid restriction/interpolation operator. These arguments are free to be implemented and supplied by the developer if special care is required, e.g., for a custom grid structure. For the Poisson problem on a regular grid, the library contains built-in restriction and interpolation operators, and a red-black Gauss-Seidel smoother. We refer the reader to \cite{ch5:Trottenberg2001} for extensive details on multigrid methods. The monitor and config types that appear in Listing \ref{ch5:lst:poissontypedefs} are used for convergence monitoring within the iterative solver and to control runtime parameters, such as tolerances and iteration limits.
  
-\lstset{label=ch5:lst:poissontypedefs,caption={Type definitions for the Laplacian matrix component and the multigrid preconditioned iterative defect correction\index{defect correction} solver.}}
+\lstset{label=ch5:lst:poissontypedefs,caption={type definitions for the Laplacian matrix component and the multigrid preconditioned iterative defect correction\index{defect correction} solver}}
  \begin{lstlisting}
  typedef double                                   value_type;
  typedef gpulab::grid<value_type>                 vector_type;
@@ -341,7 +343,7 @@ typedef gpulab::solvers::multigrid_types<
           vector_type
         , matrix_type
         , gpulab::solvers::gauss_seidel_rb_2d
-       , gpulab::solvers::grid_handler_2d_boundary>   mg_types;
+       , gpulab::solvers::grid_handler_2d>           mg_types;
  typedef gpulab::solvers::multigrid<mg_types>     mg_solver_type;
  typedef mg_solver_type::monitor_type             monitor_type;
  typedef monitor_type::config_type                config_type;
@@ -351,13 +353,14 @@ typedef gpulab::solvers::defect_correction_types<
           vector_type
         , matrix_type
         , monitor_type
-       , mg_solver_type>                              dc_types;
+       , mg_solver_type>                                 dc_types;
  typedef gpulab::solvers::defect_correction<dc_types> dc_solver_type;
  \end{lstlisting}
  
-With the type definitions set up, the implementation for the Poisson solver follows in Listing \ref{ch5:lst:poissonsolver}. Some of the initializations are left out, as they follow the same procedure as for the Heat conduction example. The defect correction and geometric multigrid solvers are initiated and then the multigrid is set as a preconditioner to the defect correction method. Finally the system is solved via a call to \texttt{solve()}.
+With the type definitions set up, the implementation for the Poisson solver follows in Listing \ref{ch5:lst:poissonsolver}. Some of the initializations are left out, as they follow the same procedure as for the heat conduction example. The defect correction and geometric multigrid solvers are initialized and then multigrid is set as a preconditioner to the defect correction method. Finally the system is solved via a call to \texttt{solve()}.
  
-\lstset{label=ch5:lst:poissonsolver,caption={Initializing the preconditioned defect correction solver to approximate the solution to $\mymat{A}\myvec{u}=\myvec{f}$.}}
+\pagebreak
+\lstset{label=ch5:lst:poissonsolver,caption={initializing the preconditioned defect correction solver to approximate the solution to $\mymat{A}\myvec{u}=\myvec{f}$}}
  \begin{lstlisting}
  matrix_type A(alpha);               // High-order matrix
  matrix_type M(1);                   // Low-order matrix
@@ -380,26 +383,22 @@ if(monitor.converged())
  \subsubsection{Numerical solutions to the Poisson problem}\label{ch5:sec:poissonresults}
  
  
-The discrete Poisson problem \eqref{ch5:eq:poissonsystem}, has been solved using the three iterative methods presented above. Convergence histories for the conjugate gradient method and geometric multigrid method, using two different resolutions are illustrated in Figure \ref{ch5:fig:poissonconvergence:a}. Multigrid methods have the attractive properties, that they are very robust and algorithmic efficient, independently of the problem size. Figure \ref{ch5:fig:poissonconvergence:a} confirms that the rate of convergence for the multigrid method is unchanged for both problem sizes. Only the attainable accuracy is slightly worsened, as a consequence of a more ill-conditioned system for large problem sizes.
+The discrete Poisson problem \eqref{ch5:eq:poissonsystem} has been solved using the three iterative methods presented above. Convergence histories for the conjugate gradient method and geometric multigrid method, using two different resolutions, are illustrated in Figure \ref{ch5:fig:poissonconvergence:a}. Multigrid methods are very robust and algorithmic efficient, independent of the problem size. Figure \ref{ch5:fig:poissonconvergence:a} confirms that the rate of convergence for the multigrid method is unchanged for both problem sizes. Only the attainable accuracy is slightly worsened, as a consequence of a more ill-conditioned system for large problem sizes.
  
-Defect correction in combination with multigrid preconditioning, enables efficient solution of high-order approximations of the Poisson problem, illustrated in Figure \ref{ch5:fig:poissonconvergence:b}. The multigrid preconditioning matrix $\mymat{M}$ is based on a low order approximation to \eqref{ch5:eq:poissonsystem}, whereas matrix $\mymat{A}$ is a high-order approximation. When $\mymat{M}$ is a close approximation to $\mymat{A}$, defect correction converges most rapidly. This is the effect that can be seen between the three convergence lines in Figure \ref{ch5:fig:poissonconvergence:b}.
+Defect correction in combination with multigrid preconditioning enables efficient solution of high-order approximations of the Poisson problem, illustrated in Figure \ref{ch5:fig:poissonconvergence:b}. The multigrid preconditioning matrix $\mymat{M}$ is based on a low-order approximation to \eqref{ch5:eq:poissonsystem}, whereas matrix $\mymat{A}$ is a high-order approximation. When $\mymat{M}$ is a close approximation to $\mymat{A}$, defect correction converges most rapidly. This is the effect that can be seen between the three convergence lines in Figure \ref{ch5:fig:poissonconvergence:b}.
  
  
  \begin{figure}[!htb]
-\setlength\figureheight{0.33\textwidth}
-\setlength\figurewidth{0.33\textwidth}
-\begin{center}
-\subfigure[Convergence history for the conjugate gradient and multigrid methods, for two different problem sizes.]{\label{ch5:fig:poissonconvergence:a}
+\centering
+\subfigure[Convergence histories for the conjugate gradient (CG) and multigrid (MG) methods, for two different problem sizes.]{\label{ch5:fig:poissonconvergence:a}
      %{\scriptsize \input{Chapters/chapter5/figures/ConvergenceMGvsCG.tikz}}
-    {\includegraphics[width=0.5\textwidth]{Chapters/chapter5/figures/ConvergenceMGvsCG_conv.pdf}}
-}
-
- \hspace{0.2cm}%
+    {\includegraphics[width=0.45\textwidth]{Chapters/chapter5/figures/ConvergenceMGvsCG_conv.pdf}}
+}%
+\hspace{0.5cm}%
  \subfigure[Defect correction convergence history for three different stencil sizes.]{\label{ch5:fig:poissonconvergence:b}
      %{\scriptsize \input{Chapters/chapter5/figures/ConvergenceDC.tikz}}
- {\includegraphics[width=0.5\textwidth]{Chapters/chapter5/figures/ConvergenceDC_conv.pdf}}
+ {\includegraphics[width=0.45\textwidth]{Chapters/chapter5/figures/ConvergenceDC_conv.pdf}}
  }
-\end{center}
  \caption{Algorithmic performance for the conjugate gradient, multigrid, and defect correction methods, measured in terms of the relative residual per iteration.}\label{ch5:fig:poissonconvergence}
  \end{figure}
  
@@ -409,85 +408,84 @@ Defect correction in combination with multigrid preconditioning, enables efficie
  
  \section{Optimization strategies for multi-GPU systems}\label{ch5:sec:multigpu}\index{multi-GPU}
  
-CUDA enabled GPUs are optimized for high memory bandwidth and fast on-chip performance. However, the role as a separate co-processor to the CPU can be a limiting factor for large scale scientific applications, because the per GPU memory capacity is fixed and is only in the range of a few Gigabytes. In comparison, it is not unusual for a high-end workstation to be equipped with $\sim32$GB of main memory, plus a terabyte capacity hard disk for secondary storage. Therefore, large scale scientific applications that process Gigabytes of data, require distributed computations on multiple GPU devices. Multi-GPU desktop computers and clusters can have a very attractive peak performance, but the addition of multiple devices introduces the potential performance bottleneck of slow data transfers across PCIe busses and network interconnections, as illustrated in Figure \ref{ch5:fig:gpu2gputransfer}. The ratio between data transfers and computational work has a significant impact on the possibility for latency hiding and thereby overall application performance.
+CUDA enabled GPUs are optimized for high memory bandwidth and fast on-chip performance. However, the role as a separate coprocessor to the CPU can be a limiting factor for large scale scientific applications, because the GPU memory capacity is fixed and is only in the range of a few gigabytes. In comparison, it is not unusual for a high-end workstation to be equipped with $\sim32$GB of main memory, plus a terabyte hard disk capacity for secondary storage. Therefore, large scale scientific applications that process gigabytes of data, require distributed computations on multiple GPU devices. Multi-GPU desktop computers and clusters can have a very attractive peak performance, but the addition of multiple devices introduces the potential performance bottleneck of slow data transfers across PCIe busses and network interconnections, as illustrated in Figure \ref{ch5:fig:gpu2gputransfer}. The ratio between data transfers and computational work has a significant impact on the possibility for latency hiding and thereby overall application performance.
  
  \begin{figure}[!htb]
  \begin{center}
  %\includegraphics[width=0.6\textwidth]{Chapters/chapter5/figures/GPU2GPU.png}
  \input{Chapters/chapter5/figures/GPU2GPU.tikz}
  \end{center}
-\caption[Message passing between two GPUs involves several memory transfers across lower bandwidth connections.]{Message passing between two GPUs involves several memory transfers across lower bandwidth connections. A kernel that aligns the data to be transferred is required, if the data is not already sequentially stored in device memory.}\label{ch5:fig:gpu2gputransfer}
+\caption[Message passing between two GPUs involves several memory transfers across lower bandwidth connections.]{Message passing between two GPUs involves several memory transfers across lower bandwidth connections. The kernel call is required if the data is not already sequentially stored in device memory. Recent generations of NVIDIA GPUs, CUDA, and MPI support direct transfers without explicitely transfering data to the host first.}\label{ch5:fig:gpu2gputransfer}
  \end{figure}
  
  
-Developing applications that exploit the full computational capabilities of modern clusters -- GPU-based or not -- is no trivial matter. Developers are faced with the complexity of distributing and coordinating computations on nodes consisting of many-core CPUs, GPUs and potentially other types of accelerators as well. These complexities give rise to challenges in finding numerical algorithms, that are well suited for such systems, forcing developers to search for novel methods for introducing concurrency.
+Developing applications that exploit the full computational capabilities of modern clusters--GPU-based or not--is no trivial matter. Developers are faced with the complexity of distributing and coordinating computations on nodes consisting of many-core CPUs, GPUs and potentially other types of accelerators as well. These complexities give rise to challenges in finding numerical algorithms, that are well suited for such systems, forcing developers to search for novel methods that utilize concurrency.
  
-To ease software development, we use MPI-2 for message passing and ensure a safe and private communication space by creation of a communicator private to the library during initialization, as recommended by Hoefler and Snir~\cite{ch5:Hoefler2011}. With the addition of GPUDirect to CUDA, it is possible to make direct transfers between GPUs, eliminating CPU overhead. Unfortunately there are rather strict system requirements to enable these direct transfers, and there is no stable MPI implementation that supports message passing via pointers to device memory. Therefore device memory is first transferred to the CPU main memory before invoking any MPI calls. We do provide device to device transfers via template-based library routines, that work directly with GPU vector objects. It hides the complexity of message passing from the developer and help developers design new components for multi-GPU execution.
+To ease software development, we use MPI-2 for message passing and ensure a safe and private communication space by creation of a communicator private to the library during initialization, as recommended by Hoefler and Snir~\cite{ch5:Hoefler2011}. With the addition of remote direct memory access (RDMA) for GPUDirect it is possible to make direct memory transfers between recent generation of GPUs (Kepler), eliminating CPU overhead. Unfortunately there are some strict system and driver requirements to enable these features. Therefore, in the following examples, device memory is first transferred to the CPU main memory before invoking any MPI calls. The library provides device-to-device transfers via template-based routines that work directly with GPU vector objects. This hides the complexity of message passing from the developer and helps developers design new components for multi-GPU execution. 
  
  %(GPUDirect v2 and Unified Virtual Addressing can help). Effort is being made to create GPU aware MPI implementations, that will allow direct memory transfers(cite: openMPI, MVAPICH2). However, only 64-bit systems with Unified Virtual Addressing (UVA) from CUDA v4.0 and Tesla 20-serie (Kepler) GPUs are capable of direct transfers. Therefore only a limited number of today's GPU clusters will be able to utilize these features.
  %For now, the simple solution is to first transfer data to the CPU main memory before invoking the appropriate MPI calls.
  
-In the following sections we present two very different methods for distributed computing based on spatial and temporal decomposition. Each method has its own characteristic, which make them attractive for various types of PDE problems, and for different problem sizes.
+In the following sections we present two very different methods for distributed computing based on spatial and temporal decomposition. Each method has its own characteristic, which makes the method attractive for various types of PDE problems and for different problem sizes.
  
  
  
  \subsection{Spatial domain decomposition}\label{ch5:sec:dd}\index{domain decomposition}
  
-Domain decomposition methods can be used for distributing computational work in the numerical solution of boundary value problems~\cite{ch5:Smith1996}. These methods add parallelism, by splitting the spatial dimensions, on which the boundary values are defined, into a number of smaller boundary value problems, and coordinating the solution between adjacent subdomains. Domain decomposition techniques, such as the classical overlapping Schwarz methods, may be considered as preconditioners \index{preconditioning} to the system of equations, that arise from discretization of the PDE, e.g. as for the Poisson problem in \eqref{ch5:eq:poissonsystem}. The algorithmic efficiency of such methods depends on the size of the domain overlaps, while an additional coarse grid correction method is sometimes necessary to maintain fast global convergence.
+Domain decomposition methods can be used for distributing computational work in the numerical solution of boundary value problems~\cite{ch5:Smith1996}. These methods add parallelism by splitting the spatial dimensions, on which the boundary values are defined, into a number of smaller boundary value problems and then coordinating the solution between adjacent subdomains. Domain decomposition techniques, such as the classical overlapping Schwarz methods, may be considered as preconditioners \index{preconditioning} to the system of equations that arise from the discretization of PDEs, e.g., as for the Poisson problem in \eqref{ch5:eq:poissonsystem}. The algorithmic efficiency of such methods depends on the size of the domain overlaps, while an additional coarse grid correction method is sometimes necessary to maintain fast global convergence.
  %A detailed study and analysis of convergence criterions, overlapping zones, local domain solvers etc. can lead to parallel efficiencies that scale linearly with the number of processors for large problem sizes, see e.g. \cite{ch5:Cai2005}.
  
-An alternative to the preconditioning strategy is to have each subdomain query information from adjacent subdomains whenever needed. For PDEs that are discretized onto regular shaped grids, this can be an attractive strategy, as the decomposition of subdomains, and thereby the communication topology\index{topology}, is straightforward. The library supports decomposition of regular grids by either pre- or user-defined topologies. A topology in this context, is a description of connectivity between processors, that share the same grid along with information about the local and global discretization. Layers of ghost points are inserted at the artificially introduced boundaries to account for grid points that reside in adjacent subdomains. The ghost point values can be updated upon request from the user. An illustrative example is given in Figure \ref{ch5:fig:dd2d}; the arrows indicate message passing between adjacent subdomains, when updating grid points within the ghost layers.
+An alternative to the preconditioning strategy is to have each subdomain query information from adjacent subdomains whenever needed. For PDEs that are discretized onto regular shaped grids, this can be an attractive strategy, as the decomposition of subdomains, and thereby the communication topology\index{topology}, is straightforward. The library supports decomposition of regular grids by either pre- or user-defined topologies. A topology in this context, is a description of connectivity between processors that share the same grid along with information about the local and global discretization. Layers of ghost points are inserted at the artificially introduced boundaries to account for grid points that reside in adjacent subdomains. The ghost point values can be updated upon request from the user. An illustrative example is given in Figure \ref{ch5:fig:dd2d}; the arrows indicate message passing between adjacent subdomains, when updating grid points within the ghost layers.
  \begin{figure}[!htb]
  \begin{center}
  \input{Chapters/chapter5/figures/dd2d.tikz}
  \end{center}
-\caption[Domain distribution along the horizontal direction of a two dimensional grid into three subdomains.]{Domain distribution along the horizontal direction of a two dimensional grid into three subdomains. {\large$\bullet$} and {\scriptsize$\textcolor[rgb]{0.5,0.5,0.5}{\blacksquare}$} represent internal grid points and ghost points respectively.}\label{ch5:fig:dd2d}
+\caption[Domain distribution of a two-dimensional grid into three subdomains.]{Domain distribution of a two-dimensional grid into three subdomains. {\large$\bullet$} and {\scriptsize$\textcolor[rgb]{0.5,0.5,0.5}{\blacksquare}$} represent internal grid points and ghost points, respectively.}\label{ch5:fig:dd2d}
  \end{figure}
  
-Topologies are introduced via an extra template argument to the grid component. A grid is by default not distributed, because the default template argument is a non-distribution topology implementation. The grid class is extended with a new member function \texttt{update()}, that makes sure that the topology instance updates all ghost points. The library contains two topology strategies, based on one dimensional and two dimensional distributions of the grid. The number of grid subdomains will be equal to the number of MPI processes executing the program.
+Topologies are introduced via an extra template argument to the grid class. A grid is by default not decomposed, because the default template argument is based on a non distribution topology implementation. The grid class is extended with a new member function \texttt{update()}, which makes sure that all ghost points are updated according to the grid topology. The library contains topologies based on one-dimensional and two-dimensional distributions of the grid. The number of grid subdomains will be equal to the number of MPI processes executing the program.
  
-If these updates are performed whenever information from adjacent subdomains are needed, i.e. before a stencil operation, all distributed interior points will be exactly the same as they would be for the non-distributed setup. Therefore, one advantage of this method is, that the algorithmic efficiency of an application can be preserved, if grid updates are invoked at the proper times.
+If grid ghost layers are updated whenever information from adjacent subdomains is needed, e.g., before a stencil operation, all interior points will be exactly the same as they would be for the non distributed setup. Therefore, one advantage of this approach is that the algorithmic efficiency of an application can be preserved, if grid updates are consistently invoked at the proper times.
  
-Distributed performance for the finite difference stencil operation is illustrated in Figure \ref{ch5:fig:multigpu}. The timings include the compute time for the finite difference approximation and the time for updating ghost layers via message passing, cf. Figure \ref{ch5:fig:gpu2gputransfer}. It is obvious from Figure \ref{ch5:fig:multigpu:a}, that communication overhead dominates for smaller problem sizes, where the non-distributed grid (1 GPU) is fastest. However, communication times do not grow as rapidly for larger problems as computation times, due to the surface to volume ratio. Therefore message passing becomes less influential for large problems, where reasonable performance speedups are obtained. Figure \ref{ch5:fig:multigpu:b} demonstrates how the computational performance on multi-GPU systems can be significantly improved for various stencil sizes. With this simple domain decomposition technique, developers are able to implement applications based on heterogeneous distributed computing, without explicitly dealing with message passing, while it is still possible to provide user specific implementations of the topology class for customized grid updates.
+Distributed performance for the finite difference stencil operation is illustrated in Figure \ref{ch5:fig:multigpu}. The timings include the compute time for the finite difference approximation and the time for updating ghost layers via message passing. It is obvious from Figure \ref{ch5:fig:multigpu:a} that communication overhead dominates for the smallest problem sizes, where the non distributed grid (1 GPU) is fastest. However, communication overhead does not grow as rapidly as computation times, due to the surface-to-volume ratio. Therefore message passing becomes less influential for large problems, where reasonable performance speedups are obtained. Figure \ref{ch5:fig:multigpu:b} demonstrates how the computational performance on multi-GPU systems can be significantly improved for various stencil sizes. With this simple domain decomposition technique, developers are able to implement applications based on heterogeneous distributed computing, without explicitly dealing with message passing and it is still possible to provide user specific implementations of the topology class for customized grid updates.
  
  % TODO: Should we put in the DD algebra?
  
  \begin{figure}[!htb]
-    \setlength\figureheight{0.27\textwidth}
-    \setlength\figurewidth{0.55\textwidth}
-    \begin{center}
+    \setlength\figureheight{0.39\textwidth}
+    \setlength\figurewidth{0.39\textwidth}
+%      \centering
      \subfigure[Absolute timings, $\alpha=3$.]{
-    %{\small\input{Chapters/chapter5/figures/MultiGPUAlpha3TimingsTeslaM2050.tikz}}
-    {\includegraphics[width=0.6\textwidth]{Chapters/chapter5/figures/MultiGPUAlpha3TimingsTeslaM2050_conv.pdf}}
+    {\scriptsize\input{Chapters/chapter5/figures/MultiGPUAlpha3TimingsTeslaM2050.tikz}}
+    %{\includegraphics[width=0.5\textwidth]{Chapters/chapter5/figures/MultiGPUAlpha3TimingsTeslaM2050_conv.pdf}}
      \label{ch5:fig:multigpu:a}
-    }
+    }%
      \subfigure[Performance at $N=4069^2$, single precision.]{
-   % {\small\input{Chapters/chapter5/figures/MultiGPUAlphaPerformanceTeslaM2050_N16777216.tikz}}
-{\includegraphics[width=0.6\textwidth]{Chapters/chapter5/figures/MultiGPUAlphaPerformanceTeslaM2050_N16777216_conv.pdf}}
+    {\scriptsize\input{Chapters/chapter5/figures/MultiGPUAlphaPerformanceTeslaM2050_N16777216.tikz}}
+       %{\includegraphics[width=0.5\textwidth]{Chapters/chapter5/figures/MultiGPUAlphaPerformanceTeslaM2050_N16777216_conv.pdf}}
      \label{ch5:fig:multigpu:b}
      }
-    \end{center}
-    \caption{Performance timings for distributed stencil operations, including communication and computation times. Test environment 2.}\label{ch5:fig:multigpu}
+    \caption{Performance timings for distributed stencil operations, including communication and computation times. Executed on test environment 3.}\label{ch5:fig:multigpu}
  \end{figure}
  
-\subsection{Parareal -- Parallel time integration}\label{ch5:sec:parareal}\index{parareal}
+\subsection{Parareal--parallel time integration}\label{ch5:sec:parareal}\index{parareal}
  
-The use of spatial domain decomposition methods is widespread, as they have proven to be efficient on a wide range of problems. Unfortunately, applications that are concerned with limited numerical problem sizes, rapidly reach their speedup limit for a low number of processors, due to a degradation of the scalability when the number of processors become large as this leads to an increasingly unfavourable communication to compute ratio of the individual subdomains. This issue is continuously worsened by the fact that communication speed has been increasing at a far slower pace than compute speed for the past many years, and is expected to continue to do so for years to come. It is often referred to as "the memory wall"~\cite{ch5:Asanovic:EECS-2006-183}, one of the grand challenges facing development and architectural design of future high-performance systems~\cite{ch5:Keyes2011,ch5:ScientificGrandChallenges2010}. Finally, there are applications based on ordinary differential equations, where classical domain decomposition methods are not even applicable~\cite{ch5:YMTR08}. For these type of applications, a method of adding parallelism in the temporal integration is of great interest. Contrary to space however, time is -- by its very nature -- sequential, which precludes a straightforward implementation of a parallel approach.
+The use of spatial domain decomposition methods is widespread, as they have proven to be efficient on a wide range of problems. Unfortunately, applications that are concerned with limited numerical problem sizes can rapidly reach a speedup limit for a low number of processors due to scalability degradation when the number of processors becomes large, as this leads to an increasingly unfavourable communication-to-compute ratio. This issue is continuously worsened by the fact that communication speed has been increasing at a far slower pace than compute speed for the past several years, and this trend is expected to continue for years to come. It is often referred to as \emph{the memory wall}~\cite{ch5:Asanovic:EECS-2006-183}, one of the grand challenges facing development and architectural design of future high-performance systems~\cite{ch5:Keyes2011,ch5:ScientificGrandChallenges2010}. Also, there are applications based on ordinary differential equations, where classical domain decomposition methods are not even applicable~\cite{ch5:YMTR08}. For these type of applications, a method of adding parallelism in the temporal integration is of great interest. Contrary to space however, time is--by its very nature--sequential, which precludes a straightforward implementation of a parallel approach.
  
-One method that introduces concurrency to the solution of evolution problems is the parareal algorithm. Parareal is an iterative method imposed on a time decomposition. Gander and Vandewalle showed in \cite{ch5:MS07}, that the algorithm can be written as both a multiple shooting method and as a two level multigrid in time approach, even though the leading idea came from spatial domain decomposition. The method has many exciting features: it is fault tolerant and has different communication characteristics than those of the classical domain decomposition methods. It has also been demonstrated to work effectively on a wide range of problems, and most importantly, once the proper distribution infrastructure is in place, it can easily be wrapped around any type of numerical integrator, for any type of initial value problem.
+One method that introduces concurrency to the solution of evolution problems is the parareal algorithm. Parareal is an iterative method imposed on a time decomposition. Gander and Vandewalle showed in \cite{ch5:MS07} that the algorithm can be written both as a multiple shooting method and as a two-level multigrid-in-time approach, even though the leading idea came from spatial domain decomposition. The method has many exciting features: it is fault tolerant and has different communication characteristics than those of the classical domain decomposition methods. It has also been demonstrated to work effectively on a wide range of problems, and most importantly, once the proper distribution infrastructure is in place, it can easily be wrapped around any type of numerical integrator, for any type of initial value problem.
+\label{ch5:parareal}
+
+\subsubsection{The parareal algorithm}
  
+The parareal algorithm was first presented in 2001, in a paper by Lions et al.~\cite{ch5:LMT01}, and later introduced in a slightly revised predictor-corrector form in 2002 by Baffico et al.~\cite{ch5:LSY02}. The parareal-in-time approach proposes to break the global problem of time evolution into a series of independent evolution problems on smaller intervals, see Figure \ref{ch5:fig:ParallelInTime}. %
  \begin{figure}[!htb]
  \def\FigureSize{0.6}
  \begin{center}
  \input{Chapters/chapter5/figures/ParallelInTime.tikz}
  \end{center}
-\caption[Time domain decomposition.]{Time domain decomposition. A compute node is assigned to each individual time sub-main to compute the initial value problem. Consistency at the time sub-domain boundaries is obtained with the application of a computationally cheap integrator in conjunction with the parareal iterative predictor-corrector algorithm, eq. \eqref{ch5:eq:PARAREAL}. }\label{ch5:fig:ParallelInTime}
-\end{figure}
-\label{ch5:parareal}
-
-\subsubsection{The Parareal algorithm}
-
-The parareal algorithm was first presented in 2001, in a paper by Lions et al.~\cite{ch5:LMT01}, and later introduced in a slightly revised predictor-corrector form in 2002 by Baffico et al.~\cite{ch5:LSY02}. The parareal in time approach proposes to break the global problem of time evolution into a series of independent evolution problems on smaller intervals, see Figure \ref{ch5:fig:ParallelInTime}. Initial states for these problems are needed, and supplied by a simple, less accurate, but computationally cheap sequential integrator. The smaller independent evolution problems can then be solved in parallel. The information, generated during the concurrent solution of the independent evolution problems with accurate propagators and inaccurate initial states, is used in a predictor-corrector fashion in conjunction with the coarse integrator to propagate the solution faster, now using the information generated in parallel. We define the decomposition into $N$ intervals, that is
+\caption[Time domain decomposition.]{Time domain decomposition. A compute node is assigned to each individual time subdomain to compute the initial value problem. Consistency at the time subdomain boundaries is obtained with the application of a computationally cheap integrator in conjunction with the parareal iterative predictor-corrector algorithm}\label{ch5:fig:ParallelInTime}
+\end{figure}%
+Initial states for these problems are needed and supplied by a simple, less accurate, but computationally cheap sequential integrator. The smaller independent evolution problems can then be solved in parallel. The information, generated during the concurrent solution of the independent evolution problems with accurate propagators and inaccurate initial states, is used in a predictor-corrector fashion in conjunction with the coarse integrator to propagate the solution faster, now using the information generated in parallel. We define the decomposition into $N$ intervals, that is,
  \begin{align}
   & T_{0}<T_{1}<\cdots<T_{n}=n\Delta T<T_{n+1}<T_{N},
  \end{align}
@@ -495,26 +493,27 @@ where $\Delta T$ is the size of the time intervals and $n=0,1,\ldots,N$. The gen
  \begin{align}\label{ch5:eq:difproblem}
   \frac{\partial u}{dt}+\mathcal{A}u=0, \qquad u(T_{0})=u^{0}, \quad t\in\left[T_{0},T_{N}\right],
  \end{align}
-where $\mathcal{A}$ is an operator from one Hilbert space to another. To solve the differential problem \eqref{ch5:eq:difproblem} we define an operator $\mathcal{F}_{\Delta T}$ that operates on some initial state $U_{n} \approx u(T_{n})$, and return an approximate solution to \eqref{ch5:eq:difproblem}, at time $T_{n}+\Delta T$. Such an operator is achieved by the implementation of a numerical time integrator, using some small time-step $\delta t\ll\Delta T$ in the integration. The numerical solution to \eqref{ch5:eq:difproblem} is then obtained by applying the fine propagator sequentially for $n=1,2,\ldots,N$.
+where $\mathcal{A}$ is an operator from one Hilbert space to another. To solve the differential problem \eqref{ch5:eq:difproblem} we define an operator $\mathcal{F}_{\Delta T}$ that operates on some initial state $U_{n} \approx u(T_{n})$ and returns an approximate solution to \eqref{ch5:eq:difproblem}, at time $T_{n}+\Delta T$. Such an operator is achieved by the implementation of a numerical time integrator, using some small time-step $\delta t\ll\Delta T$ in the integration. The numerical solution to \eqref{ch5:eq:difproblem} can then be obtained by applying the fine propagator sequentially for $n=1,2,\ldots,N$.
  \begin{align}\label{ch5:eq:fineOp}
  \hat{U}_{n}=\mathcal{F}_{\Delta T}\left(T_{n-1},\hat{U}_{n-1}\right), \qquad\hat{U}_{0}=u^{0}.
  \end{align}
-For the purpose of parallel acceleration of the otherwise purely sequential process of obtaining $\mathcal{F}_{\Delta T}^{N} u^{0} \approx u(T_{N})$, we define the coarse propagator $\mathcal{G}_{\Delta T}$. $\mathcal{G}_{\Delta T}$ also operates on some initial state $U_{n}$, propagating the solution over the time interval $\Delta T$, but now using a time step $\delta T$. Typically $\delta t<\delta T<\Delta T$. For the parareal algorithm to be effective, the coarse propagator $\mathcal{G}_{\Delta T}$ has to be substantially faster to evaluate than the fine propagator $\mathcal{F}_{\Delta T}$. There are many ways of constructing the coarse propagator, the simplest one being to apply the same numerical integrator as for the fine propagator, but using a coarser time discretization. We refer to \cite{ch5:ASNP12} for an introduction to other methods. The coarse operator reads
+
+For the purpose of parallel acceleration of the otherwise purely sequential process of obtaining $\mathcal{F}_{\Delta T}^{N} u^{0} \approx u(T_{N})$, we define the coarse propagator $\mathcal{G}_{\Delta T}$. $\mathcal{G}_{\Delta T}$ also operates on some initial state $U_{n}$, propagating the solution over the time interval $\Delta T$, but now using a time step $\delta T$. Typically $\delta t<\delta T<\Delta T$. For the parareal algorithm to be effective, the coarse propagator $\mathcal{G}_{\Delta T}$ has to be substantially faster to evaluate than the fine propagator $\mathcal{F}_{\Delta T}$. There are many ways of constructing the coarse propagator, the simplest one being to apply the same numerical integrator as for the fine propagator, but using a coarser time discretization. We refer the reader to \cite{ch5:ASNP12} for an introduction to other methods. The coarse operator reads
  \begin{align}\label{ch5:eq:coarseOp}
  \tilde{U}_{n}=\mathcal{\mathcal{G}}_{\Delta T}\left(T_{n-1},\tilde{U}_{n-1}\right),\qquad\tilde{U_{0}}=u^{0}.
  \end{align}
-Using the defined $\mathcal{F}_{\Delta T}$ and $\mathcal{G}_{\Delta T}$ operators, the predictor-corrector form of the parareal algorithm can be written in a single line
+Using the defined $\mathcal{F}_{\Delta T}$ and $\mathcal{G}_{\Delta T}$ operators, the predictor-corrector form of the parareal algorithm can be written in a single line as
  \begin{equation}\label{ch5:eq:PARAREAL}
  U_{n}^{k+1}=\mathcal{\mathcal{G}}_{\Delta T}\left(U_{n-1}^{k+1}\right)+\mathcal{\mathcal{F}}_{\Delta T}\left(U_{n-1}^{k}\right)-\mathcal{\mathcal{G}}_{\Delta T}\left(U_{n-1}^{k}\right),\quad U_{0}^{k}=u^{0},
  \end{equation}
-with the initial prediction $U^{0}_{n} = \mathcal{G}_{\Delta T}^{n} u^{0}$ for $n=1\ldots N$ and $k=1\ldots K$. $N$ being the number of time subdomains, while $K\geq1$ is the number of predictor-corrector iterations applied. The parareal algorithm is implemented in the library as a separate time integration component, using a fully distributed work scheduling model, as proposed by Aubanel (2010)~\cite{ch5:EA10}. The model is schematically presented in Figure \ref{ch5:fig:FullyDistributedCores}. The parareal component hides all communication and work distribution from the application developer.  It is defined such that a user only has to decide what coarse and fine propagators to use. Setting up the type definitions for parareal time integration using forward Euler for coarse propagation and fourth order Runge-Kutta for fine propagation, could then be defined as follows:
-\lstset{label=ch5:lst:parareal,caption={Assembling a parareal time integrator using forward Euler for coarse propagation and a Runge-Kutta method for fine propagation.}}
+with the initial prediction $U^{0}_{n} = \mathcal{G}_{\Delta T}^{n} u^{0}$ for $n=1\ldots N$ and $k=1\ldots K$. $N$ being the number of time subdomains, while $K\geq1$ is the number of predictor-corrector iterations applied. The parareal algorithm is implemented in the library as a separate time-integration component, using a fully distributed work scheduling model, as proposed by Aubanel~\cite{ch5:EA10}. The model is schematically presented in Figure \ref{ch5:fig:FullyDistributedCores}. The parareal component hides all communication and work distribution from the application developer.  It is defined such that a user only has to decide what coarse and fine propagators to use. Setting up the type definitions for parareal time-integration using forward Euler for coarse propagation and fourth order Runge-Kutta for fine propagation could then be defined as in Listings \ref{ch5:lst:parareal}. The number of GPUs used for parallelization depends on the number of MPI processes executing the application.
+\lstset{label=ch5:lst:parareal,caption={assembling a parareal time integrator using forward Euler for coarse propagation and a Runge-Kutta method for fine propagation}}
  \begin{lstlisting}
  typedef gpulab::integration::forward_euler          coarse;
  typedef gpulab::integration::ERK4                   fine;
  typedef gpulab::integration::parareal<coarse,fine>  integrator;
  \end{lstlisting}
-The number of GPUs used for parallelization depends on the number of MPI processes that the application is initiated with.
+
  
  %Each application of $\mathcal{\mathcal{G}}_{\Delta T}$ or $\mathcal{\mathcal{F}}_{\Delta T}$ to a state $\tilde{U}_{n}^{k}$ is performed on a GPU while the corrections themselves are performed on CPUs. In addition to an efficient distribution model, a stopping strategy is needed as parareal is an iterative algorithm. Various choices has been implemented in the library, these are presented in section \ref{ch5:subsec:Lib_impl}.
  
@@ -523,24 +522,28 @@ The number of GPUs used for parallelization depends on the number of MPI process
  \begin{center}
  \input{Chapters/chapter5/figures/FullyDistributed.tikz}
  \end{center}
-\caption[Schematic visualization of a fully distributed work scheduling model for the parareal algorithm as proposed by Aubanel.]{\label{ch5:fig:FullyDistributedCores} Schematic visualization of a fully distributed work scheduling model for the parareal algorithm as proposed by Aubanel. Each GPU is responsible for computing the solution on a single time sub-domain. The computation is initiated at rank $0$ and cascades trough to rank $N$ where the final solution can be fetched. }
+\caption[Schematic visualization of a fully distributed work scheduling model for the parareal algorithm as proposed by Aubanel.]{\label{ch5:fig:FullyDistributedCores} Schematic visualization of a fully distributed work scheduling model for the parareal algorithm as proposed by Aubanel~\cite{ch5:EA10}. Each GPU is responsible for computing the solution on a single time subdomain. The computation is initiated at rank $0$ and cascades through to rank $N$ where the final solution can be fetched.}
  \end{figure}
  
  \subsubsection{Computational complexity}
  
-In the analysis of the computational complexity, we first recognize that both the coarse and the fine propagator, regardless of the type of discretization scheme, involves a complexity that is proportional to the number of time steps being used. Let us define two scalar values $\mathcal{C}_\mathcal{F}$ and $\mathcal{C}_\mathcal{G}$ as the computational cost of performing a single step with the fine and coarse propagators. The computational complexity of a propagator integrating over an interval $\Delta T$ is then given by $\mathcal{C}_\mathcal{F}\frac{\Delta T}{\delta t}$ and $\mathcal{C}_\mathcal{G}\frac{\Delta T}{\delta T}$ respectively. $R$ is introduced as the relation between the two, that is, $R$ is a measure of how much faster the coarse propagator is compared to the fine propagator over the time interval $\Delta T$. The total computational cost for parareal over $N$ intervals is proportional to
+In the analysis of the computational complexity, we first recognize that both the coarse and the fine propagators, regardless of the type of discretization scheme, involve a complexity that is proportional to the number of time steps being used. Let us define two scalar values $\mathcal{C}_\mathcal{F}$ and $\mathcal{C}_\mathcal{G}$ as the computational cost of performing a single step with the fine and coarse propagators. The computational complexity of a propagator integrating over an interval $\Delta T$ is then given by $\mathcal{C}_\mathcal{F}\frac{\Delta T}{\delta t}$ and $\mathcal{C}_\mathcal{G}\frac{\Delta T}{\delta T}$, respectively. $R$ is introduced as the relation between the two; that is, $R$ is a measure of how much faster the coarse propagator is compared to the fine propagator in integrating the time interval $\Delta T$. The total computational cost for parareal over $N$ intervals is then proportional to
  \begin{align}
-\left(k+1\right)N\mathcal{C_{G}}\frac{\Delta T}{\delta T}+kN\mathcal{C_{F}}\frac{\Delta T}{\delta t},
+\left(k+1\right)N\mathcal{C_{G}}\frac{\Delta T}{\delta T}+kN\mathcal{C_{F}}\frac{\Delta T}{\delta t}.
  \end{align}
-recognising that the second term can be distributed over $N$ processors, we are left with
+Recognizing that the second term can be distributed over $N$ processors, we are left with
  \begin{align}\label{ch5:eq:ComComplexPara}
  \left(k+1\right)N\mathcal{C_{G}}\frac{\Delta T}{\delta T}+k\mathcal{C_{F}}\frac{\Delta T}{\delta t}.
  \end{align}
-The above should be compared to the computational complexity of a purely sequential propagation, using only the fine operator, $\frac{T_{N}-T_{0}}{\delta t}\mathcal{C}_\mathcal{F}=N\frac{\Delta T}{\delta t}\mathcal{C}_\mathcal{F}$. From the latter we can estimate the speed-up, here denoted $\psi$, as the ratio between the computational complexity of the purely sequential solution, and the complexity of the solution obtained by the parareal algorithm \eqref{ch5:eq:ComComplexPara}. Neglecting the influence of communication speed and correction time, we are left with the estimate
+The above should be compared to the computational complexity of a purely sequential propagation, using only the fine operator,
+\begin{align}
+\frac{T_{N}-T_{0}}{\delta t}\mathcal{C}_\mathcal{F}=N\frac{\Delta T}{\delta t}\mathcal{C}_\mathcal{F}.
+\end{align}
+We can now estimate the speedup, here denoted $\psi$, as the ratio between the computational complexity of the purely sequential solution, and the complexity of the solution obtained by the parareal algorithm \eqref{ch5:eq:ComComplexPara}. Neglecting the influence of communication speed and correction time, we are left with the estimate
  \begin{align}\label{ch5:eq:EstiSpeedBasicPara}
  \psi=\frac{N\frac{\Delta T}{\delta t}\mathcal{C}_\mathcal{F}}{\left(k+1\right)N\mathcal{C}_\mathcal{G}\frac{\Delta T}{\delta T}+k\mathcal{C}_\mathcal{F}\frac{\Delta T}{\delta t}}=\frac{N}{\left(k+1\right)N\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}+k}.
  \end{align}
-If we in addition assume that the time spend on coarse propagation is negligible compared to time spend on the fine propagation, i.e. the limit $\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}\rightarrow0$, the estimate reduces to $\psi=\frac{N}{k}$. It is thus clear that the number of iterations $k$ for the algorithm to converge pose an upper bound on obtainable parallel efficiency. The number of iterations needed for convergence is intimately coupled with the ratio between the speed of the coarse and the fine integrator $\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}$, denoted $R$. Using a slow, but more accurate coarse integrator, will lead to convergence in fewer iterations $k$, but at the same time it also makes $\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}$ larger. Ultimately, this will degrade the obtained speedup as can be deduced from \eqref{ch5:eq:EstiSpeedBasicPara}, and by Amdahl's law also lower the upper bound on possible attainable speedup. Thus, the ratio $\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}$ \emph{can not} be made arbitrarily small since the relation is inversely proportional to the iterations $k$, needed for convergence. This poses a challenge in obtaining speedup and is a trade-off between time spend on the fundamentally sequential part of the algorithm and the number of iterations needed for convergence. It is particularly important to consider this trade-off in the choice of stopping strategy; a more thorough discussion on this topic is available in \cite{ch5:ASNP12} for the interested reader. Measurements on parallel efficiency is typically observed to be in the range of 20--50\% in the literature, depending on the problem and the number of time subdomains, which is also confirmed by our measurements using GPUs. Here we include a demonstration of the obtained speedup of parareal applied to the two dimensional heat problem \ref{ch5:eq:heateqbc}. In Figure \ref{ch5:fig:pararealRvsGPUs} the iterations needed for convergence using the forward Euler method as both fine and coarse propagator is presented. R is regulated by changing the coarse time discretization. In Figure \ref{ch5:fig:pararealGPUs} speed-up and parallel efficiency measurements are presented. Notice how when using many GPUs it is advantageous to use a faster, less accurate coarse propagator, despite this requiring an extra parareal iteration that increases the total computational complexity.
+If we additionally assume that the time spent on coarse propagation is negligible compared to the time spent on the fine propagation, i.e., the limit $\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}\rightarrow0$, the estimate reduces to $\psi=\frac{N}{k}$. It is thus clear that the number of iterations $k$ for the algorithm to converge poses an upper bound on obtainable parallel efficiency. The number of iterations needed for convergence is intimately coupled with the ratio $R$ between the speed of the fine and the coarse integrators $\frac{\mathcal{C}_\mathcal{F}}{\mathcal{C}_\mathcal{G}}\frac{\delta T}{\delta t}$. Using a slow, but more accurate coarse integrator will lead to convergence in fewer iterations $k$, but at the same time it also makes $R$ smaller. Ultimately, this will degrade the obtained speedup as can be deduced from \eqref{ch5:eq:EstiSpeedBasicPara}, and by Amdahl's law it will also lower the upper bound on possible attainable speedup. Thus, $R$ \emph{cannot} be made arbitrarily large since the ratio is inversely proportional to the number of iterations $k$ needed for convergence. This poses a challenge in obtaining speedup and is a trade-off between time spent on the fundamentally sequential part of the algorithm and the number of iterations needed for convergence. It is particularly important to consider this trade-off in the choice of stopping strategy; a more thorough discussion on this topic is available in \cite{ch5:ASNP12} for the interested reader. Measurements on parallel efficiency are typically observed in the literature to be in the range of 20--50\%, depending on the problem and the number of time subdomains, which is also confirmed by our measurements using GPUs. Here we include a demonstration of the obtained speedup of parareal applied to the two-dimensional heat problem \eqref{ch5:eq:heateq}. In Figure \ref{ch5:fig:pararealRvsGPUs} the iterations needed for convergence using the forward Euler method for both fine and coarse integration are presented. $R$ is regulated by changing the time step size for the coarse integrator. In Figure \ref{ch5:fig:pararealGPUs} speedup and parallel efficiency measurements are presented. Notice, when using many GPUs it is advantageous to use a faster, less accurate coarse propagator, despite it requires an extra parareal iteration that increases the total computational complexity.
  
  \begin{figure}[!htb]
      \setlength\figureheight{0.32\textwidth}
@@ -555,14 +558,14 @@ If we in addition assume that the time spend on coarse propagation is negligible
      \label{ch5:fig:pararealRvsGPUs:b}
      }
      \end{center}
-    \caption[Parareal convergence properties as a function of $R$ and number GPUs used.]{Parareal convergence properties as a function of $R$ and number GPUs used. The error is measured as the relative difference between the purely sequential solution and the parareal solution.}\label{ch5:fig:pararealRvsGPUs}
+    \caption[Parareal convergence properties as a function of $R$ and number of GPUs used.]{Parareal convergence properties as a function of $R$ and number GPUs used. The error is measured as the relative difference between the purely sequential solution and the parareal solution.}\label{ch5:fig:pararealRvsGPUs}
  \end{figure}
  
  \begin{figure}[!htb]
      \setlength\figureheight{0.32\textwidth}
-    \setlength\figurewidth{0.35\textwidth}
+    \setlength\figurewidth{0.34\textwidth}
      \begin{center}
-    \subfigure[Measured parallel speed-up]{
+    \subfigure[Measured parallel speedup]{
      {\small\input{Chapters/chapter5/figures/pararealSpeedupvsRvsGPUs.tikz}}
      \label{ch5:fig:pararealGPUs:a}
      }
@@ -571,7 +574,7 @@ If we in addition assume that the time spend on coarse propagation is negligible
      \label{ch5:fig:pararealGPUs:b}
      }
      \end{center}
-    \caption[Parareal performance properties as a function of $R$ and number GPUs used.]{Parareal performance properties as a function of $R$ and number GPUs used. Notice how the obtained performance depends greatly on the choice of $R$ as a function of the number of GPUs. Test environment 2. }\label{ch5:fig:pararealGPUs}
+    \caption[Parareal performance properties as a function of $R$ and number of GPUs used.]{Parareal performance properties as a function of $R$ and number GPUs used. Notice how the obtained performance depends greatly on the choice of $R$ as a function of the number of GPUs. Executed on test environment 3.}\label{ch5:fig:pararealGPUs}
  \end{figure}
  
  %TODO: Do we make this into a subsubsection:
@@ -596,19 +599,19 @@ If we in addition assume that the time spend on coarse propagation is negligible
  
  \section{Conclusion and outlook}
  
-Massively parallel heterogeneous systems continue to enter the consumer market, and there has been no identification that this trend will stop for the next years to come. However, these parallel architectures require software vendors to adjust to new programming models and optimization strategies. Good software libraries are important tools for reducing time and complexity of adjusting to new architectures and they provide the user with an intuitive programming interface.
+Massively parallel heterogeneous systems continue to enter the consumer market, and there has been no identification that this trend will stop for years to come. However, these parallel architectures require software vendors to adjust to new programming models and optimization strategies. Good software libraries are important tools for reducing the time and complexity of adjusting to new architectures, and they provide the user with an intuitive programming interface.
  
-We presented ideas for a generic GPU-based library for fast assembling of PDE solvers. A high-level interface and some simple implementation concepts were created with the objective of enhancing developer productivity and to ensure performance portability. Two elementary PDE model problems were used to demonstrate assembling of both spatial and temporal approximation techniques. Results confirmed that numerical methods for solution of PDE systems are bandwidth limited on GPU systems. Therefore higher-order methods can be advantageous, because extra computational work can be performed during memory transfer idle times, leading to increased flop performance.
+We have presented ideas for a generic GPU-based library for fast assembling of PDE solvers. A high-level interface and simple implementation concepts were created with the objective of enhancing developer productivity and to ensure performance portability. Two elementary PDE model problems were used to demonstrate assembling of both spatial and temporal approximation techniques. Results confirmed that numerical methods for solution of PDE systems are bandwidth limited on GPU systems. Therefore higher-order methods can be advantageous, when extra computational work can be performed during memory transfer idle times, leading to increased flop performance.
  
-Decomposition strategies for spatial and temporal parallelization on multi-GPU systems has been presented, with reasonable performance speedups. Novel results for the parareal algorithm on multi-GPU systems was presented, and an example of parallel efficiency for the heat conduction problem was demonstrated. Furthermore, a domain decomposition technique that preserves algorithmic efficiency was presented for the Poisson problem, with good parallel scalability.
+Decomposition strategies for spatial and temporal parallelization on multi-GPU systems have been presented, with reasonable performance speedups. Novel results for the parareal algorithm on multi-GPU systems have been presented, and an example of parallel efficiency for the heat conduction problem has been demonstrated. Furthermore, a domain decomposition technique that preserves algorithmic efficiency has been presented for the Poisson problem.
  
-The library has already successfully been used for development of a fast tool intended for scientific applications within maritime engineering, see Chapter \ref{ch7} for details. We intend to further extend the library, as we explore new techniques, suitable for parallelization on heterogeneous systems, that fit the scope of our applications.
+The library has already successfully been used for development of a fast tool intended for scientific applications within maritime engineering; see Chapter \ref{ch7} for details. We intend to further extend the library, as we explore new techniques, suitable for parallelization on heterogeneous systems, that fit the scope of our applications.
  
-For more information about the library and our work within the GPUlab group at the Technical University of Denmark, we encourage you to visit \texttt{http://gpulab.imm.dtu.dk}.
+For more information about the library and our work within the GPUlab group at the Technical University of Denmark, we encourage you to visit \texttt{http://gpulab.imm.dtu.dk}
  
-\section*{Acknowledgements}
-This work was supported by grant no. 09-070032 %on "Desktop Scientific Computing on Consumer Graphics Cards"
-from the Danish Research Council for Technology and Production Sciences. A special thank goes to Professor Jan S. Hesthaven for supporting parts of this work. Scalability and performance tests was done in GPUlab at DTU Informatics, Technical University of Denmark and using the GPU-cluster at Center for Computing and Visualization, Brown University, USA. Nvidia Corporation is acknowledged for generous hardware donations to facilities of GPUlab.
+\section*{Acknowledgments}
+This work have been supported by grant no. 09-070032 %on "Desktop Scientific Computing on Consumer Graphics Cards"
+from the Danish Research Council for Technology and Production Sciences. A special thanks goes to Professor Jan S. Hesthaven for supporting parts of this work. Scalability and performance tests were done in the GPUlab at DTU Informatics, Technical University of Denmark and using the GPU-cluster at the Center for Computing and Visualization, Brown University, USA. The NVIDIA Corporation is acknowledged for generous hardware donations to facilities of the GPUlab.
  
  %-------------------------------------------------------------------------
  %\backmatter % bibliography, index and similar things, which are not numbered.