new

[book_gpu.git] / BookGPU / Chapters / chapter7 / ch7.tex
diff --git a/BookGPU/Chapters/chapter7/ch7.tex b/BookGPU/Chapters/chapter7/ch7.tex

index d1918d4462846cec6afe78a361be2f537386a470..f20975db909f43ea364fa335cc8bf6ad5b1f9f5a 100644 (file)
--- a/BookGPU/Chapters/chapter7/ch7.tex
+++ b/BookGPU/Chapters/chapter7/ch7.tex
@@ -1,10 +1,10 @@
  
-\chapterauthor{Allan P. Engsig-Karup, Stefan L. Glimberg, Allan S. Nielsen, Ole Lindberg}{Technical University of Denmark}
+\chapterauthor{Allan P. Engsig-Karup, Stefan L. Glimberg, Allan S. Nielsen, and Ole Lindberg}{Technical University of Denmark}
  %\chapterauthor{Stefan L. Glimberg}{Technical University of Denmark}
  %\chapterauthor{Allan S. Nielsen}{Technical University of Denmark}
  %\chapterauthor{Ole Lindberg}{Technical University of Denmark}
  
-\chapter{Fast hydrodynamics on heterogenous many-core hardware}
+\chapter{Fast hydrodynamics on heterogeneous many-core hardware}
  \label{ch7}
  
  \begin{figure}[!htb]
@@ -14,21 +14,21 @@
  \end{figure}
  
  
-In this chapter, we use our library for heterogenous and massively parallel GPU implementations. The library is written in Compute Unified Device Architecture (CUDA) C/C++ and a fully nonlinear and dispersive free surface water wave model \cite{ch7:EngsigKarupEtAl2011} is implemented. We describe how flexible-order finite difference\index{finite difference} (stencil) approximations to the partial differential equations of the model can be prototyped using library components provided in an in-house library. In this library hardware-specific implementation details are hidden via template-based components, as described in Chapter \ref{ch5}. We provide details of the modeling basis and important unique numerical properties which have been made tunable to create a powerful and robust tool that can be tailored for specific purposes in engineering analysis. The model is based on unified potential flow theory and can be applied in scientific applications related to maritime engineering. It can be applied for cost-efficient estimation of broad banded wave propagation, transformation of irregular multidirectional waves over variable depth, kinematics and structural wave loads over large areas and scales.
+In this chapter, we use our library for heterogeneous and massively parallel GPU implementations. The library is written in Compute Unified Device Architecture (CUDA) C/C++ and a fully nonlinear and dispersive free surface water wave model \cite{ch7:EngsigKarupEtAl2011} is implemented. We describe how flexible-order finite difference\index{finite difference} (stencil) approximations to the partial differential equations of the model can be prototyped using library components provided in an in-house library. In this library hardware-specific implementation details are hidden via template-based components, as described in Chapter \ref{ch5}. We provide details of the modeling basis and important unique numerical properties which have been made tunable to create a powerful and robust tool that can be tailored for specific purposes in engineering analysis. The model is based on unified potential flow theory and can be applied in scientific applications related to maritime engineering. It can be applied for cost-efficient estimation of broad banded wave propagation, transformation of irregular multidirectional waves over variable depth, kinematics and structural wave loads over large areas and scales.
  
-A main motivation of this work is to deliver exceptional performance to minimize calculation times, using modern parallel hardware technologies in combination with a proper choice of discretization methods and data-local algorithms with optimal complexity. This enable work and memory requirements to grow (scale) linearly with problem size on a suitable hardware system. For the wave model this is achieved by explicit time integration and iterative solution of a large nonsymmetric and sparse linear $\sigma$-transformed Laplace problem. For the latter, we use an iterative Preconditioned Defect Correction (PDC) method, accelerated using a geometric multigrid preconditioning strategy. We use modern programming paradigms in the form of Message Passing Interface (MPI) and CUDA for development of a novel massively parallel wave modeling tool, executable on modern heterogenous many-core hardware.
+A main motivation of this work is to deliver exceptional performance to minimize calculation times, using modern parallel hardware technologies in combination with a proper choice of discretization methods and data-local algorithms with optimal complexity. This enable work and memory requirements to grow (scale) linearly with problem size on a suitable hardware system. For the wave model this is achieved by explicit time integration and iterative solution of a large nonsymmetric and sparse linear $\sigma$-transformed Laplace problem. For the latter, we use an iterative Preconditioned Defect Correction (PDC) method, accelerated using a geometric multigrid preconditioning strategy. We use modern programming paradigms in the form of Message Passing Interface (MPI) and CUDA for development of a novel massively parallel wave modeling tool, executable on modern heterogeneous many-core hardware.
  
  One purpose of the developed numerical model is to ultimately perform hydrodynamic calculations in the time domain for practical analysis and simulation, e.g., to enable computationally intensive interactive real-time simulations. Realistic interactive simulations are, with present technology and available computational resources, a tremendous challenge in this setting. Yet, our aim is to take a first step in this direction and compute first-order accurate hydrodynamics for near-realistic simulations of unsteady ship-wave dynamics in a large ship simulator, used for training purposes in seakeeping operations. For this type of application, a mandatory ingredient for real-time and interactive simulation is a truly high-performance parallel implementation to ensure data processing in time for interactive visualization and responses. Details of the model properties, implementation, and promising novel combinations of techniques and algorithms for acceleration of performance are presented. Numerical experiments and benchmarks are provided to demonstrate the accuracy and efficiency of the model across recent generations of many-core CUDA-enabled hardware.
  
  \section{On hardware trends and challenges in scientific applications}
  
-During the last two decades we have seen how computer graphics hardware has been developed from fixed pipeline processors with no level of programmability, to flexible high-performance hardware platforms, suitable for general purpose scientific computations other than computer graphics. This trend has contributed to a disruptive breakthrough in high-performance computing on mass-produced commodity hardware and fuelled new opportunities for computational science and engineering for a broad range of scientific as well as modern business applications. This emphasizes the increasingly important role of computers in simulation of real world dynamics \cite{ch7:Keyes201170}. In recent years, the CUDA programming model, based on the standard C/C++ programming language and introduced by NVIDIA Corporation worldwide, has become popular as a proprietary and widely used standard in high performance communities. It is, by design and supported functionality, easy and sufficient to be deployed for wide improvement of existing and new applications across science and engineering fields, that can benefit from the the use of heterogenous hardware.
+During the last two decades we have seen how computer graphics hardware has been developed from fixed pipeline processors with no level of programmability, to flexible high-performance hardware platforms, suitable for general purpose scientific computations other than computer graphics. This trend has contributed to a disruptive breakthrough in high-performance computing on mass-produced commodity hardware and fuelled new opportunities for computational science and engineering for a broad range of scientific as well as modern business applications. This emphasizes the increasingly important role of computers in simulation of real world dynamics \cite{ch7:Keyes201170}. In recent years, the CUDA programming model, based on the standard C/C++ programming language and introduced by NVIDIA Corporation worldwide, has become popular as a proprietary and widely used standard in high performance communities. It is, by design and supported functionality, easy and sufficient to be deployed for wide improvement of existing and new applications across science and engineering fields, that can benefit from the the use of heterogeneous hardware.
  
  We should be careful about speculating about the future and extrapolating from current trends. The TOP 500 list\footnote{\url{http://www.top500.org.}} of supercomputers shows that there are some general noticeable hardware trends and gives indication of what to expect in the near future. First, since 2005 we have seen how power constraints and resulting heat dissipation problems forced chip producers to increase the number of cores rather than clock frequency.  Multicore processors have become the new standard and many-core processors are becoming available as a standard in commodity hardware, from personal laptops to supercomputing clusters.
  
  This trend suggests that there will be less fast low-latency memory available per core in the future, favoring data-locality in algorithms. In addition, we have also seen how communication speed to computation speed ratio decreases, making it increasingly difficult to supply data to hungry floating point units. In addition, there will likely be increasing amounts of data to store as a result of increasing processing capabilities. The rapidly increasing floating point performance following Moore's law for transistor production has resulted in a significant memory gap which leaves most scientific applications based on partial differential equations (PDEs) bandwidth bound rather than compute bound. This trend is driven by pure commercial needs and not the needs of high-performance computing. Roads to better performance include standardization of software infrastructure, rethinking algorithms to better exploit memory hierarchies optimally to boost strong scaling properties, increasing locality in algorithms, and introducing as much concurrency and work as possible to both utilize and exploit the many cores. Also, software that can utilize many cores should be fault-tolerant to maximize time to solution for application users. We should also expect to see multiple layers of parallelism that will have to be exploited and possibly autotuned to optimally utilize available hardware resources. This introduces new challenges in compilers, requires programming experts with hardware knowledge, and introduces new trends in software developments to leverage productivity and utilize available computational resources in more optimal ways. We have observed a fundamental paradigm shift of underlying hardware design towards much more heterogeneity and parallelism.
  
-A key problem is that improvements in performance require porting legacy codes\footnote{In the worst case, a legacy code is an undocumented serial code developed a long time ago by a developer no longer around.} to new hardware, and possibly changing algorithms which have been developed for the conventional single core processors decades ago. Without this, it may be impossible to utilize and scale algorithmic work optimally to achieve high performance on modern and emerging hardware. This problem is currently addressed with rapid progress by researchers and industry by development of new optimized libraries that can utilize such new hardware. While we have seen significant improvements in such efforts, and today see much more rapid development of applications, there are still relatively few scientific applications running entirely on heterogenous hardware. 
+A key problem is that improvements in performance require porting legacy codes\footnote{In the worst case, a legacy code is an undocumented serial code developed a long time ago by a developer no longer around.} to new hardware, and possibly changing algorithms which have been developed for the conventional single core processors decades ago. Without this, it may be impossible to utilize and scale algorithmic work optimally to achieve high performance on modern and emerging hardware. This problem is currently addressed with rapid progress by researchers and industry by development of new optimized libraries that can utilize such new hardware. While we have seen significant improvements in such efforts, and today see much more rapid development of applications, there are still relatively few scientific applications running entirely on heterogeneous hardware. 
  
  %The main justification for porting or developing application on such hardware is a significant performance band expected (significant) 
  
@@ -37,7 +37,7 @@ A key problem is that improvements in performance require porting legacy codes\f
  %However, increasing amounts of applications are utilizing accelerators to parts of their code to gain speedups albeit with less dramatic improvements of performance as one can potentially find by adapting most, if not all of the application to modern hardware.
  
  In this work, we explore some of these trends by developing, by bottom-up-design, a water-wave model which can be utilized in maritime engineering and with the intended use on affordable office desktops as well as on more expensive modern compute clusters for engineering analysis purposes.  
-
+\clearpage
  \section{On modeling paradigms for highly nonlinear and dispersive water waves}
  \label{ch7:sec:modernwavemodellingparadigms}
  
@@ -200,7 +200,7 @@ From a numerical point of view, an efficient and scalable discretization strateg
  
  We present scalability and performance tests based on the same two test environments outlined in Chapter \ref{ch5}, Section \ref{ch5:sec:testenvironments}, plus a fourth test environment based on the most recent hardware generation:
  \begin{description}
-\item[Test environment 4.] Desktop with dual-socket Sandy Bridge Intel Xeon E5-2670 (2.60 GHz) processors, 64GB RAM, 2x Nvidia Tesla K20 GPUs.
+\item[Test environment 4.] Desktop with dual-socket Sandy Bridge Intel Xeon E5-2670 (2.60 GHz) processors, 64GB RAM, 2x NVIDIA Tesla K20 GPUs.
  \end{description}
  Performance results can be used to predict actual runtimes as described in \cite{ch7:EngsigKarupEtAl2011}, e.g., for estimation of whether a real-time constraint for a given problem size can be met.
  
@@ -266,7 +266,7 @@ where $S$ is a structural surface.
  
  \subsection{Finite difference approximations}\index{finite difference}
  
-The numerical scheme is implemented as a flexible-order finite difference collocation scheme where all finite difference approximations of derivatives are constructed from one-dimensional approximations in a standard way, each having the maximum possible accuracy. In explicit numerical schemes, finite difference approximations can be implemented using a matrix-free technique to exploit that only a few different stencils are in fact needed. This can significantly reduce memory requirements of the implemented model by exploiting that the same small set of stencils can be reused. See Chapter \ref{ch5} for more details about matrix-free stencil operations supported in our in-house library for heterogenous and massively parallel computing using GPUs.
+The numerical scheme is implemented as a flexible-order finite difference collocation scheme where all finite difference approximations of derivatives are constructed from one-dimensional approximations in a standard way, each having the maximum possible accuracy. In explicit numerical schemes, finite difference approximations can be implemented using a matrix-free technique to exploit that only a few different stencils are in fact needed. This can significantly reduce memory requirements of the implemented model by exploiting that the same small set of stencils can be reused. See Chapter \ref{ch5} for more details about matrix-free stencil operations supported in our in-house library for heterogeneous and massively parallel computing using GPUs.
  
  %\newpage
  \subsection{Time integration}\index{time integration}
@@ -359,7 +359,7 @@ Subtracting $g^n$ in \eqref{ch7:eq:discreteupdate} and dividing by a pseudo time
  \frac{g^{*,n+1}-g^n}{\tau} =\frac{(1-\Gamma)}{\tau} (g_e^n-g^n).
  \end{align}
  %
-The first term is similar to a first-order accurate Forward Euler\index{forward Euler} approximation of a rate of change term. This motivates an {\em embedded penalty forcing technique} based on adding a correction term of the form
+The first term is similar to a first-order accurate Forward Euler\index{Euler!forward Euler} approximation of a rate of change term. This motivates an {\em embedded penalty forcing technique} based on adding a correction term of the form
  %
  \begin{align}\label{ch7:eq:penalty}
  \partial_t g = \mathcal{N}(g) + \frac{1-\Gamma(x)}{\tau} (g_e(t,x)-g(t,x)), \quad {\bf x}\in\Omega_\Gamma,
@@ -393,7 +393,7 @@ The profiles can be reversed by a change of coordinate, i.e., $\Gamma(1-\hat{x})
  \includegraphics[width=0.98\textwidth]{Chapters/chapter7/figures/nonlinearwavespenalty-eps-converted-to.pdf}
  % Nx = 540, 6th order, vertical clustering, Nz=6;
  }
-\caption[Snapshots at intervals $T/8$ over one wave period in time.]{Snapshots at intervals $T/8$ over one wave period in time of computed (a) small-amplitude $(kh,kH)=(0.63,0.005)$ and (b) finite-amplitude $(kh,kH)=(1,0.41)$ stream function waves elevations having reached a steady state after transient startup. Combined wave generation and absorption zones occur in the left relaxation zone of both (a) and (b). In b) an absorption zone is positioned next to the right boundary and causes minor visible reflections. }
+\caption[Snapshots at intervals $T/8$ over one wave period in time.]{Snapshots at intervals $T/8$ over one wave period in time of computed (a) small-amplitude $(kh,kH)=(0.63,0.005)$ and (b) finite-amplitude $(kh,kH)=(1,0.41)$ stream function waves elevations having reached a steady state after transient startup. Combined wave generation and absorption zones occur in the left relaxation zone of both (a) and (b). In (b) an absorption zone is positioned next to the right boundary and causes minor visible reflections. }
  \label{ch7:figstandwave}
  \end{figure}
  
@@ -412,7 +412,7 @@ it is attractive to use iterative methods for large system sizes $n=N_xN_yN_z$ a
  \end{align}
  where $\mathcal{M}$ is a preconditioner with the property that $\mathcal{M}^{-1}\approx \mathcal{A}^{-1}$ can be computed at low cost.
  
-The bottleneck problem in a unified potential flow model is the solution of a discrete $\sigma$-transformed Laplace problem stated in the compact forms \eqref{ch7:linsyscompact} or \eqref{ch7:eq:linsys}. It is attractive to find an efficient iterative strategy where convergence\index{convergence} is understood via a convergence theory, has modest storage requirements, has minimal global communication requirements (in the form of global inner products), and is suitable for flexible-order\index{flexible order} discretizations. The class of geometric Multigrid Methods\index{multigrid} fulfills these requirements and has shown to be among the most efficient iterative strategies for a wide class of problems \cite{ch7:Trottenberg01}. In particular, the time required to solve a system of linear equations to a given accuracy level can be made to scale proportional to the number of unknowns.
+The bottleneck problem in a unified potential flow model is the solution of a discrete $\sigma$-transformed Laplace problem stated in the compact forms \eqref{ch7:linsyscompact} or \eqref{ch7:eq:linsys}. It is attractive to find an efficient iterative strategy where convergence\index{convergence} is understood via a convergence theory, has modest storage requirements, has minimal global communication requirements (in the form of global inner products), and is suitable for flexible-order\index{flexible order} discretizations. The class of geometric Multigrid methods\index{multigrid} fulfills these requirements and has been shown to be among the most efficient iterative strategies for a wide class of problems \cite{ch7:Trottenberg01}. In particular, the time required to solve a system of linear equations to a given accuracy level can be made to scale proportional to the number of unknowns.
  
  There are several known approaches to multigrid methods \cite{ch7:MR744926} for high-order discretizations. Among these, Defect Correction\index{defect correction} Methods (DCMs) \cite{ch7:Stetter1978,ch7:AuzingerStetter1982,ch7:Hackbusch1982,ch7:Auzinger1987,ch7:Trottenberg01} have been employed successfully, e.g., in computational fluid dynamics \cite{ch7:LaytonEtAl2002}, in numerical simulations since the early 1970s.  
  The fundamental idea of DCMs is to combine the good stability properties of low-order discretizations with high-order accuracy discretizations for explicit residual evaluations. These iterative methods impose low storage requirements, have scalable work effort under suitable choices of preconditioning strategies, and may be accelerated using a multigrid method based on low-order discretizations while still achieving high-order accuracy.
@@ -429,7 +429,7 @@ where $\Phi^{[k]},{\boldsymbol \delta}^{[k]},{\bf r}^{[k]}\in\mathbb{R}^n$ are t
  
  \subsection{Distributed computations via domain decomposition}\label{ch7:sec:dd}\index{domain decomposition}
  
-Numerical modeling of large ocean areas to account for nonlinear wave-wave interactions and wave-structure interactions requires large degrees of spatial resolution, significant computational resources, and parallel computations to be practical. The recent generations of programmable GPUs are heavily optimized for on-chip bandwidth performance but not capacity. This implies that for the solution of large-scale PDE problems, distributed computing on multiple GPU devices is required due to limited capacity in the global memory space of current GPUs. Via a combination of MPI and CUDA we have recently demonstrated how both small and large systems can be solved efficiently by heterogenous computations using a data domain decomposition technique in parallel \cite{ch7:GlimbergEtAl2012}. The idea is to distribute the computational tasks to multiple GPUs, to enable reduced computational times and increased problem sizes. A homogenous partitioning of the data is used to ensure that the load balance across the GPUs is uniform. Data distribution and message passing introduce a data transfer bottleneck in the form of the Peripheral Component Interconnect express (PCIe) link and network interconnection. Thus, if the computational intensity of the local problem is not large enough to enable sufficient latency hiding of this bottleneck, the whole application is likely to be (severely) limited by the PCIe link or network bandwidth performance rather than the high on-chip bandwidth of the individual GPUs.
+Numerical modeling of large ocean areas to account for nonlinear wave-wave interactions and wave-structure interactions requires large degrees of spatial resolution, significant computational resources, and parallel computations to be practical. The recent generations of programmable GPUs are heavily optimized for on-chip bandwidth performance but not capacity. This implies that for the solution of large-scale PDE problems, distributed computing on multiple GPU devices is required due to limited capacity in the global memory space of current GPUs. Via a combination of MPI and CUDA we have recently demonstrated how both small and large systems can be solved efficiently by heterogeneous computations using a data domain decomposition technique in parallel \cite{ch7:GlimbergEtAl2012}. The idea is to distribute the computational tasks to multiple GPUs, to enable reduced computational times and increased problem sizes. A homogenous partitioning of the data is used to ensure that the load balance across the GPUs is uniform. Data distribution and message passing introduce a data transfer bottleneck in the form of the Peripheral Component Interconnect express (PCIe) link and network interconnection. Thus, if the computational intensity of the local problem is not large enough to enable sufficient latency hiding of this bottleneck, the whole application is likely to be (severely) limited by the PCIe link or network bandwidth performance rather than the high on-chip bandwidth of the individual GPUs.
  
  The ratio between necessary data transfers and computational work for the proposed numerical model for free surface water waves is high enough to expect reasonable latency hiding. The data domain decomposition method consists of a logically structured division of the computational domain into multiple subdomains. Each of these subdomains is connected via fictitious ghost layers at the artificial boundaries of width corresponding to the half-width of the finite difference stencils employed. This results in a favorable volume-to-boundary ratio as the problem size increases, and diminishing communication overhead for message passing. Information between subdomains is exchanged through ghost layers at every step of the iterative PDC method, in connection with the matrix-vector evaluation for the $\sigma$-transformed Laplace problem, and before relaxation steps in the multigrid method. A single global synchronization point occurs at most once each iteration, if convergence is monitored, where a global reduction step (inner product) between all processor nodes takes place. The main advantage of this decomposition strategy is that the decomposition into multiple subdomains is straightforward. However, it comes with the cost of extra data transfers to update the set of fictitious ghost layers.
  
@@ -444,7 +444,7 @@ The ratio between necessary data transfers and computational work for the propos
      {\scriptsize\input{Chapters/chapter7/figures/TeslaK20SpeedupGPUvsCPU3D.tikz}}
      }
      \end{center}
-    \caption[Performance timings per PDC iteration as a function of increasing problem size $N$, for single, mixed and double precision arithmetics.]{Performance timings per PDC iteration as a function of increasing problem size $N$, for single, mixed, and double precision arithmetics. Three-dimensional nonlinear waves, using sixth order finite difference approximations, preconditioned with one multigrid V-cycle and with one pre- and post- red-black Gauss-Seidel smoothing operation. Speedup compared to fastest known serial implementation. Using Test environment 4, CPU timings represent starting points for our investigations and have been obtained using the Fortran 90 code. These references results are based on a single-core (non-parallel) run on a Intel Core i7, 2.80GHz processor.}\label{ch7:fig:perftimings}
+    \caption[Performance timings per PDC iteration as a function of increasing problem size $N$, for single, mixed, and double precision arithmetics.]{Performance timings per PDC iteration as a function of increasing problem size $N$, for single, mixed, and double precision arithmetics. Three-dimensional nonlinear waves, using sixth order finite difference approximations, preconditioned with one multigrid V-cycle and with one pre- and post- red-black Gauss-Seidel smoothing operation. Speedup compared to fastest known serial implementation. Using test environment 4, CPU timings represent starting points for our investigations and have been obtained using the Fortran 90 code. These references results are based on a single-core (nonparallel) run on a Intel Core i7, 2.80GHz processor.}\label{ch7:fig:perftimings}
  \end{figure}
  
  The parallel domain decomposition solver has been validated against the sequential solvers with respect to algorithmic efficiency to establish that the code produces correct results. An analysis of the numerical efficiency has also been carried out on different GPU systems to identify comparative behaviors as both the problems sizes and number of compute nodes vary. For example, performance scalings on Test environment 1 and Test environment 3 are presented in Figure \ref{ch7:fig:multigpuperformance}. The figure confirms that there is only a limited benefit from using multiple GPUs for small problem sizes, since the computational intensity is simply too low to efficiently hide the latency of message passing. A substantial speedup is achieved compared to the single GPU version, while being able to solve even larger systems.
@@ -469,12 +469,12 @@ Future work involves extending the domain decomposition method to include suppor
  
  \subsection{Assembling the wave model from library components}
  
-It is described in Chapter \ref{ch5} how we have developed a heterogenous library has for fast prototyping of PDE solvers, utilizing the massively parallel architecture of CUDA-enabled GPUs. The objective is to provide a set of generic components within a single framework, such that software developers can assemble application-specific solvers efficiently at a high abstraction level, requiring a minimum of CUDA specific kernel implementations and parameter tuning.
+It is described in Chapter \ref{ch5} how we have developed a heterogeneous library has for fast prototyping of PDE solvers, utilizing the massively parallel architecture of CUDA-enabled GPUs. The objective is to provide a set of generic components within a single framework, such that software developers can assemble application-specific solvers efficiently at a high abstraction level, requiring a minimum of CUDA specific kernel implementations and parameter tuning.
  
-The CUDA-based numerical wave model has been developed based on all the numerical techniques described in preceding sections. These techniques are a part of the library implemented as generic components, which makes them useful for the numerical solutions of PDE systems. The components of the numeral model as described in Section \ref{ch7:sec:nummodel} include an ERK4 time integrator, flexible-order finite difference approximations on regular grids, and an iterative multigrid PDC solver for the $\sigma$-transformed Laplace equation \eqref{ch7:TransformedLaplace}. Application developers can use either these components directly from the library or are free to combine their own implementations with library components, if they need alternative strategies that are not present in the library.
+The CUDA-based numerical wave model has been developed based on all the numerical techniques described in preceding sections. These techniques are a part of the library implemented as generic components, which makes them useful for the numerical solutions of PDE systems. The components of the numeral model as described in Section \ref{ch7:sec:nummodel} include an ERK4 time integrator, flexible-order finite difference approximations on regular grids, and an iterative multigrid PDC solver for the $\sigma$-transformed Laplace equation \eqref{ch7:TransformedLaplace}. Application developers either either can use  these components directly from the library or are free to combine their own implementations with library components, if they need alternative strategies that are not present in the library.
  
  For the unified potential flow model the user will need to provide implementations of the following components: the right-hand side operator for the semidiscrete free surface variables \eqref{ch7:FSorigin}, the matrix-vector operator for the discretized $\sigma$-transformed Laplace equation \eqref{ch7:TransformedLaplace}, a smoother for the multigrid relaxation step, and the potential flow solver itself, that reads initial data and advances the solution in time. In order to make the library as generic as possible, all components are template-based, which makes it possible to assemble the PDE solver by combining type definitions in the preamble of the application. An excerpt of the potential flow assembling is given in Listing \ref{ch7:lst:solversetup}.
-
+\pagebreak
  \lstset{label=ch7:lst:solversetup,caption={generic assembling of the potential flow solver for fully nonlinear free surface water waves}
  %,basicstyle=\scriptsize
  }
@@ -510,7 +510,7 @@ typedef free_surface::potential_flow_solver_3d<potential_flow_types> potential_f
  
  Hereafter, the potential flow solver is aware of all component types that should be used to solve the entire PDE system, and it will be easy for developers to exchange parts at later times. The \texttt{laplace\_sigma\_stencil\_3d} class implements both the matrix-vector and right-hand side operator. The flexible-order finite difference kernel for the matrix-free matrix-vector product for the two-dimensional Laplace problem is presented in Listing \ref{ch7:lst:fd2d}. Library macros and reusable kernel routines are used throughout the implementations to enhance developer productivity and hide hardware specific details. This kernel can be used both for matrix-vector products for the original system and for the preconditioning.
  
-\lstset{label=ch7:lst:fd2d,caption={CUDA kernel implementation for the two dimensional finite difference approximation to the transformed Laplace equation}
+\lstset{label=ch7:lst:fd2d,caption={CUDA kernel implementation for the two-dimensional finite difference approximation to the transformed Laplace equation}
  %,basicstyle=\scriptsize\ttfamily
  }
  \begin{lstlisting}
@@ -683,7 +683,7 @@ E_c(N_x,N_z,kh,h/L) & = \frac{c^2}{c_s^2},
  E_m(N_x,N_z,kh,h/L) &= \frac{1}{h}\int_{-h}^\eta \left( \frac{\phi(z)-\phi_s(z)}{\phi_s(z)} \right)^2 dz,
  \end{align}
  \end{subequations}
-where $m$ is one of the scalar functions $\phi,u,w$ describing kinematics, $c$ is the numerical phase celerity of regular waves, and $c_s=\sqrt{g\tanh(kh)/k}$ is the exact phase celerity according to linear Stokes Theory \cite{ch7:SJ01}.  Measurements of the errors are taken in the vertical below the crest of a wave which is well resolved in the horizontal direction.
+where $m$ is one of the scalar functions $\phi,u,w$ describing kinematics; $c$ is the numerical phase celerity of regular waves; and $c_s=\sqrt{g\tanh(kh)/k}$ is the exact phase celerity according to linear Stokes Theory \cite{ch7:SJ01}.  Measurements of the errors are taken in the vertical below the crest of a wave which is well resolved in the horizontal direction.
  
  % kinematicAnalysis.m
  \begin{figure}[!htb]
@@ -696,7 +696,7 @@ where $m$ is one of the scalar functions $\phi,u,w$ describing kinematics, $c$ i
  \includegraphics[width=0.45\textwidth]{Chapters/chapter7/figures/lineardispersion_Nx30-HL90-p6_Linear-eps-converted-to.pdf}
  }
  \end{center}
-\caption{The accuracy in phase celerity $c$ determined by \eqref{ch7:errdisp} for small-amplitude (linear) wave.
+\caption[The accuracy in phase celerity $c$ determined by \eqref{ch7:errdisp} for small-amplitude (linear) wave.]{The accuracy in phase celerity $c$ determined by \eqref{ch7:errdisp} for small-amplitude (linear) wave.
  $N_z\in[6,12]$. Sixth order scheme.}
  \label{ch7:figlinear}
  \end{figure}
@@ -731,11 +731,11 @@ The numerical model detailed has been subject to careful verification and valida
  
  \subsection{On precision requirements in engineering applications}\index{precision}
  
-Practical engineering applications are widely used for analysis purposes and give support to decision making in engineering design. For engineering purposes the turn-around time for producing analysis results is of crucial importance as it affects cost-benefit of work efforts. The key interest is often just "engineering  accuracy" in computed end results which suggests that we can do with less precision in calculations. One may ask: {\em what are the precision requirements for engineering applications?} 
+Practical engineering applications are widely used for analysis purposes and give support to decision making in engineering design. For engineering purposes the turn-around time for producing analysis results is of crucial importance as it affects cost-benefit of work efforts. The key interest is often just ``engineering  accuracy'' in computed end results which suggests that we can do with less precision in calculations. One may ask: {\em what are the precision requirements for engineering applications?} 
  
-In a recent study \cite{ch7:EngsigKarupEtAl2011,ch7:Glimberg2011}, it was shown that the PDC method when executed on GPUs can be utilized to efficiently solve water wave problems. This was done by trading accuracy for speed in parts of the PDC algorithm, e.g., by using either single, or mixed-precision\index{precision!mixed} computations. Without preconditioning the PDC method reduces to a classical iterative refinement technique, which is known to be fault tolerant \cite{ch7:Higham:2002:ASN}.
+In a recent study \cite{ch7:EngsigKarupEtAl2011,ch7:Glimberg2011}, it was shown that the PDC method when executed on GPUs can be utilized to efficiently solve water-wave problems. This was done by trading accuracy for speed in parts of the PDC algorithm, e.g., by using either single, or mixed-precision\index{precision!mixed} computations. Without preconditioning the PDC method reduces to a classical iterative refinement technique, which is known to be fault tolerant \cite{ch7:Higham:2002:ASN}.
  
-Previously reported performance results for the wave model can be taken a step further. We seek to demonstrate how single precision computations can be used for engineering analysis without significantly affecting accuracy in final computational results. At the same time improvements in computational speed can be almost a factor of two for large problems as a direct result of reduced data transfer, cf. Figure \ref{ch7:fig:perftimings}. Therefore, in pursue of high performance, it is of interest to exploit the reduced data transfers associated with replacing double-precision with single-precision floating point calculations. In a well-organized code this step can be taken with minimal programming effort.
+Previously reported performance results for the wave model can be taken a step further. We seek to demonstrate how single-precision computations can be used for engineering analysis without significantly affecting accuracy in final computational results. At the same time improvements in computational speed can be almost a factor of two for large problems as a direct result of reduced data transfer, cf. Figure \ref{ch7:fig:perftimings}. Therefore, in pursue of high performance, it is of interest to exploit the reduced data transfers associated with replacing double-precision with single-precision floating point calculations. In a well-organized code this step can be taken with minimal programming effort.
  %
  \begin{figure}[!htb]
  \begin{center}
@@ -751,7 +751,7 @@ Previously reported performance results for the wave model can be taken a step f
  \label{ch7:convhist}
  \end{figure}
  
-Most scientific applications \cite{ch7:lessismore} use double-precision calculations to minimize accumulation of round-off errors, to employ higher precision for ill-conditioned problems, or to stabilize critical sections in the code that require higher precision. Round-off errors tend to accumulate more slowly when higher precision is used, thereby avoiding significant losses of accuracy due to round-off errors. Paradoxically, for many computational tasks, the need for high precision connected with the above-mentioned restrictions does not apply at all, or only applies to a portion of the tasks. In addition, on modern hardware there can be relative differences in peak floating point performance of up to about one order of magnitude in favor of single-precision over double-precision math calculations depending on choice of hardware architecture. However, for bandwidth-bound applications, the key performance metric is not floating-point performance, but rather bandwidth performance. In both cases, data transfer can be effectively halved by switching to single precision storage, in which case bandwidth performance increases and at the same time makes it possible to feed floating point units with data at effectively twice the speed. If maximizing performance is an ultimate goal, such considerations suggests that it can be possible to compute faster by using single- over double-precision arithmetics if accuracy requirements can be fulfilled.
+Most scientific applications \cite{ch7:lessismore} use double-precision calculations to minimize accumulation of round-off errors, to employ higher precision for ill-conditioned problems, or to stabilize critical sections in the code that require higher precision. Round-off errors tend to accumulate more slowly when higher precision is used, thereby avoiding significant losses of accuracy due to round-off errors. Paradoxically, for many computational tasks, the need for high precision connected with the above-mentioned restrictions does not apply at all, or only applies to a portion of the tasks. In addition, on modern hardware there can be relative differences in peak floating-point performance of up to about one order of magnitude in favor of single-precision over double-precision math calculations depending on choice of hardware architecture. However, for bandwidth-bound applications, the key performance metric is not floating-point performance, but rather bandwidth performance. In both cases, data transfer can be effectively halved by switching to single-precision storage, in which case bandwidth performance increases and at the same time makes it possible to feed floating-point units with data at effectively twice the speed. If maximizing performance is an ultimate goal, such considerations suggests that it can be possible to compute faster by using single- over double-precision arithmetics if accuracy requirements can be fulfilled.
  %
  
  As a simple illustrative numerical experiment, we can consider the iterative solution of \eqref{ch7:eq:linsys} using the PDC method using single- and double-precision math, respectively. As a simple test case, we consider the solution of periodic stream function waves in two spatial dimensions. Computed convergence histories are presented in Figure \ref{ch7:convhist} where it is clear that the main difference is the attainable accuracy level achievable before stagnation. In both cases, the attainable accuracy, defined in terms of the absolute error for the exact stream function solution to the governing equations, is associated with accuracy of approximately $10^{-4}$ (solid line). In single-precision math, the algebraic and estimated errors measured in the two-norm can reach the level of machine precision. These results suggest that single-precision math is sufficient for calculating accurate solutions at the chosen spatial resolution. We find that the iterative solution of the $\sigma$-transformed Laplace problem by the PDC method does not immediately lead to significant accumulation of round-off errors and further investigations are warranted for unsteady computations.
@@ -761,9 +761,9 @@ Elaborating on this example, we examine the propagation of a regular stream func
  \label{ch7:filter}
  \mathcal{F}u(x_i) = \sum_{n=-\alpha}^{\alpha} c_n u(x_{i+n}),
  \end{align}
-where $c_n\in\mathbb{R}$ are the stencil coefficients and $\alpha\in\mathbb{Z}_+$ is the stencil half-width. An active filter can for example be based on employing a Savitzky-Golay smoothening filter \cite{ch7:PT90}, e.g., the mild 7-point SG(6,10) filter, and applying it after every 10th time step to each of the collocation nodes defining the free surface variables. The same procedure can be used for stabilization of nonlinear simulations to remove high-frequency "saw-tooth" instabilities as shown in \cite{ch7:EBL08}. This filtering technique can also remove high-frequency noise resulting from round-off errors in computations that would otherwise potentially pollute the computational results and in the worst case leave them useless. The effect of this type of filtering on the numerical efficiency of the model is insignificant.
+where $c_n\in\mathbb{R}$ are the stencil coefficients and $\alpha\in\mathbb{Z}_+$ is the stencil half-width. An active filter can for example be based on employing a Savitzky-Golay smoothening filter \cite{ch7:PT90}, e.g., the mild 7-point SG(6,10) filter, and applying it after every 10th time step to each of the collocation nodes defining the free surface variables. The same procedure can be used for stabilization of nonlinear simulations to remove high-frequency ``saw-tooth'' instabilities as shown in \cite{ch7:EBL08}. This filtering technique can also remove high-frequency noise resulting from round-off errors in computations that would otherwise potentially pollute the computational results and in the worst case leave them useless. The effect of this type of filtering on the numerical efficiency of the model is insignificant.
  
-Results from numerical experiments are presented in Figure \ref{ch7:filtering}, and most of the errors can be attributed to phase errors resulting from difference in exact versus numerical phase speed. In numerical experiments, we find that while results computed in double-precision are not significantly affected by accumulation of round-off errors, the single-precision results are. In figures \ref{ch7:filtering} (a) and (b), a direct solver based on sparse Gaussian elimination within MATLAB\footnote{\url{http://www.mathworks.com}.} is used to solve the linear system at every stage and a comparison is made between single and unfiltered double-precision calculations. It is shown in Figure \ref{ch7:filtering} a) that without a filter, the single-precision calculations result in "blow-up" after which the solver fails just before 50 wave periods of calculation time. However, in Figure \ref{ch7:filtering} (b) it is demonstrated that invoking a smoothening filter, cf. \eqref{ch7:filter}, stabilizes the accumulation of round-off errors and the calculations continue and achieve reduced accuracy compared to the computed double-precision results. Thus, it is confirmed that such a filter can be used to control and suppress high-frequency oscillations that results from accumulation of round-off errors. In contrast, replacing the direct solver with an iterative PDC method using the GPU-accelerated wave model appears to be much more attractive upon inspection of Figures \ref{ch7:filtering} (c) and (d). The single-precision results are found to be stable with and {\em without} the filter-based strategy for this problem. The calculations show that single-precision math leads to slightly faster error accumulation for this choice of resolution, however, with only small differences in error level during long time integration. This highlights that fault-tolerance of the iterative PDC method contributes to securing robustness of the calculations.
+Results from numerical experiments are presented in Figure \ref{ch7:filtering}, and most of the errors can be attributed to phase errors resulting from difference in exact versus numerical phase speed. In numerical experiments, we find that while results computed in double-precision are not significantly affected by accumulation of round-off errors, the single-precision results are. In Figures \ref{ch7:filtering} (a) and (b), a direct solver based on sparse Gaussian elimination within MATLAB\footnote{\url{http://www.mathworks.com}.} is used to solve the linear system at every stage and a comparison is made between single- and unfiltered double-precision calculations. It is shown in Figure \ref{ch7:filtering} a) that without a filter, the single-precision calculations result in ``blow-up'' after which the solver fails just before 50 wave periods of calculation time. However, in Figure \ref{ch7:filtering} (b) it is demonstrated that invoking a smoothening filter, cf. \eqref{ch7:filter}, stabilizes the accumulation of round-off errors and the calculations continue and achieve reduced accuracy compared to the computed double-precision results. Thus, it is confirmed that such a filter can be used to control and suppress high-frequency oscillations that results from accumulation of round-off errors. In contrast, replacing the direct solver with an iterative PDC method using the GPU-accelerated wave model appears to be much more attractive upon inspection of Figures \ref{ch7:filtering} (c) and (d). The single-precision results are found to be stable with and {\em without} the filter-based strategy for this problem. The calculations show that single-precision math leads to slightly faster error accumulation for this choice of resolution, however, with only small differences in error level during long time integration. This highlights that fault-tolerance of the iterative PDC method contributes to securing robustness of the calculations.
  
  \begin{figure}[!htb]
  \begin{center}
@@ -781,15 +781,16 @@ Results from numerical experiments are presented in Figure \ref{ch7:filtering},
  \includegraphics[width=0.45\textwidth]{Chapters/chapter7/figures/ComparisonDCWithFiltering-eps-converted-to.pdf}
  }
  \end{center}
-\caption[Comparison of accuracy as a function of time for double-precision calculations vs. single-precision with and without filtering.]{Comparison between accuracy as a function of time for double-precision calculations vs. single-precision with and without filtering. The double-precision result is unfiltered in each comparison and shows to be less sensitive to round-off errors. Medium steep nonlinear stream function wave in intermediate water $(kh,H/L)=(1,0.0502)$. Discretization is based on $(N_x,N_z)=(30,6)$, A courant number of $C_r=0.5$, and sixth order stencils.}
+\caption[Comparison of accuracy as a function of time for double-precision calculations vs. single-precision with and without filtering.]{Comparison of accuracy as a function of time for double-precision calculations vs. single-precision with and without filtering. The double-precision result is unfiltered in each comparison and shows to be less sensitive to round-off errors. Medium steep nonlinear stream function wave in intermediate water $(kh,H/L)=(1,0.0502)$. Discretization is based on $(N_x,N_z)=(30,6)$, a courant number of $C_r=0.5$, and sixth order stencils.}
  \label{ch7:filtering}
  \end{figure}
  
-Last, we demonstrate using a classical benchmark for propagation of nonlinear waves over a semicircular shoal that single-precision math is likely to be sufficient for achieving engineering accuracy. The benchmark is based on Whalin's experiment~\cite{ch7:Whalin1971} which is often used in validation of dispersive water wave models for coastal engineering applications, e.g., see previous work \cite{ch7:EBL08}. Experimental results exists for incident waves with wave periods $T=1,2,3\,$s and wave heights $H=0.0390, 0.0150, 0.0136\,$m. All three test cases have been discretized with a computational grid of size ($257 \times 41 \times 7$) to resolve the physical dimensions of $L_x=35\,$m, $L_y=6.096\,$m. The still water depth decreases in the direction of the incident waves as a semicircular shoal from $0.4572\,$m to $0.1524\,$m with an illustration of a snapshot of the free surface given in Figure \ref{ch7:fig:whalinsetup}. The time step $\Delta t$ is computed based on a constant Courant number of $Cr=c\Delta x/\Delta t=0.8$, where $c$ is the incident wave speed and $\Delta x$ is the grid spacing. Waves are generated in the generation zone $0 \leq x/L \leq 1.5$, where $L$ is the wave length of incident waves, and absorbed again in the zone $35 - 2L \leq x \leq 35\,$m.
+Last, we demonstrate using a classical benchmark for propagation of nonlinear waves over a semicircular shoal that single-precision math is likely to be sufficient for achieving engineering accuracy. The benchmark is based on Whalin's experiment~\cite{ch7:Whalin1971} which is often used in validation of dispersive water wave models for coastal engineering applications, e.g., see previous work \cite{ch7:EBL08}. Experimental results exist for incident waves with wave periods $T=1,2,3\,$s and wave heights $H=0.0390, 0.0150, 0.0136\,$m. All three test cases have been discretized with a computational grid of size ($257 \times 41 \times 7$) to resolve the physical dimensions of $L_x=35\,$m, $L_y=6.096\,$m. The still water depth decreases in the direction of the incident waves as a semicircular shoal from $0.4572\,$m to $0.1524\,$m with an illustration of a snapshot of the free surface given in Figure \ref{ch7:fig:whalinsetup}. The time step $\Delta t$ is computed based on a constant Courant number of $Cr=c\Delta x/\Delta t=0.8$, where $c$ is the incident wave speed and $\Delta x$ is the grid spacing. Waves are generated in the generation zone $0 \leq x/L \leq 1.5$, where $L$ is the wave length of incident waves, and absorbed again in the zone $35 - 2L \leq x \leq 35\,$m.
  %
  
  A harmonic analysis of the wave spectrum at the shoal center line is computed and plotted in Figure \ref{ch7:whalinresults} for comparison with the analogous results obtained from the experiments data. The three harmonic amplitudes are computed via a Fast Fourier Transform (FFT) method using the last three wave periods up to $t=50\,$s. There is a satisfactory agreement between the computed and experimental results and no noticeable loss in accuracy resulting from the use of single-precision math.
-%
+
+\pagebreak
  \begin{figure}[!htb]
      \setlength\figureheight{0.3\textwidth}
      \setlength\figurewidth{0.32\textwidth}
@@ -797,17 +798,17 @@ A harmonic analysis of the wave spectrum at the shoal center line is computed an
      \subfigure[Whalin test case at $t=50$s, wave period $T=2$s, and grid dimensions ($257 \times 41 \times 7$)]{\label{ch7:fig:whalinsetup}
      \includegraphics[width=0.5\textwidth]{Chapters/chapter7/figures/WhalinWaveSol_t50_T2_single.pdf}
      }
-    \subfigure[$T=1s$]{
+    \subfigure[$T=1$s]{
      {\scriptsize\input{Chapters/chapter7/figures/WhalinWaveHarmonics_T1_single.tikz}}
      }
-    \subfigure[$T=2s$]{
+    \subfigure[$T=2$s]{
      {\scriptsize\input{Chapters/chapter7/figures/WhalinWaveHarmonics_T2_single.tikz}}
      }
-    \subfigure[$T=3s$]{
+    \subfigure[$T=3$s]{
      {\scriptsize\input{Chapters/chapter7/figures/WhalinWaveHarmonics_T3_single.tikz}}
      }
  %    \end{center}
-    \caption[Harmonic analysis for the experiment of Whalin for $T=1,2,3\,$s.]{Harmonic analysis for the experiment of Whalin for $T=1,2,3\,$s respectively. Measured experimental and computed results (single-precision) are in good agreement. Test environment 1.}\label{ch7:whalinresults}
+    \caption[Harmonic analysis for the experiment of Whalin for $T=1,2,3\,$s.]{Harmonic analysis for the experiment of Whalin for $T=1,2,3\,$s. Measured experimental and computed results (single-precision) are in good agreement. Test environment 1.}\label{ch7:whalinresults}
  \end{figure}
  
  \subsection{Acceleration via parallelism in time using parareal}\label{ch7:parareal}\index{parareal}
@@ -822,7 +823,7 @@ The parareal algorithm has been introduced as a component in our in-house GPU li
  In Section \ref{ch5:parareal} it is assumed that communication costs can be neglected and a simple model for the algorithmic work complexity is derived. It is found that there are four key discretization parameters for parareal that need to be balanced appropriately in order to achieve high parallel efficiency: the number of coarse-grained time intervals $N$, the number of iterations $K$, the ratio between the computational cost of the coarse to the fine propagator $\mathcal{C}_\mathcal{G}/\mathcal{C}_\mathcal{F}$, and the ratio between fine and coarse time step sizes $\delta t/\delta T$.
  
  % How to obtain speed-up
-Ideally, the ratio $\mathcal{C}_\mathcal{G}/\mathcal{C}_\mathcal{F}$ is small and convergence happens in $k=1$ iteration. This is seldom the case though, as it requires the coarse propagator to achieve accuracy close to that of the fine propagator while at the same time being substantially cheaper computationally, these two objectives obviously being conflicting. Obtaining the highest possible speed-up is a matter of trade-off, typically, the more GPUs used, the faster the coarse propagator should be. The performance of parareal is problem- and discretization-dependent and as such one would suspect that different wave parameters influence the suitability of the method. This was investigated in \cite{ch7:ASNP12} and indeed the performance does change with wave parameters. Typically the method works better for deep water waves with low to medium wave amplitudes.
+Ideally, the ratio $\mathcal{C}_\mathcal{G}/\mathcal{C}_\mathcal{F}$ is small and convergence happens in $k=1$ iteration. This is seldom the case though, as it requires the coarse propagator to achieve accuracy close to that of the fine propagator while at the same time being substantially cheaper computationally, these two objectives obviously being conflicting. Obtaining the highest possible speed-up is a matter of trade-off, typically, the more GPUs used, the faster the coarse propagator should be. The performance of parareal is problem- and discretization-dependent and as such one would suspect that different wave parameters influence the suitability of the method. This was investigated in \cite{ch7:ASNP12} and indeed the performance does change with wave parameters. Typically the method works better for deep water waves with low- to medium-wave amplitudes.
  %
  \begin{figure}[!htb]
      \begin{center}
@@ -842,12 +843,12 @@ Ideally, the ratio $\mathcal{C}_\mathcal{G}/\mathcal{C}_\mathcal{F}$ is small an
  %
  
  % What did we do and what are the results
-We have performed a scalability study for parareal using 2D nonlinear stream function waves based on a discretization with $(N_x,N_z)=(33,9)$ collocation points, cf. Figure \ref{ch7:fig:DDPA_SPEEDUP}. The study shows that moderate speedup is possible for this hyperbolic system. Using four GPU nodes, a speedup of slightly more than two was achieved while using sixteen GPU nodes resulted in a speedup of slightly less than five. As noticed in Figure \ref{ch7:fig:DDPA_SPEEDUP}, parallel efficiency decreases quite fast when using more GPUs. This limitation is due to the usage of a fairly slow and accurate coarse propagator and is linked to a known difficulty with parareal applied to hyperbolic systems. For hyperbolic systems, instabilities tend to arise when using a very inaccurate coarse propagator. This prevents using a large number of time subdomains, as this by Amdahl's law also requires a very fast coarse propagator. The numbers are still impressive though, considering that the speedup due to parareal comes as additional speedup to an already efficient and fast code.
-Performance results for the Whalin test case are also shown in Figure \ref{ch7:fig:whalinparareal}. There is a natural limitation to how much we can increase $R$ (the ratio between the complexity of the fine and coarse propagators), because of stability issues with the coarse propagator. In this test case we simulate from $t=[0,1]$s, using up to $32$ GPUs. For low $R$ and only two GPUs, there is no speedup gain, but for the configuration with eight or more GPUs and $R\geq6$, we are able to get more than $2$ times speedup. Though these hyperbolic systems are not optimal for performance tuning using the parareal method, results still confirm that reasonable speedups are in fact possible on heterogenous systems.
+We have performed a scalability study for parareal using 2D nonlinear stream function waves based on a discretization with $(N_x,N_z)=(33,9)$ collocation points, cf. Figure \ref{ch7:fig:DDPA_SPEEDUP}. The study shows that moderate speedup is possible for this hyperbolic system. Using four GPU nodes, a speedup of slightly more than two was achieved while using sixteen GPU nodes resulted in a speedup of slightly less than five. As noticed in Figure \ref{ch7:fig:DDPA_SPEEDUP}, parallel efficiency decreases quite fast when using more GPUs. This limitation is due to the usages of a fairly slow and accurate coarse propagator and is linked to a known difficulty with parareal applied to hyperbolic systems. For hyperbolic systems, instabilities tend to arise when using a very inaccurate coarse propagator. This prevents using a large number of time subdomains, as this by Amdahl's law also requires a very fast coarse propagator. The numbers are still impressive though, considering that the speedup due to parareal comes as additional speedup to an already efficient and fast code.
+Performance results for the Whalin test case are also shown in Figure \ref{ch7:fig:whalinparareal}. There is a natural limitation to how much we can increase $R$ (the ratio between the complexity of the fine and coarse propagators), because of stability issues with the coarse propagator. In this test case we simulate from $t=[0,1]$s, using up to $32$ GPUs. For low $R$ and only two GPUs, there is no speedup gain, but for the configuration with eight or more GPUs and $R\geq6$, we are able to get more than $2$ times speedup. Though these hyperbolic systems are not optimal for performance tuning using the parareal method, results still confirm that reasonable speedups are in fact possible on heterogeneous systems.
  
  \begin{figure}[!htb]
-    \setlength\figureheight{0.3\textwidth}
-    \setlength\figurewidth{0.32\textwidth}
+    \setlength\figureheight{0.29\textwidth}
+    \setlength\figurewidth{0.29\textwidth}
  %    \begin{center}
      \subfigure[Speedup]{
      {\small\input{Chapters/chapter7/figures/WhalinPararealSpeedup.tikz}}
@@ -868,7 +869,7 @@ Finally, we remark that the parareal algorithm is also a fault-tolerant algorith
  %\newpage
  \subsection{Towards real-time interactive ship simulation}\index{real-time simulation}
  
-A fast GPU-accelerated ship hydrodynamic model is developed for real-time interactive ship simulation by modification of the unified potential flow model presented in Section \ref{ch7:goveq}. The target scientific application is an interactive full mission marine simulator, where multiple ships controlled by naval officers can navigate in a near-realistic virtual marine environment. Full mission simulators are used for education and training of naval officers in critical manoeuvring operations and for evaluation of ship and marine infrastructure designs. To predict the motion of ships, a hydrodynamics model is required for prediction of forces by \eqref{ch7:forcecalc} which is affected by the kinematic properties of the model, cf. Section \ref{ch7:dispkin}. The state-of-the-art for such a hydrodynamic model in today's real-time ship simulators is based on fast interpolation and proper scaling of experimental model data. The amount of experimental model data is limited with respect to hull forms and configurations, requiring the need for extrapolation that compromises the accuracy.
+A fast GPU-accelerated ship hydrodynamic model is developed for real-time interactive ship simulation by modification of the unified potential flow model presented in Section \ref{ch7:goveq}. The target scientific application is an interactive full mission marine simulator, where multiple ships controlled by naval officers can navigate in a near-realistic virtual marine environment. Full mission simulators are used for education and training of naval officers in critical maneuvering operations and for evaluation of ship and marine infrastructure designs. To predict the motion of ships, a hydrodynamics model is required for prediction of forces by \eqref{ch7:forcecalc} which is affected by the kinematic properties of the model, cf. Section \ref{ch7:dispkin}. The state-of-the-art for such a hydrodynamic model in today's realtime ship simulators is based on fast interpolation and proper scaling of experimental model data. The amount of experimental model data is limited with respect to hull forms and configurations, requiring the need for extrapolation that compromises the accuracy.
  
  The objective of current and ongoing work is aimed at removing these limitations by replacing the existing hydrodynamic model and instead calculating at full-scale the flow field, wave field, ship-structure, and ship-ship interaction forces in real-time using massive parallel computation technology. The potential flow model (OceanWave3D) presented in Section \ref{ch7:goveq} is suitable as the modeling basis for this purpose since it is robust, accurate, efficient, and scalable to arbitrarily large domains. Furthermore, it can accurately account for dispersive waves in the range from shallow to deep waters in marine settings where the sea bed may be uneven.
  
@@ -901,7 +902,7 @@ where $U$ is the velocity of the ship. The unsteady linear water problem is used
   \end{subequations}
  where the pressure on the ship hull $p_{ship}$ is calculated explicitly based on a quasi-static approximation which is determined by assuming $\partial_t\phi_1\approx0$ and rewriting \eqref{ch7:quasistatic}. In general, a ship hull is a complex surface in three-dimensional space, but its draft can be approximated by a single-valued function of the horizontal coordinates $\eta_0 = \eta_0(x,y)$, and the no-flux condition on the ship hull is approximated by a flat-ship approximation. Radiation boundary conditions are approximated by a Sommerfelt absorbing boundary condition \cite{ch7:DgayguiJolySJAM1994} on the vertical sides of the physical domain to let waves escape the domain.
  
-The modified numerical model can still be based on flexible-order finite difference method as discussed in Section \ref{ch7:sec:nummodel}. The computational bottleneck problem is the efficient solution of the Laplace problem twice which can be done efficiently by the GPU-accelerated iterative PDC method as explained in section \ref{ch7:PDCmethod}. A snapshot of the steady state wave field is provided in the introduction to this chapter. Computed resistance curves for a Series 60 hull moving at forward speed corresponding to Froude number $F_n=0.316$ knots in calm water are compared to experimental data \cite{ch7:TodaEtAl1992} in Figure \ref{ch7:fig:shiphydro} (a). The computed Kelvin wave system is shown in Figure \ref{ch7:fig:shiphydro} (b). The computed results compare well with experiments at moderate ship Froudes numbers $F_n=U/\sqrt{gh}$ in the range 0.1-0.25 as expected for a linear model. The real-time constraint required to fulfill the interactive and visualization requirements can currently be met with the GPU-accelerated hydrodynamics model for problem sizes of approximately $10^6$ for ship Froude numbers in the range 0.1-0.3. The modeling and real-time aspects will be addressed in more detail in ongoing work.
+The modified numerical model can still be based on flexible-order finite difference method as discussed in Section \ref{ch7:sec:nummodel}. The computational bottleneck problem is the efficient solution of the Laplace problem twice which can be done efficiently by the GPU-accelerated iterative PDC method as explained in section \ref{ch7:PDCmethod}. A snapshot of the steady state wave field is provided in the introduction to this chapter. Computed resistance curves for a Series 60 hull moving at forward speed corresponding to Froude number $F_n=0.316$ knots in calm water are compared to experimental data \cite{ch7:TodaEtAl1992} in Figure \ref{ch7:fig:shiphydro} (a). The computed Kelvin wave system is shown in Figure \ref{ch7:fig:shiphydro} (b). The computed results compare well with experiments at moderate ship Froude numbers $F_n=U/\sqrt{gh}$ in the range 0.1--0.25 as expected for a linear model. The real-time constraint required to fulfill the interactive and visualization requirements can currently be met with the GPU-accelerated hydrodynamics model for problem sizes of approximately $10^6$ for ship Froude numbers in the range 0.1--0.3. The modeling and real-time aspects will be addressed in more detail in ongoing work.
  
  %
  \begin{figure}[!htb]
@@ -921,14 +922,14 @@ The modified numerical model can still be based on flexible-order finite differe
  %\newpage
  \section{Conclusion and future work}
  
-We have presented implementation details together with several novel results on development of a new massively parallel and scalable tool for simulation of nonlinear free surface water waves on heterogenous hardware. The tool is based on the unified potential flow model referred to as OceanWave3D \cite{ch7:EBL08} which provides the basis for efficient and scalable simulation of water waves over uneven bottoms on arbitrary domain sizes. We have demonstrated in a few examples how we can accelerate performance by using single-precision math without comprimising accuracy. We have shown that performance can be accelerated by introducing concurrency in the time integration using the parareal algorithm and for the first time in a heterogenous setup based on the use of multiple GPUs. Interestingly, we find that parallel computations using parareal may be more efficient than using conventional data-parallel distributed computations in a multi-GPU setup for moderate problem sizes. We have measured absolute performance and scalability using several of the most recent generations of NVIDIA GPUs to detail the efficiency of the current code. This is useful to predict time to results as explained in \cite{ch7:EngsigKarupEtAl2011} and may be compared against other wave models in fair comparisons. 
+We have presented implementation details together with several novel results on development of a new massively parallel and scalable tool for simulation of nonlinear free surface water waves on heterogeneous hardware. The tool is based on the unified potential flow model referred to as OceanWave3D \cite{ch7:EBL08} which provides the basis for efficient and scalable simulation of water waves over uneven bottoms on arbitrary domain sizes. We have demonstrated in a few examples how we can accelerate performance by using single-precision math without compromising accuracy. We have shown that performance can be accelerated by introducing concurrency in the time integration using the parareal algorithm and for the first time in a heterogeneous setup based on the use of multiple GPUs. Interestingly, we find that parallel computations using parareal may be more efficient than using conventional data-parallel distributed computations in a multi-GPU setup for moderate problem sizes. We have measured absolute performance and scalability using several of the most recent generations of NVIDIA GPUs to detail the efficiency of the current code. This is useful to predict time to results as explained in \cite{ch7:EngsigKarupEtAl2011} and may be compared against other wave models in fair comparisons. 
  
  Work in progress focuses on extending the governing equations to account for lack of physics such as wave runup and wave breaking. Also, we plan to extend the domain decomposition method to unstructured grids of blocks that can be boundary-fitted to more general bottom-mounted structures to be able to address wave-structure problems, cf. \cite{ch7:EHBM06,ch7:EHBW08}. For example, this will provide the basis for simulations of wave transformations in large harbor areas or predict wave climates in near-coastal areas.
  
  We anticipate that a tool based on the proposed parallel solution strategies will be useful for further advancement in fast and robust analysis techniques and large-scale simulation of free surface wave simulation (e.g., for use as an efficient far-field solver at large scales) and be a basis for next-generation wave models. We also expect that the tool can be useful for hybrid-solution strategies with local flow features possibly resolved by other models and for advancing state-of-the-art in fast physics-based wave-body simulations, e.g., ship-wave interactions in ship simulation where real-time constraints are imposed due to visualization. These subjects will be part of ongoing work addressing application aspects. 
  
-\section{Acknowledgment}
+\section{Acknowledgments}
  
  This work was supported by grant no. 09-070032 from the Danish Research Council for Technology and Production Sciences. A special thank goes to Professor Jan S. Hesthaven for supporting parts of this work. Scalability and performance tests was done in the GPUlab at DTU Informatics, Technical University of Denmark and using the GPU-cluster at Center for Computing and Visualization, Brown University, USA. NVIDIA Corporation is acknowledged for generous hardware donations to facilities of the GPUlab.
-
+\clearpage
  \putbib[Chapters/chapter7/biblio7]