BookGPU/Chapters/chapter7/ch7.tex

   1
   2 \chapterauthor{Allan P. Engsig-Karup, Stefan L. Glimberg, Allan S. Nielsen, Ole Lindberg}{Technical University of Denmark}
   3 %\chapterauthor{Stefan L. Glimberg}{Technical University of Denmark}
   4 %\chapterauthor{Allan S. Nielsen}{Technical University of Denmark}
   5 %\chapterauthor{Ole Lindberg}{Technical University of Denmark}
   6
   7 \chapter{Fast hydrodynamics on heterogenous many-core hardware}
   8 \label{ch7}
   9
  10 \begin{figure}[!htb]
  11 \centering
  12 \includegraphics[width=0.95\textwidth]{Chapters/chapter7/figures/figSeries60CB06Type7StedaySnapshot-eps-converted-to.pdf}
  13 \caption{Snapshot of steady state wave field generated by a Series 60 ship hull.}
  14 \end{figure}
  15
  16 \newpage
  17 In this chapter, we use our library for heterogenous and massively parallel GPU implementations. The library is written in Compute Unified Device Architecture (CUDA) C/C++ and a fully nonlinear and dispersive free surface water wave model \cite{ch7:EngsigKarupEtAl2011} is implemented. We describe how flexible-order finite difference\index{finite difference} (stencil) approximations to the partial differential equations of the model can be prototyped using library components provided in an in-house library. In this library hardware-specific implementation details are hidden via template-based components, as described in Chapter \ref{ch5}. We provide details of the modeling basis and important unique numerical properties which have been made tuneable to create a powerful and robust tool that can be tailored for specific purposes in engineering analysis. The model is based on unified potential flow theory and can be applied in scientific applications related to maritime engineering. It can be applied for cost-efficient estimation of broad banded wave propagation, transformation of irregular multidirectional waves over variable depth, kinematics and structural wave loads over large areas and scales.
  18
  19 A main motivation of this work is to deliver exceptional performance to minimize calculation times, using modern parallel hardware technologies in combination with a proper choice of discretization methods and data-local algorithms with optimal complexity. This enable work and memory requirements to grow (scale) linearly with problem size on a suitable hardware system. For the wave model this is achieved by explicit time integration and iterative solution of a large nonsymmetric and sparse linear $\sigma$-transformed Laplace problem. For the latter, we use an iterative Preconditioned Defect Correction (PDC) method, accelerated using a geometric multigrid preconditioning strategy. We use modern programming paradigms in the form of Message Passing Interface (MPI) and CUDA for development of a novel massively parallel wave modelling tool, executable on modern heterogenous many-core hardware.
  20
  21 One purpose of the developed numerical model is to ultimately perform hydrodynamic calculations in the time domain for practical analysis and simulation, e.g., to enable computationally intensive interactive real-time simulations. Realistic interactive simulations are, with present technology and available computational resources, a tremendous challenge in this setting. Yet, our aim is to take a first step in this direction and compute first-order accurate hydrodynamics for near-realistic simulations of unsteady ship-wave dynamics in a large ship simulator, used for training purposes in seakeeping operations. For this type of application, a mandatory ingredient for real-time and interactive simulation is a truly high-performance parallel implementation to ensure data processing in time for interactive visualization and responses. Details of the model properties, implementation, and promising novel combinations of techniques and algorithms for acceleration of performance are presented. Numerical experiments and benchmarks are provided to demonstrate the accuracy and efficiency of the model across recent generations of many-core CUDA-enabled hardware.
  22
  23 \section{On hardware trends and challenges in scientific applications}
  24
  25 During the last two decades we have seen how computer graphics hardware has been developed from fixed pipeline processors with no level of programmability, to flexible high-performance hardware platforms, suitable for general purpose scientific computations other than computer graphics. This trend has contributed to a disruptive breakthrough in high-performance computing on mass-produced commodity hardware and fuelled new opportunities for computational science and engineering for a broad range of scientific as well as modern business applications. This emphasizes the increasingly important role of computers in simulation of real world dynamics \cite{ch7:Keyes201170}. In recent years, the CUDA programming model, based on the standard C/C++ programming language and introduced by NVIDIA Corporation worldwide, has become popular as a proprietary and widely used standard in high performance communities. It is, by design and supported functionality, easy and sufficient to be deployed for wide improvement of existing and new applications across science and engineering fields, that can benefit from the the use of heterogenous hardware.
  26
  27 We should be careful about speculating about the future and extrapolating from current trends. The TOP 500 list\footnote{\url{http://www.top500.org.}} of supercomputers shows that there are some general noticeable hardware trends and gives indication of what to expect in the near future. First, since 2005 we have seen how power constraints and resulting heat dissipation problems forced chip producers to increase the number of cores rather than clock frequency.  Multicore processors have become the new standard and many-core processors are becoming available as a standard in commodity hardware, from personal laptops to supercomputing clusters.
  28
  29 This trend suggests that there will be less fast low-latency memory available per core in the future, favoring data-locality in algorithms. In addition, we have also seen how communication speed to computation speed ratio decreases, making it increasingly difficult to supply data to hungry floating point units. In addition, there will likely be increasing amounts of data to store as a result of increasing processing capabilities. The rapidly increasing floating point performance following Moore's law for transistor production has resulted in a significant memory gap which leaves most scientific applications based on partial differential equations (PDEs) bandwidth bound rather than compute bound. This trend is driven by pure commercial needs and not the needs of high-performance computing. Roads to better performance include standardization of software infrastructure, rethinking algorithms to better exploit memory hierarchies optimally to boost strong scaling properties, increasing locality in algorithms, and introducing as much concurrency and work as possible to both utilize and exploit the many cores. Also, software that can utilize many cores should be fault-tolerant to maximize time to solution for application users. We should also expect to see multiple layers of parallelism that will have to be exploited and possibly autotuned to optimally utilize available hardware resources. This introduces new challenges in compilers, requires programming experts with hardware knowledge, and introduces new trends in software developments to leverage productivity and utilize available computational resources in more optimal ways. We have observed a fundamental paradigm shift of underlying hardware design towards much more heterogeneity and parallelism.
  30
  31 A key problem is that improvements in performance require porting legacy codes\footnote{In the worst case, a legacy code is an undocumented serial code developed a long time ago by a developer no longer around.} to new hardware, and possibly changing algorithms which have been developed for the conventional single core processors decades ago. Without this, it may be impossible to utilize and scale algorithmic work optimally to achieve high performance on modern and emerging hardware. This problem is currently addressed with rapid progress by researchers and industry by development of new optimized libraries that can utilize such new hardware. While we have seen significant improvements in such efforts, and today see much more rapid development of applications, there are still relatively few scientific applications running entirely on heterogenous hardware.
  32
  33 %The main justification for porting or developing application on such hardware is a significant performance band expected (significant)
  34
  35 %This will change with improvements in library software and tools provided by key vendors
  36 %
  37 %However, increasing amounts of applications are utilizing accelerators to parts of their code to gain speedups albeit with less dramatic improvements of performance as one can potentially find by adapting most, if not all of the application to modern hardware.
  38
  39 In this work, we explore some of these trends by developing, by bottom-up-design, a water wave model which can be utilized in maritime engineering and with the intended use on affordable office desktops as well as on more expensive modern compute clusters for engineering analysis purposes.
  40
  41 \section{On modeling paradigms for highly nonlinear and dispersive water waves}
  42 \label{ch7:sec:modernwavemodellingparadigms}
  43
  44 We see the development of new or improved hardware technologies as a key driver for exploring new and revisiting existing approaches that can contribute to next-generation modeling techniques.
  45
  46 For instance, the dominant wave modeling paradigm today for numerical simulation in coastal engineering tools is the use of Boussinesq-type \index{Boussinesq models} formulations for approximate solution of unified potential flow\index{potential flow} equations over varying bathymetry \cite{ch7:MS98}. The use of Boussinesq-type models in engineering tools was pioneered in 1978 by Abott et. al. \cite{ch7:AbottEtAl1978,ch7:AbottEtAl1984} based on the original Boussinesq equations due to Peregrine \cite{ch7:Peregrine1967} for calculations of waves in a harbor area. New formulations for highly nonlinear and dispersive water waves, useful for description of wave propagation in the important application range from deep to shallow areas, have been the subject of intense research for more than two decades. Such higher order formulations can be derived by first introducing an infinite Mclaurin series solution to the Laplace equation as described in \cite{ch7:AMS99}. This technique was later generalized to arbitrary expansion levels \cite{ch7:MBL02}. By analytical truncation of such series solutions, a polynomial variation in the vertical is assumed and provides the basis for efficient higher order Boussinesq-type formulations \cite{ch7:MBS03,ch7:Bingham2009467} for fully nonlinear and dispersive water waves. It is attractive, since it is then possible to eliminate the vertical coordinate in the analytical formulation of the Laplace problem. The resulting approximate model contains higher order derivatives to describe dispersion and these require careful treatment in numerical models. Thus, this truncation procedure inherently limits the practical application range, however, it can be significantly improved via Pad\'e approximations together with the introduction of free parameters for extending the finite application range by mathematical optimization to enhance accuracy in dispersion, kinematics and shoaling characteristics.
  47
  48 Main challenges of Boussinesq-type models are accurate and large-scale simulation of waves propagating towards near-shore from deep to shallow waters through surf zones, while accounting for high-order dispersive and nonlinear effects \cite{ch7:Cavaleri2007603}. Within the last two decades, much research has focused on extending the application range through improved formulations in terms of dispersion, shoaling, kinematic and nonlinear properties. The ultimate high-order Boussinesq-type model due to \cite{ch7:MBS03} was at the time considered a breakthrough in this direction, and since then promising new formulations have been proposed. For example, the methodology behind Boussinesq-type formulations can be extended via a multilayer approach \cite{ch7:LynettEtAl2004a,ch7:LynettEtAl2004b,ch7:ChazelEtAl2010} that makes it possible to achieve a similar range of application and levels of accuracy, but without higher derivatives in the formulation that can cause numerical difficulties.
  49
  50 Boussinesq-type formulations for free surface waves are conventionally evaluated against the unified potential flow theory to evaluate limits to application range and accuracy limits. The use of unified potential theory as a basis for numerical models has traditionally been perceived as too expensive \cite{ch7:Lin2008} to solve in comparison with the typically more efficient Boussinesq-type models. This may be true in a strict comparison between the models, especially with respect to applications towards the more restricted shallow regions. However, this is in spite of the fact that a numerical unified potential flow model can be used for a larger range of practical scientific applications. A unified potential flow model has at most second-order derivatives in the formulation. In a numerical setting it has good opportunities for balancing accuracy and work effort by appropriate tuning of discrete parameters. This comes without a need for changing the underlying wave model to extend application range towards deep waters. Thus, the main problem related to the practical use of a unified model in maritime applications is arguably an issue of numerical efficiency.
  51
  52 To address this issue, we have recently proposed a new approach in a proof-of-concept that combines modern many-core hardware with appropriate numerical and parallel strategies to facilitate efficient, accurate and scalable solution of water wave problems~\cite{ch7:EngsigKarupEtAl2011}. The use of potential theory for unsteady water wave computations can be traced at least back to 1975 \cite{ch7:HausslingVanEseltine1975}, and the fully nonlinear potential equations have been solved using various numerical methods since then, e.g., see reviews \cite{ch7:Yeung1982,ch7:TsaiYue1996,ch7:DiasBridges2006,ch7:Lin2008}. In the context of the finite-difference method, an efficient and scalable second-order geometric multigrid approach was first proposed by Li and Fleming in 1997 \cite{ch7:LiFleming1997}. Since then, the numerical strategy has been significantly improved in several works \cite{ch7:BinghamZhang2007,ch7:EBL08} that have led to more efficient and robust discretization techniques, with the objective of developing a general purpose strategy, that can be used for a broad range of practical maritime applications. Recently, a comparison with a High-Order Spectral (HOS) model \cite{ch7:DucrozetEtAl2011} was also reported to assess  accuracy and relative differences in efficiency on single-core hardware against a superior spectral modeling basis for a numerical wave tank setup in a structured domain with a flat sea bed.
  53
  54 \section{Governing equations}
  55 \label{ch7:goveq}
  56
  57 We describe how, by physical principles via mathematical procedures and assumptions, it is possible to formulate a fully nonlinear and dispersive water wave model, describing incompressible, irrotational and inviscid fluid\index{fluid} flow above an uneven seabed.
  58
  59 Conservation of mass\index{mass conservation} for an infinitely small control volume can be stated as
  60 \begin{align}
  61 \frac{\partial \rho}{\partial t} + \nabla \cdot (\rho {\bf u}) = 0,
  62 \end{align}
  63 where $\rho$ is fluid density, ${\bf u}=(u,v,w)$ is the velocity field vector, and $\nabla=(\partial/\partial x, \partial/\partial y, \partial/\partial z)$ is a Cartesian gradient operator. If we assume that fluid density is constant, i.e., that the fluid is incompressible, the mass continuity equation simplifies to
  64 \begin{align}
  65 \label{ch7:conteq}
  66  \nabla \cdot {\bf u} = 0,
  67 \end{align}
  68 which shows that the divergence of the velocity field is zero everywhere.
  69
  70 Conservation of momentum\index{momentum conservation} for an infinitely small control volume is expressed by the famous Navier-Stokes equations
  71 \begin{align}
  72 \label{ch7:NSeq}
  73 \rho \frac{D{\bf u}}{Dt} = -\nabla p + \mu \nabla^2{\bf u} + {\bf F},
  74 \end{align}
  75 where $p$ is pressure and ${\bf F}$ is the net force vector acting on the fluid volume, assumed to be of the form ${\bf F}=\rho{\bf g}$ with ${\bf g}=(0,0,-g_z)$ accounting for gravitational effects in the vertical direction. This implies that surface tension effects on the free surface are neglected. Exact solutions to the Navier-Stokes equations are in general difficult to obtain, and this motivates the use of numerical methods for direct simulation of fluid flow and if necessary analytical simplifications to account only for physics of interest.
  76
  77 The material derivative for a comoving coordinate system used in Lagrangian formulations
  78 \begin{align}
  79 \frac{D}{Dt}\equiv \frac{\partial }{\partial t} + ({\bf u}\cdot \nabla),
  80 \end{align}
  81 is defined as the sum of a time derivative and a convective term measured in a static (Eularian) coordinate system and accounts for the time rate of change following the motion. Thus, for the velocity vector ${\bf u}$, the total acceleration is defined as
  82 \begin{align}
  83 \label{ch7:momentum}
  84 \frac{D{\bf u}}{Dt}\equiv \frac{\partial {\bf u}}{\partial t} + \frac{1}{2}\nabla {\bf u}^2 +{\boldsymbol \omega} \times {\bf u},
  85 \end{align}
  86 where the curl of the velocity field ${\boldsymbol \omega}\equiv\nabla\times {\bf u}$ is referred to as the vorticity vector field accounting for rotation of fluid particles. If we assume that the flow is irrotational
  87 \begin{align}
  88 \nabla\times {\bf u} = 0,
  89 \end{align}
  90 and make use of the following relationship known from vector calculus
  91 \begin{align}
  92 \frac{1}{2}\nabla({\bf u}\cdot {\bf u}) = ({\bf u}\cdot \nabla){\bf u} + {\bf u}\times (\nabla \times {\bf u}),
  93 \end{align}
  94 we can rewrite \eqref{ch7:momentum} as
  95 \begin{align}
  96 \label{ch7:momentum2}
  97 \frac{D{\bf u}}{Dt} = \frac{\partial {\bf u}}{\partial t} +  \frac{1}{2}\nabla({\bf u}\cdot {\bf u}).
  98 \end{align}
  99 Since the Navier-Stokes equations \eqref{ch7:NSeq} can be derived by application of Newton's second law to an infinitely small fluid volume, we can establish the following. Changes in momentum in an infinitely small control volume of a fluid is simply the sum of forces due to pressure gradients, dissipative forces, gravitational forces, and possibly other forces acting inside the fluid volume.
 100
 101 To accurately simulate propagation of long gravity waves and high Reynolds number flows in the context of maritime applications it is for many applications acceptable to assume that viscous forces are small in comparison with inertial forces. In this case, it is reasonable to assume that the fluid is inviscid and we can neglect the viscous terms in \eqref{ch7:NSeq}. Then we obtain the set of Euler equations
 102 \begin{align}
 103 \label{ch7:momentum3}
 104 \frac{D{\bf u}}{Dt} = - \frac{1}{\rho}\nabla p + \frac{1}{\rho} {\bf F},
 105 \end{align}
 106 which does not account for any losses in energy via dissipative physical mechanisms.
 107
 108 If we introduce a scalar velocity potential function $\phi$ that relates to the velocity components in the following way
 109 \begin{align}
 110 \label{ch7:vectorfield}
 111 {\bf u}\equiv \nabla \phi = \left(\frac{\partial \phi}{\partial x}, \frac{\partial \phi}{\partial y}, \frac{\partial \phi}{\partial z} \right),
 112 \end{align}
 113 the number of unknowns can be lowered and we find that the scalar velocity potential function due to \eqref{ch7:conteq} must satisfy the Laplace equation
 114 \begin{align}
 115 \label{ch7:laplaceeq}
 116 \nabla^2\phi=0.
 117 \end{align}
 118 Solutions to this equation are completely determined by the boundary conditions\index{boundary condition}. Thus, it is possible, given appropriate boundary conditions, to determine the scalar velocity potential $\phi$ in all of the domain by solving the resulting Laplace problem. When the scalar velocity potential function is known, detailed information of the kinematics can immediately be obtained.
 119
 120 With the definition of the vector field \eqref{ch7:vectorfield}, we can collect the momentum equations \eqref{ch7:momentum2} and \eqref{ch7:momentum3} and express the momentum equation as
 121 \begin{align}
 122 \rho\left( \frac{\partial \nabla \phi}{\partial t} + \nabla \left(\frac{1}{2} \nabla\phi \cdot \nabla\phi\right)  \right) = -\nabla p - \nabla(\rho g z),
 123 \end{align}
 124 having assumed that the net force can be decomposed into pressure and gravity forces only. This set of equations can be rewritten as
 125 \begin{align}
 126 \nabla \left[ \rho \frac{\partial \phi}{\partial t} + \rho \frac{1}{2} \nabla\phi \cdot \nabla\phi + p + \rho g z \right] = 0,
 127 \end{align}
 128 and by integration in space we arrive at the unsteady Bernoulli's equation
 129 \begin{align}
 130 \rho \frac{\partial \phi}{\partial t} + \rho \frac{1}{2} \nabla\phi \cdot \nabla\phi + p + \rho g z = G(t),
 131 \end{align}
 132 where $G(t)$ is an arbitrary function of the integration that can be assumed to be zero as it defines only a reference value for the unphysical scalar velocity potential function. Bernoulli's equation is typically used as a dynamic condition for the free fluid surface, expressing that the fluid pressure at the free surface is equal to the pressure in the air above the free surface.
 133
 134 In the following, we assume that the displacement of the free surface $z=\eta({\bf x},t)$ is described in a Cartesian coordinate system with the $xy$-plane located at the still water level and the positive $z$-axis pointing upwards. It is typical to assume a constant atmospheric pressure at the free surface by defining $p=0$ as reference. This leaves us with a dynamic boundary condition for the free surface velocity potential function stated as
 135 \begin{align}
 136 \frac{\partial \phi}{\partial t} = -\frac{1}{2} \nabla\phi \cdot \nabla\phi - g z, \quad z=\eta.
 137 \end{align}
 138 At the free surface, we can determine a kinematic free surface condition by determining the rate of change of the streamline for the surface as
 139 \begin{align}
 140 \frac{\partial \eta}{\partial t} = -\frac{\partial \phi}{\partial x} \frac{\partial \eta}{\partial x} - \frac{\partial \phi}{\partial y} \frac{\partial \eta}{\partial y} +  \frac{\partial \phi}{\partial z}, \quad z=\eta.
 141 \end{align}
 142 Spatial and temporal differentiations of the free surface variables are related by the chain rule
 143 %
 144 \begin{align}
 145 \boldsymbol{\nabla}\tilde{\phi} &= (\boldsymbol{\nabla}\phi)_{z=\eta} + \left(\frac{\partial \phi}{\partial z}\right)_{z=\eta}\boldsymbol{\nabla}\eta,  \\
 146 \frac{\partial \tilde{\phi}}{\partial t}  &= \left( \frac{\partial \phi}{\partial t} \right)_{z=\eta} +\frac{\partial  \eta}{\partial t} \left(\frac{\partial \phi}{\partial z} \right)_{z=\eta},
 147 \end{align}
 148 %
 149 and can be used to transform the free surface problem to variables defined solely at the free surface as
 150 %
 151 \begin{subequations}
 152 \begin{align}
 153 \frac{\partial}{\partial t} \eta &= -\boldsymbol{\nabla}\eta\cdot\boldsymbol{\nabla}\tilde{\phi}+\tilde{w}(1+\boldsymbol{\nabla}\eta\cdot\boldsymbol{\nabla}\eta), \label{ch7:FSeta} \\
 154 \frac{\partial}{\partial t} \tilde{\phi} &= -g\eta - \frac{1}{2}\left(\boldsymbol{\nabla}\tilde{\phi}\cdot\boldsymbol{\nabla}\tilde{\phi}-\tilde{w}^2(1+\boldsymbol{\nabla}\eta\cdot\boldsymbol{\nabla}\eta)\right),
 155 \label{ch7:FSphi}
 156 \end{align}
 157 \label{ch7:FSorigin}
 158 \end{subequations}
 159 %
 160 with the $\boldsymbol{\nabla}$-operator from here conveniently re-defined as a horizontal gradient operator $\boldsymbol{\nabla}=(\partial_x,\partial_y)$ and tilde's used for free surface variables. To solve the set of unsteady free surface equations \eqref{ch7:FSorigin}, we need a closure between the horizontal and vertical free surface velocity variables. This can be established by solving the Laplace equation \eqref{ch7:laplaceeq} in the interior domain together with suitable boundary conditions.
 161
 162 A kinematic bottom condition can be derived by assuming that the fluid particles follow a streamline along the solid sea bed. Consider the rate of change of such a streamline at still-water depth $z=-h({\bf x},t)$ and we find
 163 \begin{align}
 164 \label{ch7:kinbot}
 165 \frac{\partial z}{\partial t} = - \frac{\partial h}{\partial x}  \frac{\partial \phi}{\partial x} - \frac{\partial h}{\partial y}  \frac{\partial \phi}{\partial y} - \frac{\partial h}{\partial t}, \quad z=-h({\bf x},t).
 166 \end{align}
 167 We assume that the sea bed is static allowing us to neglect the last term. Thus, specifying $\tilde\phi$ as a Dirichlet condition at the free surface together with a kinematic bottom boundary condition at the sea bed defines a Laplace problem
 168 %
 169 \begin{subequations}
 170 \label{ch7:eq:laplaceproblem}
 171 \begin{align}
 172 \phi & =  \tilde{\phi}, \quad z = \eta({\bf x},t), \\
 173 \boldsymbol{\nabla}^2\phi + \partial_{zz}\phi & = 0, \quad -h\leq z<\eta({\bf x},t), \label{ch7:Laplace} \\
 174 \partial_z \phi + \boldsymbol{\nabla}h\cdot\boldsymbol{\nabla}\phi &= 0, \quad z=-h, \label{ch7:KB}
 175 \end{align}
 176 \end{subequations}
 177 where we have used $\partial_t z \equiv \partial_z \phi$ to rewrite the first term of the kinematic bottom condition \eqref{ch7:kinbot}.
 178 %
 179 The moving free surface makes the spatial fluid domain $\Omega$ vary in time. The main challenges in solving these equations numerically are dealing with the time-dependent fluid domain and nonlinearity of the equations.
 180 %
 181
 182 \subsection{Boundary conditions}\index{boundary condition}
 183
 184 We consider three types of boundaries, namely, fully reflective boundaries, incident wave boundaries, and absorbing boundaries. The fully reflective boundaries are handled through numerical approximations of the boundary conditions for solid walls and bottom surfaces stating that the velocity in the normal direction is zero
 185 \begin{align}
 186 {\bf n}\cdot \nabla \phi = 0, \quad {\bf x}\in\partial\Omega,
 187 \end{align}
 188 where ${\bf n}=(n_x,n_y)$ is a two-dimensional normal vector pointing outwards from the solid surface. A complementary condition for the free surface elevation variable is
 189 \begin{align}
 190 {\bf n} \cdot \nabla\eta = 0.
 191 \end{align}
 192
 193 Incident wave and absorbing boundary conditions are imposed via an embedded penalty forcing technique as described in Section \ref{ch7:wavegen}.
 194
 195 \section{The numerical model}
 196 \label{ch7:sec:nummodel}
 197
 198 The unified potential flow model is attractive as a basis due to the underlying analytical properties.
 199 From a numerical point of view, an efficient and scalable discretization strategy should be based on using a data-local method, e.g., a flexible-order finite difference method for discretely approximating the governing equations and imposing boundary conditions via fictitious ghost points techniques as described in \cite{ch7:BinghamZhang2007,ch7:EBL08}. Such an approach has several attractive features from a scientific computing perspective. For example, finite difference methods are among the simplest and most efficient methods due to the use of structured grids and data structures. This results in low implementation and computational complexity which maps efficiently to modern computer architectures. Formal accuracy and tuneable numerics are achieved by employing flexible-order finite difference\index{finite difference} (local stencil)\index{stencil} approximations.
 200
 201 We present scalability and performance tests based on the same two test environments outlined in Chapter \ref{ch5}, Section \ref{ch5:sec:testenvironments}, plus a third test environment based on the most recent hardware generation:
 202 \begin{description}
 203 \item[Test environment 4.] Desktop with dual-socket Sandy Bridge Intel Xeon E5-2670 (2.60 GHz) processors, 64GB RAM, 2x Nvidia Tesla K20 GPUs.
 204 \end{description}
 205 Performance results can be used to predict actual runtimes as described in \cite{ch7:EngsigKarupEtAl2011}, e.g., for estimation of whether a real-time constraint for a given problem size can be met.
 206
 207 \subsection{Strategies for efficient solution of the Laplace problem}\index{Laplace problem}
 208
 209 As explained in Section \ref{ch7:sec:modernwavemodellingparadigms}, for the formulation of potential flow problems there are two widely used paradigms for solving the Laplace problem efficiently. The most widely used approach is the Boussinesq-type where essentially the three-dimensional formulation is reduced to a two-dimensional formulation. The main argument for this type of model reduction procedure is the resulting efficiency in the numerical models. The price paid is typically high-order derivatives in the approximate formulation and is justified by the efficient solution of an approximate Laplace problem.
 210
 211 A second approach is to transform the partial differential equation mathematically to provide a basis for an efficient direct solution of the discrete Laplace problem for the entire volume. This strategy is based on a paradigm where approximations are done by discrete approximations rather than analytical manipulations of the equation. At a first look, this approach introduces more complexity in the formulation, e.g., by the introduction of mixed derivatives, however, essentially does not limit the application range beyond the numerical approximations and properties hereof. Using this second approach, it is standard to introduce a $\sigma$-transformation in the vertical coordinate
 212 \begin{align}
 213 \label{ch7:sigtrans}
 214 \sigma \equiv \frac{z+h(\boldsymbol{x})}{d(\boldsymbol{x},t)}, \quad 0\leq \sigma \leq 1,
 215 \end{align}
 216 where $d(\boldsymbol{x},t)=\eta(\boldsymbol{x},t) + h(\boldsymbol{x})$ is the height of the water column above the bottom. This enables a transformation of the problem to a time-constant computational domain at the expense of time-varying coefficients.
 217
 218 The fluid domain is mapped to a time-invariant computational domain in which the Laplace problem \eqref{ch7:eq:laplaceproblem} is expressed as
 219 %
 220 \begin{subequations}
 221 \label{ch7:TransformedLaplace}
 222 \begin{align}
 223   \Phi & =  \tilde{\phi}, \quad \sigma = 1, \label{ch7:FSDirichlet} \\
 224   \boldsymbol{\nabla}^2\Phi+\boldsymbol{\nabla}^2\sigma(\partial_\sigma\Phi) +2\boldsymbol{\nabla}\sigma\cdot\boldsymbol{\nabla}(\partial_\sigma\Phi) + & \nonumber \\
 225   (\boldsymbol{\nabla}\sigma\cdot\boldsymbol{\nabla}\sigma+(\partial_z\sigma)^2)\partial_{\sigma\sigma}\Phi&=0, \quad 0\leq\sigma<1, \label{ch7:convlaplace}\\
 226 {\bf n} \cdot ({\bf \nabla},\partial_z\sigma\partial_\sigma)\Phi & = 0, \quad ({\bf x},\sigma)\in\partial\Omega, \label{ch7:strucbcs}
 227 %  (\partial_z\sigma+\boldsymbol{\nabla}h\cdot\boldsymbol{\nabla}\sigma)(\partial_\sigma\Phi) + \boldsymbol{\nabla}h\cdot\boldsymbol{\nabla}\Phi&=0, \quad \sigma=0, \label{eq:BBC}
 228 \end{align}
 229 \end{subequations}
 230 %
 231 where the scalar velocity function $\Phi(\boldsymbol{x},\sigma,t)=\phi(\boldsymbol{x},z,t)$ contains all information about the flow kinematics in the entire fluid volume. The spatial derivatives of the coordinate $\sigma$ appearing in \eqref{ch7:TransformedLaplace} are expressed as
 232 %
 233 \begin{subequations}
 234 \label{ch7:timedepcoef}
 235 \begin{align}
 236 \boldsymbol{\nabla}\sigma & = \tfrac{1-\sigma}{d}\boldsymbol{\nabla}h
 237 - \tfrac{\sigma}{d}\boldsymbol{\nabla}\eta, \\
 238 \boldsymbol{\nabla}^2\sigma & = \tfrac{1-\sigma}{d}\left(\boldsymbol{\nabla}^2
 239 h-\tfrac{\boldsymbol{\nabla} h\cdot\boldsymbol{\nabla} h}{d}\right)
 240 -\tfrac{\sigma}{d}\left(\boldsymbol{\nabla}^2 \eta-\tfrac{\boldsymbol{\nabla}
 241 \eta\cdot\boldsymbol{\nabla}\eta}{d}\right)  \nonumber \\
 242  &-\tfrac{1-2\sigma}{d^2}\nabla h\cdot\nabla\eta
 243 -\tfrac{\boldsymbol{\nabla}\sigma}{d}\cdot\left(\boldsymbol{\nabla} h
 244 + \boldsymbol{\nabla}\eta\right), \\
 245 \partial_z  \sigma & =
 246 \tfrac{1}{d}.
 247 \end{align}
 248 \end{subequations}
 249 %
 250 All of these coefficients can be computed explicitly from the known two-dimensional free surface and bottom positions at given instances of time.
 251
 252 The velocity field can be determined from a known $\Phi$ using the relation
 253 \begin{align}
 254 ({\bf u},w) = (\boldsymbol{\nabla}, \partial_z \sigma \partial_\sigma) \Phi.
 255 \end{align}
 256 The flow can be computed from the scalar velocity potential and used for estimating nonhydrostatic pressure and resulting wave loads. An exact expression for local pressure as a function of the vertical coordinate can be found by vertical integration of the vertical momentum equation to be of the form
 257 \begin{align}
 258 p(z) = \rho g (\eta -z ) + \int_{z}^\eta \partial_t w dz + \frac{1}{2}(\tilde{u}^2-u(z)^2 + \tilde{v}^2 - v(z)^2 + \tilde{w}^2 - w(z)^2).
 259 \end{align}
 260 The integral term can be estimated numerically using an accurate quadrature rule. Estimation of structural forces is determined by integration of pressure
 261 \begin{align}
 262 {\bf F} =  -\iint_S p{\bf n} dS,
 263 \label{ch7:forcecalc}
 264 \end{align}
 265 where $S$ is a structural surface.
 266
 267 \subsection{Finite difference approximations}\index{finite difference}
 268
 269 The numerical scheme is implemented as a flexible-order finite difference collocation scheme where all finite difference approximations of derivatives are constructed from one-dimensional approximations in a standard way, each having the maximum possible accuracy. In explicit numerical schemes, finite difference approximations can be implemented using a matrix-free technique to exploit that only a few different stencils are in fact needed. This can significantly reduce memory requirements of the implemented model by exploiting that the same small set of stencils can be reused. See Chapter \ref{ch5} for more details about matrix-free stencil operations supported in our in-house library for heterogenous and massively parallel computing using GPUs.
 270
 271 %\newpage
 272 \subsection{Time integration}\index{time integration}
 273
 274 For users of scientific applications, robustness is of paramount importance for the solution of time-dependent PDEs. This makes stability considerations relevant in the context of both explicit and iterative numerical methods often considered most suitable for massively parallel applications. In the following, we address aspects of explicit time integration schemes which are associated with a stability\index{stability} requirement on time steps.
 275
 276 A method of lines\index{method of lines} approach is used for the discretization of the wave model. The spatial discretization yields a system of ordinary differential equations which can be expressed as a semidiscrete system.
 277 %
 278 We use the classical fourth-order Explicit Runge-Kutta Method (ERK4). This algorithm is suitable for massive parallel computations via a data-parallel implementation where the spatial discretization terms are processed. As a means to introduce more concurrency into the time integration, we consider the Parareal algorithm as described in Section \ref{ch7:parareal}.
 279 For explicit time-integration schemes a Courant-Levy-Friedrichs (CFL) condition defines a necessary restriction for temporal stability of the form
 280 \begin{align}
 281 \Delta t\leq \frac{C}{\max_{n} |\lambda_n(\mathcal{J}_h)|},
 282 \end{align}
 283 with $C\in\mathbb{R}_+$ a CFL constant typically of size $\mathcal{O}(1)$ dependent on chosen scheme, and $\mathcal{J}_h\in\mathbb{R}^{2m\times2m}$, where $m=N_xN_y$ is a discrete Jacobian\index{Jacobian} matrix obtained by local linearization in time of \eqref{ch7:FSorigin}. For ERK4, $C=2\sqrt{2}$ if all eigenvalues are purely imaginary.
 284
 285 To gain insight into necessary conditions for stability, we employ a linear stability analysis based on the semi-discrete linear system
 286 %
 287 \begin{align}
 288 \label{ch7:linstab}
 289 \frac{d}{dt}\left[ \begin{array}{c} \eta \\ \tilde{\phi} \end{array} \right] = \mathcal{J} \left[ \begin{array}{c} \eta \\ \tilde{\phi} \end{array} \right], \quad \mathcal{J}=\left[ \begin{array}{cc} 0 & \mathcal{J}_{12} \\ -g & 0 \end{array} \right],
 290 \end{align}
 291 %
 292 where $\mathcal{J}_{12} = k \tanh(kh)$ for the continuous problem which results in purely imaginary eigenvalues of the Jacobian $\lambda(\mathcal{J}) = \pm i \sqrt{gk\tanh(kh)}$ that grow unbounded in the limit $kh\to\infty$ corresponding to deep water relative to wave length. The waves are described in terms of wave number $k=2\pi/L$ where $L$ is the wave length.
 293
 294 For any numerical method, a numerical discretization of the same system of equations should mimic the continuous eigenspectrum in well-resolved parts of the spectrum. For a given discretization method, the stability of the discrete problem \eqref{ch7:linstab} is governed by the eigenvalues of the discrete Neumann-Dirichlet operator $\mathcal{J}_{12,h}\in\mathbb{R}^{m\times m}$, which connects the vertical velocity at the free surface with that of the velocity potential at the free surface. It is possible to partition the discrete $\sigma$-transformed Laplace problem and the differential in the $z$-direction in terms of equations for the interior (i) and those at the free surface boundary (b) such that
 295 %
 296 \begin{align}
 297 \mathcal{A}\phi = \left[ \begin{array}{cc} \mathcal{A}_{bi} & \mathcal{A}_{bb} \\  \mathcal{A}_{ii} & \mathcal{A}_{ib} \end{array}  \right] \left[ \begin{array}{c}  \tilde{\phi} \\  \phi_{i} \end{array}  \right], \quad
 298 \mathcal{D} \phi = \left[ \begin{array}{cc} \mathcal{D}_{bi} & \mathcal{D}_{bb} \\  \mathcal{D}_{ii} & \mathcal{D}_{ib} \end{array}  \right] \left[ \begin{array}{c}  \tilde{\phi} \\  \phi_{i} \end{array}  \right].
 299 \end{align}
 300 %
 301 From these equations we find for the discrete block Jacobian operator
 302 %
 303 \begin{align}
 304 \frac{\partial \phi}{\partial z}\Big|_{z=\eta} = \mathcal{J}_{12,h} \tilde{\phi}, \quad  \mathcal{J}_{12,h} = \mathcal{D}_{bb} - \mathcal{D}_{bi} \mathcal{A}_{ii}^{-1} \mathcal{A}_{ib}.
 305 \end{align}
 306 %
 307 As shown in \cite{ch7:RobertsonSherwin1999} the eigenvalues of $\mathcal{J}_{12,h}$ are related to the eigenvalues of the discrete Jacobian $\mathcal{J}_h$ through the following relationship
 308 %
 309 \begin{align}
 310 \lambda(\mathcal{J}_h) = \pm i \sqrt{ \lambda(\mathcal{J}_{12,h}) g }.
 311 \end{align}
 312 %
 313 and are all imaginary confirming the hyperbolic (energy-conserving) nature of the potential flow formulation. Thus, for a given discretization of the linearized equations, it is possible to compute the eigenvalues of the discrete block operator to determine the eigenspectrum of the full operator. A discrete analysis of the eigenvalues is given in \cite[Section 4.1]{ch7:EBL08}, but it is not clearly pointed out that in fact the discrete eigenspectrum is compact (bounded) for a fixed polynomial order in the vertical, i.e., that for a constant depth $h$
 314 \begin{align}
 315 \max|\lambda(\mathcal{J}_h)| = \lim_{kh\to\infty}|\lambda(\mathcal{J}_h)|\leq C(N_z)\sqrt{\frac{g}{h}}.
 316 \end{align}
 317 Similar results were reported for the first time in the context of high-order Boussinesq-type equations in \cite{ch7:ENG06,ch7:EHBM06} and recently it has been shown \cite{ch7:EE13} that widely used implicitly-implicit Boussinesq-type equations can be re-formulated to have bounded eigenspectra using high-order discretisation methods. This is an important practical property of the discrete scheme as it is favorable to numerical stability. It implies that the linear model is not severely limited by the spatial resolution in the horizontal for a specific choice of the number  of collocation nodes ($N_z$) in the vertical. This suggests that the model is quite robust due to insensitivity in the choice of time step, with the implication that local grid adaptivity can be used for improving spatial accuracy. Interestingly, for the unified potential flow model we find that this also holds for nonlinear simulations. Large time steps can be chosen when using dense grids and high-order numerics without severely degrading overall numerical stability and efficiency. This is confirmed in numerical experiments and demonstrated in Figure \ref{ch7:numexp}. However, for very steep nonlinear waves and very densely clustered nonuniform grids, stability is found to be compromised without filtering. A proper filtering strategy, e.g., based on a super collocation technique \cite{ch7Kirby03de-aliasingon}, can be used to remedy stability problems without destroying accuracy.
 318 %
 319 \begin{figure}[!htb]
 320 \centering
 321 \subfigure[Grid scaling, $x=(1-a)\xi^3+a\xi$.]{
 322 % MainLaplace2D_ex03.m
 323 \includegraphics[width=0.45\textwidth]{Chapters/chapter7/figures/scalingNx25-eps-converted-to.pdf}
 324 }
 325 \subfigure[High-order spatial discretization and stable explicit time-stepping with large time steps for a nonlinear standing wave. Scaling based on $a=0$. ]{
 326 % MainLaplace2D_ex03.m
 327 \includegraphics[width=0.45\textwidth]{Chapters/chapter7/figures/standingwaveglozman-eps-converted-to.pdf}
 328 }
 329 \subfigure[Uniform grid ($a=1$).]{
 330 % MainLaplace2D_ex035_nonlinearLaplace.m
 331 \includegraphics[width=0.45\textwidth]{Chapters/chapter7/figures/SFwaves_snapshots_uniform-eps-converted-to.pdf}
 332 }
 333 \subfigure[Clustered grid ($a=0.05$).]{
 334 % MainLaplace2D_ex035_nonlinearLaplace.m
 335 \includegraphics[width=0.45\textwidth]{Chapters/chapter7/figures/SFwaves_snapshots_clustered-eps-converted-to.pdf}
 336 }
 337 \caption[Numerical experiments to assess stability properties of numerical wave model.]{Numerical experiments to assess stability properties of numerical wave model. In three cases, computed snapshots are taken of the wave elevation over one wave period of time. In (a) the grid distribution of nodes in a one-parameter mapping for the grid is illustrated. Results from changes in wave elevation are illustrated for (b) a mildly nonlinear standing wave on a highly clustered grid, (c) regular stream function wave of medium steepness in shallow water $(kh,H/L)=(0.5,0.0292)$ on a uniform grid ($N_x=80$), and (d) a nonuniform grid with a minimal grid spacing 20 times smaller(!). In every case the step size remains fixed at $\Delta t = T/160$ s corresponding to a Courant number $C_r=c\tfrac{\Delta t}{\Delta x}=0.5$ for the uniform grid. A sixth order scheme and explicit EKR4 time-stepping are used in each test case.}
 338 \label{ch7:numexp}
 339 \end{figure}
 340 %\newpage
 341
 342 \subsection{Wave generation and absorption}
 343 \label{ch7:wavegen}
 344
 345 To simulate waves using a numerical model, a general purpose technique for both generating and absorbing waves inside the finite numerical domain is needed. It is preferable that the technique is suitable for easy integration in a software library component setup. One such technique is the line relaxation\index{relaxation} method attributed to \cite{ch7:LD83}. This is a simple ad hoc technique sufficiently accurate for engineering purposes. It modifies the computed solution every time step during simulation by a postprocessing step where the relaxed solution becomes
 346 %determined as
 347 \begin{align}
 348 \label{ch7:eq:discreteupdate}
 349 g^*(x_i,t) = \Gamma(x_i)g(x_i,t) + (1-\Gamma(x_i))g_e(x_i,t), \quad x_i\in\Omega_\Gamma.
 350 \end{align}
 351 Here $g(x,t)$ is one of the free surface variables $\tilde{\phi},\eta$ at an instant in time, and $0\leq \Gamma(x)\leq 1$ is a single-valued function within the relaxation region $x_i\in\Omega_\Gamma$. The first term acts as a sponge layer which is responsible for effectively dissipating energy inside a specified relaxation zone\index{relaxation!zone} $\Omega_\Gamma$. The terms containing $g_e(x,t)$, where $g_e$ is an analytical solution (e.g., such as the stream function wave theory \cite{ch7:Dean1965}), act as source terms in the relaxation zone. This makes it possible to generate arbitrary waves accurately in the computational domain in accordance with an analytical representation of incident waves.
 352
 353 We can interpret \eqref{ch7:eq:discreteupdate} as a discrete update of the solution at an isolated spatial point inside a relaxation zone. We introduce the notations $g^n=g(x_i,t_n)$ and $g^{*,n+1}=g^*(x_i,t_{n+1})$ and assume that $t_{n+1}=t_n+\tau$ is an instant in time. Then we can rewrite \eqref{ch7:eq:discreteupdate} to motivate an analytical modification to time-dependent equations that can provide similar modification (forcing) in simulation.
 354
 355 Subtracting $g^n$ in \eqref{ch7:eq:discreteupdate} and dividing by a pseudo time step size $\tau$, we obtain the equivalent form
 356
 357 %%
 358 \begin{align}
 359 \frac{g^{*,n+1}-g^n}{\tau} =\frac{(1-\Gamma)}{\tau} (g_e^n-g^n).
 360 \end{align}
 361 %
 362 The first term is similar to a first-order accurate Forward Euler\index{forward Euler} approximation of a rate of change term. This motivates an {\em embedded penalty forcing technique} based on adding a correction term of the form
 363 %
 364 \begin{align}\label{ch7:eq:penalty}
 365 \partial_t g = \mathcal{N}(g) + \frac{1-\Gamma(x)}{\tau} (g_e(t,x)-g(t,x)), \quad {\bf x}\in\Omega_\Gamma,
 366 \end{align}
 367 %
 368 where $\mathcal{N}$ represents a general nonlinear operator for the right-hand side. The immediate advantage is that a time-stepping scheme can easily be interchanged in a model implementation. The added term is a source term resulting in forcing inside relaxation zones when $g_e(t,x)\neq g(t,x)$ and $\Gamma(x)\neq1$ and otherwise has no effect. The strength of the forcing is influenced by the arbitrary parameter $\tau\in\mathbb{R}_+$ which can be defined to match the time scale of the dynamics. We have found that  $\tau\approx\Delta t$ works well, however, it is possible that a more optimal choice exist. Note that a too small $\tau$ may degrade the numerical stability of the model.
 369
 370 A simple validation of the zones is shown in Figure \ref{ch7:figstandwave} where waves are generated at the left wall and propagate to the right wall, where reflection occurs leading to formation of standing waves due to the resulting interaction with the incident waves inside the numerical wave tank.
 371
 372 The following relaxation functions proposed in \cite{ch7:ENG06} guarantee continuity across the interface of the relaxation zone and computational domain and are used in simulations for sponge layers and wave generation zones, respectively
 373 %
 374 \begin{align}
 375 % \label{relaxfunc1}
 376 \Gamma_{s}(\hat{x})  = 1-(1-\hat{x})^p, \quad \Gamma_{g}(\hat{x}) = -2\hat{x}^3+3\hat{x}^2, \quad \hat{x}\in[0,1]
 377 \label{ch7:relaxfunc2}
 378 \end{align}
 379 %
 380 where $\hat{x}\equiv x/x_L$ is coordinate normalised by the length $x_L$ of the zone in question.
 381
 382 The profiles can be reversed by a change of coordinate, i.e., $\Gamma(1-\hat{x})$, and scaled to interval sizes of interest. The first function satisfies the condition that any derivative at the left boundary vanishes at $\hat{x}=1$. The first derivatives of the second function vanish at both ends. The relaxation zones are positioned appropriately where waves are to be both/either generated and/or absorbed. A practical rule of thumb is that a relaxation used for absorption has a spatial length of at least two wave lengths. For absorption zones, we find that this technique is more efficient in velocity formulation of the free surface equations often used in Boussinesq-type formulations, e.g., see \cite{ch7:ENG06,ch7:EHBM06,ch7:EHBW08} in comparison with scalar potential formulations \eqref{ch7:FSorigin}. However, similar performance can be achieved by merely increasing the length of relaxation zones in such regions. Demonstrations of the technique are seen in Figure \ref{ch7:figstandwave} where vertical dashed lines indicate interfaces between relaxation zones and the computational region. Incident waves propagate from left to right in both examples.
 383 %
 384 \begin{figure}[!htb]
 385 \centering
 386 \subfigure[Wave generation, reflection and absorption of small-amplitude waves.]{
 387 % Script : MainLaplace2D_ex03penalityLINEAR_REFLECTEDWAVES.m
 388 \includegraphics[width=0.98\textwidth]{Chapters/chapter7/figures/standingwavespenalty-eps-converted-to.pdf}
 389 % Nx = 480, sixth order, vertical clustering, Nz=6;
 390 }
 391 \subfigure[Wave generation and absorption of steep finite-amplitude waves.]{
 392 % Script : MainLaplace2D_ex03penalityNONLINEAR_GENERATEWAVES.m
 393 \includegraphics[width=0.98\textwidth]{Chapters/chapter7/figures/nonlinearwavespenalty-eps-converted-to.pdf}
 394 % Nx = 540, 6th order, vertical clustering, Nz=6;
 395 }
 396 \caption[Snapshots at intervals $T/8$ over one wave period in time.]{Snapshots at intervals $T/8$ over one wave period in time of computed (a) small-amplitude $(kh,kH)=(0.63,0.005)$ and (b) finite-amplitude $(kh,kH)=(1,0.41)$ stream function waves elevations having reached a steady state after transient startup. Combined wave generation and absorption zones occur in the left relaxation zone of both (a) and (b). In b) an absorption zone is positioned next to the right boundary and causes minor visible reflections. }
 397 \label{ch7:figstandwave}
 398 \end{figure}
 399
 400 \subsection{Preconditioned Defect Correction (PDC) method}\index{preconditioning}\index{defect correction}
 401 \label{ch7:PDCmethod}
 402
 403 For the solution of sparse linear systems
 404 \begin{align}
 405 \label{ch7:linsyscompact}
 406 \mathcal{A}{\bf \Phi} = {\bf b}, \quad \mathcal{A}\in\mathbb{R}^{n\times n}, \quad {\bf b}\in\mathbb{R}^{n},
 407 \end{align}
 408 it is attractive to use iterative methods for large system sizes $n=N_xN_yN_z$ and for parallel implementations. Acceleration of suitable iterative methods can be done, e.g., by solving a left-preconditioned\index{preconditioning!left-preconditioning} system of the form %typical %A key challenge is preconditioning for acceleration of convergence of iterative methods.
 409 \begin{align}
 410 \label{ch7:eq:linsys}
 411 \mathcal{M}^{-1} ( \mathcal{A} {\bf \Phi} = {\bf b}), \quad \mathcal{M}\in\mathbb{R}^{n \times n},
 412 \end{align}
 413 where $\mathcal{M}$ is a preconditioner with the property that $\mathcal{M}^{-1}\approx \mathcal{A}^{-1}$ can be computed at low cost.
 414
 415 The bottleneck problem in a unified potential flow model is the solution of a discrete $\sigma$-transformed Laplace problem stated in the compact forms \eqref{ch7:linsyscompact} or \eqref{ch7:eq:linsys}. It is attractive to find an efficient iterative strategy where convergence\index{convergence} is understood via a convergence theory, has modest storage requirements, has minimal global communication requirements (in the form of global inner products), and is suitable for flexible-order\index{flexible order} discretizations. The class of geometric Multigrid Methods\index{multigrid} fulfills these requirements and has shown to be among the most efficient iterative strategies for a wide class of problems \cite{ch7:Trottenberg01}. In particular, the time required to solve a system of linear equations to a given accuracy level can be made to scale proportional to the number of unknowns.
 416
 417 There are several known approaches to multigrid methods \cite{ch7:MR744926} for high-order discretizations. Among these, Defect Correction\index{defect correction} Methods (DCMs) \cite{ch7:Stetter1978,ch7:AuzingerStetter1982,ch7:Hackbusch1982,ch7:Auzinger1987,ch7:Trottenberg01} have been employed successfully, e.g., in computational fluid dynamics \cite{ch7:LaytonEtAl2002}, in numerical simulations since the early 1970s.
 418 The fundamental idea of DCMs is to combine the good stability properties of low-order discretizations with high-order accuracy discretizations for explicit residual evaluations. These iterative methods impose low storage requirements, have scalable work effort under suitable choices of preconditioning strategies, and may be accelerated using a multigrid method based on low-order discretizations while still achieving high-order accuracy.
 419 Furthermore, it has been shown, that the rate of convergence of DCM combined with standard multigrid methods can achieve rates of convergence corresponding to the most efficient multigrid methods~\cite{ch7:Auzinger1987}.
 420
 421 Therefore, for the efficient and scalable solution of the unified potential flow model, we have recently \cite{ch7:EngsigKarupEtAl2011} proposed a Preconditioned Defect Correction (PDC) method for efficient iterative low-storage solution of high-order accurate discretization of the $\sigma$-transformed Laplace problem \eqref{ch7:TransformedLaplace}. The proposed strategy can be seen as a generalization of the multigrid strategy proposed by \cite{ch7:LiFleming1997}. The PDC method enables significant improvement of overall efficiency and accuracy with the preconditioning based on a second-order linearized version of the full coefficient matrix $\mathcal{A}$ as described in \cite{ch7:EBL08}.
 422
 423  Starting from some initial guess ${\bf \Phi}^{[0]}\in\mathbb{R}^n$, the PDC method for solving \eqref{ch7:eq:linsys} can be stated compactly as a three-step recurrence for $k=1,2,...$
 424 \begin{align}
 425 \label{ch7:eq:defectcorrectionprocedure}
 426 \Phi^{[k]} = \Phi^{[k-1]} + \delta^{[k-1]}, \;\; \delta^{[k-1]}=\mathcal{M}^{-1} {\bf r}^{[k-1]}, \;\; {\bf r}^{[k-1]}={\bf  b} - \mathcal{A}\Phi^{[k-1]},
 427 \end{align}
 428 where $\Phi^{[k]},{\boldsymbol \delta}^{[k]},{\bf r}^{[k]}\in\mathbb{R}^n$ are the approximate solution, the defect (preconditioned residual), and the residual of \eqref{ch7:eq:linsys} at the $k$th iteration, respectively. The algorithm can be speeded up by using mixed-precision calculations on modern many-core hardware as demonstrated in \cite{ch7:Glimberg2011}.
 429
 430 \subsection{Distributed computations via domain decomposition}\label{ch7:sec:dd}\index{domain decomposition}
 431
 432 Numerical modeling of large ocean areas to account for nonlinear wave-wave interactions and wave-structure interactions requires large degrees of spatial resolution, significant computational resources, and parallel computations to be practical. The recent generations of programmable GPUs are heavily optimized for on-chip bandwidth performance but not capacity. This implies that for the solution of large-scale PDE problems, distributed computing on multiple GPU devices is required due to limited capacity in the global memory space of current GPUs. Via a combination of MPI and CUDA we have recently demonstrated how both small and large systems can be solved efficiently by heterogenous computations using a data domain decomposition technique in parallel \cite{ch7:GlimbergEtAl2012}. The idea is to distribute the computational tasks to multiple GPUs, to enable reduced computational times and increased problem sizes. A homogenous partitioning of the data is used to ensure that the load balance across the GPUs is uniform. Data distribution and message passing introduce a data transfer bottleneck in the form of the Peripheral Component Interconnect express (PCIe) link and network interconnection. Thus, if the computational intensity of the local problem is not large enough to enable sufficient latency hiding of this bottleneck, the whole application is likely to be (severely) limited by the PCIe link or network bandwidth performance rather than the high on-chip bandwidth of the individual GPUs.
 433
 434 The ratio between necessary data transfers and computational work for the proposed numerical model for free surface water waves is high enough to expect reasonable latency hiding. The data domain decomposition method consists of a logically structured division of the computational domain into multiple subdomains. Each of these subdomains is connected via fictitious ghost layers at the artificial boundaries of width corresponding to the half-width of the finite difference stencils employed. This results in a favorable volume-to-boundary ratio as the problem size increases, and diminishing communication overhead for message passing. Information between subdomains is exchanged through ghost layers at every step of the iterative PDC method, in connection with the matrix-vector evaluation for the $\sigma$-transformed Laplace problem, and before relaxation steps in the multigrid method. A single global synchronization point occurs at most once each iteration, if convergence is monitored, where a global reduction step (inner product) between all processor nodes takes place. The main advantage of this decomposition strategy is that the decomposition into multiple subdomains is straightforward. However, it comes with the cost of extra data transfers to update the set of fictitious ghost layers.
 435
 436 \begin{figure}[!htb]
 437     \setlength\figureheight{0.30\textwidth}
 438     \setlength\figurewidth{0.33\textwidth}
 439     \begin{center}
 440     \subfigure[Test environment 4.]{
 441     {\scriptsize\input{Chapters/chapter7/figures/TeslaK20PerformanceScaling3D.tikz}}
 442     }
 443     \subfigure[Test environment 4.]{
 444     {\scriptsize\input{Chapters/chapter7/figures/TeslaK20SpeedupGPUvsCPU3D.tikz}}
 445     }
 446     \end{center}
 447     \caption[Performance timings per PDC iteration as a function of increasing problem size $N$, for single, mixed and double precision arithmetics.]{Performance timings per PDC iteration as a function of increasing problem size $N$, for single, mixed, and double precision arithmetics. Three-dimensional nonlinear waves, using sixth order finite difference approximations, preconditioned with one multigrid V-cycle and with one pre- and post- red-black Gauss-Seidel smoothing operation. Speedup compared to fastest known serial implementation. Using Test environment 4, CPU timings represent starting points for our investigations and have been obtained using the Fortran 90 code. These references results are based on a single-core (non-parallel) run on a Intel Core i7, 2.80GHz processor.}\label{ch7:fig:perftimings}
 448 \end{figure}
 449
 450 The parallel domain decomposition solver has been validated against the sequential solvers with respect to algorithmic efficiency to establish that the code produces correct results. An analysis of the numerical efficiency has also been carried out on different GPU systems to identify comparative behaviors as both the problems sizes and number of compute nodes vary. For example, performance scalings on Test environment 1 and Test environment 3 are presented in Figure \ref{ch7:fig:multigpuperformance}. The figure confirms that there is only a limited benefit from using multiple GPUs for small problem sizes, since the computational intensity is simply too low to efficiently hide the latency of message passing. A substantial speedup is achieved compared to the single GPU version, while being able to solve even larger systems.
 451 With the linear scaling of memory requirements and improved computational speed, the methodology based on multiple GPUs makes it possible to simulate water waves in very large numerical wave tanks with improved performance.
 452
 453 \begin{figure}[!htb]
 454     \setlength\figureheight{0.4\textwidth}
 455     \setlength\figurewidth{0.68\textwidth}
 456     \begin{center}
 457     \subfigure[Test environment 1.]{
 458     {\scriptsize\input{Chapters/chapter7/figures/GTX590MultiGPUScaling3D.tikz}}
 459     }
 460     \subfigure[Test environment 3.]{
 461     {\scriptsize\input{Chapters/chapter7/figures/TeslaM2050MultiGPUScaling3D.tikz}}
 462     }
 463     \end{center}
 464     \caption[Domain decomposition performance on multi-GPU systems.]{Domain decomposition performance on multi-GPU systems. In single precision, performance timings per PDC iteration are a function of increasing problem sizes. Same setup as in Figure \ref{ch7:fig:perftimings}.}
 465     \label{ch7:fig:multigpuperformance}
 466 \end{figure}
 467
 468 Future work involves extending the domain decomposition method to include support for more general curvilinear boundary-fitted meshes for representing the free surface plane and to include bottom-mounted structures which are typically encountered in near-costal areas. This will enable opportunities to resolve wave-structure interactions such as those encountered in large harbor regions and isolated human-made structures in offshore regions, e.g., legs of oil rigs or offshore windmill foundations.
 469
 470 \subsection{Assembling the wave model from library components}
 471
 472 It is described in Chapter \ref{ch5} how we have developed a heterogenous library has for fast prototyping of PDE solvers, utilizing the massively parallel architecture of CUDA-enabled GPUs. The objective is to provide a set of generic components within a single framework, such that software developers can assemble application-specific solvers efficiently at a high abstraction level, requiring a minimum of CUDA specific kernel implementations and parameter tuning.
 473
 474 The CUDA-based numerical wave model has been developed based on all the numerical techniques described in preceding sections. These techniques are a part of the library implemented as generic components, which makes them useful for the numerical solutions of PDE systems. The components of the numeral model as described in Section \ref{ch7:sec:nummodel} include an ERK4 time integrator, flexible-order finite difference approximations on regular grids, and an iterative multigrid PDC solver for the $\sigma$-transformed Laplace equation \eqref{ch7:TransformedLaplace}. Application developers can use either these components directly from the library or are free to combine their own implementations with library components, if they need alternative strategies that are not present in the library.
 475
 476 For the unified potential flow model the user will need to provide implementations of the following components: the right-hand side operator for the semidiscrete free surface variables \eqref{ch7:FSorigin}, the matrix-vector operator for the discretized $\sigma$-transformed Laplace equation \eqref{ch7:TransformedLaplace}, a smoother for the multigrid relaxation step, and the potential flow solver itself, that reads initial data and advances the solution in time. In order to make the library as generic as possible, all components are template-based, which makes it possible to assemble the PDE solver by combining type definitions in the preamble of the application. An excerpt of the potential flow assembling is given in Listing \ref{ch7:lst:solversetup}.
 477
 478 \lstset{label=ch7:lst:solversetup,caption={generic assembling of the potential flow solver for fully nonlinear free surface water waves}
 479 %,basicstyle=\scriptsize
 480 }
 481 \begin{lstlisting}
 482 // Basics
 483 typedef double value_type;
 484 typedef gpulab::grid<value_type> vector_type;
 485 typedef free_surface::laplace_sigma_stencil_3d<vector_type> matrix_type;
 486
 487 // Multigrid setup
 488 typedef gpulab::solvers::multigrid_types<
 489     vector_type
 490   , matrix_type
 491   , free_surface::gs_rb_low_order_3d
 492   , gpulab::solvers::grid_handler_3d_boundary> multigrid_types;
 493 typedef gpulab::solvers::multigrid<multigrid_types> preconditioner_type;
 494
 495 // Defect Correction setup
 496 typedef gpulab::solvers::defect_correction_types<
 497     vector_type
 498   , matrix_type
 499   , monitor_type
 500   , preconditioner_type> dc_types;
 501 typedef gpulab::solvers::defect_correction<dc_types> laplace_solver_type;
 502
 503 // Potential flow solver setup
 504 typedef free_surface::potential_flow_solver_types<
 505     vector_type
 506   , laplace_solver_type
 507   , gpulab::integration::ERK4> potential_flow_types;
 508 typedef free_surface::potential_flow_solver_3d<potential_flow_types> potential_flow_solver_type;
 509 \end{lstlisting}
 510
 511 Hereafter, the potential flow solver is aware of all component types that should be used to solve the entire PDE system, and it will be easy for developers to exchange parts at later times. The \texttt{laplace\_sigma\_stencil\_3d} class implements both the matrix-vector and right-hand side operator. The flexible-order finite difference kernel for the matrix-free matrix-vector product for the two-dimensional Laplace problem is presented in Listing \ref{ch7:lst:fd2d}. Library macros and reusable kernel routines are used throughout the implementations to enhance developer productivity and hide hardware specific details. This kernel can be used both for matrix-vector products for the original system and for the preconditioning.
 512
 513 \lstset{label=ch7:lst:fd2d,caption={CUDA kernel implementation for the two dimensional finite difference approximation to the transformed Laplace equation}
 514 %,basicstyle=\scriptsize\ttfamily
 515 }
 516 \begin{lstlisting}
 517 template <typename value_type, typename size_type>
 518 __global__ void laplace_sigma_transformed(
 519       value_type* out            , value_type const* p
 520     , value_type const* eta      , value_type const* etax
 521     , value_type const* etaxx    , value_type const* h
 522     , value_type const* hx       , value_type const* hxx
 523     , value_type dx              , value_type ds
 524     , size_type Nx               , size_type Ns
 525     , size_type xstart           , size_type alpha
 526     , value_type const* stencil1 , value_type const* stencil2)
 527 {
 528         size_type j    = IDX1Dx;
 529         size_type i    = IDX1Dy;
 530         size_type rank = alpha*2+1; // Total stencil size
 531
 532         // Get shared memory pointers
 533         value_type* ss1 = device::shared_memory<value_type>::get_pointer();
 534         value_type* ss2 = ss1 + rank*rank;
 535
 536         // Load stencils into ss1 and ss2
 537         gpulab::device::load_shared_memory(ss1,stencil1,rank*rank);
 538         gpulab::device::load_shared_memory(ss2,stencil2,rank*rank);
 539
 540         // Only internal points
 541         if(j>=xstart && j<Nx-xstart && i>0 && i<Ns-1)
 542         {
 543                 size_type offset_i = i < alpha ? 2*alpha-i : i >= Ns-alpha ? Ns-1-i : alpha;
 544                 size_type row_i    = offset_i*rank;
 545     // Always centered stencils in x-dir
 546                 size_type offset_j = alpha;
 547                 size_type row_j    = alpha*rank;
 548
 549                 value_type dhdx    = hx[j];
 550                 value_type dhdxx   = hxx[j];
 551                 value_type detadx  = etax[j];
 552                 value_type detadxx = etaxx[j];
 553
 554                 // Calculate the sigma transformation variables
 555                 value_type sig   = kernel::sigma(i,ds);
 556                 value_type d     = h[j]+eta[j];
 557                 value_type dsdz  = kernel::dsdz(d);
 558                 value_type dsdx  = kernel::dsdx (d,sig,detadx,dhdx);
 559                 value_type dsdxx = kernel::dsdxx(d,sig,detadx,detadxx,dhdx,dhdxx,dsdx);
 560
 561                 // dp/dxx
 562                 value_type dpdxx = values<value_type>::zero();
 563                 for(size_type s = 0; s<rank; ++s)
 564                 {
 565                         dpdxx += ss2[row_j+s]*p[i*Nx+j-alpha+s];
 566                 }
 567                 dpdxx /= (dx*dx);
 568
 569                 // Calculate dp/dss and dp/ds
 570                 value_type dpdss = values<value_type>::zero();
 571                 value_type dpds  = values<value_type>::zero();
 572                 value_type ps;
 573                 for(size_type s = 0; s<rank; ++s)
 574                 {
 575                         ps = p[(i+offset_i-s)*Nx+j];
 576                         dpds  += ss1[row_i+s]*ps;
 577                         dpdss += ss2[row_i+s]*ps;
 578                 }
 579                 dpds  /= ds;
 580                 dpdss /= (ds*ds);
 581
 582                 // Calculate dp/dxds
 583                 value_type dpdxds = values<value_type>::zero();
 584                 for(size_type ss = 0; ss<rank; ++ss)
 585                 {
 586                         value_type dpdx   = values<value_type>::zero();
 587                         for(size_type sx = 0; sx<rank; ++sx)
 588                         {
 589                                 dpdx += ss1[row_j+sx]*p[(i+offset_i-ss)*Nx+j-offset_j+sx];
 590                         }
 591                         dpdx /= dx;
 592                         dpdxds += ss1[row_i+ss]*dpdx;
 593                 }
 594                 dpdxds /= ds;
 595
 596                 // Calculate total
 597                 out[i*Nx+j] = dpdxx + dsdxx*dpds + 2.0*dsdx*dpdxds + (dsdx*dsdx + dsdz*dsdz)*dpdss;
 598         }
 599 }
 600 \end{lstlisting}
 601
 602 In a similar template-based approach, the kernel for the right-hand side operator of the two-dimensional problem is implemented and given in Listing \ref{ch7:lst:rhs2d}. The kernel computes the right-hand side updates for both surface variables, $\eta$ and $\tilde{\phi}$, and applies an embedded penalty forcing, cf. \eqref{ch7:eq:penalty}, for all nodes within generation or absorption zones. The penalty forcing functions are computed based on linear or nonlinear wave theory in a separate device function.
 603
 604 \lstset{label=ch7:lst:rhs2d,caption={CUDA kernel implementation for the 2D right-hand side}%,basicstyle=\scriptsize\ttfamily
 605 }
 606 \begin{lstlisting}
 607 template <typename value_type, typename size_type>
 608 __global__ void rhs(value_type const* p    , value_type const* p_surf
 609                   , value_type const* eta  , value_type* dp_surf_dt
 610                   , value_type* deta_dt    , value_type const* etax
 611                   , value_type const* h    , value_type dx
 612                   , value_type ds          , size_type Nx
 613                   , size_type xstart       , size_type xend
 614                   , unsigned int alpha     , value_type const* stencil
 615                   , value_type t           , value_type dt
 616                   , bool generate          , bool absorb)
 617 {
 618         size_type j    = IDX1D;
 619         size_type rank = alpha*2+1; // Total stencil size
 620
 621     // Load shared memory for stencil into ss
 622         value_type* ss = gpulab::device::shared_memory<value_type>::get_pointer();
 623     gpulab::device::load_shared_memory(ss,stencil,rank*rank);
 624
 625     // Only interior points
 626         if(j>=xstart && j<Nx-xstart)
 627         {
 628                 value_type g     = values<value_type>::gravity();
 629         value_type detax = etax[j];
 630
 631                 // Calculate the sigma transformation variables
 632                 value_type d     = h[j]+eta[j];
 633                 value_type dsdz  = kernel::dsdz(d);
 634
 635                 // Calculate dp/ds on surface
 636                 value_type dpds  = values<value_type>::zero();
 637                 for(size_type s=0; s<rank; ++s)
 638                 {
 639                         dpds += p[j+s*Nx]*ss[rank*rank-1-s];
 640                 }
 641                 dpds /= ds;
 642                 value_type w     = dsdz * dpds;
 643
 644                 // Calculate dp/dx on surface
 645                 value_type dpdx  = values<value_type>::zero();
 646                 for(size_type s=0; s<rank; ++s)
 647                 {
 648                         dpdx += p_surf[j-alpha+s]*ss[rank*alpha+s];
 649                 }
 650                 dpdx /= dx;
 651
 652         // Update right-hand side function
 653                 dp_surf_dt[j] = - g*eta[j] - 0.5*(dpdx*dpdx - w*w*(1.0 + (detax*detax)));
 654                 deta_dt[j]    = - detax*dpdx + w*(1.0 + (detax*detax));
 655
 656                 if(generate || absorb)
 657                 {
 658                         // Relaxation terms
 659                         value_type reta  = values<value_type>::zero();
 660                         value_type rp    = values<value_type>::zero();
 661             free_surface::kernel::rhs_relax(reta, rp, eta[j], p_surf[j], (((int)j)-(int)xstart)*dx, (xend-xstart-1)*dx, t+dt, dt, false, generate, absorb);
 662
 663                         deta_dt[j]    = deta_dt[j]    + reta;
 664                         dp_surf_dt[j] = dp_surf_dt[j] + rp;
 665                 }
 666         }
 667 }
 668 \end{lstlisting}
 669
 670 \section{Properties of the numerical model}
 671
 672 We now consider different properties of the numerical model in order to shed light on unique features and limits of the model with respect to maritime engineering applications. The presented results extend and complement earlier studies \cite{ch7:BinghamZhang2007,ch7:EBL08} for the same model. In particular, we seek to highlight that the properties are tunable to the practical applications of interest through proper choice of discretization parameters, and we therefore also provide details of numerical properties.
 673
 674 \subsection{Dispersion and kinematic properties}\index{dispersion}\index{kinematic}
 675 \label{ch7:dispkin}
 676
 677 The dispersion and kinematic properties of the unified model are determined by the tunable discretization parameters and should in general be chosen for specific problems. For assessment of errors, we introduce the metrics proposed in \cite{ch7:MBS03}
 678 \begin{subequations}
 679 \begin{align}
 680 E_c(N_x,N_z,kh,h/L) & = \frac{c^2}{c_s^2},
 681 \label{ch7:errdisp} \\ % \left( \frac{c-c_s}{c_s}  \right)^2, \\
 682 \label{ch7:errkin}
 683 E_m(N_x,N_z,kh,h/L) &= \frac{1}{h}\int_{-h}^\eta \left( \frac{\phi(z)-\phi_s(z)}{\phi_s(z)} \right)^2 dz,
 684 \end{align}
 685 \end{subequations}
 686 where $m$ is one of the scalar functions $\phi,u,w$ describing kinematics, $c$ is the numerical phase celerity of regular waves, and $c_s=\sqrt{g\tanh(kh)/k}$ is the exact phase celerity according to linear Stokes Theory \cite{ch7:SJ01}.  Measurements of the errors are taken in the vertical below the crest of a wave which is well resolved in the horizontal direction.
 687
 688 % kinematicAnalysis.m
 689 \begin{figure}[!htb]
 690 \begin{center}
 691 \subfigure[Uniform vertical grid.]{
 692 %\includegraphics[width=0.45\textwidth]{Chapters/chapter7/figures/lineardispersion_Nx30-HL90-p6-vergrid0_Linear.eps}
 693 \includegraphics[width=0.45\textwidth]{Chapters/chapter7/figures/lineardispersion_Nx30-HL90-p6-vergrid0_Linear-eps-converted-to.pdf}
 694 }
 695 \subfigure[Cosine-clustered vertical grid.]{
 696 \includegraphics[width=0.45\textwidth]{Chapters/chapter7/figures/lineardispersion_Nx30-HL90-p6_Linear-eps-converted-to.pdf}
 697 }
 698 \end{center}
 699 \caption{The accuracy in phase celerity $c$ determined by \eqref{ch7:errdisp} for small-amplitude (linear) wave.
 700 $N_z\in[6,12]$. Sixth order scheme.}
 701 \label{ch7:figlinear}
 702 \end{figure}
 703
 704 % kinematicAnalysis.m
 705 \begin{figure}[!htb]
 706 \begin{center}
 707 \subfigure[Linear]{
 708 \includegraphics[width=0.45\textwidth]{Chapters/chapter7/figures/kinematicsPHI_Nx30-HL90-p6_Linear-eps-converted-to.pdf}
 709 }
 710 \subfigure[Linear]{
 711 \includegraphics[width=0.45\textwidth]{Chapters/chapter7/figures/kinematicsW_Nx30-HL90-p6_Linear-eps-converted-to.pdf}
 712 }
 713 \subfigure[Nonlinear]{
 714 \includegraphics[width=0.45\textwidth]{Chapters/chapter7/figures/kinematicsPHI_Nx30-HL90-p6_Nonlinear-eps-converted-to.pdf}
 715 }
 716 \subfigure[Nonlinear]{
 717 \includegraphics[width=0.45\textwidth]{Chapters/chapter7/figures/kinematicsW_Nx30-HL90-p6_Nonlinear-eps-converted-to.pdf}
 718 }
 719 \end{center}
 720 \caption[Assessment of kinematic error is presented in terms of the depth-averaged error.]{Assessment of kinematic error is presented in terms of the depth-averaged error determined by \eqref{ch7:errkin} for, (a) scalar velocity potential, (b) vertical velocity for a small-amplitude (linear) wave, (c) scalar velocity potential, and (d) vertical velocity for a finite-amplitude (nonlinear) wave with wave height $H/L=90\%(H/L)_\textrm{max}$.
 721 $N_z\in[6,12]$. Sixth order scheme. Clustered vertical grid. }
 722 \label{ch7:figlinear2}
 723 \end{figure}
 724
 725 The accuracy of the dispersion and kinematic properties are found to be excellent with small differences between the accuracy of computed profiles for  linear and nonlinear waves. Figure \ref{ch7:figlinear} shows curves from a discrete analysis of how the accuracy can be improved by increasing the number of nodes in the vertical for (a) uniform grids and (b) cosine-clustered grids. Similarly, Figure \ref{ch7:figlinear2} shows how the accuracy of the kinematic properties of the model can be controlled by choosing an appropriate number of cosine clustered vertical collocation points.
 726
 727 %\section{Verification and Validation}
 728 \section{Numerical experiments}
 729
 730 The numerical model detailed has been subject to careful verification and validation utilizing a range of standard benchmarks, cf. \cite{ch7:EBL08,ch7:EngsigKarupEtAl2011}. Here we exclusively focus on properties and performance of the numerical wave model. We provide several new results that highlight possibilities for acceleration of the wave model via simple and readily applicable techniques that work well on massively parallel hardware. Finally, we describe how we can extend the implementation of the wave model into a novel GPU implementation of a linear ship-wave model for fast hydrodynamics calculations.
 731
 732 \subsection{On precision requirements in engineering applications}\index{precision}
 733
 734 Practical engineering applications are widely used for analysis purposes and give support to decision making in engineering design. For engineering purposes the turn-around time for producing analysis results is of crucial importance as it affects cost-benefit of work efforts. The key interest is often just "engineering  accuracy" in computed end results which suggests that we can do with less precision in calculations. One may ask: {\em what are the precision requirements for engineering applications?}
 735
 736 In a recent study \cite{ch7:EngsigKarupEtAl2011,ch7:Glimberg2011}, it was shown that the PDC method when executed on GPUs can be utilized to efficiently solve water wave problems. This was done by trading accuracy for speed in parts of the PDC algorithm, e.g., by using either single, or mixed-precision\index{precision!mixed} computations. Without preconditioning the PDC method reduces to a classical iterative refinement technique, which is known to be fault tolerant \cite{ch7:Higham:2002:ASN}.
 737
 738 Previously reported performance results for the wave model can be taken a step further. We seek to demonstrate how single precision computations can be used for engineering analysis without significantly affecting accuracy in final computational results. At the same time improvements in computational speed can be almost a factor of two for large problems as a direct result of reduced data transfer, cf. Figure \ref{ch7:fig:perftimings}. Therefore, in pursue of high performance, it is of interest to exploit the reduced data transfers associated with replacing double-precision with single-precision floating point calculations. In a well-organized code this step can be taken with minimal programming effort.
 739 %
 740 \begin{figure}[!htb]
 741 \begin{center}
 742 % MainLaplace2D_ex025nonlinearLaplaceSINGLE.m
 743 \subfigure[Single precision.]{
 744 \includegraphics[width=0.45\textwidth]{Chapters/chapter7/figures/PrecisionSINGLE-eps-converted-to.pdf}
 745 }
 746 \subfigure[Double precision.]{
 747 \includegraphics[width=0.45\textwidth]{Chapters/chapter7/figures/PrecisionDOUBLE-eps-converted-to.pdf}
 748 }
 749 \end{center}
 750 \caption[Comparison between convergence histories for single- and double-precision computations using a PDC method for the solution of the transformed Laplace problem.]{Comparison between convergence histories for single- and double-precision computations using a PDC method for the solution of the transformed Laplace problem. Estimated errors are qualitatively very close to the algebraic errors. Very steep nonlinear stream function wave in intermediate water $(kh,H/L)=(1,0.0903)$. Discretization based on $(N_x,N_z)=(15,9)$ with sixth order stencils.}
 751 \label{ch7:convhist}
 752 \end{figure}
 753
 754 Most scientific applications \cite{ch7:lessismore} use double-precision calculations to minimize accumulation of round-off errors, to employ higher precision for ill-conditioned problems, or to stabilize critical sections in the code that require higher precision. Round-off errors tend to accumulate more slowly when higher precision is used, thereby avoiding significant losses of accuracy due to round-off errors. Paradoxically, for many computational tasks, the need for high precision connected with the above-mentioned restrictions does not apply at all, or only applies to a portion of the tasks. In addition, on modern hardware there can be relative differences in peak floating point performance of up to about one order of magnitude in favor of single-precision over double-precision math calculations depending on choice of hardware architecture. However, for bandwidth-bound applications, the key performance metric is not floating-point performance, but rather bandwidth performance. In both cases, data transfer can be effectively halved by switching to single precision storage, in which case bandwidth performance increases and at the same time makes it possible to feed floating point units with data at effectively twice the speed. If maximizing performance is an ultimate goal, such considerations suggests that it can be possible to compute faster by using single- over double-precision arithmetics if accuracy requirements can be fulfilled.
 755 %
 756
 757 As a simple illustrative numerical experiment, we can consider the iterative solution of \eqref{ch7:eq:linsys} using the PDC method using single- and double-precision math, respectively. As a simple test case, we consider the solution of periodic stream function waves in two spatial dimensions. Computed convergence histories are presented in Figure \ref{ch7:convhist} where it is clear that the main difference is the attainable accuracy level achievable before stagnation. In both cases, the attainable accuracy, defined in terms of the absolute error for the exact stream function solution to the governing equations, is associated with accuracy of approximately $10^{-4}$ (solid line). In single-precision math, the algebraic and estimated errors measured in the two-norm can reach the level of machine precision. These results suggest that single-precision math is sufficient for calculating accurate solutions at the chosen spatial resolution. We find that the iterative solution of the $\sigma$-transformed Laplace problem by the PDC method does not immediately lead to significant accumulation of round-off errors and further investigations are warranted for unsteady computations.
 758
 759 Elaborating on this example, we examine the propagation of a regular stream function wave in time. We consider the errors in wave elevation as a function of time with and without a filtering\index{filtering} strategy for single-precision calculations in comparison with double-precision calculations. With an objective to exert control on accumulation of round-off errors that appear as high-frequency noise, the idea is to employ an inexpensive stencil-based filtering strategy, for example, a central filter in one spatial dimension
 760 \begin{align}
 761 \label{ch7:filter}
 762 \mathcal{F}u(x_i) = \sum_{n=-\alpha}^{\alpha} c_n u(x_{i+n}),
 763 \end{align}
 764 where $c_n\in\mathbb{R}$ are the stencil coefficients and $\alpha\in\mathbb{Z}_+$ is the stencil half-width. An active filter can for example be based on employing a Savitzky-Golay smoothening filter \cite{ch7:PT90}, e.g., the mild 7-point SG(6,10) filter, and applying it after every 10th time step to each of the collocation nodes defining the free surface variables. The same procedure can be used for stabilization of nonlinear simulations to remove high-frequency "saw-tooth" instabilities as shown in \cite{ch7:EBL08}. This filtering technique can also remove high-frequency noise resulting from round-off errors in computations that would otherwise potentially pollute the computational results and in the worst case leave them useless. The effect of this type of filtering on the numerical efficiency of the model is insignificant.
 765
 766 Results from numerical experiments are presented in Figure \ref{ch7:filtering}, and most of the errors can be attributed to phase errors resulting from difference in exact versus numerical phase speed. In numerical experiments, we find that while results computed in double-precision are not significantly affected by accumulation of round-off errors, the single-precision results are. In figures \ref{ch7:filtering} (a) and (b), a direct solver based on sparse Gaussian elimination within MATLAB\footnote{\url{http://www.mathworks.com}.} is used to solve the linear system at every stage and a comparison is made between single and unfiltered double-precision calculations. It is shown in Figure \ref{ch7:filtering} a) that without a filter, the single-precision calculations result in "blow-up" after which the solver fails just before 50 wave periods of calculation time. However, in Figure \ref{ch7:filtering} (b) it is demonstrated that invoking a smoothening filter, cf. \eqref{ch7:filter}, stabilizes the accumulation of round-off errors and the calculations continue and achieve reduced accuracy compared to the computed double-precision results. Thus, it is confirmed that such a filter can be used to control and suppress high-frequency oscillations that results from accumulation of round-off errors. In contrast, replacing the direct solver with an iterative PDC method using the GPU-accelerated wave model appears to be much more attractive upon inspection of Figures \ref{ch7:filtering} (c) and (d). The single-precision results are found to be stable with and {\em without} the filter-based strategy for this problem. The calculations show that single-precision math leads to slightly faster error accumulation for this choice of resolution, however, with only small differences in error level during long time integration. This highlights that fault-tolerance of the iterative PDC method contributes to securing robustness of the calculations.
 767
 768 \begin{figure}[!htb]
 769 \begin{center}
 770 % DriverWavemodelDecomposition.m
 771 \subfigure[Direct solve without filter.]{
 772 \includegraphics[width=0.45\textwidth]{Chapters/chapter7/figures/ComparisonLUNoFiltering-eps-converted-to.pdf}
 773 }
 774 \subfigure[Direct solve with filter.]{
 775 \includegraphics[width=0.45\textwidth]{Chapters/chapter7/figures/ComparisonLUWithFiltering-eps-converted-to.pdf}
 776 } \\
 777 \subfigure[Iterative PDC solve without filter.]{
 778 \includegraphics[width=0.45\textwidth]{Chapters/chapter7/figures/ComparisonDCNoFiltering-eps-converted-to.pdf}
 779 }
 780 \subfigure[Iterative PDC solve with filter.]{
 781 \includegraphics[width=0.45\textwidth]{Chapters/chapter7/figures/ComparisonDCWithFiltering-eps-converted-to.pdf}
 782 }
 783 \end{center}
 784 \caption[Comparison of accuracy as a function of time for double-precision calculations vs. single-precision with and without filtering.]{Comparison between accuracy as a function of time for double-precision calculations vs. single-precision with and without filtering. The double-precision result is unfiltered in each comparison and shows to be less sensitive to round-off errors. Medium steep nonlinear stream function wave in intermediate water $(kh,H/L)=(1,0.0502)$. Discretization is based on $(N_x,N_z)=(30,6)$, A courant number of $C_r=0.5$, and sixth order stencils.}
 785 \label{ch7:filtering}
 786 \end{figure}
 787
 788 Last, we demonstrate using a classical benchmark for propagation of nonlinear waves over a semicircular shoal that single-precision math is likely to be sufficient for achieving engineering accuracy. The benchmark is based on Whalin's experiment~\cite{ch7:Whalin1971} which is often used in validation of dispersive water wave models for coastal engineering applications, e.g., see previous work \cite{ch7:EBL08}. Experimental results exists for incident waves with wave periods $T=1,2,3\,$s and wave heights $H=0.0390, 0.0150, 0.0136\,$m. All three test cases have been discretized with a computational grid of size ($257 \times 41 \times 7$) to resolve the physical dimensions of $L_x=35\,$m, $L_y=6.096\,$m. The still water depth decreases in the direction of the incident waves as a semicircular shoal from $0.4572\,$m to $0.1524\,$m with an illustration of a snapshot of the free surface given in Figure \ref{ch7:fig:whalinsetup}. The time step $\Delta t$ is computed based on a constant Courant number of $Cr=c\Delta x/\Delta t=0.8$, where $c$ is the incident wave speed and $\Delta x$ is the grid spacing. Waves are generated in the generation zone $0 \leq x/L \leq 1.5$, where $L$ is the wave length of incident waves, and absorbed again in the zone $35 - 2L \leq x \leq 35\,$m.
 789 %
 790
 791 A harmonic analysis of the wave spectrum at the shoal center line is computed and plotted in Figure \ref{ch7:whalinresults} for comparison with the analogous results obtained from the experiments data. The three harmonic amplitudes are computed via a Fast Fourier Transform (FFT) method using the last three wave periods up to $t=50\,$s. There is a satisfactory agreement between the computed and experimental results and no noticeable loss in accuracy resulting from the use of single-precision math.
 792 %
 793 \begin{figure}[!htb]
 794     \setlength\figureheight{0.3\textwidth}
 795     \setlength\figurewidth{0.32\textwidth}
 796 %    \begin{center}
 797     \subfigure[Whalin test case at $t=50$s, wave period $T=2$s, and grid dimensions ($257 \times 41 \times 7$)]{\label{ch7:fig:whalinsetup}
 798     \includegraphics[width=0.5\textwidth]{Chapters/chapter7/figures/WhalinWaveSol_t50_T2_single.pdf}
 799     }
 800     \subfigure[$T=1s$]{
 801     {\scriptsize\input{Chapters/chapter7/figures/WhalinWaveHarmonics_T1_single.tikz}}
 802     }
 803     \subfigure[$T=2s$]{
 804     {\scriptsize\input{Chapters/chapter7/figures/WhalinWaveHarmonics_T2_single.tikz}}
 805     }
 806     \subfigure[$T=3s$]{
 807     {\scriptsize\input{Chapters/chapter7/figures/WhalinWaveHarmonics_T3_single.tikz}}
 808     }
 809 %    \end{center}
 810     \caption[Harmonic analysis for the experiment of Whalin for $T=1,2,3\,$s.]{Harmonic analysis for the experiment of Whalin for $T=1,2,3\,$s respectively. Measured experimental and computed results (single-precision) are in good agreement. Test environment 1.}\label{ch7:whalinresults}
 811 \end{figure}
 812
 813 \subsection{Acceleration via parallelism in time using parareal}\label{ch7:parareal}\index{parareal}
 814
 815 % Performance is no longer free  with new hardware
 816 With modern many-core architectures, performance is no longer intrinsic and free on new generations of hardware. Added performance with new hardware now comes with the requirement of sufficient parallelism in the application to be accelerated. Methods, tricks, and techniques for extracting parallelism in scientific applications are thus becoming increasingly relevant to enable added numerical accuracy as well as minimization of time to solution in the pursuit of faster and better analysis for engineering applications.
 817
 818 % Parareal as component
 819 The parareal algorithm has been introduced as a component in our in-house GPU library as described in Section \ref{ch5:parareal}. The parareal library component makes it possible to easily investigate potential opportunities for further acceleration of the water wave model on a heterogeneous system and to assess practical feasibility of this algorithmic strategy for various wave types. We omit a detailed review of the parareal algorithm and refer to details given in Section \ref{ch5:sec:parareal} together with recent reviews \cite{ch7:LMT01,ch7:MS07,ch7:ASNP12}.
 820
 821 % Explain key parameters shortly
 822 In Section \ref{ch5:parareal} it is assumed that communication costs can be neglected and a simple model for the algorithmic work complexity is derived. It is found that there are four key discretization parameters for parareal that need to be balanced appropriately in order to achieve high parallel efficiency: the number of coarse-grained time intervals $N$, the number of iterations $K$, the ratio between the computational cost of the coarse to the fine propagator $\mathcal{C}_\mathcal{G}/\mathcal{C}_\mathcal{F}$, and the ratio between fine and coarse time step sizes $\delta t/\delta T$.
 823
 824 % How to obtain speed-up
 825 Ideally, the ratio $\mathcal{C}_\mathcal{G}/\mathcal{C}_\mathcal{F}$ is small and convergence happens in $k=1$ iteration. This is seldom the case though, as it requires the coarse propagator to achieve accuracy close to that of the fine propagator while at the same time being substantially cheaper computationally, these two objectives obviously being conflicting. Obtaining the highest possible speed-up is a matter of trade-off, typically, the more GPUs used, the faster the coarse propagator should be. The performance of parareal is problem- and discretization-dependent and as such one would suspect that different wave parameters influence the suitability of the method. This was investigated in \cite{ch7:ASNP12} and indeed the performance does change with wave parameters. Typically the method works better for deep water waves with low to medium wave amplitudes.
 826 %
 827 \begin{figure}[!htb]
 828     \begin{center}
 829     \setlength\figureheight{0.35\textwidth}
 830     \setlength\figurewidth{0.35\textwidth}
 831     \subfigure[Performance scaling]{
 832 %        {\small\input{Chapters/chapter7/figures/PararealScaletestGTX590.tikz}}
 833       \includegraphics[width=0.47\textwidth]{Chapters/chapter7/figures/PararealScaletestGTX590_conv.pdf}
 834     }
 835     \subfigure[Speedup]{
 836        % {\small\input{Chapters/chapter7/figures/PararealSpeedupGTX590.tikz}}
 837  \includegraphics[width=0.47\textwidth]{Chapters/chapter7/figures/PararealSpeedupGTX590_conv.pdf}
 838     }
 839     \end{center}
 840     \caption[Parareal absolute timings and parareal speedup.]{(a) Parareal absolute timings for an increasingly number of water waves traveling one wave length; each wave resolution is ($33\times 9$). (b) Parareal speedup for two to sixteen compute nodes compared to the purely sequential single GPU solver. Is is noticeable how insensitive the parareal scheme is to the size of the problem solved. Test environment 3.}\label{ch7:fig:DDPA_SPEEDUP}
 841 \end{figure}
 842 %
 843
 844 % What did we do and what are the results
 845 We have performed a scalability study for parareal using 2D nonlinear stream function waves based on a discretization with $(N_x,N_z)=(33,9)$ collocation points, cf. Figure \ref{ch7:fig:DDPA_SPEEDUP}. The study shows that moderate speedup is possible for this hyperbolic system. Using four GPU nodes, a speedup of slightly more than two was achieved while using sixteen GPU nodes resulted in a speedup of slightly less than five. As noticed in Figure \ref{ch7:fig:DDPA_SPEEDUP}, parallel efficiency decreases quite fast when using more GPUs. This limitation is due to the usage of a fairly slow and accurate coarse propagator and is linked to a known difficulty with parareal applied to hyperbolic systems. For hyperbolic systems, instabilities tend to arise when using a very inaccurate coarse propagator. This prevents using a large number of time subdomains, as this by Amdahl's law also requires a very fast coarse propagator. The numbers are still impressive though, considering that the speedup due to parareal comes as additional speedup to an already efficient and fast code.
 846 Performance results for the Whalin test case are also shown in Figure \ref{ch7:fig:whalinparareal}. There is a natural limitation to how much we can increase $R$ (the ratio between the complexity of the fine and coarse propagators), because of stability issues with the coarse propagator. In this test case we simulate from $t=[0,1]$s, using up to $32$ GPUs. For low $R$ and only two GPUs, there is no speedup gain, but for the configuration with eight or more GPUs and $R\geq6$, we are able to get more than $2$ times speedup. Though these hyperbolic systems are not optimal for performance tuning using the parareal method, results still confirm that reasonable speedups are in fact possible on heterogenous systems.
 847
 848 \begin{figure}[!htb]
 849     \setlength\figureheight{0.3\textwidth}
 850     \setlength\figurewidth{0.32\textwidth}
 851 %    \begin{center}
 852     \subfigure[Speedup]{
 853     {\small\input{Chapters/chapter7/figures/WhalinPararealSpeedup.tikz}}
 854     }
 855     \subfigure[Efficiency]{
 856     {\small\input{Chapters/chapter7/figures/WhalinPararealEfficiency.tikz}}
 857     }
 858 %    \end{center}
 859     \caption[Parallel time integration using the parareal method.]{Parallel time integration using the parareal method. $R$ is the ratio between the complexity of the fine and coarse propagators. Test environment 3.}\label{ch7:fig:whalinparareal}
 860 \end{figure}
 861
 862 % Comparison with DD
 863 The parareal method is observed to be the most viable approach at speeding up small-scale problems due to the reduced communication and overhead involved. For sufficiently large problems, where sufficient work is available to hide the latency in data communication, we find the spatial domain decomposition method to be more favorable as it does not involve the addition of computational work and thereby allows for ideal speedup, something usually out of reach for the parareal algorithm. An important thing to note here is that it is technically possible to extend the work and wrap the parareal method around the domain decomposition method, thereby obtaining a multiplication of the combined speedup of both methods. This is of great interest in the sense that for any problem size, increasing the number of spatial sub-domains will eventually degrade speedup due to the latency in communication of boundaries. By exploiting the latency robustness of parareal in conjunction with domain decomposition parallelism, it may be possible to go large scale for problems that would otherwise be too small to exploit a large number of GPUs. These investigations are subject to future work.
 864
 865 % Final remarks
 866 Finally, we remark that the parareal algorithm is also a fault-tolerant algorithm. This property follows from the iterative nature of the algorithm and implies that a process can be lost during computations and regenerated without restarting the computations. This can be exploited to minimize total run time in case of such failures.
 867
 868 %\newpage
 869 \subsection{Towards real-time interactive ship simulation}\index{real-time simulation}
 870
 871 A fast GPU-accelerated ship hydrodynamic model is developed for real-time interactive ship simulation by modification of the unified potential flow model presented in Section \ref{ch7:goveq}. The target scientific application is an interactive full mission marine simulator, where multiple ships controlled by naval officers can navigate in a near-realistic virtual marine environment. Full mission simulators are used for education and training of naval officers in critical manoeuvring operations and for evaluation of ship and marine infrastructure designs. To predict the motion of ships, a hydrodynamics model is required for prediction of forces by \eqref{ch7:forcecalc} which is affected by the kinematic properties of the model, cf. Section \ref{ch7:dispkin}. The state-of-the-art for such a hydrodynamic model in today's real-time ship simulators is based on fast interpolation and proper scaling of experimental model data. The amount of experimental model data is limited with respect to hull forms and configurations, requiring the need for extrapolation that compromises the accuracy.
 872
 873 The objective of current and ongoing work is aimed at removing these limitations by replacing the existing hydrodynamic model and instead calculating at full-scale the flow field, wave field, ship-structure, and ship-ship interaction forces in real-time using massive parallel computation technology. The potential flow model (OceanWave3D) presented in Section \ref{ch7:goveq} is suitable as the modeling basis for this purpose since it is robust, accurate, efficient, and scalable to arbitrarily large domains. Furthermore, it can accurately account for dispersive waves in the range from shallow to deep waters in marine settings where the sea bed may be uneven.
 874
 875 The inclusion of ships in the wave model requires an approximate representation of such ships in the model. These ship approximations have to be chosen carefully with consideration to the computational performance of the numerical model to enable interactive real-time computing on today's modern hardware. For a first simple proof-of-concept we develop a linear wave and ship model to be used as the model basis. This implies that wave heights and ship draft are assumed to be of small amplitude corresponding and derived by a linearization technique around the mean sea level $z=0$ m.
 876
 877 The physical domain for the computation is bounded from below by the seabed and from above by the free surface of the sea or the hull of the ship. If the ship is navigating in open water, the ship's physical spatial domain is unbounded in the horizontal direction and in confined waters it is bounded by harbor structures, etc. The representation of the physical domain surrounding the ship is done by finite truncation in the horizontal directions. The resulting time-varying finite physical domain $\Omega(t)$ is a box fixed to follow the ship motion, with the ship in the middle of the top face and with Cartesian coordinate axes aligned with the horizontal components of the forward and sideward ship directions and the upward is the opposite of the direction of gravitational acceleration.
 878
 879 A linear model can be formulated in terms of kinematic and dynamic boundary conditions at the mean sea level \eqref{ch7:FSorigin} together with a Laplace problem subject to a variable depth kinematic boundary condition \eqref{ch7:eq:laplaceproblem}. The effects of ship hulls can be accounted for by splitting the scalar velocity potential function into steady $\phi_0$ and unsteady $\phi_1$ potentials such that $\phi = \phi_0 + \phi_1$ and with a quasi-static approximation of the pressure acting on the ship hull as suggested in \cite{ch7:LindbergEtAlIWWWFB2012}. This leads to a relatively simple ship model that enables a flexible and computationally efficient approximation of the ship geometry.
 880 The steady potential $\phi_0$ is calculated using a double-body approximation \cite{ch7:RavenJMST2010} of the ship and the unsteady potential $\phi_1$ is calculated by a linear free surface flow model. The resulting double-body flat-ship problem becomes
 881 \begin{subequations}
 882 \begin{align}
 883  \nabla^2 \phi_0 &= 0,  \quad -h \leq z \leq 0 \\
 884  \frac{\partial \phi_0}{\partial n} &= 0, \quad z = -h, \\
 885  \frac{\partial \phi_0}{\partial z} &= -U \frac{\partial \eta_0}{\partial x}, \quad z = 0,
 886  \end{align}
 887  \end{subequations}
 888 where $U$ is the velocity of the ship. The unsteady linear water problem is used to calculate the unsteady wave evolution around and away from the ship
 889 \begin{subequations}
 890 \begin{align}
 891  \nabla^2 \phi_1 &= 0, \quad -h \leq z \leq 0, \\
 892  \frac{\partial \phi_1}{\partial n} &= 0, \quad z = -h, \\
 893 \frac{\partial \phi_1}{\partial t}
 894 + (u_0 - U) \frac{\partial \phi_1}{\partial x}
 895 + v_0 \frac{\partial \phi_1}{\partial y}
 896  &= - \left(\frac{1}{2} \mathbf{u}_0 \cdot \mathbf{u}_0 + g \eta_0 + \frac{p_{ship}}{\rho} \right), \quad z=0, \label{ch7:quasistatic} \\
 897 \frac{\partial \eta_1}{\partial t}
 898 + (u_0 - U) \frac{\partial \eta_1}{\partial x}
 899 + v_0 \frac{\partial \eta_1}{\partial y} &= \frac{\partial \phi_1}{\partial z}, \quad z=0,
 900  \end{align}
 901  \end{subequations}
 902 where the pressure on the ship hull $p_{ship}$ is calculated explicitly based on a quasi-static approximation which is determined by assuming $\partial_t\phi_1\approx0$ and rewriting \eqref{ch7:quasistatic}. In general, a ship hull is a complex surface in three-dimensional space, but its draft can be approximated by a single-valued function of the horizontal coordinates $\eta_0 = \eta_0(x,y)$, and the no-flux condition on the ship hull is approximated by a flat-ship approximation. Radiation boundary conditions are approximated by a Sommerfelt absorbing boundary condition \cite{ch7:DgayguiJolySJAM1994} on the vertical sides of the physical domain to let waves escape the domain.
 903
 904 The modified numerical model can still be based on flexible-order finite difference method as discussed in Section \ref{ch7:sec:nummodel}. The computational bottleneck problem is the efficient solution of the Laplace problem twice which can be done efficiently by the GPU-accelerated iterative PDC method as explained in section \ref{ch7:PDCmethod}. A snapshot of the steady state wave field is provided in the introduction to this chapter. Computed resistance curves for a Series 60 hull moving at forward speed corresponding to Froude number $F_n=0.316$ knots in calm water are compared to experimental data \cite{ch7:TodaEtAl1992} in Figure \ref{ch7:fig:shiphydro} (a). The computed Kelvin wave system is shown in Figure \ref{ch7:fig:shiphydro} (b). The computed results compare well with experiments at moderate ship Froudes numbers $F_n=U/\sqrt{gh}$ in the range 0.1-0.25 as expected for a linear model. The real-time constraint required to fulfill the interactive and visualization requirements can currently be met with the GPU-accelerated hydrodynamics model for problem sizes of approximately $10^6$ for ship Froude numbers in the range 0.1-0.3. The modeling and real-time aspects will be addressed in more detail in ongoing work.
 905
 906 %
 907 \begin{figure}[!htb]
 908 \begin{center}
 909 \subfigure[Hydrodynamic force calculations.]{
 910 \includegraphics[width=0.45\textwidth]{Chapters/chapter7/figures/figSeries60CB06Type7Resistance-eps-converted-to.pdf}
 911 }
 912 \subfigure[Kelvin pattern.]{
 913 \includegraphics[width=0.45\textwidth]{Chapters/chapter7/figures/figSeries60CB06Type7kelvin-eps-converted-to.pdf}
 914 }
 915 \end{center}
 916 \caption{Computed results. Comparison with experiments for hydrodynamics force calculations confirming engineering accuracy for low Froude numbers.}
 917 \label{ch7:fig:shiphydro}
 918 \end{figure}
 919 %
 920
 921 %\newpage
 922 \section{Conclusion and future work}
 923
 924 We have presented implementation details together with several novel results on development of a new massively parallel and scalable tool for simulation of nonlinear free surface water waves on heterogenous hardware. The tool is based on the unified potential flow model referred to as OceanWave3D \cite{ch7:EBL08} which provides the basis for efficient and scalable simulation of water waves over uneven bottoms on arbitrary domain sizes. We have demonstrated in a few examples how we can accelerate performance by using single-precision math without comprimising accuracy. We have shown that performance can be accelerated by introducing concurrency in the time integration using the parareal algorithm and for the first time in a heterogenous setup based on the use of multiple GPUs. Interestingly, we find that parallel computations using parareal may be more efficient than using conventional data-parallel distributed computations in a multi-GPU setup for moderate problem sizes. We have measured absolute performance and scalability using several of the most recent generations of NVIDIA GPUs to detail the efficiency of the current code. This is useful to predict time to results as explained in \cite{ch7:EngsigKarupEtAl2011} and may be compared against other wave models in fair comparisons.
 925
 926 Work in progress focuses on extending the governing equations to account for lack of physics such as wave runup and wave breaking. Also, we plan to extend the domain decomposition method to unstructured grids of blocks that can be boundary-fitted to more general bottom-mounted structures to be able to address wave-structure problems, cf. \cite{ch7:EHBM06,ch7:EHBW08}. For example, this will provide the basis for simulations of wave transformations in large harbor areas or predict wave climates in near-coastal areas.
 927
 928 We anticipate that a tool based on the proposed parallel solution strategies will be useful for further advancement in fast and robust analysis techniques and large-scale simulation of free surface wave simulation (e.g., for use as an efficient far-field solver at large scales) and be a basis for next-generation wave models. We also expect that the tool can be useful for hybrid-solution strategies with local flow features possibly resolved by other models and for advancing state-of-the-art in fast physics-based wave-body simulations, e.g., ship-wave interactions in ship simulation where real-time constraints are imposed due to visualization. These subjects will be part of ongoing work addressing application aspects.
 929
 930 \section{Acknowledgment}
 931
 932 This work was supported by grant no. 09-070032 from the Danish Research Council for Technology and Production Sciences. A special thank goes to Professor Jan S. Hesthaven for supporting parts of this work. Scalability and performance tests was done in the GPUlab at DTU Informatics, Technical University of Denmark and using the GPU-cluster at Center for Computing and Visualization, Brown University, USA. NVIDIA Corporation is acknowledged for generous hardware donations to facilities of the GPUlab.
 933
 934 \putbib[Chapters/chapter7/biblio7]