\includegraphics[width=0.98\textwidth]{Chapters/chapter7/figures/nonlinearwavespenalty-eps-converted-to.pdf}
% Nx = 540, 6th order, vertical clustering, Nz=6;
}
-\caption[Snapshots at intervals $T/8$ over one wave period in time.]{Snapshots at intervals $T/8$ over one wave period in time of computed (a) small-amplitude $(kh,kH)=(0.63,0.005)$ and (b) finite-amplitude $(kh,kH)=(1,0.41)$ stream function waves elevations having reached a steady state after transient startup. Combined wave generation and absorption zones occur in the left relaxation zone of both (a) and (b). In b) an absorption zone is positioned next to the right boundary and causes minor visible reflections. }
+\caption[Snapshots at intervals $T/8$ over one wave period in time.]{Snapshots at intervals $T/8$ over one wave period in time of computed (a) small-amplitude $(kh,kH)=(0.63,0.005)$ and (b) finite-amplitude $(kh,kH)=(1,0.41)$ stream function waves elevations having reached a steady state after transient startup. Combined wave generation and absorption zones occur in the left relaxation zone of both (a) and (b). In (b) an absorption zone is positioned next to the right boundary and causes minor visible reflections. }
\label{ch7:figstandwave}
\end{figure}
\end{align}
where $\mathcal{M}$ is a preconditioner with the property that $\mathcal{M}^{-1}\approx \mathcal{A}^{-1}$ can be computed at low cost.
-The bottleneck problem in a unified potential flow model is the solution of a discrete $\sigma$-transformed Laplace problem stated in the compact forms \eqref{ch7:linsyscompact} or \eqref{ch7:eq:linsys}. It is attractive to find an efficient iterative strategy where convergence\index{convergence} is understood via a convergence theory, has modest storage requirements, has minimal global communication requirements (in the form of global inner products), and is suitable for flexible-order\index{flexible order} discretizations. The class of geometric Multigrid Methods\index{multigrid} fulfills these requirements and has shown to be among the most efficient iterative strategies for a wide class of problems \cite{ch7:Trottenberg01}. In particular, the time required to solve a system of linear equations to a given accuracy level can be made to scale proportional to the number of unknowns.
+The bottleneck problem in a unified potential flow model is the solution of a discrete $\sigma$-transformed Laplace problem stated in the compact forms \eqref{ch7:linsyscompact} or \eqref{ch7:eq:linsys}. It is attractive to find an efficient iterative strategy where convergence\index{convergence} is understood via a convergence theory, has modest storage requirements, has minimal global communication requirements (in the form of global inner products), and is suitable for flexible-order\index{flexible order} discretizations. The class of geometric Multigrid methods\index{multigrid} fulfills these requirements and has been shown to be among the most efficient iterative strategies for a wide class of problems \cite{ch7:Trottenberg01}. In particular, the time required to solve a system of linear equations to a given accuracy level can be made to scale proportional to the number of unknowns.
There are several known approaches to multigrid methods \cite{ch7:MR744926} for high-order discretizations. Among these, Defect Correction\index{defect correction} Methods (DCMs) \cite{ch7:Stetter1978,ch7:AuzingerStetter1982,ch7:Hackbusch1982,ch7:Auzinger1987,ch7:Trottenberg01} have been employed successfully, e.g., in computational fluid dynamics \cite{ch7:LaytonEtAl2002}, in numerical simulations since the early 1970s.
The fundamental idea of DCMs is to combine the good stability properties of low-order discretizations with high-order accuracy discretizations for explicit residual evaluations. These iterative methods impose low storage requirements, have scalable work effort under suitable choices of preconditioning strategies, and may be accelerated using a multigrid method based on low-order discretizations while still achieving high-order accuracy.
{\scriptsize\input{Chapters/chapter7/figures/TeslaK20SpeedupGPUvsCPU3D.tikz}}
}
\end{center}
- \caption[Performance timings per PDC iteration as a function of increasing problem size $N$, for single, mixed and double precision arithmetics.]{Performance timings per PDC iteration as a function of increasing problem size $N$, for single, mixed, and double precision arithmetics. Three-dimensional nonlinear waves, using sixth order finite difference approximations, preconditioned with one multigrid V-cycle and with one pre- and post- red-black Gauss-Seidel smoothing operation. Speedup compared to fastest known serial implementation. Using Test environment 4, CPU timings represent starting points for our investigations and have been obtained using the Fortran 90 code. These references results are based on a single-core (non-parallel) run on a Intel Core i7, 2.80GHz processor.}\label{ch7:fig:perftimings}
+ \caption[Performance timings per PDC iteration as a function of increasing problem size $N$, for single, mixed, and double precision arithmetics.]{Performance timings per PDC iteration as a function of increasing problem size $N$, for single, mixed, and double precision arithmetics. Three-dimensional nonlinear waves, using sixth order finite difference approximations, preconditioned with one multigrid V-cycle and with one pre- and post- red-black Gauss-Seidel smoothing operation. Speedup compared to fastest known serial implementation. Using test environment 4, CPU timings represent starting points for our investigations and have been obtained using the Fortran 90 code. These references results are based on a single-core (non-parallel) run on a Intel Core i7, 2.80GHz processor.}\label{ch7:fig:perftimings}
\end{figure}
The parallel domain decomposition solver has been validated against the sequential solvers with respect to algorithmic efficiency to establish that the code produces correct results. An analysis of the numerical efficiency has also been carried out on different GPU systems to identify comparative behaviors as both the problems sizes and number of compute nodes vary. For example, performance scalings on Test environment 1 and Test environment 3 are presented in Figure \ref{ch7:fig:multigpuperformance}. The figure confirms that there is only a limited benefit from using multiple GPUs for small problem sizes, since the computational intensity is simply too low to efficiently hide the latency of message passing. A substantial speedup is achieved compared to the single GPU version, while being able to solve even larger systems.
It is described in Chapter \ref{ch5} how we have developed a heterogenous library has for fast prototyping of PDE solvers, utilizing the massively parallel architecture of CUDA-enabled GPUs. The objective is to provide a set of generic components within a single framework, such that software developers can assemble application-specific solvers efficiently at a high abstraction level, requiring a minimum of CUDA specific kernel implementations and parameter tuning.
-The CUDA-based numerical wave model has been developed based on all the numerical techniques described in preceding sections. These techniques are a part of the library implemented as generic components, which makes them useful for the numerical solutions of PDE systems. The components of the numeral model as described in Section \ref{ch7:sec:nummodel} include an ERK4 time integrator, flexible-order finite difference approximations on regular grids, and an iterative multigrid PDC solver for the $\sigma$-transformed Laplace equation \eqref{ch7:TransformedLaplace}. Application developers can use either these components directly from the library or are free to combine their own implementations with library components, if they need alternative strategies that are not present in the library.
+The CUDA-based numerical wave model has been developed based on all the numerical techniques described in preceding sections. These techniques are a part of the library implemented as generic components, which makes them useful for the numerical solutions of PDE systems. The components of the numeral model as described in Section \ref{ch7:sec:nummodel} include an ERK4 time integrator, flexible-order finite difference approximations on regular grids, and an iterative multigrid PDC solver for the $\sigma$-transformed Laplace equation \eqref{ch7:TransformedLaplace}. Application developers either either can use these components directly from the library or are free to combine their own implementations with library components, if they need alternative strategies that are not present in the library.
For the unified potential flow model the user will need to provide implementations of the following components: the right-hand side operator for the semidiscrete free surface variables \eqref{ch7:FSorigin}, the matrix-vector operator for the discretized $\sigma$-transformed Laplace equation \eqref{ch7:TransformedLaplace}, a smoother for the multigrid relaxation step, and the potential flow solver itself, that reads initial data and advances the solution in time. In order to make the library as generic as possible, all components are template-based, which makes it possible to assemble the PDE solver by combining type definitions in the preamble of the application. An excerpt of the potential flow assembling is given in Listing \ref{ch7:lst:solversetup}.
-
+\pagebreak
\lstset{label=ch7:lst:solversetup,caption={generic assembling of the potential flow solver for fully nonlinear free surface water waves}
%,basicstyle=\scriptsize
}
Hereafter, the potential flow solver is aware of all component types that should be used to solve the entire PDE system, and it will be easy for developers to exchange parts at later times. The \texttt{laplace\_sigma\_stencil\_3d} class implements both the matrix-vector and right-hand side operator. The flexible-order finite difference kernel for the matrix-free matrix-vector product for the two-dimensional Laplace problem is presented in Listing \ref{ch7:lst:fd2d}. Library macros and reusable kernel routines are used throughout the implementations to enhance developer productivity and hide hardware specific details. This kernel can be used both for matrix-vector products for the original system and for the preconditioning.
-\lstset{label=ch7:lst:fd2d,caption={CUDA kernel implementation for the two dimensional finite difference approximation to the transformed Laplace equation}
+\lstset{label=ch7:lst:fd2d,caption={CUDA kernel implementation for the two-dimensional finite difference approximation to the transformed Laplace equation}
%,basicstyle=\scriptsize\ttfamily
}
\begin{lstlisting}
E_m(N_x,N_z,kh,h/L) &= \frac{1}{h}\int_{-h}^\eta \left( \frac{\phi(z)-\phi_s(z)}{\phi_s(z)} \right)^2 dz,
\end{align}
\end{subequations}
-where $m$ is one of the scalar functions $\phi,u,w$ describing kinematics, $c$ is the numerical phase celerity of regular waves, and $c_s=\sqrt{g\tanh(kh)/k}$ is the exact phase celerity according to linear Stokes Theory \cite{ch7:SJ01}. Measurements of the errors are taken in the vertical below the crest of a wave which is well resolved in the horizontal direction.
+where $m$ is one of the scalar functions $\phi,u,w$ describing kinematics; $c$ is the numerical phase celerity of regular waves; and $c_s=\sqrt{g\tanh(kh)/k}$ is the exact phase celerity according to linear Stokes Theory \cite{ch7:SJ01}. Measurements of the errors are taken in the vertical below the crest of a wave which is well resolved in the horizontal direction.
% kinematicAnalysis.m
\begin{figure}[!htb]
\subsection{On precision requirements in engineering applications}\index{precision}
-Practical engineering applications are widely used for analysis purposes and give support to decision making in engineering design. For engineering purposes the turn-around time for producing analysis results is of crucial importance as it affects cost-benefit of work efforts. The key interest is often just "engineering accuracy" in computed end results which suggests that we can do with less precision in calculations. One may ask: {\em what are the precision requirements for engineering applications?}
+Practical engineering applications are widely used for analysis purposes and give support to decision making in engineering design. For engineering purposes the turn-around time for producing analysis results is of crucial importance as it affects cost-benefit of work efforts. The key interest is often just ``engineering accuracy'' in computed end results which suggests that we can do with less precision in calculations. One may ask: {\em what are the precision requirements for engineering applications?}
-In a recent study \cite{ch7:EngsigKarupEtAl2011,ch7:Glimberg2011}, it was shown that the PDC method when executed on GPUs can be utilized to efficiently solve water wave problems. This was done by trading accuracy for speed in parts of the PDC algorithm, e.g., by using either single, or mixed-precision\index{precision!mixed} computations. Without preconditioning the PDC method reduces to a classical iterative refinement technique, which is known to be fault tolerant \cite{ch7:Higham:2002:ASN}.
+In a recent study \cite{ch7:EngsigKarupEtAl2011,ch7:Glimberg2011}, it was shown that the PDC method when executed on GPUs can be utilized to efficiently solve water-wave problems. This was done by trading accuracy for speed in parts of the PDC algorithm, e.g., by using either single, or mixed-precision\index{precision!mixed} computations. Without preconditioning the PDC method reduces to a classical iterative refinement technique, which is known to be fault tolerant \cite{ch7:Higham:2002:ASN}.
-Previously reported performance results for the wave model can be taken a step further. We seek to demonstrate how single precision computations can be used for engineering analysis without significantly affecting accuracy in final computational results. At the same time improvements in computational speed can be almost a factor of two for large problems as a direct result of reduced data transfer, cf. Figure \ref{ch7:fig:perftimings}. Therefore, in pursue of high performance, it is of interest to exploit the reduced data transfers associated with replacing double-precision with single-precision floating point calculations. In a well-organized code this step can be taken with minimal programming effort.
+Previously reported performance results for the wave model can be taken a step further. We seek to demonstrate how single-precision computations can be used for engineering analysis without significantly affecting accuracy in final computational results. At the same time improvements in computational speed can be almost a factor of two for large problems as a direct result of reduced data transfer, cf. Figure \ref{ch7:fig:perftimings}. Therefore, in pursue of high performance, it is of interest to exploit the reduced data transfers associated with replacing double-precision with single-precision floating point calculations. In a well-organized code this step can be taken with minimal programming effort.
%
\begin{figure}[!htb]
\begin{center}
\label{ch7:convhist}
\end{figure}
-Most scientific applications \cite{ch7:lessismore} use double-precision calculations to minimize accumulation of round-off errors, to employ higher precision for ill-conditioned problems, or to stabilize critical sections in the code that require higher precision. Round-off errors tend to accumulate more slowly when higher precision is used, thereby avoiding significant losses of accuracy due to round-off errors. Paradoxically, for many computational tasks, the need for high precision connected with the above-mentioned restrictions does not apply at all, or only applies to a portion of the tasks. In addition, on modern hardware there can be relative differences in peak floating point performance of up to about one order of magnitude in favor of single-precision over double-precision math calculations depending on choice of hardware architecture. However, for bandwidth-bound applications, the key performance metric is not floating-point performance, but rather bandwidth performance. In both cases, data transfer can be effectively halved by switching to single precision storage, in which case bandwidth performance increases and at the same time makes it possible to feed floating point units with data at effectively twice the speed. If maximizing performance is an ultimate goal, such considerations suggests that it can be possible to compute faster by using single- over double-precision arithmetics if accuracy requirements can be fulfilled.
+Most scientific applications \cite{ch7:lessismore} use double-precision calculations to minimize accumulation of round-off errors, to employ higher precision for ill-conditioned problems, or to stabilize critical sections in the code that require higher precision. Round-off errors tend to accumulate more slowly when higher precision is used, thereby avoiding significant losses of accuracy due to round-off errors. Paradoxically, for many computational tasks, the need for high precision connected with the above-mentioned restrictions does not apply at all, or only applies to a portion of the tasks. In addition, on modern hardware there can be relative differences in peak floating-point performance of up to about one order of magnitude in favor of single-precision over double-precision math calculations depending on choice of hardware architecture. However, for bandwidth-bound applications, the key performance metric is not floating-point performance, but rather bandwidth performance. In both cases, data transfer can be effectively halved by switching to single-precision storage, in which case bandwidth performance increases and at the same time makes it possible to feed floating-point units with data at effectively twice the speed. If maximizing performance is an ultimate goal, such considerations suggests that it can be possible to compute faster by using single- over double-precision arithmetics if accuracy requirements can be fulfilled.
%
As a simple illustrative numerical experiment, we can consider the iterative solution of \eqref{ch7:eq:linsys} using the PDC method using single- and double-precision math, respectively. As a simple test case, we consider the solution of periodic stream function waves in two spatial dimensions. Computed convergence histories are presented in Figure \ref{ch7:convhist} where it is clear that the main difference is the attainable accuracy level achievable before stagnation. In both cases, the attainable accuracy, defined in terms of the absolute error for the exact stream function solution to the governing equations, is associated with accuracy of approximately $10^{-4}$ (solid line). In single-precision math, the algebraic and estimated errors measured in the two-norm can reach the level of machine precision. These results suggest that single-precision math is sufficient for calculating accurate solutions at the chosen spatial resolution. We find that the iterative solution of the $\sigma$-transformed Laplace problem by the PDC method does not immediately lead to significant accumulation of round-off errors and further investigations are warranted for unsteady computations.
\end{align}
where $c_n\in\mathbb{R}$ are the stencil coefficients and $\alpha\in\mathbb{Z}_+$ is the stencil half-width. An active filter can for example be based on employing a Savitzky-Golay smoothening filter \cite{ch7:PT90}, e.g., the mild 7-point SG(6,10) filter, and applying it after every 10th time step to each of the collocation nodes defining the free surface variables. The same procedure can be used for stabilization of nonlinear simulations to remove high-frequency "saw-tooth" instabilities as shown in \cite{ch7:EBL08}. This filtering technique can also remove high-frequency noise resulting from round-off errors in computations that would otherwise potentially pollute the computational results and in the worst case leave them useless. The effect of this type of filtering on the numerical efficiency of the model is insignificant.
-Results from numerical experiments are presented in Figure \ref{ch7:filtering}, and most of the errors can be attributed to phase errors resulting from difference in exact versus numerical phase speed. In numerical experiments, we find that while results computed in double-precision are not significantly affected by accumulation of round-off errors, the single-precision results are. In figures \ref{ch7:filtering} (a) and (b), a direct solver based on sparse Gaussian elimination within MATLAB\footnote{\url{http://www.mathworks.com}.} is used to solve the linear system at every stage and a comparison is made between single and unfiltered double-precision calculations. It is shown in Figure \ref{ch7:filtering} a) that without a filter, the single-precision calculations result in "blow-up" after which the solver fails just before 50 wave periods of calculation time. However, in Figure \ref{ch7:filtering} (b) it is demonstrated that invoking a smoothening filter, cf. \eqref{ch7:filter}, stabilizes the accumulation of round-off errors and the calculations continue and achieve reduced accuracy compared to the computed double-precision results. Thus, it is confirmed that such a filter can be used to control and suppress high-frequency oscillations that results from accumulation of round-off errors. In contrast, replacing the direct solver with an iterative PDC method using the GPU-accelerated wave model appears to be much more attractive upon inspection of Figures \ref{ch7:filtering} (c) and (d). The single-precision results are found to be stable with and {\em without} the filter-based strategy for this problem. The calculations show that single-precision math leads to slightly faster error accumulation for this choice of resolution, however, with only small differences in error level during long time integration. This highlights that fault-tolerance of the iterative PDC method contributes to securing robustness of the calculations.
+Results from numerical experiments are presented in Figure \ref{ch7:filtering}, and most of the errors can be attributed to phase errors resulting from difference in exact versus numerical phase speed. In numerical experiments, we find that while results computed in double-precision are not significantly affected by accumulation of round-off errors, the single-precision results are. In Figures \ref{ch7:filtering} (a) and (b), a direct solver based on sparse Gaussian elimination within MATLAB\footnote{\url{http://www.mathworks.com}.} is used to solve the linear system at every stage and a comparison is made between single- and unfiltered double-precision calculations. It is shown in Figure \ref{ch7:filtering} a) that without a filter, the single-precision calculations result in ``blow-up'' after which the solver fails just before 50 wave periods of calculation time. However, in Figure \ref{ch7:filtering} (b) it is demonstrated that invoking a smoothening filter, cf. \eqref{ch7:filter}, stabilizes the accumulation of round-off errors and the calculations continue and achieve reduced accuracy compared to the computed double-precision results. Thus, it is confirmed that such a filter can be used to control and suppress high-frequency oscillations that results from accumulation of round-off errors. In contrast, replacing the direct solver with an iterative PDC method using the GPU-accelerated wave model appears to be much more attractive upon inspection of Figures \ref{ch7:filtering} (c) and (d). The single-precision results are found to be stable with and {\em without} the filter-based strategy for this problem. The calculations show that single-precision math leads to slightly faster error accumulation for this choice of resolution, however, with only small differences in error level during long time integration. This highlights that fault-tolerance of the iterative PDC method contributes to securing robustness of the calculations.
\begin{figure}[!htb]
\begin{center}
\includegraphics[width=0.45\textwidth]{Chapters/chapter7/figures/ComparisonDCWithFiltering-eps-converted-to.pdf}
}
\end{center}
-\caption[Comparison of accuracy as a function of time for double-precision calculations vs. single-precision with and without filtering.]{Comparison between accuracy as a function of time for double-precision calculations vs. single-precision with and without filtering. The double-precision result is unfiltered in each comparison and shows to be less sensitive to round-off errors. Medium steep nonlinear stream function wave in intermediate water $(kh,H/L)=(1,0.0502)$. Discretization is based on $(N_x,N_z)=(30,6)$, A courant number of $C_r=0.5$, and sixth order stencils.}
+\caption[Comparison of accuracy as a function of time for double-precision calculations vs. single-precision with and without filtering.]{Comparison of accuracy as a function of time for double-precision calculations vs. single-precision with and without filtering. The double-precision result is unfiltered in each comparison and shows to be less sensitive to round-off errors. Medium steep nonlinear stream function wave in intermediate water $(kh,H/L)=(1,0.0502)$. Discretization is based on $(N_x,N_z)=(30,6)$, a courant number of $C_r=0.5$, and sixth order stencils.}
\label{ch7:filtering}
\end{figure}
-Last, we demonstrate using a classical benchmark for propagation of nonlinear waves over a semicircular shoal that single-precision math is likely to be sufficient for achieving engineering accuracy. The benchmark is based on Whalin's experiment~\cite{ch7:Whalin1971} which is often used in validation of dispersive water wave models for coastal engineering applications, e.g., see previous work \cite{ch7:EBL08}. Experimental results exists for incident waves with wave periods $T=1,2,3\,$s and wave heights $H=0.0390, 0.0150, 0.0136\,$m. All three test cases have been discretized with a computational grid of size ($257 \times 41 \times 7$) to resolve the physical dimensions of $L_x=35\,$m, $L_y=6.096\,$m. The still water depth decreases in the direction of the incident waves as a semicircular shoal from $0.4572\,$m to $0.1524\,$m with an illustration of a snapshot of the free surface given in Figure \ref{ch7:fig:whalinsetup}. The time step $\Delta t$ is computed based on a constant Courant number of $Cr=c\Delta x/\Delta t=0.8$, where $c$ is the incident wave speed and $\Delta x$ is the grid spacing. Waves are generated in the generation zone $0 \leq x/L \leq 1.5$, where $L$ is the wave length of incident waves, and absorbed again in the zone $35 - 2L \leq x \leq 35\,$m.
+Last, we demonstrate using a classical benchmark for propagation of nonlinear waves over a semicircular shoal that single-precision math is likely to be sufficient for achieving engineering accuracy. The benchmark is based on Whalin's experiment~\cite{ch7:Whalin1971} which is often used in validation of dispersive water wave models for coastal engineering applications, e.g., see previous work \cite{ch7:EBL08}. Experimental results exist for incident waves with wave periods $T=1,2,3\,$s and wave heights $H=0.0390, 0.0150, 0.0136\,$m. All three test cases have been discretized with a computational grid of size ($257 \times 41 \times 7$) to resolve the physical dimensions of $L_x=35\,$m, $L_y=6.096\,$m. The still water depth decreases in the direction of the incident waves as a semicircular shoal from $0.4572\,$m to $0.1524\,$m with an illustration of a snapshot of the free surface given in Figure \ref{ch7:fig:whalinsetup}. The time step $\Delta t$ is computed based on a constant Courant number of $Cr=c\Delta x/\Delta t=0.8$, where $c$ is the incident wave speed and $\Delta x$ is the grid spacing. Waves are generated in the generation zone $0 \leq x/L \leq 1.5$, where $L$ is the wave length of incident waves, and absorbed again in the zone $35 - 2L \leq x \leq 35\,$m.
%
A harmonic analysis of the wave spectrum at the shoal center line is computed and plotted in Figure \ref{ch7:whalinresults} for comparison with the analogous results obtained from the experiments data. The three harmonic amplitudes are computed via a Fast Fourier Transform (FFT) method using the last three wave periods up to $t=50\,$s. There is a satisfactory agreement between the computed and experimental results and no noticeable loss in accuracy resulting from the use of single-precision math.
\subfigure[Whalin test case at $t=50$s, wave period $T=2$s, and grid dimensions ($257 \times 41 \times 7$)]{\label{ch7:fig:whalinsetup}
\includegraphics[width=0.5\textwidth]{Chapters/chapter7/figures/WhalinWaveSol_t50_T2_single.pdf}
}
- \subfigure[$T=1s$]{
+ \subfigure[$T=1$s]{
{\scriptsize\input{Chapters/chapter7/figures/WhalinWaveHarmonics_T1_single.tikz}}
}
- \subfigure[$T=2s$]{
+ \subfigure[$T=2$s]{
{\scriptsize\input{Chapters/chapter7/figures/WhalinWaveHarmonics_T2_single.tikz}}
}
- \subfigure[$T=3s$]{
+ \subfigure[$T=3$s]{
{\scriptsize\input{Chapters/chapter7/figures/WhalinWaveHarmonics_T3_single.tikz}}
}
% \end{center}