ch17

[book_gpu.git] / BookGPU / Chapters / chapter7 / ch7.tex
diff --git a/BookGPU/Chapters/chapter7/ch7.tex b/BookGPU/Chapters/chapter7/ch7.tex

index e74681b1476aa053df7711ea805c400ce07d1d07..e084f01a0e514e08a28a7a54521401e30064c70f 100644 (file)
--- a/BookGPU/Chapters/chapter7/ch7.tex
+++ b/BookGPU/Chapters/chapter7/ch7.tex
@@ -37,7 +37,7 @@ A key problem is that improvements in performance require porting legacy codes\f
  %However, increasing amounts of applications are utilizing accelerators to parts of their code to gain speedups albeit with less dramatic improvements of performance as one can potentially find by adapting most, if not all of the application to modern hardware.
  
  In this work, we explore some of these trends by developing, by bottom-up-design, a water-wave model which can be utilized in maritime engineering and with the intended use on affordable office desktops as well as on more expensive modern compute clusters for engineering analysis purposes.  
-
+\clearpage
  \section{On modeling paradigms for highly nonlinear and dispersive water waves}
  \label{ch7:sec:modernwavemodellingparadigms}
  
@@ -200,7 +200,7 @@ From a numerical point of view, an efficient and scalable discretization strateg
  
  We present scalability and performance tests based on the same two test environments outlined in Chapter \ref{ch5}, Section \ref{ch5:sec:testenvironments}, plus a fourth test environment based on the most recent hardware generation:
  \begin{description}
-\item[Test environment 4.] Desktop with dual-socket Sandy Bridge Intel Xeon E5-2670 (2.60 GHz) processors, 64GB RAM, 2x Nvidia Tesla K20 GPUs.
+\item[Test environment 4.] Desktop with dual-socket Sandy Bridge Intel Xeon E5-2670 (2.60 GHz) processors, 64GB RAM, 2x NVIDIA Tesla K20 GPUs.
  \end{description}
  Performance results can be used to predict actual runtimes as described in \cite{ch7:EngsigKarupEtAl2011}, e.g., for estimation of whether a real-time constraint for a given problem size can be met.
  
@@ -359,7 +359,7 @@ Subtracting $g^n$ in \eqref{ch7:eq:discreteupdate} and dividing by a pseudo time
  \frac{g^{*,n+1}-g^n}{\tau} =\frac{(1-\Gamma)}{\tau} (g_e^n-g^n).
  \end{align}
  %
-The first term is similar to a first-order accurate Forward Euler\index{forward Euler} approximation of a rate of change term. This motivates an {\em embedded penalty forcing technique} based on adding a correction term of the form
+The first term is similar to a first-order accurate Forward Euler\index{Euler!forward Euler} approximation of a rate of change term. This motivates an {\em embedded penalty forcing technique} based on adding a correction term of the form
  %
  \begin{align}\label{ch7:eq:penalty}
  \partial_t g = \mathcal{N}(g) + \frac{1-\Gamma(x)}{\tau} (g_e(t,x)-g(t,x)), \quad {\bf x}\in\Omega_\Gamma,
@@ -761,7 +761,7 @@ Elaborating on this example, we examine the propagation of a regular stream func
  \label{ch7:filter}
  \mathcal{F}u(x_i) = \sum_{n=-\alpha}^{\alpha} c_n u(x_{i+n}),
  \end{align}
-where $c_n\in\mathbb{R}$ are the stencil coefficients and $\alpha\in\mathbb{Z}_+$ is the stencil half-width. An active filter can for example be based on employing a Savitzky-Golay smoothening filter \cite{ch7:PT90}, e.g., the mild 7-point SG(6,10) filter, and applying it after every 10th time step to each of the collocation nodes defining the free surface variables. The same procedure can be used for stabilization of nonlinear simulations to remove high-frequency "saw-tooth" instabilities as shown in \cite{ch7:EBL08}. This filtering technique can also remove high-frequency noise resulting from round-off errors in computations that would otherwise potentially pollute the computational results and in the worst case leave them useless. The effect of this type of filtering on the numerical efficiency of the model is insignificant.
+where $c_n\in\mathbb{R}$ are the stencil coefficients and $\alpha\in\mathbb{Z}_+$ is the stencil half-width. An active filter can for example be based on employing a Savitzky-Golay smoothening filter \cite{ch7:PT90}, e.g., the mild 7-point SG(6,10) filter, and applying it after every 10th time step to each of the collocation nodes defining the free surface variables. The same procedure can be used for stabilization of nonlinear simulations to remove high-frequency ``saw-tooth'' instabilities as shown in \cite{ch7:EBL08}. This filtering technique can also remove high-frequency noise resulting from round-off errors in computations that would otherwise potentially pollute the computational results and in the worst case leave them useless. The effect of this type of filtering on the numerical efficiency of the model is insignificant.
  
  Results from numerical experiments are presented in Figure \ref{ch7:filtering}, and most of the errors can be attributed to phase errors resulting from difference in exact versus numerical phase speed. In numerical experiments, we find that while results computed in double-precision are not significantly affected by accumulation of round-off errors, the single-precision results are. In Figures \ref{ch7:filtering} (a) and (b), a direct solver based on sparse Gaussian elimination within MATLAB\footnote{\url{http://www.mathworks.com}.} is used to solve the linear system at every stage and a comparison is made between single- and unfiltered double-precision calculations. It is shown in Figure \ref{ch7:filtering} a) that without a filter, the single-precision calculations result in ``blow-up'' after which the solver fails just before 50 wave periods of calculation time. However, in Figure \ref{ch7:filtering} (b) it is demonstrated that invoking a smoothening filter, cf. \eqref{ch7:filter}, stabilizes the accumulation of round-off errors and the calculations continue and achieve reduced accuracy compared to the computed double-precision results. Thus, it is confirmed that such a filter can be used to control and suppress high-frequency oscillations that results from accumulation of round-off errors. In contrast, replacing the direct solver with an iterative PDC method using the GPU-accelerated wave model appears to be much more attractive upon inspection of Figures \ref{ch7:filtering} (c) and (d). The single-precision results are found to be stable with and {\em without} the filter-based strategy for this problem. The calculations show that single-precision math leads to slightly faster error accumulation for this choice of resolution, however, with only small differences in error level during long time integration. This highlights that fault-tolerance of the iterative PDC method contributes to securing robustness of the calculations.
  
@@ -789,7 +789,8 @@ Last, we demonstrate using a classical benchmark for propagation of nonlinear wa
  %
  
  A harmonic analysis of the wave spectrum at the shoal center line is computed and plotted in Figure \ref{ch7:whalinresults} for comparison with the analogous results obtained from the experiments data. The three harmonic amplitudes are computed via a Fast Fourier Transform (FFT) method using the last three wave periods up to $t=50\,$s. There is a satisfactory agreement between the computed and experimental results and no noticeable loss in accuracy resulting from the use of single-precision math.
-%
+
+\pagebreak
  \begin{figure}[!htb]
      \setlength\figureheight{0.3\textwidth}
      \setlength\figurewidth{0.32\textwidth}
@@ -807,7 +808,7 @@ A harmonic analysis of the wave spectrum at the shoal center line is computed an
      {\scriptsize\input{Chapters/chapter7/figures/WhalinWaveHarmonics_T3_single.tikz}}
      }
  %    \end{center}
-    \caption[Harmonic analysis for the experiment of Whalin for $T=1,2,3\,$s.]{Harmonic analysis for the experiment of Whalin for $T=1,2,3\,$s respectively. Measured experimental and computed results (single-precision) are in good agreement. Test environment 1.}\label{ch7:whalinresults}
+    \caption[Harmonic analysis for the experiment of Whalin for $T=1,2,3\,$s.]{Harmonic analysis for the experiment of Whalin for $T=1,2,3\,$s. Measured experimental and computed results (single-precision) are in good agreement. Test environment 1.}\label{ch7:whalinresults}
  \end{figure}
  
  \subsection{Acceleration via parallelism in time using parareal}\label{ch7:parareal}\index{parareal}
@@ -822,7 +823,7 @@ The parareal algorithm has been introduced as a component in our in-house GPU li
  In Section \ref{ch5:parareal} it is assumed that communication costs can be neglected and a simple model for the algorithmic work complexity is derived. It is found that there are four key discretization parameters for parareal that need to be balanced appropriately in order to achieve high parallel efficiency: the number of coarse-grained time intervals $N$, the number of iterations $K$, the ratio between the computational cost of the coarse to the fine propagator $\mathcal{C}_\mathcal{G}/\mathcal{C}_\mathcal{F}$, and the ratio between fine and coarse time step sizes $\delta t/\delta T$.
  
  % How to obtain speed-up
-Ideally, the ratio $\mathcal{C}_\mathcal{G}/\mathcal{C}_\mathcal{F}$ is small and convergence happens in $k=1$ iteration. This is seldom the case though, as it requires the coarse propagator to achieve accuracy close to that of the fine propagator while at the same time being substantially cheaper computationally, these two objectives obviously being conflicting. Obtaining the highest possible speed-up is a matter of trade-off, typically, the more GPUs used, the faster the coarse propagator should be. The performance of parareal is problem- and discretization-dependent and as such one would suspect that different wave parameters influence the suitability of the method. This was investigated in \cite{ch7:ASNP12} and indeed the performance does change with wave parameters. Typically the method works better for deep water waves with low to medium wave amplitudes.
+Ideally, the ratio $\mathcal{C}_\mathcal{G}/\mathcal{C}_\mathcal{F}$ is small and convergence happens in $k=1$ iteration. This is seldom the case though, as it requires the coarse propagator to achieve accuracy close to that of the fine propagator while at the same time being substantially cheaper computationally, these two objectives obviously being conflicting. Obtaining the highest possible speed-up is a matter of trade-off, typically, the more GPUs used, the faster the coarse propagator should be. The performance of parareal is problem- and discretization-dependent and as such one would suspect that different wave parameters influence the suitability of the method. This was investigated in \cite{ch7:ASNP12} and indeed the performance does change with wave parameters. Typically the method works better for deep water waves with low- to medium-wave amplitudes.
  %
  \begin{figure}[!htb]
      \begin{center}
@@ -842,12 +843,12 @@ Ideally, the ratio $\mathcal{C}_\mathcal{G}/\mathcal{C}_\mathcal{F}$ is small an
  %
  
  % What did we do and what are the results
-We have performed a scalability study for parareal using 2D nonlinear stream function waves based on a discretization with $(N_x,N_z)=(33,9)$ collocation points, cf. Figure \ref{ch7:fig:DDPA_SPEEDUP}. The study shows that moderate speedup is possible for this hyperbolic system. Using four GPU nodes, a speedup of slightly more than two was achieved while using sixteen GPU nodes resulted in a speedup of slightly less than five. As noticed in Figure \ref{ch7:fig:DDPA_SPEEDUP}, parallel efficiency decreases quite fast when using more GPUs. This limitation is due to the usage of a fairly slow and accurate coarse propagator and is linked to a known difficulty with parareal applied to hyperbolic systems. For hyperbolic systems, instabilities tend to arise when using a very inaccurate coarse propagator. This prevents using a large number of time subdomains, as this by Amdahl's law also requires a very fast coarse propagator. The numbers are still impressive though, considering that the speedup due to parareal comes as additional speedup to an already efficient and fast code.
+We have performed a scalability study for parareal using 2D nonlinear stream function waves based on a discretization with $(N_x,N_z)=(33,9)$ collocation points, cf. Figure \ref{ch7:fig:DDPA_SPEEDUP}. The study shows that moderate speedup is possible for this hyperbolic system. Using four GPU nodes, a speedup of slightly more than two was achieved while using sixteen GPU nodes resulted in a speedup of slightly less than five. As noticed in Figure \ref{ch7:fig:DDPA_SPEEDUP}, parallel efficiency decreases quite fast when using more GPUs. This limitation is due to the usages of a fairly slow and accurate coarse propagator and is linked to a known difficulty with parareal applied to hyperbolic systems. For hyperbolic systems, instabilities tend to arise when using a very inaccurate coarse propagator. This prevents using a large number of time subdomains, as this by Amdahl's law also requires a very fast coarse propagator. The numbers are still impressive though, considering that the speedup due to parareal comes as additional speedup to an already efficient and fast code.
  Performance results for the Whalin test case are also shown in Figure \ref{ch7:fig:whalinparareal}. There is a natural limitation to how much we can increase $R$ (the ratio between the complexity of the fine and coarse propagators), because of stability issues with the coarse propagator. In this test case we simulate from $t=[0,1]$s, using up to $32$ GPUs. For low $R$ and only two GPUs, there is no speedup gain, but for the configuration with eight or more GPUs and $R\geq6$, we are able to get more than $2$ times speedup. Though these hyperbolic systems are not optimal for performance tuning using the parareal method, results still confirm that reasonable speedups are in fact possible on heterogenous systems.
  
  \begin{figure}[!htb]
-    \setlength\figureheight{0.3\textwidth}
-    \setlength\figurewidth{0.32\textwidth}
+    \setlength\figureheight{0.29\textwidth}
+    \setlength\figurewidth{0.29\textwidth}
  %    \begin{center}
      \subfigure[Speedup]{
      {\small\input{Chapters/chapter7/figures/WhalinPararealSpeedup.tikz}}
@@ -868,7 +869,7 @@ Finally, we remark that the parareal algorithm is also a fault-tolerant algorith
  %\newpage
  \subsection{Towards real-time interactive ship simulation}\index{real-time simulation}
  
-A fast GPU-accelerated ship hydrodynamic model is developed for real-time interactive ship simulation by modification of the unified potential flow model presented in Section \ref{ch7:goveq}. The target scientific application is an interactive full mission marine simulator, where multiple ships controlled by naval officers can navigate in a near-realistic virtual marine environment. Full mission simulators are used for education and training of naval officers in critical manoeuvring operations and for evaluation of ship and marine infrastructure designs. To predict the motion of ships, a hydrodynamics model is required for prediction of forces by \eqref{ch7:forcecalc} which is affected by the kinematic properties of the model, cf. Section \ref{ch7:dispkin}. The state-of-the-art for such a hydrodynamic model in today's real-time ship simulators is based on fast interpolation and proper scaling of experimental model data. The amount of experimental model data is limited with respect to hull forms and configurations, requiring the need for extrapolation that compromises the accuracy.
+A fast GPU-accelerated ship hydrodynamic model is developed for real-time interactive ship simulation by modification of the unified potential flow model presented in Section \ref{ch7:goveq}. The target scientific application is an interactive full mission marine simulator, where multiple ships controlled by naval officers can navigate in a near-realistic virtual marine environment. Full mission simulators are used for education and training of naval officers in critical maneuvering operations and for evaluation of ship and marine infrastructure designs. To predict the motion of ships, a hydrodynamics model is required for prediction of forces by \eqref{ch7:forcecalc} which is affected by the kinematic properties of the model, cf. Section \ref{ch7:dispkin}. The state-of-the-art for such a hydrodynamic model in today's realtime ship simulators is based on fast interpolation and proper scaling of experimental model data. The amount of experimental model data is limited with respect to hull forms and configurations, requiring the need for extrapolation that compromises the accuracy.
  
  The objective of current and ongoing work is aimed at removing these limitations by replacing the existing hydrodynamic model and instead calculating at full-scale the flow field, wave field, ship-structure, and ship-ship interaction forces in real-time using massive parallel computation technology. The potential flow model (OceanWave3D) presented in Section \ref{ch7:goveq} is suitable as the modeling basis for this purpose since it is robust, accurate, efficient, and scalable to arbitrarily large domains. Furthermore, it can accurately account for dispersive waves in the range from shallow to deep waters in marine settings where the sea bed may be uneven.
  
@@ -901,7 +902,7 @@ where $U$ is the velocity of the ship. The unsteady linear water problem is used
   \end{subequations}
  where the pressure on the ship hull $p_{ship}$ is calculated explicitly based on a quasi-static approximation which is determined by assuming $\partial_t\phi_1\approx0$ and rewriting \eqref{ch7:quasistatic}. In general, a ship hull is a complex surface in three-dimensional space, but its draft can be approximated by a single-valued function of the horizontal coordinates $\eta_0 = \eta_0(x,y)$, and the no-flux condition on the ship hull is approximated by a flat-ship approximation. Radiation boundary conditions are approximated by a Sommerfelt absorbing boundary condition \cite{ch7:DgayguiJolySJAM1994} on the vertical sides of the physical domain to let waves escape the domain.
  
-The modified numerical model can still be based on flexible-order finite difference method as discussed in Section \ref{ch7:sec:nummodel}. The computational bottleneck problem is the efficient solution of the Laplace problem twice which can be done efficiently by the GPU-accelerated iterative PDC method as explained in section \ref{ch7:PDCmethod}. A snapshot of the steady state wave field is provided in the introduction to this chapter. Computed resistance curves for a Series 60 hull moving at forward speed corresponding to Froude number $F_n=0.316$ knots in calm water are compared to experimental data \cite{ch7:TodaEtAl1992} in Figure \ref{ch7:fig:shiphydro} (a). The computed Kelvin wave system is shown in Figure \ref{ch7:fig:shiphydro} (b). The computed results compare well with experiments at moderate ship Froudes numbers $F_n=U/\sqrt{gh}$ in the range 0.1-0.25 as expected for a linear model. The real-time constraint required to fulfill the interactive and visualization requirements can currently be met with the GPU-accelerated hydrodynamics model for problem sizes of approximately $10^6$ for ship Froude numbers in the range 0.1-0.3. The modeling and real-time aspects will be addressed in more detail in ongoing work.
+The modified numerical model can still be based on flexible-order finite difference method as discussed in Section \ref{ch7:sec:nummodel}. The computational bottleneck problem is the efficient solution of the Laplace problem twice which can be done efficiently by the GPU-accelerated iterative PDC method as explained in section \ref{ch7:PDCmethod}. A snapshot of the steady state wave field is provided in the introduction to this chapter. Computed resistance curves for a Series 60 hull moving at forward speed corresponding to Froude number $F_n=0.316$ knots in calm water are compared to experimental data \cite{ch7:TodaEtAl1992} in Figure \ref{ch7:fig:shiphydro} (a). The computed Kelvin wave system is shown in Figure \ref{ch7:fig:shiphydro} (b). The computed results compare well with experiments at moderate ship Froude numbers $F_n=U/\sqrt{gh}$ in the range 0.1--0.25 as expected for a linear model. The real-time constraint required to fulfill the interactive and visualization requirements can currently be met with the GPU-accelerated hydrodynamics model for problem sizes of approximately $10^6$ for ship Froude numbers in the range 0.1--0.3. The modeling and real-time aspects will be addressed in more detail in ongoing work.
  
  %
  \begin{figure}[!htb]
@@ -921,14 +922,14 @@ The modified numerical model can still be based on flexible-order finite differe
  %\newpage
  \section{Conclusion and future work}
  
-We have presented implementation details together with several novel results on development of a new massively parallel and scalable tool for simulation of nonlinear free surface water waves on heterogenous hardware. The tool is based on the unified potential flow model referred to as OceanWave3D \cite{ch7:EBL08} which provides the basis for efficient and scalable simulation of water waves over uneven bottoms on arbitrary domain sizes. We have demonstrated in a few examples how we can accelerate performance by using single-precision math without comprimising accuracy. We have shown that performance can be accelerated by introducing concurrency in the time integration using the parareal algorithm and for the first time in a heterogenous setup based on the use of multiple GPUs. Interestingly, we find that parallel computations using parareal may be more efficient than using conventional data-parallel distributed computations in a multi-GPU setup for moderate problem sizes. We have measured absolute performance and scalability using several of the most recent generations of NVIDIA GPUs to detail the efficiency of the current code. This is useful to predict time to results as explained in \cite{ch7:EngsigKarupEtAl2011} and may be compared against other wave models in fair comparisons. 
+We have presented implementation details together with several novel results on development of a new massively parallel and scalable tool for simulation of nonlinear free surface water waves on heterogenous hardware. The tool is based on the unified potential flow model referred to as OceanWave3D \cite{ch7:EBL08} which provides the basis for efficient and scalable simulation of water waves over uneven bottoms on arbitrary domain sizes. We have demonstrated in a few examples how we can accelerate performance by using single-precision math without compromising accuracy. We have shown that performance can be accelerated by introducing concurrency in the time integration using the parareal algorithm and for the first time in a heterogenous setup based on the use of multiple GPUs. Interestingly, we find that parallel computations using parareal may be more efficient than using conventional data-parallel distributed computations in a multi-GPU setup for moderate problem sizes. We have measured absolute performance and scalability using several of the most recent generations of NVIDIA GPUs to detail the efficiency of the current code. This is useful to predict time to results as explained in \cite{ch7:EngsigKarupEtAl2011} and may be compared against other wave models in fair comparisons. 
  
  Work in progress focuses on extending the governing equations to account for lack of physics such as wave runup and wave breaking. Also, we plan to extend the domain decomposition method to unstructured grids of blocks that can be boundary-fitted to more general bottom-mounted structures to be able to address wave-structure problems, cf. \cite{ch7:EHBM06,ch7:EHBW08}. For example, this will provide the basis for simulations of wave transformations in large harbor areas or predict wave climates in near-coastal areas.
  
  We anticipate that a tool based on the proposed parallel solution strategies will be useful for further advancement in fast and robust analysis techniques and large-scale simulation of free surface wave simulation (e.g., for use as an efficient far-field solver at large scales) and be a basis for next-generation wave models. We also expect that the tool can be useful for hybrid-solution strategies with local flow features possibly resolved by other models and for advancing state-of-the-art in fast physics-based wave-body simulations, e.g., ship-wave interactions in ship simulation where real-time constraints are imposed due to visualization. These subjects will be part of ongoing work addressing application aspects. 
  
-\section{Acknowledgment}
+\section{Acknowledgments}
  
  This work was supported by grant no. 09-070032 from the Danish Research Council for Technology and Production Sciences. A special thank goes to Professor Jan S. Hesthaven for supporting parts of this work. Scalability and performance tests was done in the GPUlab at DTU Informatics, Technical University of Denmark and using the GPU-cluster at Center for Computing and Visualization, Brown University, USA. NVIDIA Corporation is acknowledged for generous hardware donations to facilities of the GPUlab.
-
+\clearpage
  \putbib[Chapters/chapter7/biblio7]