X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/4f8020b1c6324b056b53b4325cd80d59bb7cf19f..cdcc8368664a64aed4169db27a06df9d0c9d6527:/BookGPU/Chapters/chapter5/ch5.tex?ds=inline diff --git a/BookGPU/Chapters/chapter5/ch5.tex b/BookGPU/Chapters/chapter5/ch5.tex index cbf81f3..8ee7775 100644 --- a/BookGPU/Chapters/chapter5/ch5.tex +++ b/BookGPU/Chapters/chapter5/ch5.tex @@ -1,7 +1,7 @@ -\chapterauthor{Stefan L. Glimberg}{Technical University of Denmark} -\chapterauthor{Allan P. Engsig-Karup}{Technical University of Denmark} -\chapterauthor{Allan S. Nielsen}{Technical University of Denmark} -\chapterauthor{Bernd Dammann}{Technical University of Denmark} +\chapterauthor{Stefan L. Glimberg, Allan P. Engsig-Karup, Allan S. Nielsen and Bernd Dammann}{Technical University of Denmark} +%\chapterauthor{Allan P. Engsig-Karup}{Technical University of Denmark} +%\chapterauthor{Allan S. Nielsen}{Technical University of Denmark} +%\chapterauthor{Bernd Dammann}{Technical University of Denmark} @@ -83,7 +83,7 @@ The library is grouped into component classes. Each component should fulfill a s \begin{center} \input{Chapters/chapter5/figures/component_design.tikz} \end{center} -\caption{Schematic representation of the five main components, their type definitions, and member functions. Because components are template based, the argument types cannot be known in beforehand. This concept rule set ensures compliance between components.}\label{ch5:fig:componentdesign} +\caption[Schematic representation of the five main components, their type definitions, and member functions.]{Schematic representation of the five main components, their type definitions, and member functions. Because components are template based, the argument types cannot be known in beforehand. This concept rule set ensures compliance between components.}\label{ch5:fig:componentdesign} \end{figure} A component is implemented as a generic C++ class, that normally takes as template arguments the same types that it offers through type definitions: a matrix takes a vector class as template argument, and a vector class takes a working precision value type. The matrix can then access the working precision via the vector class. Components that rely on multiple template arguments can combine these arguments via type binders to maintain code simplicity. We will demonstrate use of such type binders in the model problem examples. A thorough introduction to template-based programming in C++ can be found in \cite{ch5:Vandevoorde2002}. @@ -174,13 +174,16 @@ where $u(x,y,t)$ is the unknown heat distribution, $\kappa$ is a heat conductivi u(x,y,t_0) = \sin(\pi x)\,\sin(\pi y), & \qquad (x,y) \in \Omega. \end{align} An illustrative example of the numerical solution to the heat problem, using \eqref{ch5:eq:heatinit} as the initial condition is given in Figure \ref{ch5:fig:heatsolution}. -\begin{figure}[!htb] +\begin{figure}[!htbp] \begin{center} \setlength\figurewidth{0.3\textwidth} % - \setlength\figureheight{0.32\textwidth} % - \subfigure[$t=0.00s$]{\input{Chapters/chapter5/figures/HeatSolution0.tikz}} - \subfigure[$t=0.05s$]{\input{Chapters/chapter5/figures/HeatSolution0.049307.tikz}} + \setlength\figureheight{0.3\textwidth} % + \subfigure[$t=0.00s$]%{\input{Chapters/chapter5/figures/HeatSolution0.tikz}} +{\includegraphics[width=0.48\textwidth]{Chapters/chapter5/figures/HeatSolution0_conv.pdf}} + \subfigure[$t=0.05s$]%{\input{Chapters/chapter5/figures/HeatSolution0.049307.tikz}} +{\includegraphics[width=0.48\textwidth]{Chapters/chapter5/figures/HeatSolution0_049307_conv.pdf}} %\subfigure[$t=0.10s$]{\input{Chapters/chapter5/figures/HeatSolution0.099723.tikz}} +{\includegraphics[width=0.48\textwidth]{Chapters/chapter5/figures/HeatSolution0_099723_conv.pdf}} \end{center} \caption{Discrete solution at times $t=0s$ and $t=0.05s$, using \eqref{ch5:eq:heatinit} as initial condition and a small $20\times20$ numerical grid.}\label{ch5:fig:heatsolution} \end{figure} @@ -219,8 +222,8 @@ All items are either directly available in the library or can be designed from c With the right hand side operator in place, we are ready to implement the solver. For this simple PDE problem we compute all necessary initial data in the body of the main function and use the forward Euler time integrator to compute the solution till $t=t_{end}$. For more advanced PDE solvers, a built-in \texttt{ode\_solver} class is defined, that helps taking care of initialization and storage of multiple state variables. Declaring type definitions for all components at the beginning of the main file gives a good overview of the solver composition. In this way, it will be easy to control or change solver components at later times. Listing~\ref{ch5:lst:heattypedefs} lists the type definitions\index{type definitions} that are used to assemble the heat conduction solver. %\stefan{elaborate on the choices}. -\lstset{label=ch5:lst:heattypedefs,caption={Type definitions for all the heat conduction solver components used throughout the remaining code.}} -\begin{lstlisting} +\lstset{caption={Type definitions for all the heat conduction solver components used throughout the remaining code.}} +\begin{lstlisting}[label=ch5:lst:heattypedefs] typedef double value_type; typedef laplacian rhs_type; typedef gpulab::grid vector_type; @@ -229,8 +232,8 @@ typedef gpulab::integration::forward_euler time_integrator_type; \end{lstlisting} The grid is by default treated as a device object and memory is allocated on the GPU upon initialization of the grid. Setting up the grid can be done via the property type class. The property class holds information about the discrete and physical dimensions, along with fictitious ghost (halo) layers and periodicity conditions. For the heat conduction problem we use a non-periodic domain within the unit square and with no ghost layers. Listing \ref{ch5:lst:gridsetup} illustrates the grid assembly. -\lstset{label=ch5:lst:gridsetup,caption={Create 2D grid of discrete dimension \texttt{N} times \texttt{N} and physical dimension $0$ to $1$.}} -\begin{lstlisting} +\lstset{caption={Create 2D grid of discrete dimension \texttt{N} times \texttt{N} and physical dimension $0$ to $1$.}} +\begin{lstlisting}[label=ch5:lst:gridsetup] // Setup discrete and physical dimensions gpulab::grid_dim dim(N,N,1); gpulab::grid_dim p0(0,0); @@ -242,8 +245,8 @@ vector_type u(props); \end{lstlisting} Hereafter the vector \texttt{u} can be initialized accordingly to \eqref{ch5:eq:heatinit}. Finally we need to instantiate the right hand side Laplacian operator from Listing \ref{ch5:lst:laplaceimpl} and the forward Euler time integrator in order to integrate from $t_0$ till $t_{end}$. -\lstset{label=ch5:lst:timeintegrator,caption=Create time integrator and the right hand side Laplacian operator.} -\begin{lstlisting} +\lstset{caption=Create time integrator and the right hand side Laplacian operator.} +\begin{lstlisting}[label=ch5:lst:timeintegrator] rhs_type rhs(alpha); // Create right hand side operator time_integrator_type solver; // Create time integrator solver(&rhs,u,0.0f,tend,dt); // Integrate from to tend using dt @@ -263,10 +266,11 @@ Solution times for the heat conduction problem is in itself not very interesting \setlength\figurewidth{0.4\textwidth} \begin{center} {\small -\input{Chapters/chapter5/figures/AlphaPerformanceGTX590_N16777216.tikz} +%\input{Chapters/chapter5/figures/AlphaPerformanceGTX590_N16777216.tikz} +{\includegraphics[width=0.5\textwidth]{Chapters/chapter5/figures/AlphaPerformanceGTX590_N16777216_conv.pdf}} } \end{center} -\caption{Single and double precision floating point operations per second for a two dimensional stencil operator on a numerical grid of size $4096^2$. Various stencil sizes are used $\alpha=1,2,3,4$, equivalent to $5$pt, $9$pt, $13$pt, and $17$pt stencils. Test environment 1.}\label{ch5:fig:stencilperformance} +\caption[Single and double precision floating point operations per second for a two dimensional stencil operator on a numerical grid of size $4096^2$.]{Single and double precision floating point operations per second for a two dimensional stencil operator on a numerical grid of size $4096^2$. Various stencil sizes are used $\alpha=1,2,3,4$, equivalent to $5$pt, $9$pt, $13$pt, and $17$pt stencils. Test environment 1.}\label{ch5:fig:stencilperformance} \end{figure} @@ -300,8 +304,8 @@ where $k$ is the iteration number and $\mymat{M}$ is the preconditioner\index{pr In the following section we demonstrate how to assemble a solver for the discrete Poisson problem, using one of the tree iterative methods to efficiently solve \eqref{ch5:eq:poissonsystem}. -\lstset{label=ch5:lst:dc,caption={Main loop for the iterative defect correction solver. The solver is instantiated with template argument types for the matrix and vector classes, allowing underlying implementations to be based on GPU kernels.}} -\begin{lstlisting} +\lstset{caption={Main loop for the iterative defect correction solver. The solver is instantiated with template argument types for the matrix and vector classes, allowing underlying implementations to be based on GPU kernels.}} +\begin{lstlisting}[label=ch5:lst:dc] while(r.nrm2() > tol) { // Calculate residual @@ -326,8 +330,8 @@ Assembling the Poisson solver follows almost the same procedure as the heat cond At the beginning of the solver implementation we list the type definitions for the Poisson solver that will be used throughout the implementation. Here we use a geometric multigrid\index{multigrid} method as a preconditioner for the defect correction method. Therefore the multigrid solver is assembled first, so that it can be used in the assembling of the defect correction method. Listing \ref{ch5:lst:poissontypedefs} defines the types for the vector, the matrix, the multigrid preconditioner and the defect correction solver. The geometric multigrid method needs two additional template arguments, that are specific for multigrid, namely a smoother and a grid restriction/interpolation operator. These arguments are free to be implemented and supplied by the developer if special care is required for their specific problems, e.g. for a custom grid structure. For the Poisson problem on a regular grid, the library contains built-in restriction and interpolation operators, and a red-black Gauss-Seidel smoother. We refer to \cite{ch5:Trottenberg2001} for extensive details on multigrid methods. The monitor and config types that appear in Listing \ref{ch5:lst:poissontypedefs} are used for convergence monitoring within the iterative solver and to control run time parameters, such as tolerances, iteration limits, etc. -\lstset{label=ch5:lst:poissontypedefs,caption={Type definitions for the Laplacian matrix component and the multigrid preconditioned iterative defect correction\index{defect correction} solver.}} -\begin{lstlisting} +\lstset{caption={Type definitions for the Laplacian matrix component and the multigrid preconditioned iterative defect correction\index{defect correction} solver.}} +\begin{lstlisting}[label=ch5:lst:poissontypedefs] typedef double value_type; typedef gpulab::grid vector_type; typedef laplacian matrix_type; @@ -353,8 +357,8 @@ typedef gpulab::solvers::defect_correction dc_solver_type; With the type definitions set up, the implementation for the Poisson solver follows in Listing \ref{ch5:lst:poissonsolver}. Some of the initializations are left out, as they follow the same procedure as for the Heat conduction example. The defect correction and geometric multigrid solvers are initiated and then the multigrid is set as a preconditioner to the defect correction method. Finally the system is solved via a call to \texttt{solve()}. -\lstset{label=ch5:lst:poissonsolver,caption={Initializing the preconditioned defect correction solver to approximate the solution to $\mymat{A}\myvec{u}=\myvec{f}$.}} -\begin{lstlisting} +\lstset{caption={Initializing the preconditioned defect correction solver to approximate the solution to $\mymat{A}\myvec{u}=\myvec{f}$.}} +\begin{lstlisting}[label=ch5:lst:poissonsolver] matrix_type A(alpha); // High-order matrix matrix_type M(1); // Low-order matrix @@ -386,10 +390,14 @@ Defect correction in combination with multigrid preconditioning, enables efficie \setlength\figurewidth{0.33\textwidth} \begin{center} \subfigure[Convergence history for the conjugate gradient and multigrid methods, for two different problem sizes.]{\label{ch5:fig:poissonconvergence:a} - {\scriptsize \input{Chapters/chapter5/figures/ConvergenceMGvsCG.tikz}} -} \hspace{0.2cm}% + %{\scriptsize \input{Chapters/chapter5/figures/ConvergenceMGvsCG.tikz}} + {\includegraphics[width=0.5\textwidth]{Chapters/chapter5/figures/ConvergenceMGvsCG_conv.pdf}} +} + + \hspace{0.2cm}% \subfigure[Defect correction convergence history for three different stencil sizes.]{\label{ch5:fig:poissonconvergence:b} - {\scriptsize \input{Chapters/chapter5/figures/ConvergenceDC.tikz}} + %{\scriptsize \input{Chapters/chapter5/figures/ConvergenceDC.tikz}} + {\includegraphics[width=0.5\textwidth]{Chapters/chapter5/figures/ConvergenceDC_conv.pdf}} } \end{center} \caption{Algorithmic performance for the conjugate gradient, multigrid, and defect correction methods, measured in terms of the relative residual per iteration.}\label{ch5:fig:poissonconvergence} @@ -408,7 +416,7 @@ CUDA enabled GPUs are optimized for high memory bandwidth and fast on-chip perfo %\includegraphics[width=0.6\textwidth]{Chapters/chapter5/figures/GPU2GPU.png} \input{Chapters/chapter5/figures/GPU2GPU.tikz} \end{center} -\caption{Message passing between two GPUs involves several memory transfers across lower bandwidth connections. A kernel that aligns the data to be transferred is required, if the data is not already sequentially stored in device memory.}\label{ch5:fig:gpu2gputransfer} +\caption[Message passing between two GPUs involves several memory transfers across lower bandwidth connections.]{Message passing between two GPUs involves several memory transfers across lower bandwidth connections. A kernel that aligns the data to be transferred is required, if the data is not already sequentially stored in device memory.}\label{ch5:fig:gpu2gputransfer} \end{figure} @@ -433,7 +441,7 @@ An alternative to the preconditioning strategy is to have each subdomain query i \begin{center} \input{Chapters/chapter5/figures/dd2d.tikz} \end{center} -\caption{Domain distribution along the horizontal direction of a two dimensional grid into three subdomains. {\large$\bullet$} and {\scriptsize$\textcolor[rgb]{0.5,0.5,0.5}{\blacksquare}$} represent internal grid points and ghost points respectively.}\label{ch5:fig:dd2d} +\caption[Domain distribution along the horizontal direction of a two dimensional grid into three subdomains.]{Domain distribution along the horizontal direction of a two dimensional grid into three subdomains. {\large$\bullet$} and {\scriptsize$\textcolor[rgb]{0.5,0.5,0.5}{\blacksquare}$} represent internal grid points and ghost points respectively.}\label{ch5:fig:dd2d} \end{figure} Topologies are introduced via an extra template argument to the grid component. A grid is by default not distributed, because the default template argument is a non-distribution topology implementation. The grid class is extended with a new member function \texttt{update()}, that makes sure that the topology instance updates all ghost points. The library contains two topology strategies, based on one dimensional and two dimensional distributions of the grid. The number of grid subdomains will be equal to the number of MPI processes executing the program. @@ -449,11 +457,13 @@ Distributed performance for the finite difference stencil operation is illustrat \setlength\figurewidth{0.55\textwidth} \begin{center} \subfigure[Absolute timings, $\alpha=3$.]{ - {\small\input{Chapters/chapter5/figures/MultiGPUAlpha3TimingsTeslaM2050.tikz}} + %{\small\input{Chapters/chapter5/figures/MultiGPUAlpha3TimingsTeslaM2050.tikz}} + {\includegraphics[width=0.6\textwidth]{Chapters/chapter5/figures/MultiGPUAlpha3TimingsTeslaM2050_conv.pdf}} \label{ch5:fig:multigpu:a} } \subfigure[Performance at $N=4069^2$, single precision.]{ - {\small\input{Chapters/chapter5/figures/MultiGPUAlphaPerformanceTeslaM2050_N16777216.tikz}} + % {\small\input{Chapters/chapter5/figures/MultiGPUAlphaPerformanceTeslaM2050_N16777216.tikz}} +{\includegraphics[width=0.6\textwidth]{Chapters/chapter5/figures/MultiGPUAlphaPerformanceTeslaM2050_N16777216_conv.pdf}} \label{ch5:fig:multigpu:b} } \end{center} @@ -471,7 +481,7 @@ One method that introduces concurrency to the solution of evolution problems is \begin{center} \input{Chapters/chapter5/figures/ParallelInTime.tikz} \end{center} -\caption{Time domain decomposition. A compute node is assigned to each individual time sub-main to compute the initial value problem. Consistency at the time sub-domain boundaries is obtained with the application of a computationally cheap integrator in conjunction with the parareal iterative predictor-corrector algorithm, eq. \eqref{ch5:eq:PARAREAL}. }\label{ch5:fig:ParallelInTime} +\caption[Time domain decomposition.]{Time domain decomposition. A compute node is assigned to each individual time sub-main to compute the initial value problem. Consistency at the time sub-domain boundaries is obtained with the application of a computationally cheap integrator in conjunction with the parareal iterative predictor-corrector algorithm, eq. \eqref{ch5:eq:PARAREAL}. }\label{ch5:fig:ParallelInTime} \end{figure} \label{ch5:parareal} @@ -498,8 +508,8 @@ Using the defined $\mathcal{F}_{\Delta T}$ and $\mathcal{G}_{\Delta T}$ operator U_{n}^{k+1}=\mathcal{\mathcal{G}}_{\Delta T}\left(U_{n-1}^{k+1}\right)+\mathcal{\mathcal{F}}_{\Delta T}\left(U_{n-1}^{k}\right)-\mathcal{\mathcal{G}}_{\Delta T}\left(U_{n-1}^{k}\right),\quad U_{0}^{k}=u^{0}, \end{equation} with the initial prediction $U^{0}_{n} = \mathcal{G}_{\Delta T}^{n} u^{0}$ for $n=1\ldots N$ and $k=1\ldots K$. $N$ being the number of time subdomains, while $K\geq1$ is the number of predictor-corrector iterations applied. The parareal algorithm is implemented in the library as a separate time integration component, using a fully distributed work scheduling model, as proposed by Aubanel (2010)~\cite{ch5:EA10}. The model is schematically presented in Figure \ref{ch5:fig:FullyDistributedCores}. The parareal component hides all communication and work distribution from the application developer. It is defined such that a user only has to decide what coarse and fine propagators to use. Setting up the type definitions for parareal time integration using forward Euler for coarse propagation and fourth order Runge-Kutta for fine propagation, could then be defined as follows: -\lstset{label=ch5:lst:parareal,caption={Assembling a parareal time integrator using forward Euler for coarse propagation and a Runge-Kutta method for fine propagation.}} -\begin{lstlisting} +\lstset{caption={Assembling a parareal time integrator using forward Euler for coarse propagation and a Runge-Kutta method for fine propagation.}} +\begin{lstlisting}[label=ch5:lst:parareal] typedef gpulab::integration::forward_euler coarse; typedef gpulab::integration::ERK4 fine; typedef gpulab::integration::parareal integrator; @@ -513,7 +523,7 @@ The number of GPUs used for parallelization depends on the number of MPI process \begin{center} \input{Chapters/chapter5/figures/FullyDistributed.tikz} \end{center} -\caption{\label{ch5:fig:FullyDistributedCores} Schematic visualization of a fully distributed work scheduling model for the parareal algorithm as proposed by Aubanel. Each GPU is responsible for computing the solution on a single time sub-domain. The computation is initiated at rank $0$ and cascades trough to rank $N$ where the final solution can be fetched. } +\caption[Schematic visualization of a fully distributed work scheduling model for the parareal algorithm as proposed by Aubanel.]{\label{ch5:fig:FullyDistributedCores} Schematic visualization of a fully distributed work scheduling model for the parareal algorithm as proposed by Aubanel. Each GPU is responsible for computing the solution on a single time sub-domain. The computation is initiated at rank $0$ and cascades trough to rank $N$ where the final solution can be fetched. } \end{figure} \subsubsection{Computational complexity} @@ -545,7 +555,7 @@ If we in addition assume that the time spend on coarse propagation is negligible \label{ch5:fig:pararealRvsGPUs:b} } \end{center} - \caption{Parareal convergence properties as a function of $R$ and number GPUs used. The error is measured as the relative difference between the purely sequential solution and the parareal solution.}\label{ch5:fig:pararealRvsGPUs} + \caption[Parareal convergence properties as a function of $R$ and number GPUs used.]{Parareal convergence properties as a function of $R$ and number GPUs used. The error is measured as the relative difference between the purely sequential solution and the parareal solution.}\label{ch5:fig:pararealRvsGPUs} \end{figure} \begin{figure}[!htb] @@ -561,7 +571,7 @@ If we in addition assume that the time spend on coarse propagation is negligible \label{ch5:fig:pararealGPUs:b} } \end{center} - \caption{Parareal performance properties as a function of $R$ and number GPUs used. Notice how the obtained performance depends greatly on the choice of $R$ as a function of the number of GPUs. Test environment 2. }\label{ch5:fig:pararealGPUs} + \caption[Parareal performance properties as a function of $R$ and number GPUs used.]{Parareal performance properties as a function of $R$ and number GPUs used. Notice how the obtained performance depends greatly on the choice of $R$ as a function of the number of GPUs. Test environment 2. }\label{ch5:fig:pararealGPUs} \end{figure} %TODO: Do we make this into a subsubsection: @@ -623,4 +633,4 @@ from the Danish Research Council for Technology and Production Sciences. A speci \putbib[Chapters/chapter5/biblio5] % Reset lst label and caption -\lstset{label=,caption=} \ No newline at end of file +%\lstset{label=,caption=}