X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/4f8020b1c6324b056b53b4325cd80d59bb7cf19f..f917bf7fb163427e5d96188533e7d90d5a5d6846:/BookGPU/Chapters/chapter5/ch5.tex diff --git a/BookGPU/Chapters/chapter5/ch5.tex b/BookGPU/Chapters/chapter5/ch5.tex index cbf81f3..e37a709 100644 --- a/BookGPU/Chapters/chapter5/ch5.tex +++ b/BookGPU/Chapters/chapter5/ch5.tex @@ -1,7 +1,7 @@ -\chapterauthor{Stefan L. Glimberg}{Technical University of Denmark} -\chapterauthor{Allan P. Engsig-Karup}{Technical University of Denmark} -\chapterauthor{Allan S. Nielsen}{Technical University of Denmark} -\chapterauthor{Bernd Dammann}{Technical University of Denmark} +\chapterauthor{Stefan L. Glimberg, Allan P. Engsig-Karup, Allan S. Nielsen and Bernd Dammann}{Technical University of Denmark} +%\chapterauthor{Allan P. Engsig-Karup}{Technical University of Denmark} +%\chapterauthor{Allan S. Nielsen}{Technical University of Denmark} +%\chapterauthor{Bernd Dammann}{Technical University of Denmark} @@ -83,7 +83,7 @@ The library is grouped into component classes. Each component should fulfill a s \begin{center} \input{Chapters/chapter5/figures/component_design.tikz} \end{center} -\caption{Schematic representation of the five main components, their type definitions, and member functions. Because components are template based, the argument types cannot be known in beforehand. This concept rule set ensures compliance between components.}\label{ch5:fig:componentdesign} +\caption[Schematic representation of the five main components, their type definitions, and member functions.]{Schematic representation of the five main components, their type definitions, and member functions. Because components are template based, the argument types cannot be known in beforehand. This concept rule set ensures compliance between components.}\label{ch5:fig:componentdesign} \end{figure} A component is implemented as a generic C++ class, that normally takes as template arguments the same types that it offers through type definitions: a matrix takes a vector class as template argument, and a vector class takes a working precision value type. The matrix can then access the working precision via the vector class. Components that rely on multiple template arguments can combine these arguments via type binders to maintain code simplicity. We will demonstrate use of such type binders in the model problem examples. A thorough introduction to template-based programming in C++ can be found in \cite{ch5:Vandevoorde2002}. @@ -174,13 +174,16 @@ where $u(x,y,t)$ is the unknown heat distribution, $\kappa$ is a heat conductivi u(x,y,t_0) = \sin(\pi x)\,\sin(\pi y), & \qquad (x,y) \in \Omega. \end{align} An illustrative example of the numerical solution to the heat problem, using \eqref{ch5:eq:heatinit} as the initial condition is given in Figure \ref{ch5:fig:heatsolution}. -\begin{figure}[!htb] +\begin{figure}[!htbp] \begin{center} \setlength\figurewidth{0.3\textwidth} % - \setlength\figureheight{0.32\textwidth} % - \subfigure[$t=0.00s$]{\input{Chapters/chapter5/figures/HeatSolution0.tikz}} - \subfigure[$t=0.05s$]{\input{Chapters/chapter5/figures/HeatSolution0.049307.tikz}} + \setlength\figureheight{0.3\textwidth} % + \subfigure[$t=0.00s$]%{\input{Chapters/chapter5/figures/HeatSolution0.tikz}} +{\includegraphics[width=0.48\textwidth]{Chapters/chapter5/figures/HeatSolution0_conv.pdf}} + \subfigure[$t=0.05s$]%{\input{Chapters/chapter5/figures/HeatSolution0.049307.tikz}} +{\includegraphics[width=0.48\textwidth]{Chapters/chapter5/figures/HeatSolution0_049307_conv.pdf}} %\subfigure[$t=0.10s$]{\input{Chapters/chapter5/figures/HeatSolution0.099723.tikz}} +{\includegraphics[width=0.48\textwidth]{Chapters/chapter5/figures/HeatSolution0_099723_conv.pdf}} \end{center} \caption{Discrete solution at times $t=0s$ and $t=0.05s$, using \eqref{ch5:eq:heatinit} as initial condition and a small $20\times20$ numerical grid.}\label{ch5:fig:heatsolution} \end{figure} @@ -263,10 +266,11 @@ Solution times for the heat conduction problem is in itself not very interesting \setlength\figurewidth{0.4\textwidth} \begin{center} {\small -\input{Chapters/chapter5/figures/AlphaPerformanceGTX590_N16777216.tikz} +%\input{Chapters/chapter5/figures/AlphaPerformanceGTX590_N16777216.tikz} +{\includegraphics[width=0.5\textwidth]{Chapters/chapter5/figures/AlphaPerformanceGTX590_N16777216_conv.pdf}} } \end{center} -\caption{Single and double precision floating point operations per second for a two dimensional stencil operator on a numerical grid of size $4096^2$. Various stencil sizes are used $\alpha=1,2,3,4$, equivalent to $5$pt, $9$pt, $13$pt, and $17$pt stencils. Test environment 1.}\label{ch5:fig:stencilperformance} +\caption[Single and double precision floating point operations per second for a two dimensional stencil operator on a numerical grid of size $4096^2$.]{Single and double precision floating point operations per second for a two dimensional stencil operator on a numerical grid of size $4096^2$. Various stencil sizes are used $\alpha=1,2,3,4$, equivalent to $5$pt, $9$pt, $13$pt, and $17$pt stencils. Test environment 1.}\label{ch5:fig:stencilperformance} \end{figure} @@ -386,10 +390,14 @@ Defect correction in combination with multigrid preconditioning, enables efficie \setlength\figurewidth{0.33\textwidth} \begin{center} \subfigure[Convergence history for the conjugate gradient and multigrid methods, for two different problem sizes.]{\label{ch5:fig:poissonconvergence:a} - {\scriptsize \input{Chapters/chapter5/figures/ConvergenceMGvsCG.tikz}} -} \hspace{0.2cm}% + %{\scriptsize \input{Chapters/chapter5/figures/ConvergenceMGvsCG.tikz}} + {\includegraphics[width=0.5\textwidth]{Chapters/chapter5/figures/ConvergenceMGvsCG_conv.pdf}} +} + + \hspace{0.2cm}% \subfigure[Defect correction convergence history for three different stencil sizes.]{\label{ch5:fig:poissonconvergence:b} - {\scriptsize \input{Chapters/chapter5/figures/ConvergenceDC.tikz}} + %{\scriptsize \input{Chapters/chapter5/figures/ConvergenceDC.tikz}} + {\includegraphics[width=0.5\textwidth]{Chapters/chapter5/figures/ConvergenceDC_conv.pdf}} } \end{center} \caption{Algorithmic performance for the conjugate gradient, multigrid, and defect correction methods, measured in terms of the relative residual per iteration.}\label{ch5:fig:poissonconvergence} @@ -408,7 +416,7 @@ CUDA enabled GPUs are optimized for high memory bandwidth and fast on-chip perfo %\includegraphics[width=0.6\textwidth]{Chapters/chapter5/figures/GPU2GPU.png} \input{Chapters/chapter5/figures/GPU2GPU.tikz} \end{center} -\caption{Message passing between two GPUs involves several memory transfers across lower bandwidth connections. A kernel that aligns the data to be transferred is required, if the data is not already sequentially stored in device memory.}\label{ch5:fig:gpu2gputransfer} +\caption[Message passing between two GPUs involves several memory transfers across lower bandwidth connections.]{Message passing between two GPUs involves several memory transfers across lower bandwidth connections. A kernel that aligns the data to be transferred is required, if the data is not already sequentially stored in device memory.}\label{ch5:fig:gpu2gputransfer} \end{figure} @@ -433,7 +441,7 @@ An alternative to the preconditioning strategy is to have each subdomain query i \begin{center} \input{Chapters/chapter5/figures/dd2d.tikz} \end{center} -\caption{Domain distribution along the horizontal direction of a two dimensional grid into three subdomains. {\large$\bullet$} and {\scriptsize$\textcolor[rgb]{0.5,0.5,0.5}{\blacksquare}$} represent internal grid points and ghost points respectively.}\label{ch5:fig:dd2d} +\caption[Domain distribution along the horizontal direction of a two dimensional grid into three subdomains.]{Domain distribution along the horizontal direction of a two dimensional grid into three subdomains. {\large$\bullet$} and {\scriptsize$\textcolor[rgb]{0.5,0.5,0.5}{\blacksquare}$} represent internal grid points and ghost points respectively.}\label{ch5:fig:dd2d} \end{figure} Topologies are introduced via an extra template argument to the grid component. A grid is by default not distributed, because the default template argument is a non-distribution topology implementation. The grid class is extended with a new member function \texttt{update()}, that makes sure that the topology instance updates all ghost points. The library contains two topology strategies, based on one dimensional and two dimensional distributions of the grid. The number of grid subdomains will be equal to the number of MPI processes executing the program. @@ -449,11 +457,13 @@ Distributed performance for the finite difference stencil operation is illustrat \setlength\figurewidth{0.55\textwidth} \begin{center} \subfigure[Absolute timings, $\alpha=3$.]{ - {\small\input{Chapters/chapter5/figures/MultiGPUAlpha3TimingsTeslaM2050.tikz}} + %{\small\input{Chapters/chapter5/figures/MultiGPUAlpha3TimingsTeslaM2050.tikz}} + {\includegraphics[width=0.6\textwidth]{Chapters/chapter5/figures/MultiGPUAlpha3TimingsTeslaM2050_conv.pdf}} \label{ch5:fig:multigpu:a} } \subfigure[Performance at $N=4069^2$, single precision.]{ - {\small\input{Chapters/chapter5/figures/MultiGPUAlphaPerformanceTeslaM2050_N16777216.tikz}} + % {\small\input{Chapters/chapter5/figures/MultiGPUAlphaPerformanceTeslaM2050_N16777216.tikz}} +{\includegraphics[width=0.6\textwidth]{Chapters/chapter5/figures/MultiGPUAlphaPerformanceTeslaM2050_N16777216_conv.pdf}} \label{ch5:fig:multigpu:b} } \end{center} @@ -471,7 +481,7 @@ One method that introduces concurrency to the solution of evolution problems is \begin{center} \input{Chapters/chapter5/figures/ParallelInTime.tikz} \end{center} -\caption{Time domain decomposition. A compute node is assigned to each individual time sub-main to compute the initial value problem. Consistency at the time sub-domain boundaries is obtained with the application of a computationally cheap integrator in conjunction with the parareal iterative predictor-corrector algorithm, eq. \eqref{ch5:eq:PARAREAL}. }\label{ch5:fig:ParallelInTime} +\caption[Time domain decomposition.]{Time domain decomposition. A compute node is assigned to each individual time sub-main to compute the initial value problem. Consistency at the time sub-domain boundaries is obtained with the application of a computationally cheap integrator in conjunction with the parareal iterative predictor-corrector algorithm, eq. \eqref{ch5:eq:PARAREAL}. }\label{ch5:fig:ParallelInTime} \end{figure} \label{ch5:parareal} @@ -513,7 +523,7 @@ The number of GPUs used for parallelization depends on the number of MPI process \begin{center} \input{Chapters/chapter5/figures/FullyDistributed.tikz} \end{center} -\caption{\label{ch5:fig:FullyDistributedCores} Schematic visualization of a fully distributed work scheduling model for the parareal algorithm as proposed by Aubanel. Each GPU is responsible for computing the solution on a single time sub-domain. The computation is initiated at rank $0$ and cascades trough to rank $N$ where the final solution can be fetched. } +\caption[Schematic visualization of a fully distributed work scheduling model for the parareal algorithm as proposed by Aubanel.]{\label{ch5:fig:FullyDistributedCores} Schematic visualization of a fully distributed work scheduling model for the parareal algorithm as proposed by Aubanel. Each GPU is responsible for computing the solution on a single time sub-domain. The computation is initiated at rank $0$ and cascades trough to rank $N$ where the final solution can be fetched. } \end{figure} \subsubsection{Computational complexity} @@ -545,7 +555,7 @@ If we in addition assume that the time spend on coarse propagation is negligible \label{ch5:fig:pararealRvsGPUs:b} } \end{center} - \caption{Parareal convergence properties as a function of $R$ and number GPUs used. The error is measured as the relative difference between the purely sequential solution and the parareal solution.}\label{ch5:fig:pararealRvsGPUs} + \caption[Parareal convergence properties as a function of $R$ and number GPUs used.]{Parareal convergence properties as a function of $R$ and number GPUs used. The error is measured as the relative difference between the purely sequential solution and the parareal solution.}\label{ch5:fig:pararealRvsGPUs} \end{figure} \begin{figure}[!htb] @@ -561,7 +571,7 @@ If we in addition assume that the time spend on coarse propagation is negligible \label{ch5:fig:pararealGPUs:b} } \end{center} - \caption{Parareal performance properties as a function of $R$ and number GPUs used. Notice how the obtained performance depends greatly on the choice of $R$ as a function of the number of GPUs. Test environment 2. }\label{ch5:fig:pararealGPUs} + \caption[Parareal performance properties as a function of $R$ and number GPUs used.]{Parareal performance properties as a function of $R$ and number GPUs used. Notice how the obtained performance depends greatly on the choice of $R$ as a function of the number of GPUs. Test environment 2. }\label{ch5:fig:pararealGPUs} \end{figure} %TODO: Do we make this into a subsubsection: @@ -623,4 +633,4 @@ from the Danish Research Council for Technology and Production Sciences. A speci \putbib[Chapters/chapter5/biblio5] % Reset lst label and caption -\lstset{label=,caption=} \ No newline at end of file +%\lstset{label=,caption=}