\newlabel{sub@img:sap_example_med5}{{(c)}{33}}
\newlabel{img:sap_example_med3_it2}{{4.1(d)}{33}}
\newlabel{sub@img:sap_example_med3_it2}{{(d)}{33}}
-\@writefile{lof}{\contentsline {figure}{\numberline {4.1}{\ignorespaces Exemple of median filtering, applied to salt \& pepper noise reduction.\relax }}{33}}
+\@writefile{lof}{\contentsline {figure}{\numberline {4.1}{\ignorespaces Example of median filtering, applied to salt \& pepper noise reduction.\relax }}{33}}
\@writefile{lof}{\contentsline {subfigure}{\numberline{(a)}{\ignorespaces {Airplane image, corrupted by salt and pepper noise of density 0.25}}}{33}}
\@writefile{lof}{\contentsline {subfigure}{\numberline{(b)}{\ignorespaces {Image denoised by a $3\times 3$ median filter}}}{33}}
\@writefile{lof}{\contentsline {subfigure}{\numberline{(c)}{\ignorespaces {Image denoised by a $5\times 5$ median filter}}}{33}}
\newlabel{fig:sap_examples}{{4.1}{33}}
\newlabel{lst:medianGeneric}{{4.1}{34}}
\@writefile{lol}{\contentsline {lstlisting}{\numberline {4.1}Generic CUDA kernel achieving median filtering}{34}}
+\@writefile{lof}{\contentsline {figure}{\numberline {4.2}{\ignorespaces Example of 5x5 median filtering\relax }}{35}}
+\newlabel{fig:median_1}{{4.2}{35}}
\newlabel{algoMedianGeneric}{{2}{35}}
\newlabel{algoMedianGeneric:memcpyH2D}{{1}{35}}
\newlabel{algoMedianGeneric:cptstart}{{3}{35}}
\newlabel{algoMedianGeneric:memcpyD2H}{{7}{35}}
\@writefile{loa}{\contentsline {algocf}{\numberline {2}{\ignorespaces generic n$\times $n median filter\relax }}{35}}
\@writefile{toc}{\contentsline {section}{\numberline {4.3}NVidia GPU tuning recipes}{35}}
-\@writefile{lof}{\contentsline {figure}{\numberline {4.2}{\ignorespaces Exemple of 5x5 median filtering\relax }}{35}}
-\newlabel{fig:median_1}{{4.2}{35}}
\@writefile{lof}{\contentsline {figure}{\numberline {4.3}{\ignorespaces Illustration of window overlapping in 5x5 median filtering\relax }}{36}}
\newlabel{fig:median_overlap}{{4.3}{36}}
\@writefile{lot}{\contentsline {table}{\numberline {4.1}{\ignorespaces Performance results of \texttt {kernel medianR}. \relax }}{36}}
\newlabel{fig:compMedians2}{{4.8}{44}}
\newlabel{sec:median5}{{4.5.1}{44}}
\@writefile{toc}{\contentsline {subsection}{\numberline {4.5.1}A register-only 5$\times $5 median filter }{44}}
-\@writefile{lof}{\contentsline {figure}{\numberline {4.9}{\ignorespaces Reducing register count in a 5$\times $5 register-only median kernel outputting 2 pixels simultaneously. The first 7 forgetful selection stages are common to both processed center pixels. Only the last 5 selections have to be done separately.\relax }}{45}}
+\@writefile{lof}{\contentsline {figure}{\numberline {4.9}{\ignorespaces Reducing register count in a 5$\times $5 register-only median kernel outputting 2 pixels simultaneously.}}{45}}
\newlabel{fig:median5overlap}{{4.9}{45}}
-\@writefile{lof}{\contentsline {figure}{\numberline {4.10}{\ignorespaces First iteration of the $5\times 5$ selection process, with $k_{25}=14$, which shows how Instruction Level Parallelism is maximized by the use of an incomplete sorting network. Arrows represent the result of the swapping function, with the lowest value at the starting point and the highest value at the end point.\relax }}{45}}
+\@writefile{lof}{\contentsline {figure}{\numberline {4.10}{\ignorespaces First iteration of the $5\times 5$ selection process, with $k_{25}=14$, which shows how Instruction Level Parallelism is maximized by the use of an incomplete sorting network.}}{45}}
\newlabel{fig:median5overlap}{{4.10}{45}}
\newlabel{lst:medianForget2pix5}{{4.5}{46}}
\@writefile{lol}{\contentsline {lstlisting}{\numberline {4.5}kernel 5$\times $5 median filter processing 2 output pixel values per thread by a combined forgetfull selection.}{46}}
\newlabel{sub@img:sap_example_sep_med5}{{(c)}{49}}
\newlabel{img:sap_example_sep_med3_it2}{{4.11(d)}{49}}
\newlabel{sub@img:sap_example_sep_med3_it2}{{(d)}{49}}
-\@writefile{lof}{\contentsline {figure}{\numberline {4.11}{\ignorespaces Exemple of separable median filtering (smoother), applied to salt \& pepper noise reduction.\relax }}{49}}
+\@writefile{lof}{\contentsline {figure}{\numberline {4.11}{\ignorespaces Example of separable median filtering (smoother), applied to salt \& pepper noise reduction.\relax }}{49}}
\@writefile{lof}{\contentsline {subfigure}{\numberline{(a)}{\ignorespaces {Airplane image, corrupted with by salt and pepper noise of density 0.25}}}{49}}
\@writefile{lof}{\contentsline {subfigure}{\numberline{(b)}{\ignorespaces {Image denoised by a $3\times 3$ separable smoother}}}{49}}
\@writefile{lof}{\contentsline {subfigure}{\numberline{(c)}{\ignorespaces {Image denoised by a $5\times 5$ separable smoother}}}{49}}
\subfigure[Image denoised by a $3\times 3$ median filter]{\label{img:sap_example_med3} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_med3.png}}\\
\subfigure[Image denoised by a $5\times 5$ median filter]{\label{img:sap_example_med5} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_med5.png}}\qquad
\subfigure[Image denoised by 2 iterations of a $3\times 3$ median filter]{\label{img:sap_example_med3_it2} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_med3_it2.png}}\\
- \caption{Exemple of median filtering, applied to salt \& pepper noise reduction.}
+ \caption{Example of median filtering, applied to salt \& pepper noise reduction.}
\label{fig:sap_examples}
\end{figure}
Since inner loops that fill the histogram vector contain very few fetching instructions (from 9 to 49, depending on the window size), it is not surprising to note their neglectable impact compared to outer loops that fetch image pixels (from 256k to 16M instructions).
One could be tempted to claim that CPU has no chance to win, which is not so obvious as it highly depends on what kind of algorithm is run and above all, how it is implemented. To illustrate this, we can notice that, despite a maximum effective throughput potential that is almost five times higher, measured GTX280 throughput values sometimes prove slower than CPU values, as shown in Table \ref{tab:medianHisto1}.
+
+\lstinputlisting[label={lst:medianGeneric},caption=Generic CUDA kernel achieving median filtering]{Chapters/chapter3/code/medianGeneric.cu}
+
+
On the GPU's side, we note high dependence on window size due to the redundancy induced by the multiple fetches of each pixel inside each block, becoming higher with the window size as illustrated by Figure \ref{fig:median_overlap}. On C2070 card, thanks to a more efficient caching mechanism, this effect is lesser. On GPUs, dependency over image size is low, and due to slightly more efficient data transfers when copying larger data amounts, pixel throughputs increases with image size. As an example, transferring a 4096$\times$4096 pixel image (32~MBytes) is a bit faster than transferring 64 times a 512$\times$512 pixel image (0.5~MBytes).
%% mettre l'eau à la bouche
-\lstinputlisting[label={lst:medianGeneric},caption=Generic CUDA kernel achieving median filtering]{Chapters/chapter3/code/medianGeneric.cu}
-
\begin{figure}
\centering
\includegraphics[width=8cm]{Chapters/chapter3/img/median_1.png}
- \caption{Exemple of 5x5 median filtering}
+ \caption{Example of 5x5 median filtering}
\label{fig:median_1}
\end{figure}
\begin{figure}
\centering
\includegraphics[width=6cm]{Chapters/chapter3/img/median5_overlap4.png}
- \caption{Reducing register count in a 5$\times$5 register-only median kernel outputting 2 pixels simultaneously. The first 7 forgetful selection stages are common to both processed center pixels. Only the last 5 selections have to be done separately.}
+ \caption[Reducing register count in a 5$\times$5 register-only median kernel outputting 2 pixels simultaneously.]{Reducing register count in a 5$\times$5 register-only median kernel outputting 2 pixels simultaneously. The first 7 forgetful selection stages are common to both processed center pixels. Only the last 5 selections have to be done separately.}
\label{fig:median5overlap}
\end{figure}
\begin{figure}
\centering
\includegraphics[width=6cm]{Chapters/chapter3/img/forgetful_selection4.png}
- \caption{First iteration of the $5\times 5$ selection process, with $k_{25}=14$, which shows how Instruction Level Parallelism is maximized by the use of an incomplete sorting network. Arrows represent the result of the swapping function, with the lowest value at the starting point and the highest value at the end point.}
+ \caption[First iteration of the $5\times 5$ selection process, with $k_{25}=14$, which shows how Instruction Level Parallelism is maximized by the use of an incomplete sorting network.]{First iteration of the $5\times 5$ selection process, with $k_{25}=14$, which shows how Instruction Level Parallelism is maximized by the use of an incomplete sorting network. Arrows represent the result of the swapping function, with the lowest value at the starting point and the highest value at the end point.}
\label{fig:median5overlap}
\end{figure}
\subfigure[Image denoised by a $3\times 3$ separable smoother]{\label{img:sap_example_sep_med3} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_sep_med3.png}}\\
\subfigure[Image denoised by a $5\times 5$ separable smoother]{\label{img:sap_example_sep_med5} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_sep_med5.png}}\qquad
\subfigure[Image background estimation by a $55\times 55$ separable smoother]{\label{img:sap_example_sep_med3_it2} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_sep_med111.png}}\\
- \caption{Exemple of separable median filtering (smoother), applied to salt \& pepper noise reduction.}
+ \caption{Example of separable median filtering (smoother), applied to salt \& pepper noise reduction.}
\label{fig:sap_examples2}
\end{figure}
\begin{figure}
\centering
\includegraphics[width=11cm]{Chapters/chapter4/img/convo1.png}
- \caption{Principle of a generic convolution implementation. The center pixel is represented with a black background and the pixels of its neighborhood are denoted $I_{p,q}$ where $(p,q)$ is the relative position of the neighbor pixel. Elements $h_{t,u}$ are the values of the convolution mask.}
+ \caption[Principle of a generic convolution implementation.]{Principle of a generic convolution implementation. The center pixel is represented with a black background and the pixels of its neighborhood are denoted $I_{p,q}$ where $(p,q)$ is the relative position of the neighbor pixel. Elements $h_{t,u}$ are the values of the convolution mask.}
\label{fig:convoPrinciple}
\end{figure}
\begin{algorithm}
$\mathbf{4096\times 4096}$&4.700&1585 &13.05&533 &25.56&533 \\\hline
\end{tabular}
}
-\caption{Timings ($time$) and throughput values ($TP$ in Mpix/s) of one register-only non separable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$ and $7\times 7$ pixels, on a C2070 card (fermi architecture). Data transfer duration are those of Table \ref{tab:memcpy1}.}
+\caption[Timings ($time$) and throughput values ($TP$ in Mpix/s) of one register-only non separable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$ and $7\times 7$ pixels, on a C2070 card.]{Timings ($time$) and throughput values ($TP$ in Mpix/s) of one register-only non separable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$ and $7\times 7$ pixels, on a C2070 card (fermi architecture). Data transfer duration are those of Table \ref{tab:memcpy1}.}
\label{tab:convoNonSepReg1}
\end{table}
$\mathbf{4096\times 4096}$&3.171&1075 &8.720&793 &17.076&569 \\\hline
\end{tabular}
}
-\caption{Timings ($time$) and throughput values ($TP$ in Mpix/s) of one register-only non separable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$ and $7\times 7$ pixels, on a GTX280 (GT200 architecture). Data transfer duration are those of Table \ref{tab:memcpy1}.}
+\caption[Timings ($time$) and throughput values ($TP$ in Mpix/s) of one register-only non separable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$ and $7\times 7$ pixels, on a GTX280.]{Timings ($time$) and throughput values ($TP$ in Mpix/s) of one register-only non separable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$ and $7\times 7$ pixels, on a GTX280 (GT200 architecture). Data transfer duration are those of Table \ref{tab:memcpy1}.}
\label{tab:convoNonSepReg3}
\end{table}
$\mathbf{4096\times 4096}$&1.964&2138 &3.926&1711 &6.416&1364 \\\hline
\end{tabular}
}
-\caption{Timings ($time$) and throughput values ($TP$ in Mpix/s) of our generic fixed mask size convolution kernel run on a C2070 card. Data transfer durations are those of Table \ref{tab:memcpy1}.}
+\caption[Timings ($time$) and throughput values ($TP$ in Mpix/s) of our generic fixed mask size convolution kernel run on a C2070 card.]{Timings ($time$) and throughput values ($TP$ in Mpix/s) of our generic fixed mask size convolution kernel run on a C2070 card. Data transfer durations are those of Table \ref{tab:memcpy1}.}
\label{tab:convoGene8x8p}
\end{table}
\begin{figure}
\centering
\includegraphics[width=12cm]{Chapters/chapter4/img/convoShMem.png}
- \caption{Organization of the prefetching stage of data, for a $5\times 5$ mask and a thread block size of $8\times 4$. Threads in both top corners of the top figure are identified either by a circle or by a star symbol. The image tile, loaded into shared memory includes the pixels to be updated by the threads of the block, as well as its 2-pixel wide halo. Here, circle and star symbols in the image tile show which pixels are actually loaded into one shared memory vector by its corresponding thread. }
+ \caption[Organization of the prefetching stage of data, for a $5\times 5$ mask and a thread block size of $8\times 4$.]{Organization of the prefetching stage of data, for a $5\times 5$ mask and a thread block size of $8\times 4$. Threads in both top corners of the top figure are identified either by a circle or by a star symbol. The image tile, loaded into shared memory includes the pixels to be updated by the threads of the block, as well as its 2-pixel wide halo. Here, circle and star symbols in the image tile show which pixels are actually loaded into one shared memory vector by its corresponding thread. }
\label{fig:ShMem1}
\end{figure}
Still, we also implemented this method, in a similar manner as Nvidia did in its SDK sample code.
$\mathbf{4096\times 4096}$&2090 &1637 &1195 &830&684&561\\\hline
\end{tabular}
}
-\caption{Throughput values, in MegaPixel per second, of our generic 8 pixels per thread kernel using shared memory, run on a C2070 card. Data transfer durations are those of Table \ref{tab:memcpy1}.}
+\caption[Throughput values, in MegaPixel per second, of our generic 8 pixels per thread kernel using shared memory, run on a C2070 card.]{Throughput values, in MegaPixel per second, of our generic 8 pixels per thread kernel using shared memory, run on a C2070 card. Data transfer durations are those of Table \ref{tab:memcpy1}.}
\label{tab:convoGeneSh2}
\end{table}
\lstinputlisting[label={lst:convoGeneSh1},caption=CUDA kernel achieving a generic convolution operation after a preloading of data in shared memory.]{Chapters/chapter4/code/convoGeneSh1.cu}
$\mathbf{2048\times 2048}$&1.094 &1.191 &\bf 1.260 &\bf 1.444&\bf 1.545&\bf 1.722\\\hline
$\mathbf{4096\times 4096}$&4.262 &4.631 &\bf 5.000 &\bf 5.676&\bf 6.105&\bf 6.736\\\hline
\end{tabular}}
-\caption{Performances, in milliseconds, of our generic 8 pixels per thread 1-D convolution kernels using shared memory, run on a C2070 card. Timings include data copy. Bold values correspond to situations where separable-convolution kernels run faster than non separable ones.}
+\caption[Performances, in milliseconds, of our generic 8 pixels per thread 1-D convolution kernels using shared memory, run on a C2070 card.]{Performances, in milliseconds, of our generic 8 pixels per thread 1-D convolution kernels using shared memory, run on a C2070 card. Timings include data copy. Bold values correspond to situations where separable-convolution kernels run faster than non separable ones.}
\label{tab:convoSepSh1}
\end{table}
\begin{table}[h]
$\mathbf{4096\times 4096}$&1654 &1596 &\bf 1542 &\bf 1452&\bf 1400&\bf 1330\\\hline
\end{tabular}
}
-\caption{Throughput values, in MegaPixel per second, of our generic 8 pixels per thread 1-D convolution kernel using shared memory, run on a C2070 card. Data transfer durations are those of Table \ref{tab:memcpy1}.}
+\caption[Throughput values, in MegaPixel per second, of our generic 8 pixels per thread 1-D convolution kernel using shared memory, run on a C2070 card.]{Throughput values, in MegaPixel per second, of our generic 8 pixels per thread 1-D convolution kernel using shared memory, run on a C2070 card. Data transfer durations are those of Table \ref{tab:memcpy1}.}
\label{tab:convoSepSh2}
\end{table}
\begin{center}
\input{Chapters/chapter5/figures/component_design.tikz}
\end{center}
-\caption{Schematic representation of the five main components, their type definitions, and member functions. Because components are template based, the argument types cannot be known in beforehand. This concept rule set ensures compliance between components.}\label{ch5:fig:componentdesign}
+\caption[Schematic representation of the five main components, their type definitions, and member functions.]{Schematic representation of the five main components, their type definitions, and member functions. Because components are template based, the argument types cannot be known in beforehand. This concept rule set ensures compliance between components.}\label{ch5:fig:componentdesign}
\end{figure}
A component is implemented as a generic C++ class, that normally takes as template arguments the same types that it offers through type definitions: a matrix takes a vector class as template argument, and a vector class takes a working precision value type. The matrix can then access the working precision via the vector class. Components that rely on multiple template arguments can combine these arguments via type binders to maintain code simplicity. We will demonstrate use of such type binders in the model problem examples. A thorough introduction to template-based programming in C++ can be found in \cite{ch5:Vandevoorde2002}.
{\includegraphics[width=0.5\textwidth]{Chapters/chapter5/figures/AlphaPerformanceGTX590_N16777216_conv.pdf}}
}
\end{center}
-\caption{Single and double precision floating point operations per second for a two dimensional stencil operator on a numerical grid of size $4096^2$. Various stencil sizes are used $\alpha=1,2,3,4$, equivalent to $5$pt, $9$pt, $13$pt, and $17$pt stencils. Test environment 1.}\label{ch5:fig:stencilperformance}
+\caption[Single and double precision floating point operations per second for a two dimensional stencil operator on a numerical grid of size $4096^2$.]{Single and double precision floating point operations per second for a two dimensional stencil operator on a numerical grid of size $4096^2$. Various stencil sizes are used $\alpha=1,2,3,4$, equivalent to $5$pt, $9$pt, $13$pt, and $17$pt stencils. Test environment 1.}\label{ch5:fig:stencilperformance}
\end{figure}
%\includegraphics[width=0.6\textwidth]{Chapters/chapter5/figures/GPU2GPU.png}
\input{Chapters/chapter5/figures/GPU2GPU.tikz}
\end{center}
-\caption{Message passing between two GPUs involves several memory transfers across lower bandwidth connections. A kernel that aligns the data to be transferred is required, if the data is not already sequentially stored in device memory.}\label{ch5:fig:gpu2gputransfer}
+\caption[Message passing between two GPUs involves several memory transfers across lower bandwidth connections.]{Message passing between two GPUs involves several memory transfers across lower bandwidth connections. A kernel that aligns the data to be transferred is required, if the data is not already sequentially stored in device memory.}\label{ch5:fig:gpu2gputransfer}
\end{figure}
\begin{center}
\input{Chapters/chapter5/figures/dd2d.tikz}
\end{center}
-\caption{Domain distribution along the horizontal direction of a two dimensional grid into three subdomains. {\large$\bullet$} and {\scriptsize$\textcolor[rgb]{0.5,0.5,0.5}{\blacksquare}$} represent internal grid points and ghost points respectively.}\label{ch5:fig:dd2d}
+\caption[Domain distribution along the horizontal direction of a two dimensional grid into three subdomains.]{Domain distribution along the horizontal direction of a two dimensional grid into three subdomains. {\large$\bullet$} and {\scriptsize$\textcolor[rgb]{0.5,0.5,0.5}{\blacksquare}$} represent internal grid points and ghost points respectively.}\label{ch5:fig:dd2d}
\end{figure}
Topologies are introduced via an extra template argument to the grid component. A grid is by default not distributed, because the default template argument is a non-distribution topology implementation. The grid class is extended with a new member function \texttt{update()}, that makes sure that the topology instance updates all ghost points. The library contains two topology strategies, based on one dimensional and two dimensional distributions of the grid. The number of grid subdomains will be equal to the number of MPI processes executing the program.
\begin{center}
\input{Chapters/chapter5/figures/ParallelInTime.tikz}
\end{center}
-\caption{Time domain decomposition. A compute node is assigned to each individual time sub-main to compute the initial value problem. Consistency at the time sub-domain boundaries is obtained with the application of a computationally cheap integrator in conjunction with the parareal iterative predictor-corrector algorithm, eq. \eqref{ch5:eq:PARAREAL}. }\label{ch5:fig:ParallelInTime}
+\caption[Time domain decomposition.]{Time domain decomposition. A compute node is assigned to each individual time sub-main to compute the initial value problem. Consistency at the time sub-domain boundaries is obtained with the application of a computationally cheap integrator in conjunction with the parareal iterative predictor-corrector algorithm, eq. \eqref{ch5:eq:PARAREAL}. }\label{ch5:fig:ParallelInTime}
\end{figure}
\label{ch5:parareal}
\begin{center}
\input{Chapters/chapter5/figures/FullyDistributed.tikz}
\end{center}
-\caption{\label{ch5:fig:FullyDistributedCores} Schematic visualization of a fully distributed work scheduling model for the parareal algorithm as proposed by Aubanel. Each GPU is responsible for computing the solution on a single time sub-domain. The computation is initiated at rank $0$ and cascades trough to rank $N$ where the final solution can be fetched. }
+\caption[Schematic visualization of a fully distributed work scheduling model for the parareal algorithm as proposed by Aubanel.]{\label{ch5:fig:FullyDistributedCores} Schematic visualization of a fully distributed work scheduling model for the parareal algorithm as proposed by Aubanel. Each GPU is responsible for computing the solution on a single time sub-domain. The computation is initiated at rank $0$ and cascades trough to rank $N$ where the final solution can be fetched. }
\end{figure}
\subsubsection{Computational complexity}
\label{ch5:fig:pararealRvsGPUs:b}
}
\end{center}
- \caption{Parareal convergence properties as a function of $R$ and number GPUs used. The error is measured as the relative difference between the purely sequential solution and the parareal solution.}\label{ch5:fig:pararealRvsGPUs}
+ \caption[Parareal convergence properties as a function of $R$ and number GPUs used.]{Parareal convergence properties as a function of $R$ and number GPUs used. The error is measured as the relative difference between the purely sequential solution and the parareal solution.}\label{ch5:fig:pararealRvsGPUs}
\end{figure}
\begin{figure}[!htb]
\label{ch5:fig:pararealGPUs:b}
}
\end{center}
- \caption{Parareal performance properties as a function of $R$ and number GPUs used. Notice how the obtained performance depends greatly on the choice of $R$ as a function of the number of GPUs. Test environment 2. }\label{ch5:fig:pararealGPUs}
+ \caption[Parareal performance properties as a function of $R$ and number GPUs used.]{Parareal performance properties as a function of $R$ and number GPUs used. Notice how the obtained performance depends greatly on the choice of $R$ as a function of the number of GPUs. Test environment 2. }\label{ch5:fig:pararealGPUs}
\end{figure}
%TODO: Do we make this into a subsubsection:
\putbib[Chapters/chapter5/biblio5]
% Reset lst label and caption
-\lstset{label=,caption=}
+%\lstset{label=,caption=}
%\begin{algorithm}[H]
% \caption{Computing function in the final asynchronous scheme.}% without GPU computing.}
% \label{algo:ch6p2AsyncSyncComp}
-%\pagebreak
+\pagebreak
\begin{Listing}{algo:ch6p2AsyncSyncComp}{Computing function in the final asynchronous scheme}% without GPU computing.}
// Variables declarations and initialization
...
%\begin{algorithm}[H]
% \caption{Initialization of the main process of complete overlap with asynchronism.}
% \label{algo:ch6p2FullOverAsyncMain}
-\pagebreak
+%\pagebreak
\begin{Listing}{algo:ch6p2FullOverAsyncMain}{Initialization of the main process of complete overlap with asynchronism}
// Variables declarations and initialization
...
%\begin{algorithm}[H]
% \caption{Computing function in the final asynchronous scheme with CPU/GPU overlap.}
% \label{algo:ch6p2FullOverAsyncComp1}
-\pagebreak
+%\pagebreak
\begin{Listing}{algo:ch6p2FullOverAsyncComp1}{Computing function in the final asynchronous scheme with CPU/GPU overlap}
// Variables declarations and initialization
...
%\begin{algorithm}[H]
% \caption{Auxiliary computing function in the final asynchronous scheme with CPU/GPU overlap.}
% \label{algo:ch6p2FullOverAsyncComp2}
-\pagebreak
+%\pagebreak
\begin{Listing}{algo:ch6p2FullOverAsyncComp2}{Auxiliary computing function in the final asynchronous scheme with CPU/GPU overlap}
// Variables declarations and initialization
... auxInput ... // Local array for input data
%\begin{algorithm}[H]
% \caption{Dynamic initialization of access mode and priorities.}
% \label{algo:ch6p3ORWLinit}
+\pagebreak
\begin{Listing}{algo:ch6p3ORWLinit}{Dynamic initialization of access mode and priorities}
/* One operation with priority 0 (highest) consists */
/* in copying from current to the GPU and run MM, there. */
% MainLaplace2D_ex035_nonlinearLaplace.m
\includegraphics[width=0.45\textwidth]{Chapters/chapter7/figures/SFwaves_snapshots_clustered-eps-converted-to.pdf}
}
-\caption{Numerical experiments to assess stability properties of numerical wave model. In three cases, computed snapshots are taken of the wave elevation over one wave period of time. In a) the grid distribution of nodes in a one-parameter mapping for the grid is illustrated. Results from changes in wave elevation are illustrated for b) a mildly nonlinear standing wave on a highly clustered grid, c) regular stream function wave of medium steepness in shallow water $(kh,H/L)=(0.5,0.0292)$ on a uniform grid ($N_x=80$) and d) for a nonuniform grid with a minimal grid spacing 20 times smaller(!). In every case the step size remains fixed at $\Delta t = T/160$ s corresponding to a Courant number $C_r=c\tfrac{\Delta t}{\Delta x}=0.5$ for the uniform grid. A 6'$th$ order scheme and explicit EKR4 time-stepping is used in each test case.}
+\caption[Numerical experiments to assess stability properties of numerical wave model.]{Numerical experiments to assess stability properties of numerical wave model. In three cases, computed snapshots are taken of the wave elevation over one wave period of time. In a) the grid distribution of nodes in a one-parameter mapping for the grid is illustrated. Results from changes in wave elevation are illustrated for b) a mildly nonlinear standing wave on a highly clustered grid, c) regular stream function wave of medium steepness in shallow water $(kh,H/L)=(0.5,0.0292)$ on a uniform grid ($N_x=80$) and d) for a nonuniform grid with a minimal grid spacing 20 times smaller(!). In every case the step size remains fixed at $\Delta t = T/160$ s corresponding to a Courant number $C_r=c\tfrac{\Delta t}{\Delta x}=0.5$ for the uniform grid. A 6'$th$ order scheme and explicit EKR4 time-stepping is used in each test case.}
\label{ch7:numexp}
\end{figure}
%\newpage
\includegraphics[width=0.98\textwidth]{Chapters/chapter7/figures/nonlinearwavespenalty-eps-converted-to.pdf}
% Nx = 540, 6th order, vertical clustering, Nz=6;
}
-\caption{Snapshots at intervals $T/8$ over one wave period in time of computed a) small-amplitude $(kh,kH)=(0.63,0.005)$ and b) finite-amplitude $(kh,kH)=(1,0.41)$ stream function waves elevations having reached a steady state after transient startup. Combined wave generation and absorption zones in the western relaxation zone of both a) and b). In b) an absorption zone is positioned next to the eastern boundary and causes minor visible reflections. }
+\caption[Snapshots at intervals $T/8$ over one wave period in time.]{Snapshots at intervals $T/8$ over one wave period in time of computed a) small-amplitude $(kh,kH)=(0.63,0.005)$ and b) finite-amplitude $(kh,kH)=(1,0.41)$ stream function waves elevations having reached a steady state after transient startup. Combined wave generation and absorption zones in the western relaxation zone of both a) and b). In b) an absorption zone is positioned next to the eastern boundary and causes minor visible reflections. }
\label{ch7:figstandwave}
\end{figure}
The ratio between necessary data transfers and computational work for the proposed numerical model for free surface water waves is high enough to expect reasonable latency hiding. The data domain decomposition method consists of a logically structured division of the computational domain into multiple subdomains. Each of these subdomains are connected via fictitious ghost layers at the artificial boundaries of width corresponding to the half-width of the finite difference stencils employed. This results in a favourable volume-to-boundary ratio as the problem size increases, diminishing communication overhead for message passing. Information between subdomains are exchanged through ghost layers at every step of the iterative PDC method, in connection with the matrix-vector evaluation for the $\sigma$-transformed Laplace problem, and before relaxation steps in the multigrid method. A single global synchronization point occur at most once each iteration, if convergence is monitored, where a global reduction step (inner product) between all processor nodes takes place. The main advantage of this decomposition strategy is, that the decomposition into multiple subdomains is straightforward. However, it comes with the cost of extra data transfers to update the set of fictitious ghost layers.
-The parallel domain decomposition solver has been validated against the sequential solvers with respect to algorithmic efficiency to establish that the code produce correct results. An analysis of the numerical efficiency have also been carried out on different GPU systems to identify comparative behaviors as both the problems sizes and number of compute nodes vary. For example, performance scalings on Test environment 1 and Test environment 2 are presented in figure \ref{ch7:fig:multigpuperformance}. The figure confirms that there is only a limited benefit from using multiple GPUs for small problem sizes, since the computational intensity is simply too low to efficiently hide the latency of message passing. A substantial speedup is achieved compared to the single GPU version, while being able to solve even larger systems.
-With the linear scaling of memory requirements and improved computational speed, the methodology based on multiple GPUs makes it possible to simulate water waves in very large numerical wave tanks with improved performance.
-
\begin{figure}[!htb]
\setlength\figureheight{0.30\textwidth}
\setlength\figurewidth{0.33\textwidth}
{\scriptsize\input{Chapters/chapter7/figures/TeslaK20SpeedupGPUvsCPU3D.tikz}}
}
\end{center}
- \caption{Performance timings per PDC iteration as a function of increasing problem size $N$, for single, mixed and double precision arithmetics. Three dimensional nonlinear waves, using $6^{th}$ order finite difference approximations, preconditioned with one multigrid V-cycle and one pre- and post- Red-black Gauss-Seidel smoothing. Speedup compared to fastest known serial implementation. Using Test environment 3. CPU timings represent starting point for our investigations and has been obtained using Fortran 90 code and is based on a single-core run on a Intel Core i7, 2.80GHz processor.}\label{ch7:fig:perftimings}
+ \caption[Performance timings per PDC iteration as a function of increasing problem size $N$, for single, mixed and double precision arithmetics.]{Performance timings per PDC iteration as a function of increasing problem size $N$, for single, mixed and double precision arithmetics. Three dimensional nonlinear waves, using $6^{th}$ order finite difference approximations, preconditioned with one multigrid V-cycle and one pre- and post- Red-black Gauss-Seidel smoothing. Speedup compared to fastest known serial implementation. Using Test environment 3. CPU timings represent starting point for our investigations and has been obtained using Fortran 90 code and is based on a single-core run on a Intel Core i7, 2.80GHz processor.}\label{ch7:fig:perftimings}
\end{figure}
+The parallel domain decomposition solver has been validated against the sequential solvers with respect to algorithmic efficiency to establish that the code produce correct results. An analysis of the numerical efficiency have also been carried out on different GPU systems to identify comparative behaviors as both the problems sizes and number of compute nodes vary. For example, performance scalings on Test environment 1 and Test environment 2 are presented in figure \ref{ch7:fig:multigpuperformance}. The figure confirms that there is only a limited benefit from using multiple GPUs for small problem sizes, since the computational intensity is simply too low to efficiently hide the latency of message passing. A substantial speedup is achieved compared to the single GPU version, while being able to solve even larger systems.
+With the linear scaling of memory requirements and improved computational speed, the methodology based on multiple GPUs makes it possible to simulate water waves in very large numerical wave tanks with improved performance.
+
+
+
\begin{figure}[!htb]
{\scriptsize\input{Chapters/chapter7/figures/TeslaM2050MultiGPUScaling3D.tikz}}
}
\end{center}
- \caption{Domain decomposition performance on multi-GPU systems. Performance timings per PDC iteration as a function of increasing problem sizes using single precision. Same setup as in figure \ref{ch7:fig:perftimings}.}
+ \caption[Domain decomposition performance on multi-GPU systems.]{Domain decomposition performance on multi-GPU systems. Performance timings per PDC iteration as a function of increasing problem sizes using single precision. Same setup as in figure \ref{ch7:fig:perftimings}.}
\label{ch7:fig:multigpuperformance}
\end{figure}
\includegraphics[width=0.45\textwidth]{Chapters/chapter7/figures/kinematicsW_Nx30-HL90-p6_Nonlinear-eps-converted-to.pdf}
}
\end{center}
-\caption{Assessment of kinematic error is presented in terms of the depth-averaged error determined by \eqref{ch7:errkin} for a) scalar velocity potential and b) vertical velocity for a small-amplitude (linear) wave, and c) scalar velocity potential and d) vertical velocity for a finite-amplitude (nonlinear) wave with wave height $H/L=90\%(H/L)_\textrm{max}$.
+\caption[Assessment of kinematic error is presented in terms of the depth-averaged error.]{Assessment of kinematic error is presented in terms of the depth-averaged error determined by \eqref{ch7:errkin} for a) scalar velocity potential and b) vertical velocity for a small-amplitude (linear) wave, and c) scalar velocity potential and d) vertical velocity for a finite-amplitude (nonlinear) wave with wave height $H/L=90\%(H/L)_\textrm{max}$.
$N_z\in[6,12]$. Sixth order scheme. Clustered vertical grid. }
\label{ch7:figlinear2}
\end{figure}
\includegraphics[width=0.45\textwidth]{Chapters/chapter7/figures/PrecisionDOUBLE-eps-converted-to.pdf}
}
\end{center}
-\caption{Comparison between convergence histories for single and double precision computations using a PDC method for the solution of the transformed Laplace problem. Very steep nonlinear stream function wave in intermediate water $(kh,H/L)=(1,0.0903)$. Discretizaiton based on $(N_x,N_z)=(15,9)$ with 6'$th$ order stencils.}
+\caption[Comparison between convergence histories for single and double precision computations using a PDC method for the solution of the transformed Laplace problem.]{Comparison between convergence histories for single and double precision computations using a PDC method for the solution of the transformed Laplace problem. Very steep nonlinear stream function wave in intermediate water $(kh,H/L)=(1,0.0903)$. Discretizaiton based on $(N_x,N_z)=(15,9)$ with 6'$th$ order stencils.}
\label{ch7:convhist}
\end{figure}
\includegraphics[width=0.45\textwidth]{Chapters/chapter7/figures/ComparisonDCWithFiltering-eps-converted-to.pdf}
}
\end{center}
-\caption{Comparison between accuracy as a function of time for double precision calculations vs. single precision with and without filtering. The double precision result are unfiltered in each comparison and shows to be less sensitive to roundoff-errors. Medium steep nonlinear stream function wave in intermediate water $(kh,H/L)=(1,0.0502)$. Discretization is based on $(N_x,N_z)=(30,6)$, A courant number of $C_r=0.5$ and 6'$th$ order stencils.}
+\caption[Comparison between accuracy as a function of time for double precision calculations vs. single precision with and without filtering.]{Comparison between accuracy as a function of time for double precision calculations vs. single precision with and without filtering. The double precision result are unfiltered in each comparison and shows to be less sensitive to roundoff-errors. Medium steep nonlinear stream function wave in intermediate water $(kh,H/L)=(1,0.0502)$. Discretization is based on $(N_x,N_z)=(30,6)$, A courant number of $C_r=0.5$ and 6'$th$ order stencils.}
\label{ch7:filtering}
\end{figure}
{\scriptsize\input{Chapters/chapter7/figures/WhalinWaveHarmonics_T3_single.tikz}}
}
% \end{center}
- \caption{Harmonic analysis for the experiment of Whalin for $T=1,2,3\,s$ respectively. Measured experimental and computed results (single precision) are in good agreement. Test environment 1.}\label{ch7:whalinresults}
+ \caption[Harmonic analysis for the experiment of Whalin for $T=1,2,3\,s$ respectively.]{Harmonic analysis for the experiment of Whalin for $T=1,2,3\,s$ respectively. Measured experimental and computed results (single precision) are in good agreement. Test environment 1.}\label{ch7:whalinresults}
\end{figure}
\subsection{Acceleration via parallelism in time using 'Parareal'}\label{ch7:parareal}\index{parareal}
\includegraphics[width=0.5\textwidth]{Chapters/chapter7/figures/PararealSpeedupGTX590_conv.pdf}
}
\end{center}
- \caption{(a) Parareal absolute timings for an increasingly number of water waves traveling one wave length, each wave resolution is ($33\times 9$). (b) Parareal speedup for two to sixteen compute nodes compared to the purely sequential single GPU solver. Notice how insensitive the parareal scheme is to the size of the problem solved. Test environment 2.}\label{ch7:fig:DDPA_SPEEDUP}
+ \caption[Parareal absolute timings and parareal speedup.]{(a) Parareal absolute timings for an increasingly number of water waves traveling one wave length, each wave resolution is ($33\times 9$). (b) Parareal speedup for two to sixteen compute nodes compared to the purely sequential single GPU solver. Notice how insensitive the parareal scheme is to the size of the problem solved. Test environment 2.}\label{ch7:fig:DDPA_SPEEDUP}
\end{figure}
%
{\small\input{Chapters/chapter7/figures/WhalinPararealEfficiency.tikz}}
}
% \end{center}
- \caption{Parallel time integration using the parareal method. $R$ is the ratio between the complexity of the fine and coarse propagators. Test environment 2.}\label{ch7:fig:whalinparareal}
+ \caption[Parallel time integration using the parareal method.]{Parallel time integration using the parareal method. $R$ is the ratio between the complexity of the fine and coarse propagators. Test environment 2.}\label{ch7:fig:whalinparareal}
\end{figure}
% Comparison with DD
MM & $m \times (m-1)$ & $m \times (m-1)$ \\
\hline
\end{tabular}
- \caption{The different data structures of the $LB$ algorithm and their associated complexities in memory size and numbers of accesses. The parameters $n$, $m$ and $n'$ designate respectively the total number of jobs, the total number of machines and the number of remaining jobs to be scheduled for the sub-problems the lower bound is being computed.}
+ \caption[The different data structures of the $LB$ algorithm and their associated complexities in memory size and numbers of accesses.]{The different data structures of the $LB$ algorithm and their associated complexities in memory size and numbers of accesses. The parameters $n$, $m$ and $n'$ designate respectively the total number of jobs, the total number of machines and the number of remaining jobs to be scheduled for the sub-problems the lower bound is being computed.}
\label{ch8:tabMemComplex}
\end{table}
$20 \times 20$ & 3.800 (3.8KB) & 3.800 (7.6KB) & 400 (0.4KB) & 20 (0.04KB) & 380 (0.76KB) \\
\hline
\end{tabular}
-\caption{The sizes of each data structure for the different experimented problem instances. The sizes are given in number of elements and in bytes (between brackets).}
+\caption[The sizes of each data structure for the different experimented problem instances.]{The sizes of each data structure for the different experimented problem instances. The sizes are given in number of elements and in bytes (between brackets).}
\label{ch8:tabMemSizes}
\end{table}
% \hline
% \hline
\end{tabular}
- \caption{Speedup for different FSP instances and pool sizes obtained with data access optimization. $PTM$ is placed in shared memory and all others are placed in global memory.}
+ \caption[Speedup for different FSP instances and pool sizes obtained with data access optimization.]{Speedup for different FSP instances and pool sizes obtained with data access optimization. $PTM$ is placed in shared memory and all others are placed in global memory.}
\label{ch8:PTM-on-SM}
\end{table}
% \hline
% \hline
\end{tabular}
- \caption{Speedup for different FSP instances and pool sizes obtained with data access optimization.
+ \caption[Speedup for different FSP instances and pool sizes obtained with data access optimization.]{Speedup for different FSP instances and pool sizes obtained with data access optimization.
$JM$ is placed in shared memory and all others are placed in global memory.}
\label{ch8:JM-on-SM}
\end{table}
% \hline
% \hline
\end{tabular}
- \caption{Speedup for different FSP instances and pool sizes obtained with data access optimization. $PTM$ and $JM$ are placed together in shared memory and all others are placed in global memory.}
+ \caption[Speedup for different FSP instances and pool sizes obtained with data access optimization.]{Speedup for different FSP instances and pool sizes obtained with data access optimization. $PTM$ and $JM$ are placed together in shared memory and all others are placed in global memory.}
\label{ch8:JM-PTM-on-SM}
\end{table}