new

[book_gpu.git] / BookGPU / Chapters / chapter5 / ch5.tex
diff --git a/BookGPU/Chapters/chapter5/ch5.tex b/BookGPU/Chapters/chapter5/ch5.tex

index 55e56327ec1fad71bf272861e4f5239065b8fd10..6355b099460edd1f859ce812fcafe6f2b96b239e 100644 (file)
--- a/BookGPU/Chapters/chapter5/ch5.tex
+++ b/BookGPU/Chapters/chapter5/ch5.tex
@@ -546,11 +546,10 @@ We can now estimate the speedup, here denoted $\psi$, as the ratio between the c
  \begin{align}\label{ch5:eq:EstiSpeedBasicPara}
  \psi=\frac{N\frac{\Delta T}{\delta t}\mathcal{C}_\mathcal{F}}{\left(k+1\right)N\mathcal{C}_\mathcal{G}\frac{\Delta T}{\delta T}+k\mathcal{C}_\mathcal{F}\frac{\Delta T}{\delta t}}=\frac{N}{\left(k+1\right)N\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}+k}.
  \end{align}
  \begin{align}\label{ch5:eq:EstiSpeedBasicPara}
  \psi=\frac{N\frac{\Delta T}{\delta t}\mathcal{C}_\mathcal{F}}{\left(k+1\right)N\mathcal{C}_\mathcal{G}\frac{\Delta T}{\delta T}+k\mathcal{C}_\mathcal{F}\frac{\Delta T}{\delta t}}=\frac{N}{\left(k+1\right)N\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}+k}.
  \end{align}
-If we additionally assume that the time spent on coarse propagation is negligible compared to the time spent on the fine propagation, i.e., the limit $\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}\rightarrow0$, the estimate reduces to $\psi=\frac{N}{k}$. It is thus clear that the number of iterations $k$ for the algorithm to converge poses an upper bound on obtainable parallel efficiency. The number of iterations needed for convergence is intimately coupled with the ratio $R$ between the speed of the fine and the coarse integrators $\frac{\mathcal{C}_\mathcal{F}}{\mathcal{C}_\mathcal{G}}\frac{\delta T}{\delta t}$. Using a slow, but more accurate coarse integrator will lead to convergence in fewer iterations $k$, but at the same time it also makes $R$ smaller. Ultimately, this will degrade the obtained speedup as can be deduced from \eqref{ch5:eq:EstiSpeedBasicPara}, and by Amdahl's law it will also lower the upper bound on possible attainable speedup. Thus, $R$ \emph{cannot} be made arbitrarily large since the ratio is inversely proportional to the number of iterations $k$ needed for convergence. This poses a challenge in obtaining speedup and is a trade-off between time spent on the fundamentally sequential part of the algorithm and the number of iterations needed for convergence. It is particularly important to consider this trade-off in the choice of stopping strategy; a more thorough discussion on this topic is available in \cite{ch5:ASNP12} for the interested reader. Measurements on parallel efficiency are typically observed in the literature to be in the range of 20--50\%, depending on the problem and the number of time subdomains, which is also confirmed by our measurements using GPUs. Here we include a demonstration of the obtained speedup of parareal applied to the two-dimensional heat problem \eqref{ch5:eq:heateq}. In Figure \ref{ch5:fig:pararealRvsGPUs} the iterations needed for convergence using the forward Euler method for both fine and coarse integration are presented. $R$ is regulated by changing the time step size for the coarse integrator. In Figure \ref{ch5:fig:pararealGPUs} speedup and parallel efficiency measurements are presented. Notice, when using many GPUs it is advantageous to use a faster, less accurate coarse propagator, despite it requires an extra parareal iteration that increases the total computational complexity.
+If we additionally assume that the time spent on coarse propagation is negligible compared to the time spent on the fine propagation, i.e., the limit $\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}\rightarrow0$, the estimate reduces to $\psi=\frac{N}{k}$. It is thus clear that the number of iterations $k$ for the algorithm to converge poses an upper bound on obtainable parallel efficiency. The number of iterations needed for convergence is intimately coupled with the ratio $R$ between the speed of the fine and the coarse integrators $\frac{\mathcal{C}_\mathcal{F}}{\mathcal{C}_\mathcal{G}}\frac{\delta T}{\delta t}$. Using a slow, but more accurate coarse integrator will lead to convergence in fewer iterations $k$, but at the same time it also makes $R$ smaller. Ultimately, this will degrade the obtained speedup as can be deduced from \eqref{ch5:eq:EstiSpeedBasicPara}, and by Amdahl's law it will also lower the upper bound on possible attainable speedup. Thus, $R$ \emph{cannot} be made arbitrarily large since the ratio is inversely proportional to the number of iterations $k$ needed for convergence. This poses a challenge in obtaining speedup and is a trade-off between time spent on the fundamentally sequential part of the algorithm and the number of iterations needed for convergence. It is particularly important to consider this trade-off in the choice of stopping strategy; a more thorough discussion on this topic is available in \cite{ch5:ASNP12} for the interested reader. Measurements on parallel efficiency are typically observed in the literature to be in the range of 20--50\%, depending on the problem and the number of time subdomains, which is also confirmed by our measurements using GPUs. Here we include a demonstration of the obtained speedup of parareal applied to the two-dimensional heat problem \eqref{ch5:eq:heateq}. In Figure \ref{ch5:fig:pararealRvsGPUs} the iterations needed for convergence using the forward Euler method for both fine and coarse integration are presented. $R$ is regulated by changing the time step size for the coarse integrator. In Figure \ref{ch5:fig:pararealGPUs} speedup and parallel efficiency measurements are presented. Notice, when using many GPUs it is advantageous to use a faster, less accurate coarse propagator, despite it requires an extra parareal iteration that increases the total computational complexity.\eject
  
  
  
  
-%\clearpage
-\begin{figure}[!htb]
+\begin{figure}[t!]
      \setlength\figureheight{0.32\textwidth}
      \setlength\figurewidth{0.35\textwidth}
      \begin{center}
      \setlength\figureheight{0.32\textwidth}
      \setlength\figurewidth{0.35\textwidth}
      \begin{center}
@@ -561,10 +560,11 @@ If we additionally assume that the time spent on coarse propagation is negligibl
      \subfigure[Iterations $K$ needed to obtain a relative error less than $10^{-5}$.]{
      {\small\input{Chapters/chapter5/figures/pararealKvsRvsGPUs.tikz}}
      \label{ch5:fig:pararealRvsGPUs:b}
      \subfigure[Iterations $K$ needed to obtain a relative error less than $10^{-5}$.]{
      {\small\input{Chapters/chapter5/figures/pararealKvsRvsGPUs.tikz}}
      \label{ch5:fig:pararealRvsGPUs:b}
-    }
+    }\vspace*{-8pt}
      \end{center}
      \caption[Parareal convergence properties as a function of $R$ and number of GPUs used.]{Parareal convergence properties as a function of $R$ and number of GPUs used. The error is measured as the relative difference between the purely sequential solution and the parareal solution.}\label{ch5:fig:pararealRvsGPUs}
  \end{figure}
      \end{center}
      \caption[Parareal convergence properties as a function of $R$ and number of GPUs used.]{Parareal convergence properties as a function of $R$ and number of GPUs used. The error is measured as the relative difference between the purely sequential solution and the parareal solution.}\label{ch5:fig:pararealRvsGPUs}
  \end{figure}
+
  \begin{figure}[!htb]
      \setlength\figureheight{0.32\textwidth}
      \setlength\figurewidth{0.34\textwidth}
  \begin{figure}[!htb]
      \setlength\figureheight{0.32\textwidth}
      \setlength\figurewidth{0.34\textwidth}
@@ -576,9 +576,9 @@ If we additionally assume that the time spent on coarse propagation is negligibl
      \subfigure[Measured parallel efficiency]{
      {\small\input{Chapters/chapter5/figures/pararealEfficiencyvsRvsGPUs.tikz}}
      \label{ch5:fig:pararealGPUs:b}
      \subfigure[Measured parallel efficiency]{
      {\small\input{Chapters/chapter5/figures/pararealEfficiencyvsRvsGPUs.tikz}}
      \label{ch5:fig:pararealGPUs:b}
-    }
+    }\vspace*{-8pt}
      \end{center}
      \end{center}
-    \caption[Parareal performance properties as a function of $R$ and number of GPUs used.]{Parareal performance properties as a function of $R$ and number GPUs used. Notice how the obtained performance depends greatly on the choice of $R$ as a function of the number of GPUs. Executed on test environment 3.}\label{ch5:fig:pararealGPUs}
+    \caption[Parareal performance properties as a function of $R$ and number of GPUs used.]{Parareal performance properties as a function of $R$ and number GPUs used. Notice how the obtained performance depends greatly on the choice of $R$ as a function of the number of GPUs. Executed on test environment 3.}\label{ch5:fig:pararealGPUs}\vspace*{-12pt}
  \end{figure}
  %TODO: Do we make this into a subsubsection:
  %\subsubsection{Library Implementation}\label{ch5:subsec:libimpl}
  \end{figure}
  %TODO: Do we make this into a subsubsection:
  %\subsubsection{Library Implementation}\label{ch5:subsec:libimpl}