From 2ce2baf7820f44ab044b4df98722576116551e57 Mon Sep 17 00:00:00 2001 From: couturie Date: Tue, 8 Oct 2013 20:28:36 +0200 Subject: [PATCH] last version --- BookGPU/Chapters/chapter5/ch5.tex | 12 ++++----- BookGPU/Chapters/chapter6/PartieSync.tex | 3 ++- BookGPU/Chapters/chapter6/ch6.tex | 6 ++--- BookGPU/Chapters/chapter9/ch9.tex | 6 ++--- BookGPU/Makefile | 5 +++- BookGPU/sunil.cls | 34 +++++++++++++++++++----- 6 files changed, 45 insertions(+), 21 deletions(-) diff --git a/BookGPU/Chapters/chapter5/ch5.tex b/BookGPU/Chapters/chapter5/ch5.tex index 55e5632..6355b09 100644 --- a/BookGPU/Chapters/chapter5/ch5.tex +++ b/BookGPU/Chapters/chapter5/ch5.tex @@ -546,11 +546,10 @@ We can now estimate the speedup, here denoted $\psi$, as the ratio between the c \begin{align}\label{ch5:eq:EstiSpeedBasicPara} \psi=\frac{N\frac{\Delta T}{\delta t}\mathcal{C}_\mathcal{F}}{\left(k+1\right)N\mathcal{C}_\mathcal{G}\frac{\Delta T}{\delta T}+k\mathcal{C}_\mathcal{F}\frac{\Delta T}{\delta t}}=\frac{N}{\left(k+1\right)N\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}+k}. \end{align} -If we additionally assume that the time spent on coarse propagation is negligible compared to the time spent on the fine propagation, i.e., the limit $\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}\rightarrow0$, the estimate reduces to $\psi=\frac{N}{k}$. It is thus clear that the number of iterations $k$ for the algorithm to converge poses an upper bound on obtainable parallel efficiency. The number of iterations needed for convergence is intimately coupled with the ratio $R$ between the speed of the fine and the coarse integrators $\frac{\mathcal{C}_\mathcal{F}}{\mathcal{C}_\mathcal{G}}\frac{\delta T}{\delta t}$. Using a slow, but more accurate coarse integrator will lead to convergence in fewer iterations $k$, but at the same time it also makes $R$ smaller. Ultimately, this will degrade the obtained speedup as can be deduced from \eqref{ch5:eq:EstiSpeedBasicPara}, and by Amdahl's law it will also lower the upper bound on possible attainable speedup. Thus, $R$ \emph{cannot} be made arbitrarily large since the ratio is inversely proportional to the number of iterations $k$ needed for convergence. This poses a challenge in obtaining speedup and is a trade-off between time spent on the fundamentally sequential part of the algorithm and the number of iterations needed for convergence. It is particularly important to consider this trade-off in the choice of stopping strategy; a more thorough discussion on this topic is available in \cite{ch5:ASNP12} for the interested reader. Measurements on parallel efficiency are typically observed in the literature to be in the range of 20--50\%, depending on the problem and the number of time subdomains, which is also confirmed by our measurements using GPUs. Here we include a demonstration of the obtained speedup of parareal applied to the two-dimensional heat problem \eqref{ch5:eq:heateq}. In Figure \ref{ch5:fig:pararealRvsGPUs} the iterations needed for convergence using the forward Euler method for both fine and coarse integration are presented. $R$ is regulated by changing the time step size for the coarse integrator. In Figure \ref{ch5:fig:pararealGPUs} speedup and parallel efficiency measurements are presented. Notice, when using many GPUs it is advantageous to use a faster, less accurate coarse propagator, despite it requires an extra parareal iteration that increases the total computational complexity. +If we additionally assume that the time spent on coarse propagation is negligible compared to the time spent on the fine propagation, i.e., the limit $\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}\rightarrow0$, the estimate reduces to $\psi=\frac{N}{k}$. It is thus clear that the number of iterations $k$ for the algorithm to converge poses an upper bound on obtainable parallel efficiency. The number of iterations needed for convergence is intimately coupled with the ratio $R$ between the speed of the fine and the coarse integrators $\frac{\mathcal{C}_\mathcal{F}}{\mathcal{C}_\mathcal{G}}\frac{\delta T}{\delta t}$. Using a slow, but more accurate coarse integrator will lead to convergence in fewer iterations $k$, but at the same time it also makes $R$ smaller. Ultimately, this will degrade the obtained speedup as can be deduced from \eqref{ch5:eq:EstiSpeedBasicPara}, and by Amdahl's law it will also lower the upper bound on possible attainable speedup. Thus, $R$ \emph{cannot} be made arbitrarily large since the ratio is inversely proportional to the number of iterations $k$ needed for convergence. This poses a challenge in obtaining speedup and is a trade-off between time spent on the fundamentally sequential part of the algorithm and the number of iterations needed for convergence. It is particularly important to consider this trade-off in the choice of stopping strategy; a more thorough discussion on this topic is available in \cite{ch5:ASNP12} for the interested reader. Measurements on parallel efficiency are typically observed in the literature to be in the range of 20--50\%, depending on the problem and the number of time subdomains, which is also confirmed by our measurements using GPUs. Here we include a demonstration of the obtained speedup of parareal applied to the two-dimensional heat problem \eqref{ch5:eq:heateq}. In Figure \ref{ch5:fig:pararealRvsGPUs} the iterations needed for convergence using the forward Euler method for both fine and coarse integration are presented. $R$ is regulated by changing the time step size for the coarse integrator. In Figure \ref{ch5:fig:pararealGPUs} speedup and parallel efficiency measurements are presented. Notice, when using many GPUs it is advantageous to use a faster, less accurate coarse propagator, despite it requires an extra parareal iteration that increases the total computational complexity.\eject -%\clearpage -\begin{figure}[!htb] +\begin{figure}[t!] \setlength\figureheight{0.32\textwidth} \setlength\figurewidth{0.35\textwidth} \begin{center} @@ -561,10 +560,11 @@ If we additionally assume that the time spent on coarse propagation is negligibl \subfigure[Iterations $K$ needed to obtain a relative error less than $10^{-5}$.]{ {\small\input{Chapters/chapter5/figures/pararealKvsRvsGPUs.tikz}} \label{ch5:fig:pararealRvsGPUs:b} - } + }\vspace*{-8pt} \end{center} \caption[Parareal convergence properties as a function of $R$ and number of GPUs used.]{Parareal convergence properties as a function of $R$ and number of GPUs used. The error is measured as the relative difference between the purely sequential solution and the parareal solution.}\label{ch5:fig:pararealRvsGPUs} \end{figure} + \begin{figure}[!htb] \setlength\figureheight{0.32\textwidth} \setlength\figurewidth{0.34\textwidth} @@ -576,9 +576,9 @@ If we additionally assume that the time spent on coarse propagation is negligibl \subfigure[Measured parallel efficiency]{ {\small\input{Chapters/chapter5/figures/pararealEfficiencyvsRvsGPUs.tikz}} \label{ch5:fig:pararealGPUs:b} - } + }\vspace*{-8pt} \end{center} - \caption[Parareal performance properties as a function of $R$ and number of GPUs used.]{Parareal performance properties as a function of $R$ and number GPUs used. Notice how the obtained performance depends greatly on the choice of $R$ as a function of the number of GPUs. Executed on test environment 3.}\label{ch5:fig:pararealGPUs} + \caption[Parareal performance properties as a function of $R$ and number of GPUs used.]{Parareal performance properties as a function of $R$ and number GPUs used. Notice how the obtained performance depends greatly on the choice of $R$ as a function of the number of GPUs. Executed on test environment 3.}\label{ch5:fig:pararealGPUs}\vspace*{-12pt} \end{figure} %TODO: Do we make this into a subsubsection: %\subsubsection{Library Implementation}\label{ch5:subsec:libimpl} diff --git a/BookGPU/Chapters/chapter6/PartieSync.tex b/BookGPU/Chapters/chapter6/PartieSync.tex index ce21565..7fdaace 100755 --- a/BookGPU/Chapters/chapter6/PartieSync.tex +++ b/BookGPU/Chapters/chapter6/PartieSync.tex @@ -506,7 +506,8 @@ and twice as many GPU buffers. \begin{figure}[t] \centering \includegraphics{Chapters/chapter6/figures/Sync-CompleteInterleaveOverlap.pdf} - \caption{Complete overlap of internode CPU communications, CPU/GPU data transfers, and GPU + \caption[Complete overlap of internode CPU communications,\break\hfill CPU/GPU data transfers, and GPU + computations, interleaving computation-communication iterations.]{Complete overlap of internode CPU communications, CPU/GPU data transfers, and GPU computations, interleaving computation-communication iterations.} \label{fig:ch6p1overlapinterleaved} \end{figure} diff --git a/BookGPU/Chapters/chapter6/ch6.tex b/BookGPU/Chapters/chapter6/ch6.tex index 65698d7..8e2b7e1 100755 --- a/BookGPU/Chapters/chapter6/ch6.tex +++ b/BookGPU/Chapters/chapter6/ch6.tex @@ -32,9 +32,9 @@ %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Main document %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% -\chapterauthor{Sylvain Contassot-Vivier}{Université Lorraine, Loria UMR 7503 \& AlGorille INRIA Project Team, Nancy, France.} -\chapterauthor{Stephane Vialle}{SUPELEC, UMI GT-CNRS 2958 \& AlGorille INRIA Project Team, Metz, France.} -\chapterauthor{Jens Gustedt}{INRIA Nancy--Grand Est, AlGorille INRIA Project Team, Strasbourg, France.} +\chapterauthor{Sylvain Contassot-Vivier}{Université Lorraine, Loria UMR 7503 \& AlGorille INRIA Project Team, Nancy, France} +\chapterauthor{Stephane Vialle}{SUPELEC, UMI GT-CNRS 2958 \& AlGorille INRIA Project Team, Metz, France} +\chapterauthor{Jens Gustedt}{INRIA Nancy--Grand Est, AlGorille INRIA Project Team, Strasbourg, France} \chapter{Development methodologies for GPU and cluster of GPUs} diff --git a/BookGPU/Chapters/chapter9/ch9.tex b/BookGPU/Chapters/chapter9/ch9.tex index 0294f48..442c53a 100644 --- a/BookGPU/Chapters/chapter9/ch9.tex +++ b/BookGPU/Chapters/chapter9/ch9.tex @@ -99,7 +99,7 @@ solutions. The process is repeated until a stopping criterion is satisfied. \emph{Evolutionary algorithms}, \emph{swarm optimization}, and \emph{ant colonies} fall into this class. -\clearpage +%\clearpage \section{Parallel models for metaheuristics}\label{ch8:sec:paraMeta} Optimization problems, whether real-life or academic, are more often NP-hard and CPU time and/or memory consuming. Metaheuristics @@ -187,7 +187,7 @@ consuming. Unlike the two previous parallel models, the solution-level\index{metaheuristics!solution-level parallelism} parallel model is problem-dependent.} \end{itemize} -\clearpage +%\clearpage \section[Challenges for the design of GPU-based metaheuristics]{Challenges for the design of GPU-based\hfill\break metaheuristics} \label{ch8:sec:challenges} @@ -251,7 +251,7 @@ data (e.g., data for fitness evaluation that all threads concurrently access) on the constant memory, and the most accessed data structures (e.g., population of individuals for a CUDA thread block) on the shared memory. - +\clearpage \subsection{Threads synchronization} \index{GPU!threads synchronization} The thread synchronization issue is caused by both the GPU architecture and diff --git a/BookGPU/Makefile b/BookGPU/Makefile index a444ead..1f720d1 100644 --- a/BookGPU/Makefile +++ b/BookGPU/Makefile @@ -32,7 +32,10 @@ all: pdflatex ${BOOK} pdflatex ${BOOK} cp BookGPU.toc BookGPU.toc_old - cp BookGPU.toc_new BookGPU.toc #copy the corrected toc + cp BookGPU.toc_new2 BookGPU.toc #copy the corrected toc + cp bu16.bbl bu16.bbl_old + cp bu16.bbl_new bu16.bbl + pdflatex ${BOOK} diff --git a/BookGPU/sunil.cls b/BookGPU/sunil.cls index 9d42511..06b11df 100755 --- a/BookGPU/sunil.cls +++ b/BookGPU/sunil.cls @@ -815,10 +815,8 @@ \newcommand\subsubsection{\@ifstar{\@ssubsubsection}{\@trplarg{\@subsubsection}}} \def\@ssubsubsection#1{% - \if@ChapterTOCs - \myaddcontentsline{\@chaptoc}{chapsubsubsection}{\string\makebox[\subsecnumwidth][l]{}#1}\fi - \@startsection{subsubsection}{3}{\z@}{12\p@}{6\p@}{% - \SubsubsectionHeadFont}*{#1}} +% \if@ChapterTOCs \myaddcontentsline{\@chaptoc}{chapsubsubsection}{\string\makebox[\subsecnumwidth][l]{}#1}\fi + \@startsection{subsubsection}{3}{\z@}{12\p@}{6\p@}{\SubsubsectionHeadFont}*{#1}} \def\@subsubsection[#1][#2]#3{% \if@ChapterTOCs \addtocounter{subsubsection}{1}% @@ -991,7 +989,7 @@ \newcommand\@pnumwidth{1.55em} \newcommand\@tocrmarg{2.55em} \newcommand\@dotsep{4.5} -\setcounter{tocdepth}{3} +\setcounter{tocdepth}{1} \newcounter{numauthors} @@ -1050,7 +1048,7 @@ % \def\l@part#1#2{% \toc@draw - \gdef\toc@draw{\draw@part{\large #1}{\large #2}}} + \gdef\toc@draw{\draw@part{\large #1}{#2}}} \def\l@fm#1#2{% \toc@draw @@ -1140,7 +1138,7 @@ \toc@draw \gdef\toc@draw{\draw@section{#1}{#2}}} \def\draw@section#1#2{% - \@dottedtocline{1}{\section@toc@skip}{\SectionTOCWidth}{#1 }{{ + \@newdottedtocline{1}{\section@toc@skip}{\SectionTOCWidth}{#1 }{{ \tocfont #2}}} \newlength\subsection@toc@skip \subsection@toc@skip\section@toc@skip @@ -1185,6 +1183,28 @@ \@dottedtocline{5}{\subparagraph@toc@skip}{6em}{#1}{{ \tocfont #2}}} +\def\@newdottedtocline#1#2#3#4#5{% + \ifnum #1>\c@tocdepth + \else + \vskip \z@ %\@plus.2\p@ +%%% {\leftskip #2\relax\rightskip\@tocrmarg\parfillskip-\rightskip +%%% \parindent #2\relax\@afterindenttrue +%%% \interlinepenalty\@M +%%% \leavevmode +%%% \@tempdima #3\relax +%%% \advance\leftskip\@tempdima\null\hskip-\leftskip +%%% {#4\hfil}\nobreak +%%% \if@pdf +%%% \else +%%% \leaders\hbox{$\m@th\mkern\@dotsep mu\hbox{.}\mkern\@dotsep mu$}\hfill +%%% \nobreak +%%% \hb@xt@\@pnumwidth{\hfil\normalfont\normalcolor #5}% +%%%\fi +%%% \par} + \fi + } + + \def\@dottedtocline#1#2#3#4#5{% \ifnum #1>\c@tocdepth \else -- 2.39.5