last version

author couturie <couturie@extinction>

Tue, 8 Oct 2013 18:28:36 +0000 (20:28 +0200)

committer couturie <couturie@extinction>

Tue, 8 Oct 2013 18:28:36 +0000 (20:28 +0200)
author couturie <couturie@extinction>
Tue, 8 Oct 2013 18:28:36 +0000 (20:28 +0200)
committer couturie <couturie@extinction>
Tue, 8 Oct 2013 18:28:36 +0000 (20:28 +0200)
diff --git a/BookGPU/Chapters/chapter5/ch5.tex b/BookGPU/Chapters/chapter5/ch5.tex

index 55e56327ec1fad71bf272861e4f5239065b8fd10..6355b099460edd1f859ce812fcafe6f2b96b239e 100644 (file)
--- a/BookGPU/Chapters/chapter5/ch5.tex
+++ b/BookGPU/Chapters/chapter5/ch5.tex
@@ -546,11 +546,10 @@ We can now estimate the speedup, here denoted $\psi$, as the ratio between the c
  \begin{align}\label{ch5:eq:EstiSpeedBasicPara}
  \psi=\frac{N\frac{\Delta T}{\delta t}\mathcal{C}_\mathcal{F}}{\left(k+1\right)N\mathcal{C}_\mathcal{G}\frac{\Delta T}{\delta T}+k\mathcal{C}_\mathcal{F}\frac{\Delta T}{\delta t}}=\frac{N}{\left(k+1\right)N\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}+k}.
  \end{align}
-If we additionally assume that the time spent on coarse propagation is negligible compared to the time spent on the fine propagation, i.e., the limit $\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}\rightarrow0$, the estimate reduces to $\psi=\frac{N}{k}$. It is thus clear that the number of iterations $k$ for the algorithm to converge poses an upper bound on obtainable parallel efficiency. The number of iterations needed for convergence is intimately coupled with the ratio $R$ between the speed of the fine and the coarse integrators $\frac{\mathcal{C}_\mathcal{F}}{\mathcal{C}_\mathcal{G}}\frac{\delta T}{\delta t}$. Using a slow, but more accurate coarse integrator will lead to convergence in fewer iterations $k$, but at the same time it also makes $R$ smaller. Ultimately, this will degrade the obtained speedup as can be deduced from \eqref{ch5:eq:EstiSpeedBasicPara}, and by Amdahl's law it will also lower the upper bound on possible attainable speedup. Thus, $R$ \emph{cannot} be made arbitrarily large since the ratio is inversely proportional to the number of iterations $k$ needed for convergence. This poses a challenge in obtaining speedup and is a trade-off between time spent on the fundamentally sequential part of the algorithm and the number of iterations needed for convergence. It is particularly important to consider this trade-off in the choice of stopping strategy; a more thorough discussion on this topic is available in \cite{ch5:ASNP12} for the interested reader. Measurements on parallel efficiency are typically observed in the literature to be in the range of 20--50\%, depending on the problem and the number of time subdomains, which is also confirmed by our measurements using GPUs. Here we include a demonstration of the obtained speedup of parareal applied to the two-dimensional heat problem \eqref{ch5:eq:heateq}. In Figure \ref{ch5:fig:pararealRvsGPUs} the iterations needed for convergence using the forward Euler method for both fine and coarse integration are presented. $R$ is regulated by changing the time step size for the coarse integrator. In Figure \ref{ch5:fig:pararealGPUs} speedup and parallel efficiency measurements are presented. Notice, when using many GPUs it is advantageous to use a faster, less accurate coarse propagator, despite it requires an extra parareal iteration that increases the total computational complexity.
+If we additionally assume that the time spent on coarse propagation is negligible compared to the time spent on the fine propagation, i.e., the limit $\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}\rightarrow0$, the estimate reduces to $\psi=\frac{N}{k}$. It is thus clear that the number of iterations $k$ for the algorithm to converge poses an upper bound on obtainable parallel efficiency. The number of iterations needed for convergence is intimately coupled with the ratio $R$ between the speed of the fine and the coarse integrators $\frac{\mathcal{C}_\mathcal{F}}{\mathcal{C}_\mathcal{G}}\frac{\delta T}{\delta t}$. Using a slow, but more accurate coarse integrator will lead to convergence in fewer iterations $k$, but at the same time it also makes $R$ smaller. Ultimately, this will degrade the obtained speedup as can be deduced from \eqref{ch5:eq:EstiSpeedBasicPara}, and by Amdahl's law it will also lower the upper bound on possible attainable speedup. Thus, $R$ \emph{cannot} be made arbitrarily large since the ratio is inversely proportional to the number of iterations $k$ needed for convergence. This poses a challenge in obtaining speedup and is a trade-off between time spent on the fundamentally sequential part of the algorithm and the number of iterations needed for convergence. It is particularly important to consider this trade-off in the choice of stopping strategy; a more thorough discussion on this topic is available in \cite{ch5:ASNP12} for the interested reader. Measurements on parallel efficiency are typically observed in the literature to be in the range of 20--50\%, depending on the problem and the number of time subdomains, which is also confirmed by our measurements using GPUs. Here we include a demonstration of the obtained speedup of parareal applied to the two-dimensional heat problem \eqref{ch5:eq:heateq}. In Figure \ref{ch5:fig:pararealRvsGPUs} the iterations needed for convergence using the forward Euler method for both fine and coarse integration are presented. $R$ is regulated by changing the time step size for the coarse integrator. In Figure \ref{ch5:fig:pararealGPUs} speedup and parallel efficiency measurements are presented. Notice, when using many GPUs it is advantageous to use a faster, less accurate coarse propagator, despite it requires an extra parareal iteration that increases the total computational complexity.\eject
  
  
-%\clearpage
-\begin{figure}[!htb]
+\begin{figure}[t!]
      \setlength\figureheight{0.32\textwidth}
      \setlength\figurewidth{0.35\textwidth}
      \begin{center}
@@ -561,10 +560,11 @@ If we additionally assume that the time spent on coarse propagation is negligibl
      \subfigure[Iterations $K$ needed to obtain a relative error less than $10^{-5}$.]{
      {\small\input{Chapters/chapter5/figures/pararealKvsRvsGPUs.tikz}}
      \label{ch5:fig:pararealRvsGPUs:b}
-    }
+    }\vspace*{-8pt}
      \end{center}
      \caption[Parareal convergence properties as a function of $R$ and number of GPUs used.]{Parareal convergence properties as a function of $R$ and number of GPUs used. The error is measured as the relative difference between the purely sequential solution and the parareal solution.}\label{ch5:fig:pararealRvsGPUs}
  \end{figure}
+
  \begin{figure}[!htb]
      \setlength\figureheight{0.32\textwidth}
      \setlength\figurewidth{0.34\textwidth}
@@ -576,9 +576,9 @@ If we additionally assume that the time spent on coarse propagation is negligibl
      \subfigure[Measured parallel efficiency]{
      {\small\input{Chapters/chapter5/figures/pararealEfficiencyvsRvsGPUs.tikz}}
      \label{ch5:fig:pararealGPUs:b}
-    }
+    }\vspace*{-8pt}
      \end{center}
-    \caption[Parareal performance properties as a function of $R$ and number of GPUs used.]{Parareal performance properties as a function of $R$ and number GPUs used. Notice how the obtained performance depends greatly on the choice of $R$ as a function of the number of GPUs. Executed on test environment 3.}\label{ch5:fig:pararealGPUs}
+    \caption[Parareal performance properties as a function of $R$ and number of GPUs used.]{Parareal performance properties as a function of $R$ and number GPUs used. Notice how the obtained performance depends greatly on the choice of $R$ as a function of the number of GPUs. Executed on test environment 3.}\label{ch5:fig:pararealGPUs}\vspace*{-12pt}
  \end{figure}
  %TODO: Do we make this into a subsubsection:
  %\subsubsection{Library Implementation}\label{ch5:subsec:libimpl}
diff --git a/BookGPU/Chapters/chapter6/PartieSync.tex b/BookGPU/Chapters/chapter6/PartieSync.tex

index ce215650b12551d15131ccdea204209de12cb70e..7fdaace7f3468648572dbfd896e3413a6ec3bac6 100755 (executable)
--- a/BookGPU/Chapters/chapter6/PartieSync.tex
+++ b/BookGPU/Chapters/chapter6/PartieSync.tex
@@ -506,7 +506,8 @@ and twice as many GPU buffers.
  \begin{figure}[t]
    \centering
    \includegraphics{Chapters/chapter6/figures/Sync-CompleteInterleaveOverlap.pdf}
-  \caption{Complete overlap of internode CPU communications, CPU/GPU data transfers, and GPU
+  \caption[Complete overlap of internode CPU communications,\break\hfill CPU/GPU data transfers, and GPU
+  computations, interleaving computation-communication iterations.]{Complete overlap of internode CPU communications, CPU/GPU data transfers, and GPU
    computations, interleaving computation-communication iterations.}
    \label{fig:ch6p1overlapinterleaved}
  \end{figure}
diff --git a/BookGPU/Chapters/chapter6/ch6.tex b/BookGPU/Chapters/chapter6/ch6.tex

index 65698d76c92a89410d74b1339e6920bdf0c1efe1..8e2b7e14ebaa1ad1122c7bfdfddd22afd5aebd16 100755 (executable)
--- a/BookGPU/Chapters/chapter6/ch6.tex
+++ b/BookGPU/Chapters/chapter6/ch6.tex
@@ -32,9 +32,9 @@
  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  % Main document
  %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\chapterauthor{Sylvain Contassot-Vivier}{Université Lorraine, Loria UMR 7503 \& AlGorille INRIA Project Team, Nancy, France.}
-\chapterauthor{Stephane Vialle}{SUPELEC, UMI GT-CNRS 2958 \& AlGorille INRIA Project Team, Metz, France.}
-\chapterauthor{Jens Gustedt}{INRIA Nancy--Grand Est, AlGorille INRIA Project Team, Strasbourg, France.}
+\chapterauthor{Sylvain Contassot-Vivier}{Université Lorraine, Loria UMR 7503 \& AlGorille INRIA Project Team, Nancy, France}
+\chapterauthor{Stephane Vialle}{SUPELEC, UMI GT-CNRS 2958 \& AlGorille INRIA Project Team, Metz, France}
+\chapterauthor{Jens Gustedt}{INRIA Nancy--Grand Est, AlGorille INRIA Project Team, Strasbourg, France}
  
  \chapter{Development methodologies for GPU and cluster of GPUs}
  
diff --git a/BookGPU/Chapters/chapter9/ch9.tex b/BookGPU/Chapters/chapter9/ch9.tex

index 0294f4887303bb76aaaf4bf6c14cc246b0e2639c..442c53a03ba02139a8b7ac7f8c68b084c9723823 100644 (file)
--- a/BookGPU/Chapters/chapter9/ch9.tex
+++ b/BookGPU/Chapters/chapter9/ch9.tex
@@ -99,7 +99,7 @@ solutions. The process is repeated until a stopping criterion is
  satisfied. \emph{Evolutionary algorithms}, \emph{swarm
  optimization}, and \emph{ant colonies} fall into this class.
  
-\clearpage
+%\clearpage
  \section{Parallel models for metaheuristics}\label{ch8:sec:paraMeta}
  Optimization problems, whether real-life or academic, are more
  often NP-hard and CPU time and/or memory consuming. Metaheuristics
@@ -187,7 +187,7 @@ consuming. Unlike the two previous parallel models, the
  solution-level\index{metaheuristics!solution-level parallelism}
  parallel model is problem-dependent.}
  \end{itemize}
-\clearpage
+%\clearpage
  \section[Challenges for the design of GPU-based  metaheuristics]{Challenges for the design of GPU-based\hfill\break  metaheuristics}
  \label{ch8:sec:challenges}
  
@@ -251,7 +251,7 @@ data (e.g., data for fitness evaluation that all threads
  concurrently access) on the constant memory, and the most accessed
  data structures (e.g., population of individuals for a CUDA thread
  block) on the shared memory.
-
+\clearpage
  \subsection{Threads synchronization}
  \index{GPU!threads synchronization} The thread
  synchronization issue is caused by both the GPU architecture and
diff --git a/BookGPU/Makefile b/BookGPU/Makefile

index a444ead71eda1bbac2447670a92ec9e039e5fb62..1f720d16bc6615d133f49849018ca3095ddc9dc7 100644 (file)
--- a/BookGPU/Makefile
+++ b/BookGPU/Makefile
@@ -32,7 +32,10 @@ all:
         pdflatex ${BOOK}
         pdflatex ${BOOK}
         cp BookGPU.toc  BookGPU.toc_old   
-       cp BookGPU.toc_new  BookGPU.toc   #copy the corrected toc
+       cp BookGPU.toc_new2  BookGPU.toc   #copy the corrected toc
+       cp bu16.bbl bu16.bbl_old
+       cp bu16.bbl_new bu16.bbl
+
  
  
         pdflatex ${BOOK}
diff --git a/BookGPU/sunil.cls b/BookGPU/sunil.cls

index 9d425119454e8c049f06eb512a4b62f164d1f142..06b11dfbf200795d76996e933aa57ed1bd5766e2 100755 (executable)
--- a/BookGPU/sunil.cls
+++ b/BookGPU/sunil.cls
@@ -815,10 +815,8 @@
  
  \newcommand\subsubsection{\@ifstar{\@ssubsubsection}{\@trplarg{\@subsubsection}}}
  \def\@ssubsubsection#1{%
-  \if@ChapterTOCs
-    \myaddcontentsline{\@chaptoc}{chapsubsubsection}{\string\makebox[\subsecnumwidth][l]{}#1}\fi
-  \@startsection{subsubsection}{3}{\z@}{12\p@}{6\p@}{%
-    \SubsubsectionHeadFont}*{#1}}
+%  \if@ChapterTOCs    \myaddcontentsline{\@chaptoc}{chapsubsubsection}{\string\makebox[\subsecnumwidth][l]{}#1}\fi
+  \@startsection{subsubsection}{3}{\z@}{12\p@}{6\p@}{\SubsubsectionHeadFont}*{#1}}
  \def\@subsubsection[#1][#2]#3{%
    \if@ChapterTOCs
      \addtocounter{subsubsection}{1}%
@@ -991,7 +989,7 @@
  \newcommand\@pnumwidth{1.55em}
  \newcommand\@tocrmarg{2.55em}
  \newcommand\@dotsep{4.5}
-\setcounter{tocdepth}{3}
+\setcounter{tocdepth}{1}
  
  
  \newcounter{numauthors}
@@ -1050,7 +1048,7 @@
  %
  \def\l@part#1#2{% 
  \toc@draw
- \gdef\toc@draw{\draw@part{\large #1}{\large #2}}}
+ \gdef\toc@draw{\draw@part{\large #1}{#2}}}
  
    \def\l@fm#1#2{%
    \toc@draw
@@ -1140,7 +1138,7 @@
    \toc@draw
    \gdef\toc@draw{\draw@section{#1}{#2}}}
  \def\draw@section#1#2{%
-  \@dottedtocline{1}{\section@toc@skip}{\SectionTOCWidth}{#1 }{{
+  \@newdottedtocline{1}{\section@toc@skip}{\SectionTOCWidth}{#1 }{{
  \tocfont #2}}}
  \newlength\subsection@toc@skip
  \subsection@toc@skip\section@toc@skip
@@ -1185,6 +1183,28 @@
    \@dottedtocline{5}{\subparagraph@toc@skip}{6em}{#1}{{
  \tocfont #2}}}
  
+\def\@newdottedtocline#1#2#3#4#5{%
+  \ifnum #1>\c@tocdepth
+  \else
+    \vskip \z@ %\@plus.2\p@
+%%%    {\leftskip #2\relax\rightskip\@tocrmarg\parfillskip-\rightskip
+%%%      \parindent #2\relax\@afterindenttrue
+%%%      \interlinepenalty\@M
+%%%      \leavevmode
+%%%      \@tempdima #3\relax
+%%%      \advance\leftskip\@tempdima\null\hskip-\leftskip
+%%%      {#4\hfil}\nobreak
+%%%      \if@pdf
+%%%      \else
+%%%        \leaders\hbox{$\m@th\mkern\@dotsep mu\hbox{.}\mkern\@dotsep mu$}\hfill
+%%%        \nobreak
+%%%        \hb@xt@\@pnumwidth{\hfil\normalfont\normalcolor #5}%
+%%%\fi
+%%%      \par}
+     \fi
+      }
+
+
  \def\@dottedtocline#1#2#3#4#5{%
    \ifnum #1>\c@tocdepth
    \else
author	couturie <couturie@extinction>
	Tue, 8 Oct 2013 18:28:36 +0000 (20:28 +0200)
committer	couturie <couturie@extinction>
	Tue, 8 Oct 2013 18:28:36 +0000 (20:28 +0200)
BookGPU/Chapters/chapter5/ch5.tex		patch \| blob \| history
BookGPU/Chapters/chapter6/PartieSync.tex		patch \| blob \| history
BookGPU/Chapters/chapter6/ch6.tex		patch \| blob \| history
BookGPU/Chapters/chapter9/ch9.tex		patch \| blob \| history
BookGPU/Makefile		patch \| blob \| history
BookGPU/sunil.cls		patch \| blob \| history