From: couturie <couturie@extinction>
Date: Tue, 8 Oct 2013 18:28:36 +0000 (+0200)
Subject: last version
X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/commitdiff_plain/2ce2baf7820f44ab044b4df98722576116551e57

last version
---

diff --git a/BookGPU/Chapters/chapter5/ch5.tex b/BookGPU/Chapters/chapter5/ch5.tex
index 55e5632..6355b09 100644
--- a/BookGPU/Chapters/chapter5/ch5.tex
+++ b/BookGPU/Chapters/chapter5/ch5.tex
@@ -546,11 +546,10 @@ We can now estimate the speedup, here denoted $\psi$, as the ratio between the c
 \begin{align}\label{ch5:eq:EstiSpeedBasicPara}
 \psi=\frac{N\frac{\Delta T}{\delta t}\mathcal{C}_\mathcal{F}}{\left(k+1\right)N\mathcal{C}_\mathcal{G}\frac{\Delta T}{\delta T}+k\mathcal{C}_\mathcal{F}\frac{\Delta T}{\delta t}}=\frac{N}{\left(k+1\right)N\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}+k}.
 \end{align}
-If we additionally assume that the time spent on coarse propagation is negligible compared to the time spent on the fine propagation, i.e., the limit $\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}\rightarrow0$, the estimate reduces to $\psi=\frac{N}{k}$. It is thus clear that the number of iterations $k$ for the algorithm to converge poses an upper bound on obtainable parallel efficiency. The number of iterations needed for convergence is intimately coupled with the ratio $R$ between the speed of the fine and the coarse integrators $\frac{\mathcal{C}_\mathcal{F}}{\mathcal{C}_\mathcal{G}}\frac{\delta T}{\delta t}$. Using a slow, but more accurate coarse integrator will lead to convergence in fewer iterations $k$, but at the same time it also makes $R$ smaller. Ultimately, this will degrade the obtained speedup as can be deduced from \eqref{ch5:eq:EstiSpeedBasicPara}, and by Amdahl's law it will also lower the upper bound on possible attainable speedup. Thus, $R$ \emph{cannot} be made arbitrarily large since the ratio is inversely proportional to the number of iterations $k$ needed for convergence. This poses a challenge in obtaining speedup and is a trade-off between time spent on the fundamentally sequential part of the algorithm and the number of iterations needed for convergence. It is particularly important to consider this trade-off in the choice of stopping strategy; a more thorough discussion on this topic is available in \cite{ch5:ASNP12} for the interested reader. Measurements on parallel efficiency are typically observed in the literature to be in the range of 20--50\%, depending on the problem and the number of time subdomains, which is also confirmed by our measurements using GPUs. Here we include a demonstration of the obtained speedup of parareal applied to the two-dimensional heat problem \eqref{ch5:eq:heateq}. In Figure \ref{ch5:fig:pararealRvsGPUs} the iterations needed for convergence using the forward Euler method for both fine and coarse integration are presented. $R$ is regulated by changing the time step size for the coarse integrator. In Figure \ref{ch5:fig:pararealGPUs} speedup and parallel efficiency measurements are presented. Notice, when using many GPUs it is advantageous to use a faster, less accurate coarse propagator, despite it requires an extra parareal iteration that increases the total computational complexity.
+If we additionally assume that the time spent on coarse propagation is negligible compared to the time spent on the fine propagation, i.e., the limit $\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}\rightarrow0$, the estimate reduces to $\psi=\frac{N}{k}$. It is thus clear that the number of iterations $k$ for the algorithm to converge poses an upper bound on obtainable parallel efficiency. The number of iterations needed for convergence is intimately coupled with the ratio $R$ between the speed of the fine and the coarse integrators $\frac{\mathcal{C}_\mathcal{F}}{\mathcal{C}_\mathcal{G}}\frac{\delta T}{\delta t}$. Using a slow, but more accurate coarse integrator will lead to convergence in fewer iterations $k$, but at the same time it also makes $R$ smaller. Ultimately, this will degrade the obtained speedup as can be deduced from \eqref{ch5:eq:EstiSpeedBasicPara}, and by Amdahl's law it will also lower the upper bound on possible attainable speedup. Thus, $R$ \emph{cannot} be made arbitrarily large since the ratio is inversely proportional to the number of iterations $k$ needed for convergence. This poses a challenge in obtaining speedup and is a trade-off between time spent on the fundamentally sequential part of the algorithm and the number of iterations needed for convergence. It is particularly important to consider this trade-off in the choice of stopping strategy; a more thorough discussion on this topic is available in \cite{ch5:ASNP12} for the interested reader. Measurements on parallel efficiency are typically observed in the literature to be in the range of 20--50\%, depending on the problem and the number of time subdomains, which is also confirmed by our measurements using GPUs. Here we include a demonstration of the obtained speedup of parareal applied to the two-dimensional heat problem \eqref{ch5:eq:heateq}. In Figure \ref{ch5:fig:pararealRvsGPUs} the iterations needed for convergence using the forward Euler method for both fine and coarse integration are presented. $R$ is regulated by changing the time step size for the coarse integrator. In Figure \ref{ch5:fig:pararealGPUs} speedup and parallel efficiency measurements are presented. Notice, when using many GPUs it is advantageous to use a faster, less accurate coarse propagator, despite it requires an extra parareal iteration that increases the total computational complexity.\eject
 
 
-%\clearpage
-\begin{figure}[!htb]
+\begin{figure}[t!]
     \setlength\figureheight{0.32\textwidth}
     \setlength\figurewidth{0.35\textwidth}
     \begin{center}
@@ -561,10 +560,11 @@ If we additionally assume that the time spent on coarse propagation is negligibl
     \subfigure[Iterations $K$ needed to obtain a relative error less than $10^{-5}$.]{
     {\small\input{Chapters/chapter5/figures/pararealKvsRvsGPUs.tikz}}
     \label{ch5:fig:pararealRvsGPUs:b}
-    }
+    }\vspace*{-8pt}
     \end{center}
     \caption[Parareal convergence properties as a function of $R$ and number of GPUs used.]{Parareal convergence properties as a function of $R$ and number of GPUs used. The error is measured as the relative difference between the purely sequential solution and the parareal solution.}\label{ch5:fig:pararealRvsGPUs}
 \end{figure}
+
 \begin{figure}[!htb]
     \setlength\figureheight{0.32\textwidth}
     \setlength\figurewidth{0.34\textwidth}
@@ -576,9 +576,9 @@ If we additionally assume that the time spent on coarse propagation is negligibl
     \subfigure[Measured parallel efficiency]{
     {\small\input{Chapters/chapter5/figures/pararealEfficiencyvsRvsGPUs.tikz}}
     \label{ch5:fig:pararealGPUs:b}
-    }
+    }\vspace*{-8pt}
     \end{center}
-    \caption[Parareal performance properties as a function of $R$ and number of GPUs used.]{Parareal performance properties as a function of $R$ and number GPUs used. Notice how the obtained performance depends greatly on the choice of $R$ as a function of the number of GPUs. Executed on test environment 3.}\label{ch5:fig:pararealGPUs}
+    \caption[Parareal performance properties as a function of $R$ and number of GPUs used.]{Parareal performance properties as a function of $R$ and number GPUs used. Notice how the obtained performance depends greatly on the choice of $R$ as a function of the number of GPUs. Executed on test environment 3.}\label{ch5:fig:pararealGPUs}\vspace*{-12pt}
 \end{figure}
 %TODO: Do we make this into a subsubsection:
 %\subsubsection{Library Implementation}\label{ch5:subsec:libimpl}
diff --git a/BookGPU/Chapters/chapter6/PartieSync.tex b/BookGPU/Chapters/chapter6/PartieSync.tex
index ce21565..7fdaace 100755
--- a/BookGPU/Chapters/chapter6/PartieSync.tex
+++ b/BookGPU/Chapters/chapter6/PartieSync.tex
@@ -506,7 +506,8 @@ and twice as many GPU buffers.
 \begin{figure}[t]
   \centering
   \includegraphics{Chapters/chapter6/figures/Sync-CompleteInterleaveOverlap.pdf}
-  \caption{Complete overlap of internode CPU communications, CPU/GPU data transfers, and GPU
+  \caption[Complete overlap of internode CPU communications,\break\hfill CPU/GPU data transfers, and GPU
+  computations, interleaving computation-communication iterations.]{Complete overlap of internode CPU communications, CPU/GPU data transfers, and GPU
   computations, interleaving computation-communication iterations.}
   \label{fig:ch6p1overlapinterleaved}
 \end{figure}
diff --git a/BookGPU/Chapters/chapter6/ch6.tex b/BookGPU/Chapters/chapter6/ch6.tex
index 65698d7..8e2b7e1 100755
--- a/BookGPU/Chapters/chapter6/ch6.tex
+++ b/BookGPU/Chapters/chapter6/ch6.tex
@@ -32,9 +32,9 @@
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
 % Main document
 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
-\chapterauthor{Sylvain Contassot-Vivier}{UniversitÃ© Lorraine, Loria UMR 7503 \& AlGorille INRIA Project Team, Nancy, France.}
-\chapterauthor{Stephane Vialle}{SUPELEC, UMI GT-CNRS 2958 \& AlGorille INRIA Project Team, Metz, France.}
-\chapterauthor{Jens Gustedt}{INRIA Nancy--Grand Est, AlGorille INRIA Project Team, Strasbourg, France.}
+\chapterauthor{Sylvain Contassot-Vivier}{UniversitÃ© Lorraine, Loria UMR 7503 \& AlGorille INRIA Project Team, Nancy, France}
+\chapterauthor{Stephane Vialle}{SUPELEC, UMI GT-CNRS 2958 \& AlGorille INRIA Project Team, Metz, France}
+\chapterauthor{Jens Gustedt}{INRIA Nancy--Grand Est, AlGorille INRIA Project Team, Strasbourg, France}
 
 \chapter{Development methodologies for GPU and cluster of GPUs}
 
diff --git a/BookGPU/Chapters/chapter9/ch9.tex b/BookGPU/Chapters/chapter9/ch9.tex
index 0294f48..442c53a 100644
--- a/BookGPU/Chapters/chapter9/ch9.tex
+++ b/BookGPU/Chapters/chapter9/ch9.tex
@@ -99,7 +99,7 @@ solutions. The process is repeated until a stopping criterion is
 satisfied. \emph{Evolutionary algorithms}, \emph{swarm
 optimization}, and \emph{ant colonies} fall into this class.
 
-\clearpage
+%\clearpage
 \section{Parallel models for metaheuristics}\label{ch8:sec:paraMeta}
 Optimization problems, whether real-life or academic, are more
 often NP-hard and CPU time and/or memory consuming. Metaheuristics
@@ -187,7 +187,7 @@ consuming. Unlike the two previous parallel models, the
 solution-level\index{metaheuristics!solution-level parallelism}
 parallel model is problem-dependent.}
 \end{itemize}
-\clearpage
+%\clearpage
 \section[Challenges for the design of GPU-based  metaheuristics]{Challenges for the design of GPU-based\hfill\break  metaheuristics}
 \label{ch8:sec:challenges}
 
@@ -251,7 +251,7 @@ data (e.g., data for fitness evaluation that all threads
 concurrently access) on the constant memory, and the most accessed
 data structures (e.g., population of individuals for a CUDA thread
 block) on the shared memory.
-
+\clearpage
 \subsection{Threads synchronization}
 \index{GPU!threads synchronization} The thread
 synchronization issue is caused by both the GPU architecture and
diff --git a/BookGPU/Makefile b/BookGPU/Makefile
index a444ead..1f720d1 100644
--- a/BookGPU/Makefile
+++ b/BookGPU/Makefile
@@ -32,7 +32,10 @@ all:
 	pdflatex ${BOOK}
 	pdflatex ${BOOK}
 	cp BookGPU.toc  BookGPU.toc_old   
-	cp BookGPU.toc_new  BookGPU.toc   #copy the corrected toc
+	cp BookGPU.toc_new2  BookGPU.toc   #copy the corrected toc
+	cp bu16.bbl bu16.bbl_old
+	cp bu16.bbl_new bu16.bbl
+
 
 
 	pdflatex ${BOOK}
diff --git a/BookGPU/sunil.cls b/BookGPU/sunil.cls
index 9d42511..06b11df 100755
--- a/BookGPU/sunil.cls
+++ b/BookGPU/sunil.cls
@@ -815,10 +815,8 @@
 
 \newcommand\subsubsection{\@ifstar{\@ssubsubsection}{\@trplarg{\@subsubsection}}}
 \def\@ssubsubsection#1{%
-  \if@ChapterTOCs
-    \myaddcontentsline{\@chaptoc}{chapsubsubsection}{\string\makebox[\subsecnumwidth][l]{}#1}\fi
-  \@startsection{subsubsection}{3}{\z@}{12\p@}{6\p@}{%
-    \SubsubsectionHeadFont}*{#1}}
+%  \if@ChapterTOCs    \myaddcontentsline{\@chaptoc}{chapsubsubsection}{\string\makebox[\subsecnumwidth][l]{}#1}\fi
+  \@startsection{subsubsection}{3}{\z@}{12\p@}{6\p@}{\SubsubsectionHeadFont}*{#1}}
 \def\@subsubsection[#1][#2]#3{%
   \if@ChapterTOCs
     \addtocounter{subsubsection}{1}%
@@ -991,7 +989,7 @@
 \newcommand\@pnumwidth{1.55em}
 \newcommand\@tocrmarg{2.55em}
 \newcommand\@dotsep{4.5}
-\setcounter{tocdepth}{3}
+\setcounter{tocdepth}{1}
 
 
 \newcounter{numauthors}
@@ -1050,7 +1048,7 @@
 %
 \def\l@part#1#2{% 
 \toc@draw
- \gdef\toc@draw{\draw@part{\large #1}{\large #2}}}
+ \gdef\toc@draw{\draw@part{\large #1}{#2}}}
 
   \def\l@fm#1#2{%
   \toc@draw
@@ -1140,7 +1138,7 @@
   \toc@draw
   \gdef\toc@draw{\draw@section{#1}{#2}}}
 \def\draw@section#1#2{%
-  \@dottedtocline{1}{\section@toc@skip}{\SectionTOCWidth}{#1 }{{
+  \@newdottedtocline{1}{\section@toc@skip}{\SectionTOCWidth}{#1 }{{
 \tocfont #2}}}
 \newlength\subsection@toc@skip
 \subsection@toc@skip\section@toc@skip
@@ -1185,6 +1183,28 @@
   \@dottedtocline{5}{\subparagraph@toc@skip}{6em}{#1}{{
 \tocfont #2}}}
 
+\def\@newdottedtocline#1#2#3#4#5{%
+  \ifnum #1>\c@tocdepth
+  \else
+    \vskip \z@ %\@plus.2\p@
+%%%    {\leftskip #2\relax\rightskip\@tocrmarg\parfillskip-\rightskip
+%%%      \parindent #2\relax\@afterindenttrue
+%%%      \interlinepenalty\@M
+%%%      \leavevmode
+%%%      \@tempdima #3\relax
+%%%      \advance\leftskip\@tempdima\null\hskip-\leftskip
+%%%      {#4\hfil}\nobreak
+%%%      \if@pdf
+%%%      \else
+%%%        \leaders\hbox{$\m@th\mkern\@dotsep mu\hbox{.}\mkern\@dotsep mu$}\hfill
+%%%        \nobreak
+%%%        \hb@xt@\@pnumwidth{\hfil\normalfont\normalcolor #5}%
+%%%\fi
+%%%      \par}
+     \fi
+      }
+
+
 \def\@dottedtocline#1#2#3#4#5{%
   \ifnum #1>\c@tocdepth
   \else