From 65f8be28e79668334c6fcbc3a2af46a7c8a2eab0 Mon Sep 17 00:00:00 2001
From: couturie <couturie@extinction>
Date: Thu, 3 Oct 2013 21:33:31 +0200
Subject: [PATCH 1/1] new

---
 BookGPU/Chapters/chapter1/ch1.tex         |  8 +++---
 BookGPU/Chapters/chapter10/ch10.tex       |  2 +-
 BookGPU/Chapters/chapter18/ch18.tex       |  2 +-
 BookGPU/Chapters/chapter2/ch2.tex         |  4 +--
 BookGPU/Chapters/chapter3/ch3.tex         |  8 +++---
 BookGPU/Chapters/chapter4/ch4.tex         | 10 ++++----
 BookGPU/Chapters/chapter5/ch5.tex         | 31 +++++++++++------------
 BookGPU/Chapters/chapter6/PartieAsync.tex |  2 +-
 BookGPU/Chapters/chapter8/ch8.tex         |  4 +--
 BookGPU/Chapters/chapter9/ch9.tex         |  6 ++---
 BookGPU/Makefile                          |  2 ++
 11 files changed, 40 insertions(+), 39 deletions(-)

diff --git a/BookGPU/Chapters/chapter1/ch1.tex b/BookGPU/Chapters/chapter1/ch1.tex
index 2fef3a4..bfa57c8 100755
--- a/BookGPU/Chapters/chapter1/ch1.tex
+++ b/BookGPU/Chapters/chapter1/ch1.tex
@@ -59,10 +59,10 @@ have been  proposed. The  other well-known alternative  is OpenCL which  aims at
 proposing an alternative  to CUDA and which is  multiplatform and portable. This
 is a  great advantage since  it is even  possible to execute OpenCL  programs on
 traditional CPUs.  The main drawback is  that it is less close to the hardware
-and  consequently it sometimes  provides  less efficient  programs. Moreover,  CUDA
+and,  consequently, it sometimes  provides  less efficient  programs. Moreover,  CUDA
 benefits from  more mature compilation and optimization  procedures.  Other less
 known environments have been proposed,  but most of them have been discontinued,
-such FireStream by ATI which is  not maintained anymore and has been replaced by
+such as FireStream by ATI, which is  not maintained anymore and has been replaced by
 OpenCL and  BrookGPU  by  Stanford  University~\cite{ch1:Buck:2004:BGS}.   Another
 environment based on  pragma (insertion of pragma directives  inside the code to
 help  the  compiler  to generate  efficient  code)  is  called OpenACC.   For  a
@@ -267,8 +267,8 @@ to fill the shared  memory at the start of the kernel  with global data that are
 used very  frequently, then threads can  access it for  their computation.  Threads
 can obviously change  the content of this shared  memory either with computation
 or by loading  other data and they can  store its content in the  global memory. So
-shared memory can  be seen as a cache memory which is manageable manually. This
-obviously  requires an effort from the programmer.
+shared memory can  be seen as a cache memory, which is manually managed. This
+obviously  requires effort from the programmer.
 
 On  recent cards,  the programmer  may decide  what amount  of cache  memory and
 shared memory is attributed to a kernel. The cache memory is an L1 cache which is
diff --git a/BookGPU/Chapters/chapter10/ch10.tex b/BookGPU/Chapters/chapter10/ch10.tex
index 17b3b4f..c481bc4 100644
--- a/BookGPU/Chapters/chapter10/ch10.tex
+++ b/BookGPU/Chapters/chapter10/ch10.tex
@@ -3,7 +3,7 @@
 %\chapterauthor{Bastien Chopard}{Department of Computer Science, University of Geneva}
 
 %\chapter{Linear programming on a GPU: a study case based on the simplex method and the branch-cut-and bound algorithm}
-\chapter{Linear Programming on a GPU: A~Case~Study} 
+\chapter{Linear programming on a GPU: a~case~study} 
 \section{Introduction}
 \label{chXXX:sec:intro}
 The simplex method~\cite{VCLP} is a well-known optimization algorithm for solving linear programming (LP) models in the field of operations research. It is part of software often employed by businesses for finding solutions to problems such as airline scheduling problems. The original standard simplex method was proposed by Dantzig in 1947. A more efficient method, named the revised simplex, was later developed. Nowadays its sequential implementation can be found in almost all commercial LP solvers. But the always increasing complexity and size of LP problems from the industry, drives the demand for more computational power.
diff --git a/BookGPU/Chapters/chapter18/ch18.tex b/BookGPU/Chapters/chapter18/ch18.tex
index 9e92d3e..155cc04 100755
--- a/BookGPU/Chapters/chapter18/ch18.tex
+++ b/BookGPU/Chapters/chapter18/ch18.tex
@@ -81,7 +81,7 @@ naive and improved efficient generators for CPU and for GPU.
 These generators are finally experimented in Section~\ref{sec:experiments}.
 
 
-\section{Basic remindees}
+\section{Basic reminders}
 \label{section:BASIC RECALLS}
 
 This section is devoted to basic definitions and terminologies in the fields of
diff --git a/BookGPU/Chapters/chapter2/ch2.tex b/BookGPU/Chapters/chapter2/ch2.tex
index 490d753..906c7b8 100755
--- a/BookGPU/Chapters/chapter2/ch2.tex
+++ b/BookGPU/Chapters/chapter2/ch2.tex
@@ -25,8 +25,8 @@ are executed on a GPU. This code is in Listing~\ref{ch2:lst:ex1}.
 As GPUs have  their own memory, the first step consists  of allocating memory on
 the  GPU.    A  call  to   \texttt{cudaMalloc}\index{CUDA  functions!cudaMalloc}
 allocates memory on  the GPU.  The first parameter of this  function is a pointer
-on a  memory on the  device, i.e. the  GPU. The second parameter  represents the
-size of the allocated variables, this size is expressed in bits.
+on a  memory on the  device, i.e., the  GPU. The second parameter  represents the
+size of the allocated variables; this size is expressed in bits.
 \pagebreak
 \lstinputlisting[label=ch2:lst:ex1,caption=simple example]{Chapters/chapter2/ex1.cu}
 
diff --git a/BookGPU/Chapters/chapter3/ch3.tex b/BookGPU/Chapters/chapter3/ch3.tex
index b8ff22c..e9199dc 100755
--- a/BookGPU/Chapters/chapter3/ch3.tex
+++ b/BookGPU/Chapters/chapter3/ch3.tex
@@ -4,7 +4,7 @@
 \newcommand{\kr}{\includegraphics[scale=0.6]{Chapters/chapter3/img/kernRight.png}}
 
 
-\chapter{Setting up the environment.}
+\chapter{Setting up the environment}
 Image processing using a GPU often means using it as a general purpose computing processor, which soon brings up the issue of data transfers, especially when kernel runtime is fast and/or when large data sets are processed.
 The truth is that, in certain cases, data transfers between GPU and CPU are slower than the actual computation on GPU. 
 It remains that global runtime can still be faster than similar processes run on CPU.
@@ -15,7 +15,7 @@ Obviously, our code originally accepts various image dimensions and can process
 However, so as to propose concise and more readable code, we will assume the following limitations:
 16~bit-coded gray-level input images whose dimensions $H\times W$ are multiples of 512 pixels. 
 
-\section{Data transfers, memory management.}
+\section{Data transfers, memory management}
 This section deals with the following issues: 
 \begin{enumerate}
 \item Data transfer from CPU memory to GPU global memory: several GPU memory areas are available as destination memory but the 2D caching mechanism of texture memory, \index{memory hierarchy!texture memory} specifically designed for fetching neighboring pixels, is currently the fastest way to fetch gray-level pixel values inside a kernel computation. This has led us to choose \textbf{texture memory} as primary GPU memory area for input images.
@@ -105,7 +105,7 @@ Designing a 2D median filter basically consists of defining a square window $H(i
 \begin{figure}[b]
    \centering
    \includegraphics[width=8cm]{Chapters/chapter3/img/median_1.png}
-   \caption{Example of 5x5 median filtering}
+   \caption{Example of 5x5 median filtering.}
    \label{fig:median_1}
 \end{figure}
 Figure \ref{fig:sap_examples} shows an example of a $512\times 512$ pixel image, corrupted by a  \textit{salt and pepper} noise and the denoised versions, output respectively by a $3\times 3$, a $5\times 5$, and 2 iterations of a $3\times 3$ median filter.
@@ -139,7 +139,7 @@ On the GPU's side, we note high dependence on window size due to the redundancy
 \begin{figure}[h]
    \centering
    \includegraphics[width=5cm]{Chapters/chapter3/img/median_overlap.png}
-   \caption{Illustration of window overlapping in 5x5 median filtering}
+   \caption{Illustration of window overlapping in 5x5 median filtering.}
    \label{fig:median_overlap}
 \end{figure}
 
diff --git a/BookGPU/Chapters/chapter4/ch4.tex b/BookGPU/Chapters/chapter4/ch4.tex
index ed3f531..be254fe 100644
--- a/BookGPU/Chapters/chapter4/ch4.tex
+++ b/BookGPU/Chapters/chapter4/ch4.tex
@@ -14,7 +14,7 @@ operation} basically consists of taking the sum of products of elements
 from two 2D functions, letting one of the two functions move over
 every element of the other, producing a third function that is typically
 viewed as a modified version of one of the original functions. To
-begin with, we shall examine non separable or generic convolutions,
+begin with, we shall examine nonseparable or generic convolutions,
 before addressing the matter of separable convolutions. We shall refer
 to $I$ as an $H\times L$ pixel gray-level image and to $I(x,y)$ as the gray-level
 value of each pixel of coordinates $(x,y)$.
@@ -239,7 +239,7 @@ However, our technique requires writing one kernel per mask size, which can be s
 
 \lstinputlisting[label={lst:convoGene8x8pL3},caption=CUDA kernel achieving a $3\times 3$ convolution operation with the mask in symbol memory and direct data fetches in texture memory]{Chapters/chapter4/code/convoGene8x8pL3.cu}
 
-\subsection{Using shared memory to store prefetched data\index{prefetching}.}
+\subsection{Using shared memory to store prefetched data\index{prefetching}}
  \index{memory hierarchy!shared memory}
 A more convenient way of coding a convolution kernel is to use shared memory to perform a prefetching stage of the whole halo before computing the convolution sums.
 This proves to be quite efficient and more versatile, but it obviously generates some overhead because 
@@ -302,7 +302,7 @@ This saves a lot of arithmetic operations, as a generic $n\times n$ convolution
 
 However, besides reducing the operation count, performing a separable convolution also means writing an intermediate image into global memory.
 CPU implementations of separable convolutions often use a single function to perform both 1D convolution stages. To do so, this function reads the input image and actually ouputs the transposed filtered image. 
-Applying this principle to GPUs is not efficient, as outputting the transposed image means non coalescent writes into global memory, generating severe performance loss. Hence the idea of developing two different kernels, one for each of the vertical and horizontal convolutions.
+Applying this principle to GPUs is not efficient, as outputting the transposed image means noncoalescent writes into global memory, generating severe performance loss. Hence the idea of developing two different kernels, one for each of the vertical and horizontal convolutions.
 
 Here, the use of shared memory is the best choice, as there is no overlapping between neighbor windows and thus no possible optimization.
 Moreover, to ensure efficiency, it is important to read the input image from texture memory, which implies an internal GPU data copy between both 1D convolution stages.
@@ -322,7 +322,7 @@ $\mathbf{1024\times 1024}$&0.306 &0.333 &\bf 0.333 &\bf 0.378&\bf 0.404&\bf 0.46
 $\mathbf{2048\times 2048}$&1.094 &1.191 &\bf 1.260 &\bf 1.444&\bf 1.545&\bf 1.722\\\hline
 $\mathbf{4096\times 4096}$&4.262 &4.631 &\bf 5.000 &\bf 5.676&\bf 6.105&\bf 6.736\\\hline
 \end{tabular}}  
-\caption[Performances, in milliseconds, of our generic 8 pixels per thread 1D convolution kernels using shared memory, run  on a C2070 card.]{Performances, in milliseconds, of our generic 8 pixels per thread 1D convolution kernels using shared memory, run  on a C2070 card. Timings include data copy. Bold values correspond to situations where separable-convolution kernels run faster than non separable ones.}
+\caption[Performances, in milliseconds, of our generic 8 pixels per thread 1D convolution kernels using shared memory, run  on a C2070 card.]{Performances, in milliseconds, of our generic 8 pixels per thread 1D convolution kernels using shared memory, run  on a C2070 card. Timings include data copy. Bold values correspond to situations where separable-convolution kernels run faster than nonseparable ones.}
 \label{tab:convoSepSh1}
 \end{table}
 \begin{table}[h]
@@ -337,7 +337,7 @@ $\mathbf{2048\times 2048}$&1598 &1541 &\bf 1503 &\bf 1410&\bf 1364&\bf 1290\\\hl
 $\mathbf{4096\times 4096}$&1654 &1596 &\bf 1542 &\bf 1452&\bf 1400&\bf 1330\\\hline
 \end{tabular}
 }  
-\caption[Throughput values, in megapixel per second, of our generic 8 pixels per thread 1D convolution kernel using shared memory, run on a C2070 card.]{Throughput values, in MegaPixel per second, of our generic 8 pixels per thread 1D convolution kernel using shared memory, run on a C2070 card. Bold values correspond to situations where separable-convolution kernels run faster than non separable ones (data transfer durations are those of Table \ref{tab:memcpy1}).}
+\caption[Throughput values, in megapixel per second, of our generic 8 pixels per thread 1D convolution kernel using shared memory, run on a C2070 card.]{Throughput values, in MegaPixel per second, of our generic 8 pixels per thread 1D convolution kernel using shared memory, run on a C2070 card. Bold values correspond to situations where separable-convolution kernels run faster than nonseparable ones (data transfer durations are those of Table \ref{tab:memcpy1}).}
 \label{tab:convoSepSh2}
 \end{table} 
 \begin{table}[h]
diff --git a/BookGPU/Chapters/chapter5/ch5.tex b/BookGPU/Chapters/chapter5/ch5.tex
index 3a4942d..55e5632 100644
--- a/BookGPU/Chapters/chapter5/ch5.tex
+++ b/BookGPU/Chapters/chapter5/ch5.tex
@@ -13,7 +13,7 @@
 %\end{itemize}
 
 \clearpage
-\section{Software development for heterogeneous architectures}
+\section{Software development for heterogeneous\hfill\break architectures}
 %Our library facilitates massively parallelization through GPU computing and contains components for various iterative strategies such as DC, CG, CGNR, BiCGSTAB and GMRES(m) for solution of large linear systems along with support for preconditioning strategies. The goal is to create a reusable library and framework which provide general components with performance similar to that of a dedicated solver. Preliminary results show that performance overhead can be kept minimal in this new software framework [8].
 
 
@@ -26,7 +26,7 @@
 
 % there is a price/penalty to pay when using libraries, i.e. one doesn't necessarily get the best performance, because the architectural details are not visible to the programmer (keyword: Visibility  [Berkeley dwarf paper]).  However, if the library permits to write own add-ons/kernels (flexibility), this is no longer an issue.
 
-Massively parallel processors, such as graphical processing units (GPUs), have in recent years proven to be effective for a vast amount of scientific applications. Today, most desktop computers are equipped with one or more powerful GPUs, offering heterogeneous high-performance computing to a broad range of scientific researchers and software developers. Though GPUs are now programmable and can be highly effective computing units, they still pose challenges for software developers to fully utilize their efficiency. Sequential legacy codes are not always easily parallelized, and the time spent on conversion might not pay off in the end. This is particular true for heterogenous computers, where the architectural differences between the main and coprocessor can be so significant that they require completely different optimization strategies. The cache hierarchy management of CPUs and GPUs are an evident example hereof. In the past, industrial companies were able to boost application performance solely by upgrading their hardware systems, with an overt balance between investment and performance speedup. Today, the picture is different; not only do they have to invest in new hardware, but they also must account for the adaption and training of their software developers. What traditionally used to be a hardware problem, addressed by the chip manufacturers, has now become a software problem for application developers.
+Massively parallel processors, such as graphical processing units (GPUs), have in recent years proven to be effective for a vast amount of scientific applications. Today, most desktop computers are equipped with one or more powerful GPUs, offering heterogeneous high-performance computing to a broad range of scientific researchers and software developers. Though GPUs are now programmable and can be highly effective computing units, they still pose challenges for software developers to fully utilize their efficiency. Sequential legacy codes are not always easily parallelized, and the time spent on conversion might not pay off in the end. This is particular true for heterogeneous computers, where the architectural differences between the main and coprocessor can be so significant that they require completely different optimization strategies. The cache hierarchy management of CPUs and GPUs are an evident example hereof. In the past, industrial companies were able to boost application performance solely by upgrading their hardware systems, with an overt balance between investment and performance speedup. Today, the picture is different; not only do they have to invest in new hardware, but they also must account for the adaption and training of their software developers. What traditionally used to be a hardware problem, addressed by the chip manufacturers, has now become a software problem for application developers.
 
 Software libraries\index{software library}\index{library|see{software library}} can be a tremendous help for developers as they make it easier to implement an application, without requiring special knowledge of the underlying computer architecture and hardware. A library may be referred to as \emph{opaque} when it automatically utilizes the available resources, without requiring specific details from the developer\cite{ch5:Asanovic:EECS-2006-183}. The ultimate goal for a successful library is to simplify the process of writing new software and thus to increase developer productivity. Since programmable heterogeneous CPU/GPU systems are a rather new phenomenon, there is a limited number of established software libraries that take full advantage of such heterogeneous high performance systems, and there are no de facto design standards for such systems either. Some existing libraries for conventional homogeneous systems have already added support for offloading computationally intense operations onto coprocessing GPUs. However, this approach comes at the cost of frequent memory transfers across the low bandwidth PCIe bus.
 
@@ -142,7 +142,7 @@ c_{10} & c_{11} & c_{12} \\
 c_{20} & c_{21} & c_{22} \\
 \end{array}\right].
 \end{eqnarray}
-Matrix components precompute these compact stencil coefficients and provides member functions that computes the finite difference approximation of input vectors. Unit scaled coefficients (assuming grid spacing is one) are computed and stored to be accessible via both CPU and GPU memory. On the GPU, the constant memory space is used for faster memory access~\cite{ch5:cudaguide}. In order to apply a stencil on a non unit-spaced grid, with grid space $\Delta x$, the scale factor $1/(\Delta x)^q$ will have to be multiplied by the finite difference sum, i.e., $(c_{00}u_0 + c_{01}u_1 + c_{02}u_2)/(\Delta x)^q \approx u^{(q)}_0$ as in the first row of \eqref{ch5:eq:stencilmatrix}.
+Matrix components precompute these compact stencil coefficients and provides member functions that computes the finite difference approximation of input vectors. Unit scaled coefficients (assuming grid spacing is one) are computed and stored to be accessible via both CPU and GPU memory. On the GPU, the constant memory space is used for faster memory access~\cite{ch5:cudaguide}. In order to apply a stencil on a nonunit-spaced grid, with grid space $\Delta x$, the scale factor $1/(\Delta x)^q$ will have to be multiplied by the finite difference sum, i.e., $(c_{00}u_0 + c_{01}u_1 + c_{02}u_2)/(\Delta x)^q \approx u^{(q)}_0$ as in the first row of \eqref{ch5:eq:stencilmatrix}.
 
 Setting up a two-dimensional grid of size $N_x \times N_y$ in the unit square and computing the first derivative hereof is illustrated in Listing~\ref{ch5:lst:stencil}. The grid is a vector component, derived from the vector class. It is by default treated as a device object and memory is automatically allocated on the device to fit the grid size. The finite difference approximation as in \eqref{ch5:eq:fdstencil}, is performed via a CUDA kernel behind the scenes during the calls to \texttt{mult} and \texttt{diff\_x}, utilizing the memory hierarchy as the CUDA guidelines prescribe~\cite{ch5:cudaguide,ch5:cudapractice}. To increase developer productivity, kernel launch configurations have default settings, based on CUDA guidelines, principles, and experiences from performance testings, such that the user does not have to explicitly specify them. For problem-specific finite difference approximations, where the built-in stencil operators are insufficient, a pointer to the coefficient matrix \eqref{ch5:eq:stencilcoeffs} can be accessed as demonstrated in Listing \ref{ch5:lst:stencil} and passed to customized kernels.
 \pagebreak
@@ -230,7 +230,7 @@ typedef gpulab::grid<value_type>            vector_type;
 typedef vector_type::property_type          property_type;
 typedef gpulab::integration::forward_euler  time_integrator_type;
 \end{lstlisting}
-The grid is by default treated as a device object, and memory is allocated on the GPU upon initialization of the grid. Setting up the grid can be done via the property type class. The property class holds information about the discrete and physical dimensions, along with fictitious ghost (halo) layers and periodicity conditions. For the heat conduction problem we use a non periodic domain of size $N\times N$ within the unit square with no ghost layers. Listing \ref{ch5:lst:gridsetup} illustrates the grid assembly.
+The grid is by default treated as a device object, and memory is allocated on the GPU upon initialization of the grid. Setting up the grid can be done via the property type class. The property class holds information about the discrete and physical dimensions, along with fictitious ghost (halo) layers and periodicity conditions. For the heat conduction problem we use a nonperiodic domain of size $N\times N$ within the unit square with no ghost layers. Listing \ref{ch5:lst:gridsetup} illustrates the grid assembly.
 
 \lstset{label=ch5:lst:gridsetup,caption={creating a two-dimensional grid of size \texttt{N} times \texttt{N} and physical dimension $0$ to $1$}}
 \begin{lstlisting}
@@ -281,7 +281,7 @@ Solution time for the heat conduction problem is in itself not very interesting,
 
 
 \subsection{Poisson equation}\index{Poisson equation}
-The Poisson equation is a second-order elliptic differential equation, often encountered in applications within scientific fields such as electrostatics and mechanics. We consider the two- dimensional BVP \index{boundary volume problem} defined in terms of Poisson's equation with homogeneous Dirichlet boundary conditions on the form
+The Poisson equation is a second-order elliptic differential equation, often encountered in applications within scientific fields such as electrostatics and mechanics. We consider the two-dimensional BVP \index{boundary volume problem} defined in terms of Poisson's equation with homogeneous Dirichlet boundary conditions on the form
 \begin{subequations}\begin{align}
 \nabla^2 u = f(x,y),& \qquad (x,y) \in \Omega([0,1]\times[0,1]), \\
 u = 0,& \qquad (x,y) \in \partial\Omega.
@@ -409,7 +409,7 @@ Defect correction in combination with multigrid preconditioning enables efficien
 
 \section{Optimization strategies for multi-GPU systems}\label{ch5:sec:multigpu}\index{multi-GPU}
 
-CUDA enabled GPUs are optimized for high memory bandwidth and fast on-chip performance. However, the role as a separate coprocessor to the CPU can be a limiting factor for large scale scientific applications, because the GPU memory capacity is fixed and is only in the range of a few gigabytes. In comparison, it is not unusual for a high-end workstation to be equipped with $\sim32$GB of main memory, plus a terabyte hard disk capacity for secondary storage. Therefore, large scale scientific applications that process gigabytes of data, require distributed computations on multiple GPU devices. Multi-GPU desktop computers and clusters can have a very attractive peak performance, but the addition of multiple devices introduces the potential performance bottleneck of slow data transfers across PCIe busses and network interconnections, as illustrated in Figure \ref{ch5:fig:gpu2gputransfer}. The ratio between data transfers and computational work has a significant impact on the possibility for latency hiding and thereby overall application performance.
+CUDA-enabled GPUs are optimized for high memory bandwidth and fast on-chip performance. However, the role as a separate coprocessor to the CPU can be a limiting factor for large scale scientific applications, because the GPU memory capacity is fixed and is only in the range of a few gigabytes. In comparison, it is not unusual for a high-end workstation to be equipped with $\sim32$GB of main memory, plus a terabyte hard disk capacity for secondary storage. Therefore, large scale scientific applications that process gigabytes of data, require distributed computations on multiple GPU devices. Multi-GPU desktop computers and clusters can have a very attractive peak performance, but the addition of multiple devices introduces the potential performance bottleneck of slow data transfers across PCIe busses and network interconnections, as illustrated in Figure \ref{ch5:fig:gpu2gputransfer}. The ratio between data transfers and computational work has a significant impact on the possibility for latency hiding and thereby overall application performance.
 
 \begin{figure}[!htb]
 \begin{center}
@@ -420,7 +420,7 @@ CUDA enabled GPUs are optimized for high memory bandwidth and fast on-chip perfo
 \end{figure}
 
 
-Developing applications that exploit the full computational capabilities of modern clusters--GPU-based or not--is no trivial matter. Developers are faced with the complexity of distributing and coordinating computations on nodes consisting of many-core CPUs, GPUs and potentially other types of accelerators as well. These complexities give rise to challenges in finding numerical algorithms, that are well suited for such systems, forcing developers to search for novel methods that utilize concurrency.
+Developing applications that exploit the full computational capabilities of modern clusters--GPU-based or not--is no trivial matter. Developers are faced with the complexity of distributing and coordinating computations on nodes consisting of many-core CPUs, GPUs, and potentially other types of accelerators as well. These complexities give rise to challenges in finding numerical algorithms, that are well suited for such systems, forcing developers to search for novel methods that utilize concurrency.
 
 To ease software development, we use MPI-2 for message passing and ensure a safe and private communication space by creation of a communicator private to the library during initialization, as recommended by Hoefler and Snir~\cite{ch5:Hoefler2011}. With the addition of remote direct memory access (RDMA) for GPUDirect it is possible to make direct memory transfers between recent generation of GPUs (Kepler), eliminating CPU overhead. Unfortunately there are some strict system and driver requirements to enable these features. Therefore, in the following examples, device memory is first transferred to the CPU main memory before invoking any MPI calls. The library provides device-to-device transfers via template-based routines that work directly with GPU vector objects. This hides the complexity of message passing from the developer and helps developers design new components for multi-GPU execution. 
 
@@ -444,11 +444,11 @@ An alternative to the preconditioning strategy is to have each subdomain query i
 \caption[Domain distribution of a two-dimensional grid into three subdomains.]{Domain distribution of a two-dimensional grid into three subdomains. {\large$\bullet$} and {\scriptsize$\textcolor[rgb]{0.5,0.5,0.5}{\blacksquare}$} represent internal grid points and ghost points, respectively.}\label{ch5:fig:dd2d}
 \end{figure}
 
-Topologies are introduced via an extra template argument to the grid class. A grid is by default not decomposed, because the default template argument is based on a non distribution topology implementation. The grid class is extended with a new member function \texttt{update()}, which makes sure that all ghost points are updated according to the grid topology. The library contains topologies based on one-dimensional and two-dimensional distributions of the grid. The number of grid subdomains will be equal to the number of MPI processes executing the program.
+Topologies are introduced via an extra template argument to the grid class. A grid is by default not decomposed, because the default template argument is based on a nondistribution topology implementation. The grid class is extended with a new member function \texttt{update()}, which makes sure that all ghost points are updated according to the grid topology. The library contains topologies based on one-dimensional and two-dimensional distributions of the grid. The number of grid subdomains will be equal to the number of MPI processes executing the program.
 
-If grid ghost layers are updated whenever information from adjacent subdomains is needed, e.g., before a stencil operation, all interior points will be exactly the same as they would be for the non distributed setup. Therefore, one advantage of this approach is that the algorithmic efficiency of an application can be preserved, if grid updates are consistently invoked at the proper times.
+If grid ghost layers are updated whenever information from adjacent subdomains is needed, e.g., before a stencil operation, all interior points will be exactly the same as they would be for the nondistributed setup. Therefore, one advantage of this approach is that the algorithmic efficiency of an application can be preserved, if grid updates are consistently invoked at the proper times.
 
-Distributed performance for the finite difference stencil operation is illustrated in Figure \ref{ch5:fig:multigpu}. The timings include the compute time for the finite difference approximation and the time for updating ghost layers via message passing. It is obvious from Figure \ref{ch5:fig:multigpu:a} that communication overhead dominates for the smallest problem sizes, where the non distributed grid (1 GPU) is fastest. However, communication overhead does not grow as rapidly as computation times, due to the surface-to-volume ratio. Therefore message passing becomes less influential for large problems, where reasonable performance speedups are obtained. Figure \ref{ch5:fig:multigpu:b} demonstrates how the computational performance on multi-GPU systems can be significantly improved for various stencil sizes. With this simple domain decomposition technique, developers are able to implement applications based on heterogeneous distributed computing, without explicitly dealing with message passing and it is still possible to provide user specific implementations of the topology class for customized grid updates.
+Distributed performance for the finite difference stencil operation is illustrated in Figure \ref{ch5:fig:multigpu}. The timings include the compute time for the finite difference approximation and the time for updating ghost layers via message passing. It is obvious from Figure \ref{ch5:fig:multigpu:a} that communication overhead dominates for the smallest problem sizes, where the nondistributed grid (1 GPU) is fastest. However, communication overhead does not grow as rapidly as computation times, due to the surface-to-volume ratio. Therefore message passing becomes less influential for large problems, where reasonable performance speedups are obtained. Figure \ref{ch5:fig:multigpu:b} demonstrates how the computational performance on multi-GPU systems can be significantly improved for various stencil sizes. With this simple domain decomposition technique, developers are able to implement applications based on heterogeneous distributed computing, without explicitly dealing with message passing and it is still possible to provide user specific implementations of the topology class for customized grid updates.
 
 \clearpage
 
@@ -486,7 +486,7 @@ The parareal algorithm was first presented in 2001, in a paper by Lions et al.~\
 \begin{center}
 \input{Chapters/chapter5/figures/ParallelInTime.tikz}
 \end{center}
-\caption[Time domain decomposition.]{Time domain decomposition. A compute node is assigned to each individual time subdomain to compute the initial value problem. Consistency at the time subdomain boundaries is obtained with the application of a computationally cheap integrator in conjunction with the parareal iterative predictor-corrector algorithm}\label{ch5:fig:ParallelInTime}
+\caption[Time domain decomposition.]{Time domain decomposition. A compute node is assigned to each individual time subdomain to compute the initial value problem. Consistency at the time subdomain boundaries is obtained with the application of a computationally cheap integrator in conjunction with the parareal iterative predictor-corrector algorithm.}\label{ch5:fig:ParallelInTime}
 \end{figure}%
 Initial states for these problems are needed and supplied by a simple, less accurate, but computationally cheap sequential integrator. The smaller independent evolution problems can then be solved in parallel. The information, generated during the concurrent solution of the independent evolution problems with accurate propagators and inaccurate initial states, is used in a predictor-corrector fashion in conjunction with the coarse integrator to propagate the solution faster, now using the information generated in parallel. We define the decomposition into $N$ intervals, that is,
 \begin{align}
@@ -509,7 +509,7 @@ Using the defined $\mathcal{F}_{\Delta T}$ and $\mathcal{G}_{\Delta T}$ operator
 \begin{equation}\label{ch5:eq:PARAREAL}
 U_{n}^{k+1}=\mathcal{\mathcal{G}}_{\Delta T}\left(U_{n-1}^{k+1}\right)+\mathcal{\mathcal{F}}_{\Delta T}\left(U_{n-1}^{k}\right)-\mathcal{\mathcal{G}}_{\Delta T}\left(U_{n-1}^{k}\right),\quad U_{0}^{k}=u^{0},
 \end{equation}
-with the initial prediction $U^{0}_{n} = \mathcal{G}_{\Delta T}^{n} u^{0}$ for $n=1\ldots N$ and $k=1\ldots K$. $N$ being the number of time subdomains, while $K\geq1$ is the number of predictor-corrector iterations applied. The parareal algorithm is implemented in the library as a separate time-integration component, using a fully distributed work scheduling model, as proposed by Aubanel~\cite{ch5:EA10}. The model is schematically presented in Figure \ref{ch5:fig:FullyDistributedCores}. The parareal component hides all communication and work distribution from the application developer.  It is defined such that a user only has to decide what coarse and fine propagators to use. Setting up the type definitions for parareal time-integration using forward Euler for coarse propagation and fourth order Runge-Kutta for fine propagation could then be defined as in Listings \ref{ch5:lst:parareal}. The number of GPUs used for parallelization depends on the number of MPI processes executing the application.
+with the initial prediction $U^{0}_{n} = \mathcal{G}_{\Delta T}^{n} u^{0}$ for $n=1\ldots N$ and $k=1\ldots K$. $N$ being the number of time subdomains, while $K\geq1$ is the number of predictor-corrector iterations applied. The parareal algorithm is implemented in the library as a separate time-integration component, using a fully distributed work scheduling model, as proposed by Aubanel~\cite{ch5:EA10}. The model is schematically presented in Figure \ref{ch5:fig:FullyDistributedCores}. The parareal component hides all communication and work distribution from the application developer.  It is defined such that a user only has to decide what coarse and fine propagators to use. Setting up the type definitions for parareal time-integration using forward Euler for coarse propagation and fourth order Runge-Kutta for fine propagation could then be defined as in Listing~\ref{ch5:lst:parareal}. The number of GPUs used for parallelization depends on the number of MPI processes executing the application.
 \lstset{label=ch5:lst:parareal,caption={assembling a parareal time integrator using forward Euler for coarse propagation and a Runge-Kutta method for fine propagation}}
 \begin{lstlisting}
 typedef gpulab::integration::forward_euler          coarse;
@@ -548,6 +548,8 @@ We can now estimate the speedup, here denoted $\psi$, as the ratio between the c
 \end{align}
 If we additionally assume that the time spent on coarse propagation is negligible compared to the time spent on the fine propagation, i.e., the limit $\frac{\mathcal{C}_\mathcal{G}}{\mathcal{C}_\mathcal{F}}\frac{\delta t}{\delta T}\rightarrow0$, the estimate reduces to $\psi=\frac{N}{k}$. It is thus clear that the number of iterations $k$ for the algorithm to converge poses an upper bound on obtainable parallel efficiency. The number of iterations needed for convergence is intimately coupled with the ratio $R$ between the speed of the fine and the coarse integrators $\frac{\mathcal{C}_\mathcal{F}}{\mathcal{C}_\mathcal{G}}\frac{\delta T}{\delta t}$. Using a slow, but more accurate coarse integrator will lead to convergence in fewer iterations $k$, but at the same time it also makes $R$ smaller. Ultimately, this will degrade the obtained speedup as can be deduced from \eqref{ch5:eq:EstiSpeedBasicPara}, and by Amdahl's law it will also lower the upper bound on possible attainable speedup. Thus, $R$ \emph{cannot} be made arbitrarily large since the ratio is inversely proportional to the number of iterations $k$ needed for convergence. This poses a challenge in obtaining speedup and is a trade-off between time spent on the fundamentally sequential part of the algorithm and the number of iterations needed for convergence. It is particularly important to consider this trade-off in the choice of stopping strategy; a more thorough discussion on this topic is available in \cite{ch5:ASNP12} for the interested reader. Measurements on parallel efficiency are typically observed in the literature to be in the range of 20--50\%, depending on the problem and the number of time subdomains, which is also confirmed by our measurements using GPUs. Here we include a demonstration of the obtained speedup of parareal applied to the two-dimensional heat problem \eqref{ch5:eq:heateq}. In Figure \ref{ch5:fig:pararealRvsGPUs} the iterations needed for convergence using the forward Euler method for both fine and coarse integration are presented. $R$ is regulated by changing the time step size for the coarse integrator. In Figure \ref{ch5:fig:pararealGPUs} speedup and parallel efficiency measurements are presented. Notice, when using many GPUs it is advantageous to use a faster, less accurate coarse propagator, despite it requires an extra parareal iteration that increases the total computational complexity.
 
+
+%\clearpage
 \begin{figure}[!htb]
     \setlength\figureheight{0.32\textwidth}
     \setlength\figurewidth{0.35\textwidth}
@@ -561,9 +563,8 @@ If we additionally assume that the time spent on coarse propagation is negligibl
     \label{ch5:fig:pararealRvsGPUs:b}
     }
     \end{center}
-    \caption[Parareal convergence properties as a function of $R$ and number of GPUs used.]{Parareal convergence properties as a function of $R$ and number GPUs used. The error is measured as the relative difference between the purely sequential solution and the parareal solution.}\label{ch5:fig:pararealRvsGPUs}
+    \caption[Parareal convergence properties as a function of $R$ and number of GPUs used.]{Parareal convergence properties as a function of $R$ and number of GPUs used. The error is measured as the relative difference between the purely sequential solution and the parareal solution.}\label{ch5:fig:pararealRvsGPUs}
 \end{figure}
-
 \begin{figure}[!htb]
     \setlength\figureheight{0.32\textwidth}
     \setlength\figurewidth{0.34\textwidth}
@@ -579,7 +580,6 @@ If we additionally assume that the time spent on coarse propagation is negligibl
     \end{center}
     \caption[Parareal performance properties as a function of $R$ and number of GPUs used.]{Parareal performance properties as a function of $R$ and number GPUs used. Notice how the obtained performance depends greatly on the choice of $R$ as a function of the number of GPUs. Executed on test environment 3.}\label{ch5:fig:pararealGPUs}
 \end{figure}
-
 %TODO: Do we make this into a subsubsection:
 %\subsubsection{Library Implementation}\label{ch5:subsec:libimpl}
 %Describe library features. Stopping criteria. Usage of different numerical integrators. Examples including c++ code. Speed-up measurements on test cases.
@@ -599,7 +599,6 @@ If we additionally assume that the time spent on coarse propagation is negligibl
 %    \end{center}
 %    \caption{Parareal performance and convergence properties as a function of $R$ and number GPUs used. $R$ is regulated by the time discretization in the coarse propagator. }\label{ch5:fig:pararealGPUs}
 %\end{figure}
-
 \section{Conclusion and outlook}
 
 Massively parallel heterogeneous systems continue to enter the consumer market, and there has been no identification that this trend will stop for years to come. However, these parallel architectures require software vendors to adjust to new programming models and optimization strategies. Good software libraries are important tools for reducing the time and complexity of adjusting to new architectures, and they provide the user with an intuitive programming interface.
diff --git a/BookGPU/Chapters/chapter6/PartieAsync.tex b/BookGPU/Chapters/chapter6/PartieAsync.tex
index d8da9a4..1617ffb 100644
--- a/BookGPU/Chapters/chapter6/PartieAsync.tex
+++ b/BookGPU/Chapters/chapter6/PartieAsync.tex
@@ -1174,7 +1174,7 @@ account in  the main computations  when it is  relevant. So, the  Newton process
 should be  accelerated a little bit.
 
 We  compare the  performance obtained  with overlapped  Jacobian  updatings and
-non overlapped ones for several problem sizes (see~\Fig{fig:ch6p2aux}).
+nonoverlapped ones for several problem sizes (see~\Fig{fig:ch6p2aux}).
 \begin{figure}[h]
   \centering
   \includegraphics[width=.75\columnwidth]{Chapters/chapter6/curves/recouvs.pdf}
diff --git a/BookGPU/Chapters/chapter8/ch8.tex b/BookGPU/Chapters/chapter8/ch8.tex
index 6bd4bee..ff9c80d 100644
--- a/BookGPU/Chapters/chapter8/ch8.tex
+++ b/BookGPU/Chapters/chapter8/ch8.tex
@@ -78,7 +78,7 @@ Set the Best\_Solution to $\emptyset$; \\
     }
 }
 
-\caption{general template of the branch-and-bound algorithm.}
+\caption{general template of the branch-and-bound algorithm}
 \label{ch8:algoBB}
 \end{algorithm}
 
@@ -533,7 +533,7 @@ Using the approach defined in \cite{ch8:Mezmaz_2007}, it is possible to obtain a
 \begin{itemize}
 \item compute, using the approach defined in \cite{ch8:Mezmaz_2007}, a list $L$ of subproblems such as the resolution of $L$ lasts $T$ minutes with a sequential B\&B;
 \item initialize the pool of our sequential B\&B with the subproblems of this list $L$;
-\item solve the subproblems of this pool with our sequential B\&B ,
+\item solve the subproblems of this pool with our sequential B\&B;
 \item get the sequential resolution time $T{cpu}$ and the number of explored subproblems $N{cpu}$;
 \item check that $T{cpu}$ is approximately equal to $T$;
 \item initialize the pool of our GPU B\&B with the subproblems of the list $L$;
diff --git a/BookGPU/Chapters/chapter9/ch9.tex b/BookGPU/Chapters/chapter9/ch9.tex
index 0fe38c4..0294f48 100644
--- a/BookGPU/Chapters/chapter9/ch9.tex
+++ b/BookGPU/Chapters/chapter9/ch9.tex
@@ -99,7 +99,7 @@ solutions. The process is repeated until a stopping criterion is
 satisfied. \emph{Evolutionary algorithms}, \emph{swarm
 optimization}, and \emph{ant colonies} fall into this class.
 
-
+\clearpage
 \section{Parallel models for metaheuristics}\label{ch8:sec:paraMeta}
 Optimization problems, whether real-life or academic, are more
 often NP-hard and CPU time and/or memory consuming. Metaheuristics
@@ -188,7 +188,7 @@ solution-level\index{metaheuristics!solution-level parallelism}
 parallel model is problem-dependent.}
 \end{itemize}
 \clearpage
-\section{Challenges for the design of GPU-based metaheuristics}
+\section[Challenges for the design of GPU-based  metaheuristics]{Challenges for the design of GPU-based\hfill\break  metaheuristics}
 \label{ch8:sec:challenges}
 
 Developing GPU-based parallel
@@ -501,7 +501,7 @@ QAPLIB~\cite{burkard1991qaplib}. Speedups up to $10 \times$ are
 achieved by the GPU implementation compared
 to the same sequential implementation on CPU using SA-matrix.
 
-\subsection[Implementing population-based metaheuristics\hfill\break on GPUs]{Implementing population-based metaheuristics on GPUs}
+\subsection[Implementing population-based metaheuristics on GPUs]{Implementing population-based metaheuristics on GPUs}
 
 State-of-the-art works dealing with the implementation of
 p-metaheuristics on GPUs generally rely on parallel models and
diff --git a/BookGPU/Makefile b/BookGPU/Makefile
index 401df60..a444ead 100644
--- a/BookGPU/Makefile
+++ b/BookGPU/Makefile
@@ -31,8 +31,10 @@ all:
 	makeindex  ${BOOK}.idx
 	pdflatex ${BOOK}
 	pdflatex ${BOOK}
+	cp BookGPU.toc  BookGPU.toc_old   
 	cp BookGPU.toc_new  BookGPU.toc   #copy the corrected toc
 
+
 	pdflatex ${BOOK}
 #	dvipdf ${BOOK}
 
-- 
2.39.5