ch15

[book_gpu.git] / BookGPU / Chapters / chapter15 / ch15.tex
diff --git a/BookGPU/Chapters/chapter15/ch15.tex b/BookGPU/Chapters/chapter15/ch15.tex

index 9b0bf27bfa807183128f49b1abd9dc086e95f8b9..7e25220e38354c0472ab23827a8bf24f8ecbb005 100644 (file)
--- a/BookGPU/Chapters/chapter15/ch15.tex
+++ b/BookGPU/Chapters/chapter15/ch15.tex
@@ -1,13 +1,14 @@
-\chapterauthor{Pierre Fortin}{Laboratoire d'Informatique de Paris 6, University Paris 6}
-\chapterauthor{Rachid Habel}{T\'el\'ecom SudParis}
-\chapterauthor{Fabienne J\'ez\'equel}{Laboratoire d'Informatique de Paris 6, University Paris 6}
-\chapterauthor{Jean-Luc Lamotte}{Laboratoire d'Informatique de Paris 6, University Paris 6}
+\chapterauthor{Rachid Habel}{T\'el\'ecom SudParis, France}
+\chapterauthor{Pierre Fortin, Fabienne J\'ez\'equel, and Jean-Luc Lamotte}{Laboratoire d'Informatique de Paris 6, Université Pierre et Marie Curie, France}
+
+%\chapterauthor{Fabienne J\'ez\'equel}{Laboratoire d'Informatique de Paris 6, University Paris 6}
+%\chapterauthor{Jean-Luc Lamotte}{Laboratoire d'Informatique de Paris 6, University Paris 6}
  \chapterauthor{Stan Scott}{School of Electronics, Electrical Engineering \& Computer Science,
-The Queen's University of Belfast}
+The Queen's University of Belfast, United Kingdom}
  
  %\newcommand{\fixme}[1]{{\bf #1}}
  
-\chapter[Numerical validation and performance optimization on GPUs in atomic physics]{Numerical validation and performance optimization on GPUs of an application in atomic physics} 
+\chapter[Numerical validation and GPU performance in atomic physics]{Numerical validation and performance optimization on GPUs of an application in atomic physics} 
  \label{chapter15}
  
  \section{Introduction}\label{ch15:intro}
@@ -19,32 +20,32 @@ data transfers between CPU memory and GPU memory. Thus, to have good
  performance on
  GPUs, applications should be coarse-grained and have a high arithmetic
  intensity 
-($i.e.$ the ratio of arithmetic operations to memory operations). 
+(i.e., the ratio of arithmetic operations to memory operations). 
  Another important aspect of GPU programming is that floating-point
  operations are preferably performed in single precision\index{precision!single precision}, if the
   validity of results is not impacted by that format.   
  The GPU compute power for floating-point operations is indeed greater in
  single precision\index{precision!single precision} than in double precision\index{precision!double precision}.  
  The peak performance ratio between single precision\index{precision!single precision} and double
-precision varies for example for NVIDIA GPUs from $12$ for the first Tesla
+precision varies, for example, for NVIDIA GPUs from $12$ for the first Tesla
  GPUs (C1060), 
-to $2$ for the Fermi GPUs (C2050 and C2070)  
+to $2$ for the Fermi GPUs (C2050 and C2070),   
  and to $3$ for the latest Kepler architecture (K20/K20X).  
  As far as AMD GPUs are concerned, the latest AMD GPU (Tahiti HD 7970)
  presents a ratio of $4$.  
  Moreover, GPU internal memory accesses and CPU-GPU data transfers are
-faster in single precision\index{precision!single precision} than in double precision\index{precision!double precision},  
+faster in single precision\index{precision!single precision} than in double precision\index{precision!double precision} 
  because of the different format lengths. 
  
  This chapter describes the deployment on GPUs of PROP, a program of the
  2DRMP~\cite{FARM_2DRMP,2DRMP} suite which models electron collisions
  with H-like atoms and ions at intermediate energies. 2DRMP operates successfully on serial
-computers, high performance clusters and supercomputers.  The primary
+computers, high performance clusters, and supercomputers.  The primary
  purpose of the PROP program is to propagate a global
-R-matrix~\cite{Burke_1987}, $\Re$, in the two-electron configuration
+$R$-matrix~\cite{Burke_1987}, $\Re$, in the two-electron configuration
  space.
  The propagation needs to be performed for all collision energies, 
-for instance hundreds of energies,
+for instance, hundreds of energies,
  which are independent.
  Propagation equations are dominated by matrix multiplications involving sub-matrices of $\Re$.
  However, the matrix multiplications are not
@@ -53,10 +54,10 @@ columns and increases in size as the propagation proceeds \cite{VECPAR}.
  
  In a preliminary investigation PROP was selected by GENCI\footnote{GENCI: Grand Equipement National
    de Calcul Intensif, \url{www.genci.fr}} and
-CAPS\footnote{CAPS is a software company providing products and solutions
-  for manycore application programming and deployment,
-  \url{www.caps-entreprise.com}},
-following their first call for projects in 2009-2010 
+CAPS,\footnote{CAPS is a software company providing products and solutions
+  for many-core application programming and deployment,
+  \url{www.caps-entreprise.com}} 
+following their first call for projects in 2009--2010 
  aimed at
  deploying applications on hybrid systems based on GPUs.
  First CAPS  
@@ -65,11 +66,11 @@ For matrix products the GPU performance gain over CPU increases indeed
  with the matrix size, since the 
  CPU-GPU transfer overhead becomes less significant and since CPUs are
  still more efficient for fine computation grains. 
-Then, using HMPP\index{HMPP}\footnote{
-HMPP or {\em CAPS compiler}, see: \url{www.caps-entreprise.com/hmpp.html}},
+Then, using HMPP,\index{HMPP}\footnote{
+HMPP (Hybrid Multicore Parallel Programming) or {CAPS compiler}, see: \url{www.caps-entreprise.com/hmpp.html}} 
  a commercial 
  hybrid and parallel compiler, CAPS 
-developed a version of PROP, in
+developed a version of PROP in 
  which matrix multiplications are performed on
  the GPU or the CPU, depending on the matrix size.
  Unfortunately this partial GPU implementation of PROP does not offer
@@ -85,7 +86,7 @@ computation code of PROP on
  GPUs to avoid the overhead generated by
  data transfers 
  and we propose successive improvements
-(including a specific one to the Fermi architecture) 
+(including one specific to the Fermi architecture) 
  in order to optimize the GPU code.
  
  
@@ -93,10 +94,10 @@ in order to optimize the GPU code.
  
  \section{2DRMP and the PROP program}
  \label{s:2DRMP_PROP}
-\subsection{Principles of R-matrix propagation}
-2DRMP~\cite{FARM_2DRMP,2DRMP} is part of the CPC library\footnote{CPC:
+\subsection{Principles of $R$-matrix propagation}
+2DRMP~\cite{FARM_2DRMP,2DRMP} is part of the CPC library.\footnote{CPC:
  Computer Physics Communications, 
-\url{http://cpc.cs.qub.ac.uk/}}.  
+\url{http://cpc.cs.qub.ac.uk/}} 
  It is a suite of seven 
  programs aimed at creating virtual experiments on high performance and grid
  architectures to enable the study of electron scattering from H-like
@@ -121,22 +122,22 @@ $R$-matrix, $\Re$, from sector to sector across the internal region, as shown in
  \begin{figure}[h]
  \begin{center}
  \includegraphics*[width=0.8\linewidth]{Chapters/chapter15/figures/Domain.pdf} 
-\caption{\label{domain} Propagation of the R-matrix from domain D to domain D'.}
+\caption{\label{domain} Propagation of the $R$-matrix from domain $D$ to domain $D'$.}
  \end{center}
  \end{figure}
  
  We consider the general situation in
  Fig.~\ref{domain} where we assume that we already know
  the global $R$-matrix, $\Re^{I}$, associated with the boundary defined
-by edges 5, 2, 1 and 6 
+by edges 5, 2, 1, and 6 
  in domain $D$ and we wish to
-evaluate the new global $R$-matrix, $\Re^{O}$, associated with edges 5, 3, 4 and 6 
+evaluate the new global $R$-matrix, $\Re^{O}$, associated with edges 5, 3, 4, and 6 
  in domain $D'$ following propagation across subregion $d$.
  Input edges are denoted by I (edges 1 and~2), output edges by O (edges 3 and 4) and
  common edges by X (edges 5 and~6).
  Because of symmetry, only the lower half of domains $D$ and $D'$ has to be considered. 
  The global $R$-matrices, $\Re^{I}$ in domain $D$ and $\Re^{O}$ in
-domain $D'$, can be written as:
+domain $D'$, can be written as 
  \begin{equation}
  \Re^{I} = \left(\begin{array}{cc}
        \Re_{II}^{I} & \Re_{IX}^{I}\\
@@ -152,7 +153,7 @@ domain $D'$, can be written as:
  
  
  
-From the set of local $R$-matrices, $\mathbf{R}_{ij}$ ($i,j\in \{1,2,3,4\}$)
+From the set of local $R$-matrices, $\mathbf{R}_{ij}$ ($i,j\in \{1,2,3,4\}$), 
  associated 
  with subregion $d$, we can define
  \begin{subequations}
@@ -194,7 +195,7 @@ two linear systems are solved.
  
  \medskip
  
-While equations (\ref{eq1})-(\ref{eq4}) can be applied to the
+While equations (\ref{eq1})--(\ref{eq4}) can be applied to the
  propagation across a general subregion two special situations should be
  noted: propagation across a diagonal subregion and propagation across
  a subregion bounded by the $r_{1}$-axis at the beginning of a new
@@ -203,7 +204,7 @@ strip.
  In the case of a diagonal subregion, from symmetry considerations,
  edge 2 is identical to edge 1 and edge 3 is identical to edge~4. 
  Accordingly, with only one input edge and one output edge equations
-(\ref{eqaa})-(\ref{eqdd}) become: 
+(\ref{eqaa})--(\ref{eqdd}) become 
  \begin{subequations}
  \begin{eqnarray}
  \mathbf{r}_{II} = 2\mathbf{R}_{11}, \ 
@@ -236,24 +237,24 @@ diagonal.
  \begin{center}
  \begin{tabular}{|c|c|c|c|c|c|}
   \hline
- \multirow{2}{0.09\linewidth}{\centering Data set} &
+ \multirow{2}{0.09\linewidth}{\centering Data Set} &
   \multirow{2}{0.15\linewidth}{\centering
-  Local $R$-\\matrix size} &
+  Local $R$-\\Matrix Size} &
   \multirow{2}{0.07\linewidth}{\centering Strips} &
   \multirow{2}{0.09\linewidth}{\centering Sectors} &
- \multirow{2}{0.19\linewidth}{\centering Final global \\$R$-matrix size} &
- \multirow{2}{0.15\linewidth}{\centering Scattering\\energies} \\
+ \multirow{2}{0.19\linewidth}{\centering Final Global \\$R$-Matrix Size} &
+ \multirow{2}{0.15\linewidth}{\centering Scattering\\Energies}\\
    & & & & & \\
    \hline
-  Small & 90x90  & 4 & 10 & 360x360 & 6\\
+  Small & $90\times90$  & 4 & 10 & $360\times360$ & 6\\
   \hline
-  Medium  & 90x90  & 4 & 10 & 360x360 & 64\\
+  Medium  & $90\times90$  & 4 & 10 & $360\times360$ & 64\\
   \hline
-  Large  & 383x383  & 20 & 210 &  7660x7660 & 6\\
+  Large  & $383\times383$  & 20 & 210 &  $7660\times7660$ & 6\\
   \hline
-  Huge  & 383x383  & 20 & 210 &  7660x7660 & 64\\ \hline
+  Huge  & $383\times383$  & 20 & 210 &  $7660\times7660$ & 64\\ \hline
  \end{tabular}
-\caption{\label{data-sets}Characteristics of four data sets}
+\caption{\label{data-sets}Characteristics of four data sets.}
  \end{center}
  \end{table}
  
@@ -302,7 +303,7 @@ for the next evaluation.
    From $\Re^{I}$ and local $R$-matrices, compute $\Re^{O}$\;
   $\Re^{O}$ becomes $\Re^{I}$ for the next sector\;
   }
- Compute physical $R$-Matrix \;
+ Compute physical $R$-Matrix\;
  }
  %\end{algorithmic}
  \end{algorithm}
@@ -317,15 +318,15 @@ $R$-matrix is computed and stored into an output file.
  In the PROP program, sectors are characterized into four types,
  depending on the computation performed: 
  \begin{itemize}
-\item the starting sector (labeled 0 in Fig.~\ref{prop})
-\item the axis sectors (labeled 1, 3 and 6 in Fig.~\ref{prop})
-\item the diagonal sectors (labeled 2, 5 and 9 in Fig.~\ref{prop})
-\item the off-diagonal sectors (labeled 4, 7 and 8 in Fig.~\ref{prop}).
+\item the starting sector (labeled 0 in Fig.~\ref{prop}); 
+\item the axis sectors (labeled 1, 3, and 6 in Fig.~\ref{prop}); 
+\item the diagonal sectors (labeled 2, 5,  and 9 in Fig.~\ref{prop}); 
+\item the off-diagonal sectors (labeled 4, 7, and 8 in Fig.~\ref{prop}).
  \end{itemize}
  
  
-The serial version of PROP is implemented in Fortran~90 and uses
-for linear algebra operations BLAS\index{BLAS} and LAPACK\index{LAPACK} routines
+The serial version of PROP is implemented in Fortran~90 and 
+for linear algebra operations uses BLAS\index{BLAS} and LAPACK\index{LAPACK} routines
  which are fully optimized for x86 architecture.
  This 
  program 
@@ -347,7 +348,7 @@ MPI.
  
  In order to handle larger matrices, and thus obtain better GPU  speedup, CAPS  
  recast equations (\ref{eq1}) to (\ref{eq4}) into one equation.
-The output $R$-matrix $\Re^{O}$ defined by equation~(\ref{eq:RI_RO}) is now computed as follows. 
+The output $R$-matrix $\Re^{O}$ defined by equation~(\ref{eq:RI_RO}) is now computed as follows: 
  \begin{equation}\label{eq_CAPS_1}
  \Re^{O} = \Re^{O^{\ \prime}} + U A^{-1} V, 
  \end{equation}
@@ -370,39 +371,39 @@ system $AW=V$ is solved.
  This reimplementation of PROP reduces the number of equations to be
  solved and the number of matrix copies for evaluating each sector.
  For instance, for an off-diagonal sector, 
-copies fall from 22 to 5, matrix multiplications from 4 to~1 and calls
+copies fall from 22 to 5, matrix multiplications from 4 to~1, and calls
  to a linear equation solver from 2 to 1. 
  
  To implement this version, CAPS 
  used HMPP\index{HMPP}, a 
  commercial hybrid and parallel compiler, 
-based on compiler directives like the new OpenACC\index{OpenACC} standard\footnote{See: \url{www.openacc-standard.org}}.  
+based on compiler directives such as the new OpenACC\index{OpenACC} standard.\footnote{See: \url{www.openacc-standard.org}} 
  If the matrices are large enough (the limit sizes are experimental parameters), 
  they are multiplied on the GPU, otherwise on the CPU. 
  CAPS 
- used the MKL BLAS\index{BLAS} implementation on an Intel Xeon
+ used Intel's MKL (Math Kernel Library) BLAS\index{BLAS} implementation on an Intel Xeon
  x5560 quad core CPU (2.8 GHz) 
  and the CUBLAS\index{CUBLAS} library (CUDA 2.2) on one Tesla C1060 GPU. 
  On the large data set (see Table~\ref{data-sets}), CAPS 
   obtained a speedup of 1.15 for the GPU 
-version over the CPU one (with multi-threaded MKL calls on the four
+version over the CPU one (with multithreaded MKL calls on the four
  CPU cores). This limited gain in performance is mainly
  due to the use of double precision\index{precision!double precision} computation 
  and to the small or medium sizes of most matrices.
  For these matrices, the computation gain on  
-the GPU is indeed
+the GPU is indeed 
  strongly affected by the overhead 
  generated by transferring these matrices from  
  the CPU memory to the GPU memory to perform each matrix multiplication and then
  transferring the result back to the CPU memory. 
  
-Our goal is to speedup PROP more significantly by porting the whole
+Our goal is to speed up PROP more significantly by porting the whole
  code to the GPU and therefore avoiding 
  the 
  intermediate data transfers between
  the host (CPU) and the GPU. We will also study the
  stability of PROP in single precision\index{precision!single precision} because 
-single precision\index{precision!single precision} computation is faster on the GPU  
+single-precision\index{precision!single precision} computation is faster on the GPU  
  and CPU-GPU data transfers are twice as fast as those performed in
  double precision\index{precision!double precision}. 
  
@@ -439,16 +440,16 @@ double precision\index{precision!double precision}.
  \end{comment}
  
  
-Floating-point input data, computation and output data of PROP are 
-originally in double precision\index{precision!double precision} format.
+Floating-point input data, computation, and output data of PROP are 
+originally in double-precision\index{precision!double precision} format.
  PROP produces a standard $R$-matrix H-file \cite{FARM_2DRMP}
- and a collection of Rmat00X files (where X
-ranges from 0 to the number of scattering energies - 1)
-holding the physical R-matrix for each 
+ and a collection of {Rmat}00X files (where X
+ranges from 0 to the number of scattering energies $-$ 1)
+holding the physical $R$-matrix for each 
  energy.
-The  H-file  and the Rmat00X files are binary input files of the FARM program \cite{FARM_2DRMP}
+The  H-file  and the {Rmat}00X files are binary input files of the FARM program \cite{FARM_2DRMP}
  (last program of the 2DRMP suite).
-Their text equivalent are the  prop.out 
+Their text equivalents are the  prop.out 
  and the prop00X.out files. 
  To study the validity of PROP results in  single precision\index{precision!single precision},
  first,
@@ -464,7 +465,7 @@ into files in double precision\index{precision!double precision}.
  \begin{figure}[h]
  \begin{center}
  \includegraphics*[width=0.9\linewidth]{Chapters/chapter15/figures/error.pdf} 
-\caption{\label{fig:sp-distrib} Error distribution for medium case in single precision\index{precision!single precision}}
+\caption{\label{fig:sp-distrib} Error distribution for medium case in single precision.\index{precision!single precision}}
  \end{center}
  \end{figure}
  
@@ -476,23 +477,26 @@ error distribution is
  given in Fig.~\ref{fig:sp-distrib}. 
  We focus on the largest errors. 
  \begin{itemize}
-\item Errors greater than 100: the only impacted value is of order 1.E-6
+\item Errors greater than $10^{2}$: %100: 
+the only impacted value is of order $10^{-6}$ %1.E-6
  and is negligible compared to the other ones 
  in the same prop00X.out file.
  
-\item Errors between 1 and 100: the values corresponding to the
-  largest errors are of order 1.E-3 and are negligible compared to
-  the majority of the other values which range between 1.E-2 and
-  1.E-1.
+\item Errors between 1 and $10^{2}$: %100: 
+the values corresponding to the
+  largest errors are of order $10^{-3}$ %1.E-3 
+and are negligible compared to
+  the majority of the other values which range between $10^{-2}$ %1.E-2 
+and $10^{-1}$. 
  
-\item Errors between 1.E-2 and 1: the largest errors ($\ge$ 6\%)
-  impact values the order of magnitude of which is at most 1.E-1.
+\item Errors between $10^{-2}$ %1.E-2 
+and 1: the largest errors ($\ge$ 6\%)
+  impact values the order of magnitude of which is at most $10^{-1}$. %1.E-1.
    These values are negligible. 
    Relative errors of approximately 5\% impact values the order of
-  magnitude of which is at most 1.E2. 
+  magnitude of which is at most $10^{2}$. %1.E2. 
    For instance, the value 164 produced by the reference version of
-  PROP becomes 172 in the single precision\index{precision!single precision} version.
-
+  PROP becomes 172 in the single precision\index{precision!single precision} version. 
  \end{itemize}
  
  To study the impact of the single precision\index{precision!single precision} version of PROP on the
@@ -505,72 +509,76 @@ transitions
  Table~\ref{sp-farm} shows that all cross-section files are impacted by
  errors.  Indeed in the  {2p4d} file,  four relative errors are 
  greater than one and the maximum relative error is 1.60. 
-However the largest errors impact negligible values. For example, the maximum
-error (1.60) impacts a reference value which is 4.5E-4.  The largest 
+However, the largest errors impact negligible values. For example, the maximum
+error (1.60) impacts a reference value which is $4.5\ 10^{-4}$.  %4.5E-4.  
+The largest 
  values are impacted by low errors. For instance, the maximum value
-(1.16) is impacted by a relative error of the order 1.E-3. 
+(1.16) is impacted by a relative error of the order $10^{-3}$. %1.E-3. 
  
  \begin{table}[t] 
  \begin{center}
  \begin{tabular}{|c|c||c|c|} \hline
-  file & largest relative error & file & largest relative error\\ \hline
+ File & Largest Relative Error & File & Largest Relative Error\\ \hline
   {1s1s} & 0.02& {1s3p} & 0.11  \\ \hline
   {1s2s} & 0.06 &  {1s3d} &  0.22 \\ \hline
   {1s2p} &  0.08 & {1s4s} &  0.20  \\ \hline
   {1s3s} &  0.17 &2p4d & 1.60  \\ \hline
  \end{tabular}
-\caption{\label{sp-farm}Impact  on FARM  of the single precision\index{precision!single precision} version of PROP}
+\caption{\label{sp-farm}Impact  on FARM  of the single precision\index{precision!single precision} version of PROP.}
  \end{center}
  \end{table}
  
  To examine in more detail the impact of PROP on FARM, 
-cross sections above the ionization threshold (1 Ryd)
+cross-sections above the ionization threshold (1 Ryd)
  are compared in single and
  double precision\index{precision!double precision}  for 
-transitions amongst the 1s, \dots 4s, 2p, \dots 4p, 3d, 4d target states.  
+transitions among the 1s, \dots 4s, 2p, \dots 4p, 3d, 4d target states.  
  This comparison is carried out by generating 45 plots.  In all the
- plots, results in single and double precision\index{precision!double precision} match except for few
+ plots, results in single and double precision\index{precision!double precision} match except for a few
   scattering energies which are very close to pseudo-state thresholds. 
-For example Fig.~\ref{1s2p} and \ref{1s4d} present the scattering energies corresponding to the
+For example, Fig.~\ref{1s2p} and \ref{1s4d} present the scattering energies corresponding to the
  {1s2p} and {1s4d} cross-sections computed in single and double precision\index{precision!double precision}.   For some cross-sections, 
-increasing a threshold parameter from 1.E-4 to 1.E-3 in the FARM
-program
-results in energies close to threshold being avoided
- and therefore
+increasing a threshold parameter from $10^{-4}$ %1.E-4 
+to $10^{-3}$ %1.E-3 
+in the FARM
+program 
+results in energies close to threshold being avoided, 
+ and therefore, 
  the cross-sections in double and single precision\index{precision!single precision} match more
  accurately. 
  This is the case for instance for cross-section 1s2p (see Fig.~\ref{1s2p3}). 
  However for other cross-sections (such as 1s4d) some problematic energies remain even if the 
  threshold parameter in  the FARM
-program is increased to 1.E-3 (see Fig.~\ref{1s4d3}).  A higher 
+program is increased to  $10^{-3}$ %1.E-3
+ (see Fig.~\ref{1s4d3}).  A higher 
  threshold parameter would be required for such cross-sections. 
  
  \begin{figure}[t]
  \centering
-\subfigure[threshold = 1.E-4]{ 
+\subfigure[threshold = $10^{-4}$]{ %1.E-4]{ 
  \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s2p.pdf}
     \label{1s2p}
   }
-\subfigure[threshold = 1.E-3]{
+\subfigure[threshold = $10^{-3}$]{ %1.E-3]{
  \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s2p3.pdf}
   \label{1s2p3}
   }
  \label{fig:1s2p_10sectors}
-\caption{1s2p cross-section, 10 sectors}
+\caption{1s2p cross-section, 10 sectors.}
  \end{figure}
  
  \begin{figure}[t]
  \centering
-\subfigure[threshold = 1.E-4]{
+\subfigure[threshold = $10^{-4}$]{ %1.E-4]{
  \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s4d.pdf}
     \label{1s4d}
   }
-\subfigure[threshold = 1.E-3]{
+\subfigure[threshold = $10^{-3}$]{ %1.E-3]{
  \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s4d3.pdf}
   \label{1s4d3}
   }
  \label{fig:1s4d_10sectors}
-\caption{1s4d cross-section, 10 sectors}
+\caption{1s4d cross-section, 10 sectors.}
  \end{figure}
  
  As a conclusion, the medium case study shows that the execution of
@@ -578,8 +586,8 @@ PROP in single precision\index{precision!single precision} leads to a few inexac
  be computed by the FARM program for some cross-sections.
  Thanks to a suitable threshold parameter in the FARM program these problematic energies may possibly 
  be skipped. 
-Instead of investigating deeper the choice of such a parameter for the medium case, we analyze the 
-single precision\index{precision!single precision} computation  in a more
+Instead of investigating more deeply the choice of such a parameter for the medium case, we analyze the 
+single-precision\index{precision!single precision} computation  in a more
  realistic case in Sect.~\ref{huge}. 
  \begin{comment}
  The conclusion of the medium case study is that running PROP in single
@@ -597,18 +605,18 @@ realistic case in Sect.~\ref{huge}.
  \begin{figure}[t] 
    \centering
  \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s2pHT.pdf}
-\caption{\label{1s2pHT}1s2p cross-section, threshold = 1.E-4, 210 sectors}
+\caption{\label{1s2pHT}1s2p cross-section, threshold = $10^{-4}$, 210 sectors.} %1.E-4, 210 sectors.}
  \end{figure}
  
  \begin{figure}[t] 
    \centering
  \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s2pHT.pdf}
-\caption{\label{1s4dHT}1s4d cross-section, threshold = 1.E-4, 210 sectors}
+\caption{\label{1s4dHT}1s4d cross-section, threshold = $10^{-4}$, 210 sectors.} %1.E-4, 210 sectors.}
  \end{figure}
  
  We study here the impact on FARM of the PROP program run in
  single precision\index{precision!single precision} for the huge case (see Table~\ref{data-sets}).
-The cross-sections
+The cross-sections 
  corresponding to all
  atomic target states 1s \dots 7i are explored, which
  leads to 
@@ -617,10 +625,10 @@ It should be noted that in this case, over the same energy range above the ioniz
  As expected, all the plots exhibit large differences between single and double
  precision cross-sections. 
  For example Fig.~\ref{1s2pHT} and  \ref{1s4dHT} present the 1s2p and 1s4d cross-sections computed in
-single and double precision\index{precision!double precision} for the huge case.  
+single---and double---precision\index{precision!double precision} for the huge case.  
  We can conclude that PROP in single precision\index{precision!single precision} gives invalid results 
  for realistic simulation cases above the ionization threshold.
-Therefore the  deployment of PROP on GPU, described in Sect.~\ref{gpu-implem},
+Therefore, the  deployment of PROP on GPU, described in Sect.~\ref{gpu-implem},
  has been carried out in double precision\index{precision!double precision}. 
  
  \section{Towards a complete deployment of PROP on GPUs} 
@@ -648,12 +656,12 @@ enable calls to CUDA kernels from PROP routines.
  \end{figure}
  
  As mentioned in Algorithm~\ref{prop-algo}, evaluating a sector
-mainly consists in constructing local $R$-matrices and in computing
+mainly consists of constructing local $R$-matrices and computing
  one output $R$-matrix, $\Re^{O}$. In this first step of the porting
  process, referred to as GPU V1\label{gpuv1},
-we only consider the computation of $\Re^{O}$ on the GPU.
+we consider only the computation of $\Re^{O}$ on the GPU.
  We distinguish the following six steps, related to equations
-(\ref{eq_CAPS_1}), (\ref{eq_CAPS_2}) and (\ref{eq_CAPS_3}), and illustrated in
+(\ref{eq_CAPS_1}), (\ref{eq_CAPS_2}), and (\ref{eq_CAPS_3}), and illustrated in
  Fig.~\ref{offdiagonal} for an off-diagonal sector.
  
  \begin{description}
@@ -661,7 +669,7 @@ Fig.~\ref{offdiagonal} for an off-diagonal sector.
    to temporary arrays ($A$, $U$, $V$) and to $\Re^{O}$.
    These copies, along with possible scalings or transpositions, are
    implemented as CUDA kernels which can be applied to two
-  matrices of any size and starting at any offset. 
+  matrices of any size starting at any offset. 
    Memory accesses are coalesced\index{coalesced memory accesses} \cite{CUDA_ProgGuide} in order to
    provide the best performance for such memory-bound kernels.
  \item[Step 2] (``Local copies''):~data are copied from
@@ -673,7 +681,7 @@ Fig.~\ref{offdiagonal} for an off-diagonal sector.
  \item[Step 3] (``Linear system solving''):~matrix $A$ is factorized
    using the MAGMA DGETRF\index{MAGMA functions!DGETRF} 
     routine and the result is stored in-place.
-\item[Step 4] (``Linear system solving'' cont.):~the matrix system
+\item[Step 4] (``Linear system solving,'' cont.):~the matrix system
   of linear equations  $AW$ = $V$ is solved using the MAGMA DGETRS\index{MAGMA functions!DGETRS} 
  routine. The  solution is stored in matrix $V$.
  \item[Step 5] (``Output matrix product''):~matrix $U$
@@ -692,7 +700,7 @@ symmetry considerations \cite{2DRMP}.
  
  \subsection{Constructing the local $R$-matrices on GPU}
  
-\begin{figure}[t]
+\begin{figure}[h]
   \centering
    \includegraphics[width=0.7\linewidth]{Chapters/chapter15/figures/amplitudes_nb.pdf} 
   \caption{\label{amplitudes} Constructing the local $R$-matrix R34
@@ -700,8 +708,8 @@ symmetry considerations \cite{2DRMP}.
   amplitude array associated with edge~3.}
  \end{figure}
  
-Local $R$-matrices are constructed using two three dimensional arrays,
-$i$ and $j$. Each three dimensional array contains four
+Local $R$-matrices are constructed using two three-dimensional arrays,
+$i$ and $j$. Each three-dimensional array contains four
  matrices corresponding to the surface amplitudes associated with the
  four edges of a sector. Those matrices are named {\em amplitude arrays}.
   $j$ amplitude arrays are read from data files and $i$ amplitude arrays
@@ -718,8 +726,8 @@ GPU memory and the required matrix multiplications are performed on
  the GPU (via CUBLAS\index{CUBLAS} routines).
  
  
-The involved matrices having medium sizes (either $3066 \times 383$ or
-$5997 \times 383$),
+The involved matrices have medium sizes (either $3066 \times 383$ or
+$5997 \times 383$), and
  performing these matrix multiplications
  on the GPU is expected to be faster than on the CPU.
  However, this implies a greater communication volume
@@ -747,7 +755,7 @@ Therefore, we save the transfer of four $i$ amplitude arrays on
  each sector by transferring  only this 1D array of scaling factors. 
  Moreover, scaling $j$ amplitude arrays is expected to be faster on the
  GPU than on the CPU, thanks to the massively parallel architecture of
-the GPU and thanks to its higher internal memory bandwidth.
+the GPU and its higher internal memory bandwidth.
  
  \subsection{Using double-buffering\index{double-buffering} to overlap I/O and computation}
  
@@ -773,7 +781,7 @@ intermediate.
  The I/O times are roughly constant among all strips.
  The evaluation time is equivalent to the I/O
  time for the first sectors. But this evaluation time grows 
-linearly with the strip number, and rapidly exceeds the I/O 
+linearly with the strip number and rapidly exceeds the I/O 
  time.
  
  It is thus interesting to use a double-buffering\index{double-buffering} technique to overlap the 
@@ -798,9 +806,9 @@ thread mechanisms.
  
  
  \subsection{Matrix padding\index{padding}}
-The CUBLAS DGEMM 
-performance and the MAGMA DGETRF\index{MAGMA functions!DGETRF}/DGETRS\index{MAGMA functions!DGETRS} 
-performance is reduced when the sizes (or
+The  MAGMA DGETRF\index{MAGMA functions!DGETRF}/DGETRS\index{MAGMA functions!DGETRS} 
+performance and the CUBLAS DGEMM performance 
+are reduced when the sizes (or
  the leading dimensions) of the matrix are not multiples of the inner blocking size \cite{NTD10a}.
  This inner blocking size can be 32 or 64, depending on the computation
  and on the underlying  
@@ -810,33 +818,31 @@ the matrices are therefore padded with $0.0$ (and $1.0$ on the diagonal for the
  so that their sizes are 
  multiples of 64.
  This corresponds indeed to the optimal size for the matrix product on the
-Fermi architecture \cite{NTD10b}. And as far as linear system solving is
+Fermi architecture \cite{NTD10b}. And as far as linear system solving is 
  concerned, all the matrices have sizes which are multiples of 383: we
  therefore use padding\index{padding} to obtain multiples of 384 (which are 
  again  multiples of 64). 
  It can be noticed that this padding\index{padding} has to be performed dynamically
  as the matrices increase in size during the propagation 
  (when possible the
- maximum required storage space is however allocated only once in the
+ maximum required storage space is, however, allocated only once in the
   GPU memory). 
  
-
-
  \section{Performance results}
  \subsection{PROP deployment on GPU}
  
-\begin{table*}[ht]
+\begin{table}[ht]
  \begin{center}
  \begin{tabular}{|c||c|c||}
   \hline
-  PROP version & \multicolumn{2}{c|}{Execution time} \\
+  PROP Version & \multicolumn{2}{c|}{Execution Time} \\
    \hline
    \hline
-  CPU version: 1  core & \multicolumn{2}{c|}{201m32s} \\
+  CPU Version: 1 Core & \multicolumn{2}{c|}{201m32s} \\
    \hline
-  CPU version: 4  cores &  \multicolumn{2}{c|}{113m28s} \\
+  CPU Version: 4 Cores &  \multicolumn{2}{c|}{113m28s} \\
    \hline \hline
-GPU version  & C1060 & C2050 \\
+GPU Version  & C1060 & C2050 \\
    \hline\hline
    GPU V1 (\S~\ref{gpuv1}) & 79m25s & 66m22s  \\
    \hline
@@ -849,38 +855,37 @@ GPU version  & C1060 & C2050 \\
    GPU V5 (\S~\ref{gpuv5}) & 24m27s & 12m39s  \\
    \hline
  \end{tabular}
-\caption{\label{table:time} 
-Execution time of PROP on CPU and GPU}
  \end{center}
-\end{table*}
+\caption{Execution time of PROP on CPU and GPU.}
+\label{table:time} 
+\end{table}
  
-\begin{comment}
-\begin{table*}[ht]
-\begin{center}
-\begin{tabular}{|c||c|c||}
- \hline
-  PROP version & \multicolumn{2}{c|}{Execution time} \\
-  \hline  \hline
-CPU version & 1 core & 4 cores  \\\hline
-& {201m32s} & {113m28s} \\ \hline  \hline
-GPU version  & C1060 & C2050 \\
-  \hline\hline
-  GPU V1 (\ref{gpuv1}) & 79m25s & 66m22s  \\
-  \hline
-  GPU V2 (\ref{gpuv2}) & 47m58s & 29m52s \\
-  \hline
-  GPU V3 (\ref{gpuv3}) & 41m28s & 23m46s \\
-  \hline
-  GPU V4 (\ref{gpuv4}) & 27m21s & 13m55s\\
-  \hline
-  GPU V5 (\ref{gpuv5}) & 24m27s & 12m39s  \\
-  \hline
-\end{tabular}
-\caption{\label{table:time} 
-Execution time of the successive GPU versions}
-\end{center}
-\end{table*}
-\end{comment}
+
+%% \begin{table}[ht]
+%% \begin{center}
+%% \begin{tabular}{|c||c|c||}
+%%  \hline
+%%   PROP version & \multicolumn{2}{c|}{Execution time} \\
+%%   \hline  \hline
+%% CPU version & 1 core & 4 cores  \\\hline
+%% & {201m32s} & {113m28s} \\ \hline  \hline
+%% GPU version  & C1060 & C2050 \\
+%%   \hline\hline
+%%   GPU V1 (\ref{gpuv1}) & 79m25s & 66m22s  \\
+%%   \hline
+%%   GPU V2 (\ref{gpuv2}) & 47m58s & 29m52s \\
+%%   \hline
+%%   GPU V3 (\ref{gpuv3}) & 41m28s & 23m46s \\
+%%   \hline
+%%   GPU V4 (\ref{gpuv4}) & 27m21s & 13m55s\\
+%%   \hline
+%%   GPU V5 (\ref{gpuv5}) & 24m27s & 12m39s  \\
+%%   \hline
+%% \end{tabular}
+%% \end{center}
+%% \caption{Execution time of the successive GPU versions}
+%% \label{table:time} 
+%% \end{table}
  
  \begin{figure}[h]
  \centering
@@ -904,10 +909,10 @@ the execution times
  of PROP on CPUs and GPUs, 
  each version solves the propagation equations in the 
  form~(\ref{eq_CAPS_1}-\ref{eq_CAPS_3}) as proposed by CAPS. 
-Fig.~\ref{fig:speedup_1core} (respectively \ref{fig:speedup_4cores})
+Figure~\ref{fig:speedup_1core} (respectively, \ref{fig:speedup_4cores})
  shows the speedup of the successive GPU versions
-over one CPU core (respectively four CPU cores). 
-We use here Intel Q8200 quad-core CPUs (2.33 GHz), one C1060 GPU and
+over one CPU core (respectively, four CPU cores). 
+We use here Intel Q8200 quad-core CPUs (2.33 GHz), one C1060 GPU, and
  one C2050 (Fermi) GPU, located at 
   UPMC (Universit\'e Pierre et Marie Curie, Paris, France). 
  As a remark, the execution times measured on the C2050 would be the same 
@@ -921,7 +926,7 @@ Since energies are independent, execution times for more energies
  (e.g. the huge case) should be proportional
  to those reported in Table~\ref{table:time}.  
  
-These tests, which have been performed with CUDA 3.2, CUBLAS\index{CUBLAS} 3.2 and 
+These tests, which have been performed with CUDA 3.2, CUBLAS\index{CUBLAS} 3.2, and 
  MAGMA 0.2, 
  show that the successive GPU versions of PROP offer 
  increasing, and at the end interesting, speedups.
@@ -943,10 +948,11 @@ is noticed that the padding\index{padding}
  does offer much more performance gain with,  
  for example,  CUDA 3.1 and the MAGMA DGEMM\index{MAGMA functions!DGEMM}~\cite{NTD10b}:  the
  speedup with respect to one 
-CPU core was increased from 6.3 to 8.1 on C1060, and from 9.5 to 14.3
+CPU core was increased from 6.3 to 8.1 on C1060 and from 9.5 to 14.3
  on C2050. 
-Indeed since CUBLAS\index{CUBLAS} 3.2 performance has been improved for non block multiple
-matrix sizes through MAGMA code~\cite{NTD10a}. 
+Indeed  CUBLAS\index{CUBLAS} 3.2 performance has been improved through MAGMA code %~\cite{NTD10a}. 
+%for non block multiple matrix sizes through MAGMA code~\cite{NTD10a}. 
+for matrix sizes which are not  multiples of the inner blocking size~\cite{NTD10a}.  
  Although for all versions the C2050 (with its improved
  double precision\index{precision!double precision} performance) offers up to almost 
  double speedup compared to 
@@ -980,19 +986,21 @@ the GPU for such an application.
  We detail here the execution profile on 
  the CPU and the GPU for the evaluation of all off-diagonal sectors 
  (the most representative ones) for a complete energy propagation. 
- Fig.~\ref{fig:CPU-timing} and \ref{fig:profileGPU} show CPU and GPU execution times for the 
+ Figures~\ref{fig:CPU-timing} and \ref{fig:profileGPU} show CPU and GPU execution times for the 
  171 off-diagonal sectors of  the large case (see Table \ref{data-sets}). 
  ``Copying, adding, scaling'' corresponds to the amplitude
-  array construction (scaling) as well as to steps 1, 2 and 6 in
-  Sect.~\ref{gpu-RO}, all implemented via CUDA kernels.
-``CPU-GPU transfers'' 
-aggregate transfer times for the $j$ amplitude
-arrays and the scaling factors, as well as for the correction data.
+  array construction (scaling) as well as to Steps 1, 2, and 6 in
+  Sect.~\ref{gpu-RO}, all implemented via CUDA kernels. 
   ``Amplitude matrix product'' corresponds to the DGEMM call to
- construct the local R-matrices from the $i$ and $j$ amplitude
+ construct the local $R$-matrices from the $i$ and $j$ amplitude
   arrays. 
  ``Linear system solving'' and ``Output matrix product'' correspond
-respectively to steps 3-4 and to step 5 in Sect.~\ref{gpu-RO}.
+respectively to steps 3-4 and to step 5 in Sect.~\ref{gpu-RO}.   
+``CPU-GPU transfers''  in Fig.~\ref{fig:profileGPU} 
+aggregate transfer times for the $j$ amplitude
+arrays and the scaling factors, as well as for the correction data.
+
+
  
  On one CPU core (see  Fig.~\ref{fig:CPU-timing}), 
   matrix products for the construction of the local
@@ -1001,14 +1009,14 @@ computation time during the whole propagation. Likewise the CPU time required by
   matrix products for the output $R$-matrix is constant within each
   strip. But as the global $R$-matrix is propagated from strip to
   strip, the sizes of
-the matrices $U$ and $V$ increase, so does their multiplication time.
+the matrices $U$ and $V$ increase, and so does their multiplication time.
  The time required to solve the linear system increases
  slightly during the propagation.
-These three operations (``Amplitude matrix product'', ``Output matrix
-product'' and ``Linear system solving'') are clearly dominant in terms
+These three operations (``Amplitude matrix product,'' ``Output matrix
+product,'' and ``Linear system solving'') are clearly dominant in terms
  of computation
  time compared to the other remaining operations, which justify our
-primary focus on such three linear algebra operations.
+primary focus on these three linear algebra operations.
  
  
  On the C1060 (see Fig.~\ref{GPU-timing}), we have
@@ -1046,7 +1054,7 @@ launched by the same host thread. We therefore rely here on the {\em legacy
  API} of CUBLAS\index{CUBLAS}~\cite{CUBLAS} (like MAGMA)
  without thread safety problems. 
  A {\em breadth first} issue order is used for kernel
-launchs \cite{CUDA_stream}: for a given GPU kernel, all kernel launchs
+launches \cite{CUDA_stream}: for a given GPU kernel, all kernel launchs
  are indeed issued together in the host thread, using one stream for each
  concurrent energy, in order to maximize concurrent kernel
  execution\index{concurrent kernel execution}.  
@@ -1057,28 +1065,28 @@ propagation of each single energy
  in order to enable concurrent executions among the required kernels.
  
  
-\begin{table}[t]
+\begin{table}[h]
  \begin{center}
  \begin{tabular}{|c|c||c|c|c|c|c|}
    \hline
-  \multirow{4}{0.18\linewidth}{Medium case} & Number of & 
+  \multirow{4}{0.18\linewidth}{Medium Case} & Number of & 
    \multirow{2}{0.07\linewidth}{\centering 1} & 
    \multirow{2}{0.07\linewidth}{\centering 2} & 
    \multirow{2}{0.07\linewidth}{\centering 4} & 
    \multirow{2}{0.07\linewidth}{\centering 8} & 
    \multirow{2}{0.07\linewidth}{\centering 16} \\  
-  & energies & & & & & \\  
+  & Energies & & & & & \\  
    \cline{2-7}
    & Time (s) &  11.18 & 6.87 & 5.32 & 4.96 & 4.76 \\
    \cline{2-7}
    & Speedup & - & 1.63 & 2.10 & 2.26 & 2.35 \\
    \hline
    \hline
-  \multirow{4}{0.18\linewidth}{Large case} & Number of & 
+  \multirow{4}{0.18\linewidth}{Large Case} & Number of & 
    \multirow{2}{0.07\linewidth}{\centering 1} & 
    \multicolumn{2}{c|}{\multirow{2}{0.07\linewidth}{\centering 2}} & 
    \multicolumn{2}{c|}{\multirow{2}{0.07\linewidth}{\centering 3}} \\
-  & energies & & \multicolumn{2}{c|}{~} & \multicolumn{2}{c|}{~}  \\  
+  & Energies & & \multicolumn{2}{c|}{~} & \multicolumn{2}{c|}{~}  \\  
    \cline{2-7}
    & Time (s) & 509.51 & \multicolumn{2}{c|}{451.49} & \multicolumn{2}{c|}{436.72}  \\  
    \cline{2-7}
@@ -1110,10 +1118,10 @@ with only one energy on the GPU. The concurrent execution of multiple
  kernels\index{concurrent kernel execution} is also limited by other operations on the
  GPU \cite{CUDA_ProgGuide,CUDA_stream} and by the current MAGMA code which
  prevents concurrent MAGMA calls in different streams. 
-Better speedups can be here expected on the latest Kepler GPUs which
+Better speedups can be expected on the latest Kepler GPUs which
  offer additional compute power, and whose {\em Hyper-Q} feature may help
  improve further the GPU utilization with concurrent energies. 
-On the contrary, the same code on the C1060 shows no speedup
+To the contrary, the same code on the C1060 shows no speedup
   since the concurrent kernel launches are serialized on this previous GPU architecture. 
  
  
@@ -1144,11 +1152,11 @@ We have
  therefore deployed the PROP code in double precision\index{precision!double precision} on 
  a GPU, with successive improvements. The different GPU versions 
  each offer increasing speedups over the CPU version.
-Compared to the single (respectively four) core(s) CPU version, the
+Compared to the single (respectively, four) core(s) CPU version, the
  optimal GPU implementation 
-gives a speedup of 8.2 (resp. 4.6) on one C1060 GPU,
-and a speedup of 15.9 (resp. 9.0) on one 
-C2050 GPU with improved double precision\index{precision!double precision} performance.
+gives a speedup of 8.2 (resp., 4.6) on one C1060 GPU,
+and a speedup of 15.9 (resp., 9.0) on one 
+C2050 GPU with improved double-precision\index{precision!double precision} performance.
  An additional gain of around 15\% 
  can also be obtained on one Fermi GPU
  with large memory (C2070) thanks to concurrent kernel execution. 
@@ -1159,21 +1167,21 @@ obtained with the HMPP\index{HMPP} version, since in our tests the CPUs are
  different and the CUBLAS\index{CUBLAS} versions are more recent.  
  However, the programming effort required 
  progressively to deploy PROP on GPUs clearly offers improved and interesting speedups for this 
-real-life   application in double precision\index{precision!double precision} with varying-size matrices. 
+real-life   application in double precision\index{precision!double precision} with varying-sized matrices. 
  
  
  We are currently working on a hybrid CPU-GPU version that spreads the
  computations of the independent energies on both the CPU
  and the GPU. This will enable 
  multiple energy execution on the CPU, with 
-one or several core(s) dedicated to each energy (via multi-threaded
+one or several core(s) dedicated to each energy (via multithreaded
  BLAS\index{BLAS} libraries). Multiple 
  concurrent energies may also be propagated on each Fermi GPU. 
  By merging this work with the current MPI PROP program, we will
  obtain a scalable hybrid CPU-GPU version.
-This final version will offer an additional level of parallelism
+This final version will offer an additional level of parallelism, 
  thanks to the MPI
-standard in order to exploit multiple
+standard, in order to exploit multiple
  nodes with multiple CPU cores and possibly multiple GPU cards. 
  
  \putbib[Chapters/chapter15/biblio]