BookGPU/Chapters/chapter15/ch15.tex

   1 \chapterauthor{Rachid Habel}{T\'el\'ecom SudParis, France}
   2 \chapterauthor{Pierre Fortin, Fabienne J\'ez\'equel, and Jean-Luc Lamotte}{Laboratoire d'Informatique de Paris 6, Université Pierre et Marie Curie, France}
   3
   4 %\chapterauthor{Fabienne J\'ez\'equel}{Laboratoire d'Informatique de Paris 6, University Paris 6}
   5 %\chapterauthor{Jean-Luc Lamotte}{Laboratoire d'Informatique de Paris 6, University Paris 6}
   6 \chapterauthor{Stan Scott}{School of Electronics, Electrical Engineering \& Computer Science,
   7 The Queen's University of Belfast, United Kingdom}
   8
   9 %\newcommand{\fixme}[1]{{\bf #1}}
  10
  11 \chapter[Numerical validation and GPU performance in atomic physics]{Numerical validation and performance optimization on GPUs of an application in atomic physics}
  12 \label{chapter15}
  13
  14 \section{Introduction}\label{ch15:intro}
  15 As described in Chapter~\ref{chapter1}, GPUs are characterized by hundreds
  16 of cores and theoretically perform one order of magnitude better than CPUs.
  17 An important factor to consider when programming on GPUs
  18  is the cost of
  19 data transfers between CPU memory and GPU memory. Thus, to have good
  20 performance on
  21 GPUs, applications should be coarse-grained and have a high arithmetic
  22 intensity
  23 (i.e., the ratio of arithmetic operations to memory operations).
  24 Another important aspect of GPU programming is that floating-point
  25 operations are preferably performed in single precision\index{precision!single precision}, if the
  26  validity of results is not impacted by that format.
  27 The GPU compute power for floating-point operations is indeed greater in
  28 single precision\index{precision!single precision} than in double precision\index{precision!double precision}.
  29 The peak performance ratio between single precision\index{precision!single precision} and double
  30 precision varies, for example, for NVIDIA GPUs from $12$ for the first Tesla
  31 GPUs (C1060),
  32 to $2$ for the Fermi GPUs (C2050 and C2070),
  33 and to $3$ for the latest Kepler architecture (K20/K20X).
  34 As far as AMD GPUs are concerned, the latest AMD GPU (Tahiti HD 7970)
  35 presents a ratio of $4$.
  36 Moreover, GPU internal memory accesses and CPU-GPU data transfers are
  37 faster in single precision\index{precision!single precision} than in double precision\index{precision!double precision}
  38 because of the different format lengths.
  39
  40 This chapter describes the deployment on GPUs of PROP, a program of the
  41 2DRMP~\cite{FARM_2DRMP,2DRMP} suite which models electron collisions
  42 with H-like atoms and ions at intermediate energies. 2DRMP operates successfully on serial
  43 computers, high performance clusters, and supercomputers.  The primary
  44 purpose of the PROP program is to propagate a global
  45 $R$-matrix~\cite{Burke_1987}, $\Re$, in the two-electron configuration
  46 space.
  47 The propagation needs to be performed for all collision energies,
  48 for instance, hundreds of energies,
  49 which are independent.
  50 Propagation equations are dominated by matrix multiplications involving sub-matrices of $\Re$.
  51 However, the matrix multiplications are not
  52 straightforward in the sense that $\Re$ dynamically changes the designation of its rows and
  53 columns and increases in size as the propagation proceeds \cite{VECPAR}.
  54
  55 In a preliminary investigation PROP was selected by GENCI\footnote{GENCI: Grand Equipement National
  56   de Calcul Intensif, \url{www.genci.fr}} and
  57 CAPS,\footnote{CAPS is a software company providing products and solutions
  58   for many-core application programming and deployment,
  59   \url{www.caps-entreprise.com}}
  60 following their first call for projects in 2009--2010
  61 aimed at
  62 deploying applications on hybrid systems based on GPUs.
  63 First CAPS
  64 recast the propagation equations with larger matrices.
  65 For matrix products the GPU performance gain over CPU increases indeed
  66 with the matrix size, since the
  67 CPU-GPU transfer overhead becomes less significant and since CPUs are
  68 still more efficient for fine computation grains.
  69 Then, using HMPP,\index{HMPP}\footnote{
  70 HMPP (Hybrid Multicore Parallel Programming) or {CAPS compiler}, see: \url{www.caps-entreprise.com/hmpp.html}}
  71 a commercial
  72 hybrid and parallel compiler, CAPS
  73 developed a version of PROP in
  74 which matrix multiplications are performed on
  75 the GPU or the CPU, depending on the matrix size.
  76 Unfortunately this partial GPU implementation of PROP does not offer
  77 significant acceleration.
  78
  79 The work described in this chapter, which is based on a study presented in \cite{PF_PDSEC2011}, aims at
  80  improving PROP performance on
  81 GPUs by exploring two directions. First, because the original version of PROP is written
  82 in double precision\index{precision!double precision},
  83 we study the numerical stability of PROP in single precision\index{precision!single precision}.
  84 Second, we deploy  the whole
  85 computation code of PROP on
  86 GPUs to avoid the overhead generated by
  87 data transfers
  88 and we propose successive improvements
  89 (including one specific to the Fermi architecture)
  90 in order to optimize the GPU code.
  91
  92
  93
  94
  95 \section{2DRMP and the PROP program}
  96 \label{s:2DRMP_PROP}
  97 \subsection{Principles of $R$-matrix propagation}
  98 2DRMP~\cite{FARM_2DRMP,2DRMP} is part of the CPC library.\footnote{CPC:
  99 Computer Physics Communications,
 100 \url{http://cpc.cs.qub.ac.uk/}}
 101 It is a suite of seven
 102 programs aimed at creating virtual experiments on high performance and grid
 103 architectures to enable the study of electron scattering from H-like
 104 atoms and ions at intermediate energies.  The 2DRMP suite  uses the
 105 two-dimensional $R$-matrix propagation approach~\cite{Burke_1987}.
 106 In 2DRMP the two-electron configuration  space  ($r_1$,$r_2$) is
 107 divided into sectors.
 108 Figure~\ref{prop} shows the division of the two-electron configuration
 109 space ($r_1$,$r_2$) into 4 vertical $strips$ representing 10 $sectors$.
 110 The key computation in 2DRMP, performed by the PROP program, is the
 111 propagation of a global
 112 $R$-matrix, $\Re$, from sector to sector across the internal region, as shown in Fig.~\ref{prop}.
 113
 114 \begin{figure}[h]
 115 \begin{center}
 116 \includegraphics*[width=0.65\linewidth]{Chapters/chapter15/figures/prop.pdf}
 117 \caption{\label{prop} Subdivision of the configuration space
 118 ($r_1$,$r_2$) into a set of connected sectors.}
 119 \end{center}
 120 \end{figure}
 121
 122 \begin{figure}[h]
 123 \begin{center}
 124 \includegraphics*[width=0.8\linewidth]{Chapters/chapter15/figures/Domain.pdf}
 125 \caption{\label{domain} Propagation of the $R$-matrix from domain $D$ to domain $D'$.}
 126 \end{center}
 127 \end{figure}
 128
 129 We consider the general situation in
 130 Fig.~\ref{domain} where we assume that we already know
 131 the global $R$-matrix, $\Re^{I}$, associated with the boundary defined
 132 by edges 5, 2, 1, and 6
 133 in domain $D$ and we wish to
 134 evaluate the new global $R$-matrix, $\Re^{O}$, associated with edges 5, 3, 4, and 6
 135 in domain $D'$ following propagation across subregion $d$.
 136 Input edges are denoted by I (edges 1 and~2), output edges by O (edges 3 and 4) and
 137 common edges by X (edges 5 and~6).
 138 Because of symmetry, only the lower half of domains $D$ and $D'$ has to be considered.
 139 The global $R$-matrices, $\Re^{I}$ in domain $D$ and $\Re^{O}$ in
 140 domain $D'$, can be written as
 141 \begin{equation}
 142 \Re^{I} = \left(\begin{array}{cc}
 143       \Re_{II}^{I} & \Re_{IX}^{I}\\
 144       \Re_{XI}^{I} & \Re_{XX}^{I}
 145     \end{array}\right)
 146 \ \
 147 \Re^{O} = \left(\begin{array}{cc}
 148       \Re_{OO}^{O} & \Re_{OX}^{O}\\
 149       \Re_{XO}^{O} & \Re_{XX}^{O}
 150     \end{array}\right).
 151 \label{eq:RI_RO}
 152 \end{equation}
 153
 154
 155
 156 From the set of local $R$-matrices, $\mathbf{R}_{ij}$ ($i,j\in \{1,2,3,4\}$),
 157 associated
 158 with subregion $d$, we can define
 159 \begin{subequations}
 160 \begin{eqnarray}
 161     \mathbf{r}_{II} = \left(\begin{array}{cc}
 162       \mathbf{R}_{11} & \mathbf{R}_{12}\\
 163       \mathbf{R}_{21} & \mathbf{R}_{22}
 164     \end{array}\right), \label{eqaa} &
 165     \mathbf{r}_{IO} = \left(\begin{array}{cc}
 166       \mathbf{R}_{13} & \mathbf{R}_{14}\\
 167       \mathbf{R}_{23} & \mathbf{R}_{24}
 168     \end{array}\right), \label{eqbb}\\
 169     \mathbf{r}_{OI} = \left(\begin{array}{cc}
 170       \mathbf{R}_{31} & \mathbf{R}_{32}\\
 171       \mathbf{R}_{41} & \mathbf{R}_{42}
 172     \end{array}\right), \label{eqcc} &
 173    \mathbf{r}_{OO} = \left(\begin{array}{cc}
 174       \mathbf{R}_{33} & \mathbf{R}_{34}\\
 175       \mathbf{R}_{43} & \mathbf{R}_{44}
 176     \end{array}\right),\label{eqdd}
 177 \end{eqnarray}
 178 \end{subequations}
 179 where $I$ represents the input edges 1 and 2, and $O$ represents
 180 the output edges 3 and 4 (see Fig.~\ref{domain}).
 181 The propagation across each sector is characterized by equations~(\ref{eq1}) to (\ref{eq4}).
 182 \begin{subequations}
 183 \begin{eqnarray}
 184 \Re^{O}_{OO} &=& \mathbf{r}_{OO} - \mathbf{r}_{IO}^T (r_{II} + \Re^{I}_{II})^{-1}\mathbf{r}_{IO}, \label{eq1} \\
 185 \Re^{O}_{OX} &=& \mathbf{r}_{IO}^T (\mathbf{r}_{II} + \Re^{I}_{II})^{-1}\Re^{I}_{IX},  \label{eq2} \\
 186    \Re^{O}_{XO} &=& \Re^{I}_{XI}(\mathbf{r}_{II} + \Re^{I}_{II})^{-1}\mathbf{r}_{IO},  \label{eq3} \\
 187    \Re^{O}_{XX} &=& \Re^{I}_{XX} - \Re^{I}_{XI}(\mathbf{r}_{II} +\Re^{I}_{II})^{-1}\Re^{I}_{IX}. \label{eq4}
 188 \end{eqnarray}
 189 \end{subequations}
 190
 191 The matrix inversions are not explicitly performed. To compute
 192 $(r_{II} + \Re^{I}_{II})^{-1}\mathbf{r}_{IO}$ and $(\mathbf{r}_{II} + \Re^{I}_{II})^{-1}\Re^{I}_{IX}$,
 193 two linear systems are solved.
 194
 195
 196 \medskip
 197
 198 While equations (\ref{eq1})--(\ref{eq4}) can be applied to the
 199 propagation across a general subregion two special situations should be
 200 noted: propagation across a diagonal subregion and propagation across
 201 a subregion bounded by the $r_{1}$-axis at the beginning of a new
 202 strip.
 203
 204 In the case of a diagonal subregion, from symmetry considerations,
 205 edge 2 is identical to edge 1 and edge 3 is identical to edge~4.
 206 Accordingly, with only one input edge and one output edge equations
 207 (\ref{eqaa})--(\ref{eqdd}) become
 208 \begin{subequations}
 209 \begin{eqnarray}
 210 \mathbf{r}_{II} = 2\mathbf{R}_{11}, \
 211 \mathbf{r}_{IO} = 2\mathbf{R}_{14}, \label{eq4b}\\
 212 \mathbf{r}_{OI} = 2\mathbf{R}_{41}, \
 213 \mathbf{r}_{OO} = 2\mathbf{R}_{44}. \label{eq4d}
 214 \end{eqnarray}
 215 \end{subequations}
 216 In the case of a subregion bounded by the $r_1$-axis at the beginning
 217 of a new strip, we note that the input boundary $I$ consists of only
 218 one edge. When propagating across the first subregion in the second
 219 strip there is no common boundary $X$: in this case only equation
 220 (\ref{eq1}) needs to be solved.
 221
 222 \medskip
 223
 224 Having obtained the global $R$-matrix $\Re$ on the boundary of the
 225 innermost subregion (labeled $0$ in Fig.~\ref{prop}), $\Re$ is propagated across
 226 each subregion in the order indicated in Fig.~\ref{prop},
 227 working systematically from the
 228 $r_1$-axis at the bottom of each strip across all subregions to the
 229 diagonal.
 230
 231
 232
 233 \subsection{Description of the PROP program}
 234 \label{sec:PROP}
 235
 236 \begin{table}[t]
 237 \begin{center}
 238 \begin{tabular}{|c|c|c|c|c|c|}
 239  \hline
 240  \multirow{2}{0.09\linewidth}{\centering Data Set} &
 241  \multirow{2}{0.15\linewidth}{\centering
 242   Local $R$-\\Matrix Size} &
 243  \multirow{2}{0.07\linewidth}{\centering Strips} &
 244  \multirow{2}{0.09\linewidth}{\centering Sectors} &
 245  \multirow{2}{0.19\linewidth}{\centering Final Global \\$R$-Matrix Size} &
 246  \multirow{2}{0.15\linewidth}{\centering Scattering\\Energies}\\
 247   & & & & & \\
 248   \hline
 249   Small & $90\times90$  & 4 & 10 & $360\times360$ & 6\\
 250  \hline
 251   Medium  & $90\times90$  & 4 & 10 & $360\times360$ & 64\\
 252  \hline
 253   Large  & $383\times383$  & 20 & 210 &  $7660\times7660$ & 6\\
 254  \hline
 255   Huge  & $383\times383$  & 20 & 210 &  $7660\times7660$ & 64\\ \hline
 256 \end{tabular}
 257 \caption{\label{data-sets}Characteristics of four data sets.}
 258 \end{center}
 259 \end{table}
 260
 261 The PROP program computes the propagation of the $R$-matrix across the sectors of the internal region.
 262 Table~\ref{data-sets} shows four different
 263 data sets used in this study and highlights the principal parameters of the PROP program.
 264  PROP execution can be described by Algorithm~\ref{prop-algo}.
 265 First, amplitude arrays and
 266 correction data are read from data files generated by the preceding
 267 program of the 2DRMP suite.
 268 Then, local $R$-matrices are constructed from amplitude arrays.
 269 Correction data is used to compute correction vectors added to the diagonal of the local
 270 $R$-matrices. The local $R$-matrices, together with the input $R$-matrix,
 271  $\Re^{I}$,
 272 computed on the previous sector, are used to compute the current output
 273 $R$-matrix,
 274 $\Re^{O}$.
 275  At the end of a sector evaluation,
 276 the output $R$-matrix becomes  the input $R$-matrix
 277 for the next evaluation.
 278
 279 %% \begin{algorithm}
 280 %% \caption{\label{prop-algo}PROP algorithm}
 281 %% \begin{algorithmic}
 282 %% \FOR{all scattering energies}
 283 %%  \FOR{all sectors}
 284 %%  \STATE Read amplitude arrays
 285 %%  \STATE Read correction data
 286 %% \STATE Construct local $R$-matrices
 287 %% \STATE From $\Re^{I}$ and local $R$-matrices, compute $\Re^{O}$
 288 %% \STATE $\Re^{O}$ becomes $\Re^{I}$ for the next sector
 289 %%  \ENDFOR
 290 %%  \STATE Compute physical $R$-Matrix
 291 %% \ENDFOR
 292 %% \end{algorithmic}
 293 %% \end{algorithm}
 294
 295 \begin{algorithm}
 296 \caption{\label{prop-algo}PROP algorithm}
 297 %\begin{algorithmic}
 298 \For{all scattering energies} {
 299  \For{all sectors}{
 300   Read amplitude arrays\;
 301   Read correction data\;
 302   Construct local $R$-matrices\;
 303   From $\Re^{I}$ and local $R$-matrices, compute $\Re^{O}$\;
 304  $\Re^{O}$ becomes $\Re^{I}$ for the next sector\;
 305  }
 306  Compute physical $R$-Matrix\;
 307 }
 308 %\end{algorithmic}
 309 \end{algorithm}
 310
 311
 312 On the first sector, there is no input $R$-matrix yet. To bootstrap
 313 the propagation, the first output $R$-matrix is constructed using only
 314 one local $R$-matrix.  On the last sector, that is, on the boundary of
 315 the inner region, a physical $R$-matrix corresponding to the output
 316 $R$-matrix is computed and stored into an output file.
 317
 318 In the PROP program, sectors are characterized into four types,
 319 depending on the computation performed:
 320 \begin{itemize}
 321 \item the starting sector (labeled 0 in Fig.~\ref{prop});
 322 \item the axis sectors (labeled 1, 3, and 6 in Fig.~\ref{prop});
 323 \item the diagonal sectors (labeled 2, 5,  and 9 in Fig.~\ref{prop});
 324 \item the off-diagonal sectors (labeled 4, 7, and 8 in Fig.~\ref{prop}).
 325 \end{itemize}
 326
 327
 328 The serial version of PROP is implemented in Fortran~90 and
 329 for linear algebra operations uses BLAS\index{BLAS} and LAPACK\index{LAPACK} routines
 330 which are fully optimized for x86 architecture.
 331 This
 332 program
 333 serially propagates the $R$-matrix for
 334 all scattering energies.
 335 Since the propagations for these different
 336 energies are independent, there also
 337 exists an embarrassingly parallel version of
 338 PROP
 339 that spreads the computations of
 340 several energies
 341 among multiple CPU nodes via
 342 MPI.
 343
 344
 345
 346 \subsection{CAPS implementation}
 347 \label{caps}
 348
 349 In order to handle larger matrices, and thus obtain better GPU  speedup, CAPS
 350 recast equations (\ref{eq1}) to (\ref{eq4}) into one equation.
 351 The output $R$-matrix $\Re^{O}$ defined by equation~(\ref{eq:RI_RO}) is now computed as follows:
 352 \begin{equation}\label{eq_CAPS_1}
 353 \Re^{O} = \Re^{O^{\ \prime}} + U A^{-1} V,
 354 \end{equation}
 355 \begin{equation}\label{eq_CAPS_2}
 356 {\rm with} \
 357 \Re^{O^{\ \prime}}= \left(\begin{array}{cc}
 358       \mathbf{r}_{OO} & 0\\
 359       0 & \Re^I_{XX}    \end{array}\right), \
 360 U= \left(\begin{array}{c}
 361       \mathbf-{r}_{IO}^{T}\\
 362       \Re^I_{XI}    \end{array}\right),
 363 \end{equation}
 364 \begin{equation}\label{eq_CAPS_3}
 365 A= \mathbf{r}_{II} + \Re^I_{II} \ {\rm and}  \
 366 V= (\mathbf{r}_{IO}\ \ \ -\Re^I_{IX}).
 367 \end{equation}
 368
 369 To compute $W=A^{-1}V$, no matrix inversion is performed. The matrix
 370 system $AW=V$ is solved.
 371 This reimplementation of PROP reduces the number of equations to be
 372 solved and the number of matrix copies for evaluating each sector.
 373 For instance, for an off-diagonal sector,
 374 copies fall from 22 to 5, matrix multiplications from 4 to~1, and calls
 375 to a linear equation solver from 2 to 1.
 376
 377 To implement this version, CAPS
 378 used HMPP\index{HMPP}, a
 379 commercial hybrid and parallel compiler,
 380 based on compiler directives such as the new OpenACC\index{OpenACC} standard.\footnote{See: \url{www.openacc-standard.org}}
 381 If the matrices are large enough (the limit sizes are experimental parameters),
 382 they are multiplied on the GPU, otherwise on the CPU.
 383 CAPS
 384  used Intel's MKL (Math Kernel Library) BLAS\index{BLAS} implementation on an Intel Xeon
 385 x5560 quad core CPU (2.8 GHz)
 386 and the CUBLAS\index{CUBLAS} library (CUDA 2.2) on one Tesla C1060 GPU.
 387 On the large data set (see Table~\ref{data-sets}), CAPS
 388  obtained a speedup of 1.15 for the GPU
 389 version over the CPU one (with multithreaded MKL calls on the four
 390 CPU cores). This limited gain in performance is mainly
 391 due to the use of double precision\index{precision!double precision} computation
 392 and to the small or medium sizes of most matrices.
 393 For these matrices, the computation gain on
 394 the GPU is indeed
 395 strongly affected by the overhead
 396 generated by transferring these matrices from
 397 the CPU memory to the GPU memory to perform each matrix multiplication and then
 398 transferring the result back to the CPU memory.
 399
 400 Our goal is to speed up PROP more significantly by porting the whole
 401 code to the GPU and therefore avoiding
 402 the
 403 intermediate data transfers between
 404 the host (CPU) and the GPU. We will also study the
 405 stability of PROP in single precision\index{precision!single precision} because
 406 single-precision\index{precision!single precision} computation is faster on the GPU
 407 and CPU-GPU data transfers are twice as fast as those performed in
 408 double precision\index{precision!double precision}.
 409
 410
 411
 412 \section{Numerical validation\index{numerical validation} of PROP in single precision\index{precision!single precision}}
 413 \label{single-precision}
 414
 415 \begin{comment}
 416 \begin{table}[h]
 417 \begin{center}
 418 \begin{tabular}{|c|c|}
 419   \hline
 420    relative error interval & \# occurrences \\
 421   \hline
 422    [0, 1.E-8) & 18 \\
 423   \hline
 424    [1.E-8, 1.E-6) & 1241 \\
 425   \hline
 426    [1.E-6, 1.E-4) & 48728 \\
 427   \hline
 428    [1.E-4, 1.E-2) & 184065 \\
 429   \hline
 430    [1.E-2, 1) & 27723 \\
 431   \hline
 432    [1, 100) & 304 \\
 433   \hline
 434    [100, $+\infty$) & 1 \\
 435   \hline
 436 \end{tabular}
 437 \end{center}
 438 \caption{\label{sp-distrib}Error distribution for medium case in single precision\index{precision!single precision}}
 439 \end{table}
 440 \end{comment}
 441
 442
 443 Floating-point input data, computation, and output data of PROP are
 444 originally in double-precision\index{precision!double precision} format.
 445 PROP produces a standard $R$-matrix H-file \cite{FARM_2DRMP}
 446  and a collection of {Rmat}00X files (where X
 447 ranges from 0 to the number of scattering energies $-$ 1)
 448 holding the physical $R$-matrix for each
 449 energy.
 450 The  H-file  and the {Rmat}00X files are binary input files of the FARM program \cite{FARM_2DRMP}
 451 (last program of the 2DRMP suite).
 452 Their text equivalents are the  prop.out
 453 and the prop00X.out files.
 454 To study the validity of PROP results in  single precision\index{precision!single precision},
 455 first,
 456 reference results are
 457  generated by running the serial version of PROP in double precision\index{precision!double precision}.
 458 Data used in the most costly computation parts are read from input files in
 459 double precision\index{precision!double precision} format and then
 460 cast to single precision\index{precision!single precision} format.
 461 PROP results  (input of FARM) are computed in single precision\index{precision!single precision} and  written
 462 into files in double precision\index{precision!double precision}.
 463
 464 \subsection{Medium case study}
 465 \begin{figure}[h]
 466 \begin{center}
 467 \includegraphics*[width=0.9\linewidth]{Chapters/chapter15/figures/error.pdf}
 468 \caption{\label{fig:sp-distrib} Error distribution for medium case in single precision.\index{precision!single precision}}
 469 \end{center}
 470 \end{figure}
 471
 472 The physical $R$-matrices, in
 473 the prop00X.out files, are compared to the
 474 reference ones for the medium case (see Table~\ref{data-sets}).
 475  The relative
 476 error distribution is
 477 given in Fig.~\ref{fig:sp-distrib}.
 478 We focus on the largest errors.
 479 \begin{itemize}
 480 \item Errors greater than $10^{2}$: %100:
 481 the only impacted value is of order $10^{-6}$ %1.E-6
 482 and is negligible compared to the other ones
 483 in the same prop00X.out file.
 484
 485 \item Errors between 1 and $10^{2}$: %100:
 486 the values corresponding to the
 487   largest errors are of order $10^{-3}$ %1.E-3
 488 and are negligible compared to
 489   the majority of the other values which range between $10^{-2}$ %1.E-2
 490 and $10^{-1}$.
 491
 492 \item Errors between $10^{-2}$ %1.E-2
 493 and 1: the largest errors ($\ge$ 6\%)
 494   impact values the order of magnitude of which is at most $10^{-1}$. %1.E-1.
 495   These values are negligible.
 496   Relative errors of approximately 5\% impact values the order of
 497   magnitude of which is at most $10^{2}$. %1.E2.
 498   For instance, the value 164 produced by the reference version of
 499   PROP becomes 172 in the single precision\index{precision!single precision} version.
 500 \end{itemize}
 501
 502 To study the impact of the single precision\index{precision!single precision} version of PROP on the
 503 FARM program, the cross-section
 504 results files corresponding to
 505 transitions
 506 {1s1s},
 507 {1s2s}, {1s2p}, {1s3s}, {1s3p}, {1s3d},
 508 {1s4s}, {2p4d} are compared to the reference ones.
 509 Table~\ref{sp-farm} shows that all cross-section files are impacted by
 510 errors.  Indeed in the  {2p4d} file,  four relative errors are
 511 greater than one and the maximum relative error is 1.60.
 512 However, the largest errors impact negligible values. For example, the maximum
 513 error (1.60) impacts a reference value which is $4.5\ 10^{-4}$.  %4.5E-4.
 514 The largest
 515 values are impacted by low errors. For instance, the maximum value
 516 (1.16) is impacted by a relative error of the order $10^{-3}$. %1.E-3.
 517
 518 \begin{table}[t]
 519 \begin{center}
 520 \begin{tabular}{|c|c||c|c|} \hline
 521  File & Largest Relative Error & File & Largest Relative Error\\ \hline
 522  {1s1s} & 0.02& {1s3p} & 0.11  \\ \hline
 523  {1s2s} & 0.06 &  {1s3d} &  0.22 \\ \hline
 524  {1s2p} &  0.08 & {1s4s} &  0.20  \\ \hline
 525  {1s3s} &  0.17 &2p4d & 1.60  \\ \hline
 526 \end{tabular}
 527 \caption{\label{sp-farm}Impact  on FARM  of the single precision\index{precision!single precision} version of PROP.}
 528 \end{center}
 529 \end{table}
 530
 531 To examine in more detail the impact of PROP on FARM,
 532 cross-sections above the ionization threshold (1 Ryd)
 533 are compared in single and
 534 double precision\index{precision!double precision}  for
 535 transitions among the 1s, \dots 4s, 2p, \dots 4p, 3d, 4d target states.
 536 This comparison is carried out by generating 45 plots.  In all the
 537  plots, results in single and double precision\index{precision!double precision} match except for a few
 538  scattering energies which are very close to pseudo-state thresholds.
 539 For example, Fig.~\ref{1s2p} and \ref{1s4d} present the scattering energies corresponding to the
 540 {1s2p} and {1s4d} cross-sections computed in single and double precision\index{precision!double precision}.   For some cross-sections,
 541 increasing a threshold parameter from $10^{-4}$ %1.E-4
 542 to $10^{-3}$ %1.E-3
 543 in the FARM
 544 program
 545 results in energies close to threshold being avoided,
 546  and therefore,
 547 the cross-sections in double and single precision\index{precision!single precision} match more
 548 accurately.
 549 This is the case for instance for cross-section 1s2p (see Fig.~\ref{1s2p3}).
 550 However for other cross-sections (such as 1s4d) some problematic energies remain even if the
 551 threshold parameter in  the FARM
 552 program is increased to  $10^{-3}$ %1.E-3
 553  (see Fig.~\ref{1s4d3}).  A higher
 554 threshold parameter would be required for such cross-sections.
 555
 556 \begin{figure}[t]
 557 \centering
 558 \subfigure[threshold = $10^{-4}$]{ %1.E-4]{
 559 \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s2p.pdf}
 560    \label{1s2p}
 561  }
 562 \subfigure[threshold = $10^{-3}$]{ %1.E-3]{
 563 \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s2p3.pdf}
 564  \label{1s2p3}
 565  }
 566 \label{fig:1s2p_10sectors}
 567 \caption{1s2p cross-section, 10 sectors.}
 568 \end{figure}
 569
 570 \begin{figure}[t]
 571 \centering
 572 \subfigure[threshold = $10^{-4}$]{ %1.E-4]{
 573 \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s4d.pdf}
 574    \label{1s4d}
 575  }
 576 \subfigure[threshold = $10^{-3}$]{ %1.E-3]{
 577 \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s4d3.pdf}
 578  \label{1s4d3}
 579  }
 580 \label{fig:1s4d_10sectors}
 581 \caption{1s4d cross-section, 10 sectors.}
 582 \end{figure}
 583
 584 As a conclusion, the medium case study shows that the execution of
 585 PROP in single precision\index{precision!single precision} leads to a few inexact scattering energies to
 586 be computed by the FARM program for some cross-sections.
 587 Thanks to a suitable threshold parameter in the FARM program these problematic energies may possibly
 588 be skipped.
 589 Instead of investigating more deeply the choice of such a parameter for the medium case, we analyze the
 590 single-precision\index{precision!single precision} computation  in a more
 591 realistic case in Sect.~\ref{huge}.
 592 \begin{comment}
 593 The conclusion of the medium case study is that running PROP in single
 594 precision gives relatively stable results provided that suitable
 595 parameter values are used in the FARM program in order to skip the
 596 problematic energies that are too close to the pseudo-state
 597 thresholds.  To verify if this conclusion is still valid with a larger
 598 data set, the single precision\index{precision!single precision} computation is analyzed in a more
 599 realistic case in Sect.~\ref{huge}.
 600 \end{comment}
 601
 602 \subsection{Huge case study}\label{huge}
 603
 604
 605 \begin{figure}[t]
 606   \centering
 607 \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s2pHT.pdf}
 608 \caption{\label{1s2pHT}1s2p cross-section, threshold = $10^{-4}$, 210 sectors.} %1.E-4, 210 sectors.}
 609 \end{figure}
 610
 611 \begin{figure}[t]
 612   \centering
 613 \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s2pHT.pdf}
 614 \caption{\label{1s4dHT}1s4d cross-section, threshold = $10^{-4}$, 210 sectors.} %1.E-4, 210 sectors.}
 615 \end{figure}
 616
 617 We study here the impact on FARM of the PROP program run in
 618 single precision\index{precision!single precision} for the huge case (see Table~\ref{data-sets}).
 619 The cross-sections
 620 corresponding to all
 621 atomic target states 1s \dots 7i are explored, which
 622 leads to
 623 406 comparison plots.
 624 It should be noted that in this case, over the same energy range above the ionization threshold, the density of pseudo-state thresholds is significantly increased compared to the medium case.
 625 As expected, all the plots exhibit large differences between single and double
 626 precision cross-sections.
 627 For example Fig.~\ref{1s2pHT} and  \ref{1s4dHT} present the 1s2p and 1s4d cross-sections computed in
 628 single---and double---precision\index{precision!double precision} for the huge case.
 629 We can conclude that PROP in single precision\index{precision!single precision} gives invalid results
 630 for realistic simulation cases above the ionization threshold.
 631 Therefore, the  deployment of PROP on GPU, described in Sect.~\ref{gpu-implem},
 632 has been carried out in double precision\index{precision!double precision}.
 633
 634 \section{Towards a complete deployment of PROP on GPUs}
 635 \label{gpu-implem}
 636
 637 We now detail how PROP has been progressively deployed on
 638 GPUs in double precision\index{precision!double precision} in order to avoid the
 639 expensive memory transfers between the host and the GPU.
 640 Different versions with successive improvements and optimizations are presented.
 641 We use CUDA~\cite{CUDA_ProgGuide} for GPU programming, as well as the
 642 CUBLAS\index{CUBLAS}~\cite{CUBLAS}
 643 and MAGMA \cite{MAGMA} libraries for linear algebra operations.
 644 Since PROP is written in Fortran 90, {\em wrappers\index{wrapper}} in C are used to
 645 enable calls to CUDA kernels from PROP routines.
 646
 647
 648 \subsection{Computing the output $R$-matrix on GPU}
 649 \label{gpu-RO}
 650
 651 \begin{figure}[h]
 652   \centering
 653   \includegraphics[width=0.7\linewidth]{Chapters/chapter15/figures/offdiagonal_nb.pdf}
 654   \caption{\label{offdiagonal} The six steps of an off-diagonal sector
 655     evaluation.}
 656 \end{figure}
 657
 658 As mentioned in Algorithm~\ref{prop-algo}, evaluating a sector
 659 mainly consists of constructing local $R$-matrices and computing
 660 one output $R$-matrix, $\Re^{O}$. In this first step of the porting
 661 process, referred to as GPU V1\label{gpuv1},
 662 we consider only the computation of $\Re^{O}$ on the GPU.
 663 We distinguish the following six steps, related to equations
 664 (\ref{eq_CAPS_1}), (\ref{eq_CAPS_2}), and (\ref{eq_CAPS_3}), and illustrated in
 665 Fig.~\ref{offdiagonal} for an off-diagonal sector.
 666
 667 \begin{description}
 668 \item[Step 1] (``Input copies''):~data are copied from $\Re^{I}$
 669   to temporary arrays ($A$, $U$, $V$) and to $\Re^{O}$.
 670   These copies, along with possible scalings or transpositions, are
 671   implemented as CUDA kernels which can be applied to two
 672   matrices of any size starting at any offset.
 673   Memory accesses are coalesced\index{GPU!coalesced memory accesses} \cite{CUDA_ProgGuide} in order to
 674   provide the best performance for such memory-bound kernels.
 675 \item[Step 2] (``Local copies''):~data are copied from
 676   local $R$-matrices to temporary arrays ($U$, $V$) and to $\Re^{O}$.
 677   Moreover data from local $R$-matrix
 678   $\mathbf{r}_{II}$
 679   is added to matrix $A$ (via a CUDA kernel) and zeroes are written in
 680    $\Re^{O}$  where required.
 681 \item[Step 3] (``Linear system solving''):~matrix $A$ is factorized
 682   using the MAGMA DGETRF\index{MAGMA functions!DGETRF}
 683    routine and the result is stored in-place.
 684 \item[Step 4] (``Linear system solving,'' cont.):~the matrix system
 685  of linear equations  $AW$ = $V$ is solved using the MAGMA DGETRS\index{MAGMA functions!DGETRS}
 686 routine. The  solution is stored in matrix $V$.
 687 \item[Step 5] (``Output matrix product''):~matrix $U$
 688   is multiplied by matrix $V$ using the CUBLAS\index{CUBLAS} DGEMM
 689   routine. The result is stored in a temporary matrix~$t$.
 690 \item[Step 6] (``Output add''):~$t$ is added to $\Re^{O}$ (CUDA
 691   kernel).
 692 \end{description}
 693
 694 All the involved matrices are stored in the GPU memory. Only the
 695 local $R$-matrices are first constructed on the host and then sent
 696 to the GPU memory, since these matrices vary from sector to sector.
 697 The evaluation of the axis and diagonal sectors is similar.
 698 However, fewer operations and copies are required because of
 699 symmetry considerations \cite{2DRMP}.
 700
 701 \subsection{Constructing the local $R$-matrices on GPU}
 702
 703 \begin{figure}[h]
 704  \centering
 705   \includegraphics[width=0.7\linewidth]{Chapters/chapter15/figures/amplitudes_nb.pdf}
 706  \caption{\label{amplitudes} Constructing the local $R$-matrix R34
 707  from the $j$ amplitude array associated with edge 4 and the $i$
 708  amplitude array associated with edge~3.}
 709 \end{figure}
 710
 711 Local $R$-matrices are constructed using two three-dimensional arrays,
 712 $i$ and $j$. Each three-dimensional array contains four
 713 matrices corresponding to the surface amplitudes associated with the
 714 four edges of a sector. Those matrices are named {\em amplitude arrays}.
 715  $j$ amplitude arrays are read from data files and $i$ amplitude arrays
 716 are obtained by scaling each row of the $j$ amplitude arrays.
 717 The main part of the construction of a local $R$-matrix,
 718 presented in Fig.~\ref{amplitudes},
 719 is a matrix product between
 720 one $i$ amplitude array and one transposed $j$ amplitude array
 721 which is performed by a single DGEMM
 722 BLAS\index{BLAS} call.
 723 In this version, hereafter referred to as GPU
 724 V2\label{gpuv2}, $i$ and $j$ amplitude arrays are transferred to the
 725 GPU memory and the required matrix multiplications are performed on
 726 the GPU (via CUBLAS\index{CUBLAS} routines).
 727
 728
 729 The involved matrices have medium sizes (either $3066 \times 383$ or
 730 $5997 \times 383$), and
 731 performing these matrix multiplications
 732 on the GPU is expected to be faster than on the CPU.
 733 However, this implies a greater communication volume
 734 between the CPU and the GPU
 735 since the $i$ and $j$ amplitude arrays are larger than the local
 736 $R$-matrices.
 737 It can be noticed that correction data are also used in the
 738 construction of a local $R$-matrix,
 739 but this is a minor part in the
 740 computation. However, these correction data also have to be
 741 transferred from the CPU to the GPU for each sector.
 742
 743 \subsection{Scaling amplitude arrays on GPU}
 744 It
 745 should be worthwhile to try to reduce the CPU-GPU data
 746 transfers of the GPU V2, where the $i$ and $j$ amplitude arrays are
 747 constructed on the host and then sent to the GPU memory for each sector.
 748 In this new version, hereafter referred to as GPU V3\label{gpuv3}, we
 749 transfer only the $j$ amplitude arrays and the
 750 required scaling factors (stored in one 1D array) to the GPU memory,
 751 so that the $i$ amplitude arrays are
 752 then directly computed on the GPU by multiplying the $j$ amplitude
 753 arrays by these scaling factors (via a CUDA kernel).
 754 Therefore, we save the transfer of four $i$ amplitude arrays on
 755 each sector by transferring  only this 1D array of scaling factors.
 756 Moreover, scaling $j$ amplitude arrays is expected to be faster on the
 757 GPU than on the CPU, thanks to the massively parallel architecture of
 758 the GPU and its higher internal memory bandwidth.
 759
 760 \subsection{Using double-buffering\index{double-buffering} to overlap I/O and computation}
 761
 762 \begin{figure}[t]
 763   \centering
 764   \includegraphics[width=0.8\linewidth]{Chapters/chapter15/figures/C1060_V3_IO_COMP.pdf}
 765   \caption{\label{overlapping} Compute and I/O times for the GPU V3 on
 766     one C1060.}
 767 \end{figure}
 768
 769 As described in Algorithm~\ref{prop-algo}, there are two main steps in
 770 the propagation across a sector: reading  amplitude arrays
 771 and correction data from I/O files and
 772 evaluating the current sector.
 773 Fig.~\ref{overlapping} shows the I/O times and the evaluation times
 774 of each sector for the huge case execution (210 sectors, 20 strips) of the GPU V3
 775 on one C1060.
 776 Whereas the times required by the off-diagonal sectors are similar
 777 within each of the 20 strips,
 778 the times for diagonal sectors of each strip
 779 are the shortest ones, the times for the axis sectors being
 780 intermediate.
 781 The I/O times are roughly constant among all strips.
 782 The evaluation time is equivalent to the I/O
 783 time for the first sectors. But this evaluation time grows
 784 linearly with the strip number and rapidly exceeds the I/O
 785 time.
 786
 787 It is thus interesting to use a double-buffering\index{double-buffering} technique to overlap the
 788 I/O time with the evaluation time:
 789 for each sector, the evaluation of sector $n$ is performed
 790 (on GPU) simultaneously with the reading of data for sector
 791 $n+1$ (on CPU). This requires the duplication in the CPU memory of all the
 792 data structures
 793 used for storing data read from I/O files for each sector.
 794 This version, hereafter referred to as GPU
 795 V4\label{gpuv4}, uses POSIX threads\index{POSIX threads}. Two threads are
 796 executed concurrently: an I/O thread that reads data from I/O files
 797 for each sector, and a computation thread, dedicated to the propagation
 798 of the global $R$-matrix, that performs successively for each sector
 799 all necessary computations on GPU,
 800 as well as all required CPU-GPU data transfers.
 801 The evaluation of a sector uses the data read for this sector as well
 802 as the global $R$-matrix computed on the previous sector.
 803 This dependency requires synchronizations between the I/O thread and
 804 the computation thread which are implemented through standard POSIX
 805 thread mechanisms.
 806
 807
 808 \subsection{Matrix padding\index{padding}}
 809 The  MAGMA DGETRF\index{MAGMA functions!DGETRF}/DGETRS\index{MAGMA functions!DGETRS}
 810 performance and the CUBLAS DGEMM performance
 811 are reduced when the sizes (or
 812 the leading dimensions) of the matrix are not multiples of the inner blocking size \cite{NTD10a}.
 813 This inner blocking size can be 32 or 64, depending on the computation
 814 and on the underlying
 815 GPU architecture \cite{MAGMA,NTD10b}.
 816 In this version (GPU V5\label{gpuv5}),
 817 the matrices are therefore padded with $0.0$ (and $1.0$ on the diagonal for the linear systems)
 818 so that their sizes are
 819 multiples of 64.
 820 This corresponds indeed to the optimal size for the matrix product on the
 821 Fermi architecture \cite{NTD10b}. And as far as linear system solving is
 822 concerned, all the matrices have sizes which are multiples of 383: we
 823 therefore use padding\index{padding} to obtain multiples of 384 (which are
 824 again  multiples of 64).
 825 It can be noticed that this padding\index{padding} has to be performed dynamically
 826 as the matrices increase in size during the propagation
 827 (when possible the
 828  maximum required storage space is, however, allocated only once in the
 829  GPU memory).
 830
 831 \section{Performance results}
 832 \subsection{PROP deployment on GPU}
 833
 834 \begin{table}[ht]
 835 \begin{center}
 836 \begin{tabular}{|c||c|c||}
 837  \hline
 838   PROP Version & \multicolumn{2}{c|}{Execution Time} \\
 839   \hline
 840   \hline
 841   CPU Version: 1 Core & \multicolumn{2}{c|}{201m32s} \\
 842   \hline
 843   CPU Version: 4 Cores &  \multicolumn{2}{c|}{113m28s} \\
 844   \hline \hline
 845 GPU Version  & C1060 & C2050 \\
 846   \hline\hline
 847   GPU V1 (\S~\ref{gpuv1}) & 79m25s & 66m22s  \\
 848   \hline
 849   GPU V2 (\S~\ref{gpuv2}) & 47m58s & 29m52s \\
 850   \hline
 851   GPU V3 (\S~\ref{gpuv3}) & 41m28s & 23m46s \\
 852   \hline
 853   GPU V4 (\S~\ref{gpuv4}) & 27m21s & 13m55s\\
 854   \hline
 855   GPU V5 (\S~\ref{gpuv5}) & 24m27s & 12m39s  \\
 856   \hline
 857 \end{tabular}
 858 \end{center}
 859 \caption{Execution time of PROP on CPU and GPU.}
 860 \label{table:time}
 861 \end{table}
 862
 863
 864 %% \begin{table}[ht]
 865 %% \begin{center}
 866 %% \begin{tabular}{|c||c|c||}
 867 %%  \hline
 868 %%   PROP version & \multicolumn{2}{c|}{Execution time} \\
 869 %%   \hline  \hline
 870 %% CPU version & 1 core & 4 cores  \\\hline
 871 %% & {201m32s} & {113m28s} \\ \hline  \hline
 872 %% GPU version  & C1060 & C2050 \\
 873 %%   \hline\hline
 874 %%   GPU V1 (\ref{gpuv1}) & 79m25s & 66m22s  \\
 875 %%   \hline
 876 %%   GPU V2 (\ref{gpuv2}) & 47m58s & 29m52s \\
 877 %%   \hline
 878 %%   GPU V3 (\ref{gpuv3}) & 41m28s & 23m46s \\
 879 %%   \hline
 880 %%   GPU V4 (\ref{gpuv4}) & 27m21s & 13m55s\\
 881 %%   \hline
 882 %%   GPU V5 (\ref{gpuv5}) & 24m27s & 12m39s  \\
 883 %%   \hline
 884 %% \end{tabular}
 885 %% \end{center}
 886 %% \caption{Execution time of the successive GPU versions}
 887 %% \label{table:time}
 888 %% \end{table}
 889
 890 \begin{figure}[h]
 891 \centering
 892 \subfigure[Speedup over 1 CPU core]{
 893 \includegraphics*[width=0.76
 894         \linewidth]{Chapters/chapter15/figures/histo_speedup_1core.pdf}
 895    \label{fig:speedup_1core}
 896  }
 897
 898 \subfigure[Speedup over 4 CPU cores]{
 899 \includegraphics*[width=0.76
 900         \linewidth]{Chapters/chapter15/figures/histo_speedup_4cores.pdf}
 901  \label{fig:speedup_4cores}
 902  }
 903 \label{fig:speedup}
 904 \caption{Speedup of the successive GPU versions.}
 905 \end{figure}
 906
 907 Table~\ref{table:time} presents
 908 the execution times
 909 of PROP on CPUs and GPUs,
 910 each version solves the propagation equations in the
 911 form~(\ref{eq_CAPS_1}-\ref{eq_CAPS_3}) as proposed by CAPS.
 912 Figure~\ref{fig:speedup_1core} (respectively, \ref{fig:speedup_4cores})
 913 shows the speedup of the successive GPU versions
 914 over one CPU core (respectively, four CPU cores).
 915 We use here Intel Q8200 quad-core CPUs (2.33 GHz), one C1060 GPU, and
 916 one C2050 (Fermi) GPU, located at
 917  UPMC (Universit\'e Pierre et Marie Curie, Paris, France).
 918 As a remark, the execution times measured on the C2050 would be the same
 919 on the C2070 and on  the C2075, the only difference between these GPUs
 920 being their memory size and their TDP (Thermal Design Power)\index{TDP (thermal design power)}.
 921 We emphasize that the execution times correspond to the
 922 complete propagation for all six energies of the large case (see
 923 Table~\ref{data-sets}), that is to say to the complete execution of
 924 the PROP program.
 925 Since energies are independent, execution times for more energies
 926 (e.g. the huge case) should be proportional
 927 to those reported in Table~\ref{table:time}.
 928
 929 These tests, which have been performed with CUDA 3.2, CUBLAS\index{CUBLAS} 3.2, and
 930 MAGMA 0.2,
 931 show that the successive GPU versions of PROP offer
 932 increasing, and at the end interesting, speedups.
 933 More precisely,
 934 V2 shows that it is worth increasing slightly the
 935 CPU-GPU communication volume in order to perform
 936 large enough matrix products on the GPU.
 937  This communication volume can fortunately be
 938 reduced thanks to~V3,
 939 which also accelerates the computation of
 940 amplitude arrays thanks to the GPU.
 941 The
 942 double-buffering\index{double-buffering} technique implemented in V4
 943  effectively enables the overlapping of
 944 I/O operations with computation, while the
 945 padding\index{padding} implemented in V5 also improves the computation times.
 946 It
 947 is noticed that the padding\index{padding}
 948 does offer much more performance gain with,
 949 for example,  CUDA 3.1 and the MAGMA DGEMM\index{MAGMA functions!DGEMM}~\cite{NTD10b}:  the
 950 speedup with respect to one
 951 CPU core was increased from 6.3 to 8.1 on C1060 and from 9.5 to 14.3
 952 on C2050.
 953 Indeed  CUBLAS\index{CUBLAS} 3.2 performance has been improved through MAGMA code %~\cite{NTD10a}.
 954 %for non block multiple matrix sizes through MAGMA code~\cite{NTD10a}.
 955 for matrix sizes which are not  multiples of the inner blocking size~\cite{NTD10a}.
 956 Although for all versions the C2050 (with its improved
 957 double precision\index{precision!double precision} performance) offers up to almost
 958 double speedup compared to
 959 the C1060, the performance obtained with both architectures justifies the use of
 960 the GPU for such an application.
 961
 962 \subsection{PROP execution profile}
 963
 964 \begin{figure}[t]
 965   \centering
 966  \includegraphics*[width=0.64\linewidth]{Chapters/chapter15/figures/CPU_1_CORE_TIMES.pdf}
 967 \caption{CPU (1 core)  execution times for the off-diagonal sectors of the large case.}
 968 \label{fig:CPU-timing}
 969 \end{figure}
 970
 971 \begin{figure}[t]
 972   \centering
 973   \subfigure[GPU V5 on one C1060]{
 974 \includegraphics*[width=0.64\linewidth]{Chapters/chapter15/figures/GPU_V5_C1060_TIMES.pdf}
 975 \label{GPU-timing}}
 976
 977   \subfigure[GPU V5 on one C2050]{
 978 \includegraphics*[width=0.64\linewidth]{Chapters/chapter15/figures/GPU_V5_C2050_TIMES.pdf}
 979 \label{fermi-timing}}
 980
 981 \caption{GPU execution times for the off-diagonal sectors of
 982   the large case.}
 983 \label{fig:profileGPU}
 984 \end{figure}
 985
 986 We detail here the execution profile on
 987 the CPU and the GPU for the evaluation of all off-diagonal sectors
 988 (the most representative ones) for a complete energy propagation.
 989  Figures~\ref{fig:CPU-timing} and \ref{fig:profileGPU} show CPU and GPU execution times for the
 990 171 off-diagonal sectors of  the large case (see Table \ref{data-sets}).
 991 ``Copying, adding, scaling'' corresponds to the amplitude
 992   array construction (scaling) as well as to Steps 1, 2, and 6 in
 993   Sect.~\ref{gpu-RO}, all implemented via CUDA kernels.
 994  ``Amplitude matrix product'' corresponds to the DGEMM call to
 995  construct the local $R$-matrices from the $i$ and $j$ amplitude
 996  arrays.
 997 ``Linear system solving'' and ``Output matrix product'' correspond
 998 respectively to steps 3-4 and to step 5 in Sect.~\ref{gpu-RO}.
 999 ``CPU-GPU transfers''  in Fig.~\ref{fig:profileGPU}
1000 aggregate transfer times for the $j$ amplitude
1001 arrays and the scaling factors, as well as for the correction data.
1002
1003
1004
1005 On one CPU core (see  Fig.~\ref{fig:CPU-timing}),
1006  matrix products for the construction of the local
1007 $R$-matrices require the same
1008 computation time during the whole propagation. Likewise the CPU time required by
1009  matrix products for the output $R$-matrix is constant within each
1010  strip. But as the global $R$-matrix is propagated from strip to
1011  strip, the sizes of
1012 the matrices $U$ and $V$ increase, and so does their multiplication time.
1013 The time required to solve the linear system increases
1014 slightly during the propagation.
1015 These three operations (``Amplitude matrix product,'' ``Output matrix
1016 product,'' and ``Linear system solving'') are clearly dominant in terms
1017 of computation
1018 time compared to the other remaining operations, which justify our
1019 primary focus on these three linear algebra operations.
1020
1021
1022 On the C1060 (see Fig.~\ref{GPU-timing}), we have
1023 generally managed to obtain a similar speedup for all operations
1024 (around 8, which corresponds to Fig.~\ref{fig:speedup_1core}). Only the linear system solving
1025 presents a lower speedup (around~4).
1026 The CUDA kernels and the remaining CPU-GPU transfers make a minor contribution
1027 to the overall computation time and do not require
1028 additional improvements.
1029
1030 On the C2050 (see Fig.~\ref{fermi-timing}), additional speedup is
1031 obtained for all operations, except for the
1032 CPU-GPU transfers and the linear system solving.
1033 The CPU-GPU transfers are mainly due to the $j$ amplitude arrays, and
1034 currently still correspond to minor times. When required, the
1035 double-buffering\index{double-buffering} technique may also be used to overlap such transfers with computation on the GPU.
1036
1037
1038
1039 \section{Propagation of multiple concurrent energies on GPU}\index{concurrent kernel execution}
1040
1041 Finally, we present here an improvement that can
1042 benefit from the Fermi architecture, as well as from the newest Kepler
1043 architecture,
1044 both of which enable the concurrent execution of multiple
1045 CUDA kernels\index{concurrent kernel execution}, thus offering additional speedup on
1046 GPUs for small or medium computation grain kernels.
1047 In our case, the performance gain on the GPU is indeed limited
1048 since most matrices have small or medium sizes.
1049 By using multiple streams within one CUDA context~\cite{CUDA_ProgGuide},
1050 we can propagate multiple energies
1051 concurrently\index{concurrent kernel execution} on the Fermi GPU.
1052 It can be noticed that all GPU computations for all streams are
1053 launched by the same host thread. We therefore rely here on the {\em legacy
1054 API} of CUBLAS\index{CUBLAS}~\cite{CUBLAS} (like MAGMA)
1055 without thread safety problems.
1056 A {\em breadth first} issue order is used for kernel
1057 launches \cite{CUDA_stream}: for a given GPU kernel, all kernel launchs
1058 are indeed issued together in the host thread, using one stream for each
1059 concurrent energy, in order to maximize concurrent kernel
1060 execution\index{concurrent kernel execution}.
1061 Of course, the memory available on the GPU must be large enough to
1062 store all data structures required by each energy.
1063 Moreover, multiple streams are also used within the
1064 propagation of each single energy
1065 in order to enable concurrent executions among the required kernels.
1066
1067
1068 \begin{table}[h]
1069 \begin{center}
1070 \begin{tabular}{|c|c||c|c|c|c|c|}
1071   \hline
1072   \multirow{4}{0.18\linewidth}{Medium Case} & Number of &
1073   \multirow{2}{0.07\linewidth}{\centering 1} &
1074   \multirow{2}{0.07\linewidth}{\centering 2} &
1075   \multirow{2}{0.07\linewidth}{\centering 4} &
1076   \multirow{2}{0.07\linewidth}{\centering 8} &
1077   \multirow{2}{0.07\linewidth}{\centering 16} \\
1078   & Energies & & & & & \\
1079   \cline{2-7}
1080   & Time (s) &  11.18 & 6.87 & 5.32 & 4.96 & 4.76 \\
1081   \cline{2-7}
1082   & Speedup & - & 1.63 & 2.10 & 2.26 & 2.35 \\
1083   \hline
1084   \hline
1085   \multirow{4}{0.18\linewidth}{Large Case} & Number of &
1086   \multirow{2}{0.07\linewidth}{\centering 1} &
1087   \multicolumn{2}{c|}{\multirow{2}{0.07\linewidth}{\centering 2}} &
1088   \multicolumn{2}{c|}{\multirow{2}{0.07\linewidth}{\centering 3}} \\
1089   & Energies & & \multicolumn{2}{c|}{~} & \multicolumn{2}{c|}{~}  \\
1090   \cline{2-7}
1091   & Time (s) & 509.51 & \multicolumn{2}{c|}{451.49} & \multicolumn{2}{c|}{436.72}  \\
1092   \cline{2-7}
1093   & Speedup & - & \multicolumn{2}{c|}{1.13} & \multicolumn{2}{c|}{1.17}  \\
1094   \hline
1095 \end{tabular}
1096 \caption{\label{t:perfs_V6} Performance results with multiple
1097   concurrent energies
1098   on one C2070 GPU. GPU initialization times are not considered here. }
1099 \end{center}
1100 \end{table}
1101
1102 In order to have enough GPU memory to run two or three concurrent
1103 energies for the large case, we use one C2070 GPU
1104 (featuring 6GB of memory)
1105 with one Intel X5650 hex-core CPU, CUDA 4.1 and CUBLAS\index{CUBLAS} 4.1, as
1106 well as the latest MAGMA release (version 1.3.0).
1107 Substantial changes have been required
1108 in the MAGMA calls with respect to the previous versions of PROP that were using MAGMA 0.2.
1109  Table~\ref{t:perfs_V6} presents the speedups
1110 obtained on the Fermi GPU for multiple concurrent
1111 energies (up to sixteen since this is the maximum number of concurrent
1112 kernel launches currently supported \cite{CUDA_ProgGuide}).
1113 With the medium case, speedups greater than 2 are obtained with four
1114 concurrent energies or more.
1115 With the more realistic large case, the performance gain is lower mainly because of
1116 the increase in matrix sizes, which implies a better GPU usage
1117 with only one energy on the GPU. The concurrent execution of multiple
1118 kernels\index{concurrent kernel execution} is also limited by other operations on the
1119 GPU \cite{CUDA_ProgGuide,CUDA_stream} and by the current MAGMA code which
1120 prevents concurrent MAGMA calls in different streams.
1121 Better speedups can be expected on the latest Kepler GPUs which
1122 offer additional compute power, and whose {\em Hyper-Q} feature may help
1123 improve further the GPU utilization with concurrent energies.
1124 To the contrary, the same code on the C1060 shows no speedup
1125  since the concurrent kernel launches are serialized on this previous GPU architecture.
1126
1127
1128
1129
1130
1131
1132
1133
1134 \section{Conclusion and future work}
1135 \label{conclusion}
1136
1137 In this chapter, we have presented our methodology and our results in
1138 the deployment on
1139 a GPU of an application (the PROP program) in atomic physics.
1140
1141 We have started by studying the numerical stability of PROP using
1142 single precision\index{precision!single precision} arithmetic. This has shown that PROP
1143 using single precision\index{precision!single precision}, while relatively stable for some small cases,
1144 gives unsatisfactory results for realistic simulation cases above the
1145 ionization threshold where there is a
1146 significant density of pseudo-states. It is
1147 expected that this may not be the case below the ionization threshold
1148 where the actual target states are less dense. This requires further
1149 investigation.
1150
1151 We have
1152 therefore deployed the PROP code in double precision\index{precision!double precision} on
1153 a GPU, with successive improvements. The different GPU versions
1154 each offer increasing speedups over the CPU version.
1155 Compared to the single (respectively, four) core(s) CPU version, the
1156 optimal GPU implementation
1157 gives a speedup of 8.2 (resp., 4.6) on one C1060 GPU,
1158 and a speedup of 15.9 (resp., 9.0) on one
1159 C2050 GPU with improved double-precision\index{precision!double precision} performance.
1160 An additional gain of around 15\%
1161 can also be obtained on one Fermi GPU
1162 with large memory (C2070) thanks to concurrent kernel execution.
1163
1164 Such speedups
1165 cannot be directly compared with the 1.15 speedup
1166 obtained with the HMPP\index{HMPP} version, since in our tests the CPUs are
1167 different and the CUBLAS\index{CUBLAS} versions are more recent.
1168 However, the programming effort required
1169 progressively to deploy PROP on GPUs clearly offers improved and interesting speedups for this
1170 real-life   application in double precision\index{precision!double precision} with varying-sized matrices.
1171
1172
1173 We are currently working on a hybrid CPU-GPU version that spreads the
1174 computations of the independent energies on both the CPU
1175 and the GPU. This will enable
1176 multiple energy execution on the CPU, with
1177 one or several core(s) dedicated to each energy (via multithreaded
1178 BLAS\index{BLAS} libraries). Multiple
1179 concurrent energies may also be propagated on each Fermi GPU.
1180 By merging this work with the current MPI PROP program, we will
1181 obtain a scalable hybrid CPU-GPU version.
1182 This final version will offer an additional level of parallelism,
1183 thanks to the MPI
1184 standard, in order to exploit multiple
1185 nodes with multiple CPU cores and possibly multiple GPU cards.
1186
1187 \putbib[Chapters/chapter15/biblio]
1188