BookGPU/Chapters/chapter15/ch15.tex

   1 \chapterauthor{Pierre Fortin}{Laboratoire d'Informatique de Paris 6, University Paris 6}
   2 \chapterauthor{Rachid Habel}{T\'el\'ecom SudParis}
   3 \chapterauthor{Fabienne J\'ez\'equel}{Laboratoire d'Informatique de Paris 6, University Paris 6}
   4 \chapterauthor{Jean-Luc Lamotte}{Laboratoire d'Informatique de Paris 6, University Paris 6}
   5 \chapterauthor{Stan Scott}{School of Electronics, Electrical Engineering \& Computer Science,
   6 The Queen's University of Belfast}
   7
   8 \newcommand{\fixme}[1]{{\bf #1}}
   9
  10 \chapter[Numerical validation and performance optimization on GPUs in atomic physics]{Numerical validation and performance optimization on GPUs of an application in atomic physics}
  11 \label{chapter15}
  12
  13 \section{Introduction}\label{ch15:intro}
  14 As described in Chapter~\ref{chapter1}, GPUs are characterized by hundreds
  15 of cores and theoretically perform one order of magnitude better than CPUs.
  16 An important factor to consider when programming on GPUs
  17  is the cost of
  18 data transfers between CPU memory and GPU memory. Thus, to have good
  19 performance on
  20 GPUs, applications should be coarse-grained and have a high arithmetic
  21 intensity
  22 ($i.e.$ the ratio of arithmetic operations to memory operations).
  23 Another important aspect of GPU programming is that floating-point
  24 operations are preferably performed in single precision\index{precision!single precision}, if the
  25  validity of results is not impacted by that format.
  26 The GPU compute power for floating-point operations is indeed greater in
  27 single precision\index{precision!single precision} than in double precision\index{precision!double precision}.
  28 The peak performance ratio between single precision\index{precision!single precision} and double
  29 precision varies for example for NVIDIA GPUs from $12$ for the first Tesla
  30 GPUs (C1060),
  31 to $2$ for the Fermi GPUs (C2050 and C2070)
  32 and to $3$ for the latest Kepler architecture (K20/K20X).
  33 As far as AMD GPUs are concerned, the latest AMD GPU (Tahiti HD 7970)
  34 presents a ratio of $4$.
  35 Moreover, GPU internal memory accesses and CPU-GPU data transfers are
  36 faster in single precision\index{precision!single precision} than in double precision\index{precision!double precision},
  37 because of the different format lengths.
  38
  39 This chapter describes the deployment on GPUs of PROP, a program of the
  40 2DRMP~\cite{FARM_2DRMP,2DRMP} suite which models electron collisions
  41 with H-like atoms and ions at intermediate energies. 2DRMP operates successfully on serial
  42 computers, high performance clusters and supercomputers.  The primary
  43 purpose of the PROP program is to propagate a global
  44 R-matrix~\cite{Burke_1987}, $\Re$, in the two-electron configuration
  45 space.
  46 The propagation needs to be performed for all collision energies,
  47 for instance hundreds of energies,
  48 which are independent.
  49 Propagation equations are dominated by matrix multiplications involving sub-matrices of $\Re$.
  50 However, the matrix multiplications are not
  51 straightforward in the sense that $\Re$ dynamically changes the designation of its rows and
  52 columns and increases in size as the propagation proceeds \cite{VECPAR}.
  53
  54 In a preliminary investigation PROP was selected by GENCI\footnote{GENCI: Grand Equipement National
  55   de Calcul Intensif, \url{www.genci.fr}} and
  56 CAPS\footnote{CAPS is a software company providing products and solutions
  57   for manycore application programming and deployment,
  58   \url{www.caps-entreprise.com}},
  59 following their first call for projects in 2009-2010
  60 aimed at
  61 deploying applications on hybrid systems based on GPUs.
  62 First CAPS
  63 recast the propagation equations with larger matrices.
  64 For matrix products the GPU performance gain over CPU increases indeed
  65 with the matrix size, since the
  66 CPU-GPU transfer overhead becomes less significant and since CPUs are
  67 still more efficient for fine computation grains.
  68 Then, using HMPP\index{HMPP}\footnote{
  69 HMPP or {\em CAPS compiler}, see: \url{www.caps-entreprise.com/hmpp.html}},
  70 a commercial
  71 hybrid and parallel compiler, CAPS
  72 developed a version of PROP, in
  73 which matrix multiplications are performed on
  74 the GPU or the CPU, depending on the matrix size.
  75 Unfortunately this partial GPU implementation of PROP does not offer
  76 significant acceleration.
  77
  78 The work described in this chapter, which is based on a study presented in \cite{PF_PDSEC2011}, aims at
  79  improving PROP performance on
  80 GPUs by exploring two directions. First, because the original version of PROP is written
  81 in double precision\index{precision!double precision},
  82 we study the numerical stability of PROP in single precision\index{precision!single precision}.
  83 Second, we deploy  the whole
  84 computation code of PROP on
  85 GPUs to avoid the overhead generated by
  86 data transfers
  87 and we propose successive improvements
  88 (including a specific one to the Fermi architecture)
  89 in order to optimize the GPU code.
  90
  91
  92
  93
  94 \section{2DRMP and the PROP program}
  95 \label{s:2DRMP_PROP}
  96 \subsection{Principles of R-matrix propagation}
  97 2DRMP~\cite{FARM_2DRMP,2DRMP} is part of the CPC library\footnote{CPC:
  98 Computer Physics Communications,
  99 \url{http://cpc.cs.qub.ac.uk/}}.
 100 It is a suite of seven
 101 programs aimed at creating virtual experiments on high performance and grid
 102 architectures to enable the study of electron scattering from H-like
 103 atoms and ions at intermediate energies.  The 2DRMP suite  uses the
 104 two-dimensional $R$-matrix propagation approach~\cite{Burke_1987}.
 105 In 2DRMP the two-electron configuration  space  ($r_1$,$r_2$) is
 106 divided into sectors.
 107 Figure~\ref{prop} shows the division of the two-electron configuration
 108 space ($r_1$,$r_2$) into 4 vertical $strips$ representing 10 $sectors$.
 109 The key computation in 2DRMP, performed by the PROP program, is the
 110 propagation of a global
 111 $R$-matrix, $\Re$, from sector to sector across the internal region, as shown in Fig.~\ref{prop}.
 112
 113 \begin{figure}[h]
 114 \begin{center}
 115 \includegraphics*[width=0.65\linewidth]{Chapters/chapter15/figures/prop.pdf}
 116 \caption{\label{prop} Subdivision of the configuration space
 117 ($r_1$,$r_2$) into a set of connected sectors.}
 118 \end{center}
 119 \end{figure}
 120
 121 \begin{figure}[h]
 122 \begin{center}
 123 \includegraphics*[width=0.8\linewidth]{Chapters/chapter15/figures/Domain.pdf}
 124 \caption{\label{domain} Propagation of the R-matrix from domain D to domain D'.}
 125 \end{center}
 126 \end{figure}
 127
 128 We consider the general situation in
 129 Fig.~\ref{domain} where we assume that we already know
 130 the global $R$-matrix, $\Re^{I}$, associated with the boundary defined
 131 by edges 5, 2, 1 and 6
 132 in domain $D$ and we wish to
 133 evaluate the new global $R$-matrix, $\Re^{O}$, associated with edges 5, 3, 4 and 6
 134 in domain $D'$ following propagation across subregion $d$.
 135 Input edges are denoted by I (edges 1 and~2), output edges by O (edges 3 and 4) and
 136 common edges by X (edges 5 and~6).
 137 Because of symmetry, only the lower half of domains $D$ and $D'$ has to be considered.
 138 The global $R$-matrices, $\Re^{I}$ in domain $D$ and $\Re^{O}$ in
 139 domain $D'$, can be written as:
 140 \begin{equation}
 141 \Re^{I} = \left(\begin{array}{cc}
 142       \Re_{II}^{I} & \Re_{IX}^{I}\\
 143       \Re_{XI}^{I} & \Re_{XX}^{I}
 144     \end{array}\right)
 145 \ \
 146 \Re^{O} = \left(\begin{array}{cc}
 147       \Re_{OO}^{O} & \Re_{OX}^{O}\\
 148       \Re_{XO}^{O} & \Re_{XX}^{O}
 149     \end{array}\right).
 150 \label{eq:RI_RO}
 151 \end{equation}
 152
 153
 154
 155 From the set of local $R$-matrices, $\mathbf{R}_{ij}$ ($i,j\in \{1,2,3,4\}$)
 156 associated
 157 with subregion $d$, we can define
 158 \begin{subequations}
 159 \begin{eqnarray}
 160     \mathbf{r}_{II} = \left(\begin{array}{cc}
 161       \mathbf{R}_{11} & \mathbf{R}_{12}\\
 162       \mathbf{R}_{21} & \mathbf{R}_{22}
 163     \end{array}\right), \label{eqaa} &
 164     \mathbf{r}_{IO} = \left(\begin{array}{cc}
 165       \mathbf{R}_{13} & \mathbf{R}_{14}\\
 166       \mathbf{R}_{23} & \mathbf{R}_{24}
 167     \end{array}\right), \label{eqbb}\\
 168     \mathbf{r}_{OI} = \left(\begin{array}{cc}
 169       \mathbf{R}_{31} & \mathbf{R}_{32}\\
 170       \mathbf{R}_{41} & \mathbf{R}_{42}
 171     \end{array}\right), \label{eqcc} &
 172    \mathbf{r}_{OO} = \left(\begin{array}{cc}
 173       \mathbf{R}_{33} & \mathbf{R}_{34}\\
 174       \mathbf{R}_{43} & \mathbf{R}_{44}
 175     \end{array}\right),\label{eqdd}
 176 \end{eqnarray}
 177 \end{subequations}
 178 where $I$ represents the input edges 1 and 2, and $O$ represents
 179 the output edges 3 and 4 (see Fig.~\ref{domain}).
 180 The propagation across each sector is characterized by equations~(\ref{eq1}) to (\ref{eq4}).
 181 \begin{subequations}
 182 \begin{eqnarray}
 183 \Re^{O}_{OO} &=& \mathbf{r}_{OO} - \mathbf{r}_{IO}^T (r_{II} + \Re^{I}_{II})^{-1}\mathbf{r}_{IO}, \label{eq1} \\
 184 \Re^{O}_{OX} &=& \mathbf{r}_{IO}^T (\mathbf{r}_{II} + \Re^{I}_{II})^{-1}\Re^{I}_{IX},  \label{eq2} \\
 185    \Re^{O}_{XO} &=& \Re^{I}_{XI}(\mathbf{r}_{II} + \Re^{I}_{II})^{-1}\mathbf{r}_{IO},  \label{eq3} \\
 186    \Re^{O}_{XX} &=& \Re^{I}_{XX} - \Re^{I}_{XI}(\mathbf{r}_{II} +\Re^{I}_{II})^{-1}\Re^{I}_{IX}. \label{eq4}
 187 \end{eqnarray}
 188 \end{subequations}
 189
 190 The matrix inversions are not explicitly performed. To compute
 191 $(r_{II} + \Re^{I}_{II})^{-1}\mathbf{r}_{IO}$ and $(\mathbf{r}_{II} + \Re^{I}_{II})^{-1}\Re^{I}_{IX}$,
 192 two linear systems are solved.
 193
 194
 195 \medskip
 196
 197 While equations (\ref{eq1})-(\ref{eq4}) can be applied to the
 198 propagation across a general subregion two special situations should be
 199 noted: propagation across a diagonal subregion and propagation across
 200 a subregion bounded by the $r_{1}$-axis at the beginning of a new
 201 strip.
 202
 203 In the case of a diagonal subregion, from symmetry considerations,
 204 edge 2 is identical to edge 1 and edge 3 is identical to edge~4.
 205 Accordingly, with only one input edge and one output edge equations
 206 (\ref{eqaa})-(\ref{eqdd}) become:
 207 \begin{subequations}
 208 \begin{eqnarray}
 209 \mathbf{r}_{II} = 2\mathbf{R}_{11}, \
 210 \mathbf{r}_{IO} = 2\mathbf{R}_{14}, \label{eq4b}\\
 211 \mathbf{r}_{OI} = 2\mathbf{R}_{41}, \
 212 \mathbf{r}_{OO} = 2\mathbf{R}_{44}. \label{eq4d}
 213 \end{eqnarray}
 214 \end{subequations}
 215 In the case of a subregion bounded by the $r_1$-axis at the beginning
 216 of a new strip, we note that the input boundary $I$ consists of only
 217 one edge. When propagating across the first subregion in the second
 218 strip there is no common boundary $X$: in this case only equation
 219 (\ref{eq1}) needs to be solved.
 220
 221 \medskip
 222
 223 Having obtained the global $R$-matrix $\Re$ on the boundary of the
 224 innermost subregion (labeled $0$ in Fig.~\ref{prop}), $\Re$ is propagated across
 225 each subregion in the order indicated in Fig.~\ref{prop},
 226 working systematically from the
 227 $r_1$-axis at the bottom of each strip across all subregions to the
 228 diagonal.
 229
 230
 231
 232 \subsection{Description of the PROP program}
 233 \label{sec:PROP}
 234
 235 \begin{table}[t]
 236 \begin{center}
 237 \begin{tabular}{|c|c|c|c|c|c|}
 238  \hline
 239  \multirow{2}{0.09\linewidth}{\centering Data set} &
 240  \multirow{2}{0.15\linewidth}{\centering
 241   Local $R$-\\matrix size} &
 242  \multirow{2}{0.07\linewidth}{\centering Strips} &
 243  \multirow{2}{0.09\linewidth}{\centering Sectors} &
 244  \multirow{2}{0.19\linewidth}{\centering Final global \\$R$-matrix size} &
 245  \multirow{2}{0.15\linewidth}{\centering Scattering\\energies} \\
 246   & & & & & \\
 247   \hline
 248   Small & 90x90  & 4 & 10 & 360x360 & 6\\
 249  \hline
 250   Medium  & 90x90  & 4 & 10 & 360x360 & 64\\
 251  \hline
 252   Large  & 383x383  & 20 & 210 &  7660x7660 & 6\\
 253  \hline
 254   Huge  & 383x383  & 20 & 210 &  7660x7660 & 64\\ \hline
 255 \end{tabular}
 256 \caption{\label{data-sets}Characteristics of four data sets}
 257 \end{center}
 258 \end{table}
 259
 260 The PROP program computes the propagation of the $R$-matrix across the sectors of the internal region.
 261 Table~\ref{data-sets} shows four different
 262 data sets used in this study and highlights the principal parameters of the PROP program.
 263  PROP execution can be described by Algorithm~\ref{prop-algo}.
 264 First, amplitude arrays and
 265 correction data are read from data files generated by the preceding
 266 program of the 2DRMP suite.
 267 Then, local $R$-matrices are constructed from amplitude arrays.
 268 Correction data is used to compute correction vectors added to the diagonal of the local
 269 $R$-matrices. The local $R$-matrices, together with the input $R$-matrix,
 270  $\Re^{I}$,
 271 computed on the previous sector, are used to compute the current output
 272 $R$-matrix,
 273 $\Re^{O}$.
 274  At the end of a sector evaluation,
 275 the output $R$-matrix becomes  the input $R$-matrix
 276 for the next evaluation.
 277
 278 \begin{algorithm}
 279 \caption{\label{prop-algo}PROP algorithm}
 280 \begin{algorithmic}
 281 \FOR{all scattering energies}
 282  \FOR{all sectors}
 283  \STATE Read amplitude arrays
 284  \STATE Read correction data
 285 \STATE Construct local $R$-matrices
 286 \STATE From $\Re^{I}$ and local $R$-matrices, compute $\Re^{O}$
 287 \STATE $\Re^{O}$ becomes $\Re^{I}$ for the next sector
 288  \ENDFOR
 289  \STATE Compute physical $R$-Matrix
 290 \ENDFOR
 291 \end{algorithmic}
 292 \end{algorithm}
 293
 294
 295 On the first sector, there is no input $R$-matrix yet. To bootstrap
 296 the propagation, the first output $R$-matrix is constructed using only
 297 one local $R$-matrix.  On the last sector, that is, on the boundary of
 298 the inner region, a physical $R$-matrix corresponding to the output
 299 $R$-matrix is computed and stored into an output file.
 300
 301 In the PROP program, sectors are characterized into four types,
 302 depending on the computation performed:
 303 \begin{itemize}
 304 \item the starting sector (labeled 0 in Fig.~\ref{prop})
 305 \item the axis sectors (labeled 1, 3 and 6 in Fig.~\ref{prop})
 306 \item the diagonal sectors (labeled 2, 5 and 9 in Fig.~\ref{prop})
 307 \item the off-diagonal sectors (labeled 4, 7 and 8 in Fig.~\ref{prop}).
 308 \end{itemize}
 309
 310
 311 The serial version of PROP is implemented in Fortran~90 and uses
 312 for linear algebra operations BLAS\index{BLAS} and LAPACK\index{LAPACK} routines
 313 which are fully optimized for x86 architecture.
 314 This
 315 program
 316 serially propagates the $R$-matrix for
 317 all scattering energies.
 318 Since the propagations for these different
 319 energies are independent, there also
 320 exists an embarrassingly parallel version of
 321 PROP
 322 that spreads the computations of
 323 several energies
 324 among multiple CPU nodes via
 325 MPI.
 326
 327
 328
 329 \subsection{CAPS implementation}
 330 \label{caps}
 331
 332 In order to handle larger matrices, and thus obtain better GPU  speedup, CAPS
 333 recast equations (\ref{eq1}) to (\ref{eq4}) into one equation.
 334 The output $R$-matrix $\Re^{O}$ defined by equation~(\ref{eq:RI_RO}) is now computed as follows.
 335 \begin{equation}\label{eq_CAPS_1}
 336 \Re^{O} = \Re^{O^{\ \prime}} + U A^{-1} V,
 337 \end{equation}
 338 \begin{equation}\label{eq_CAPS_2}
 339 {\rm with} \
 340 \Re^{O^{\ \prime}}= \left(\begin{array}{cc}
 341       \mathbf{r}_{OO} & 0\\
 342       0 & \Re^I_{XX}    \end{array}\right), \
 343 U= \left(\begin{array}{c}
 344       \mathbf-{r}_{IO}^{T}\\
 345       \Re^I_{XI}    \end{array}\right),
 346 \end{equation}
 347 \begin{equation}\label{eq_CAPS_3}
 348 A= \mathbf{r}_{II} + \Re^I_{II} \ {\rm and}  \
 349 V= (\mathbf{r}_{IO}\ \ \ -\Re^I_{IX}).
 350 \end{equation}
 351
 352 To compute $W=A^{-1}V$, no matrix inversion is performed. The matrix
 353 system $AW=V$ is solved.
 354 This reimplementation of PROP reduces the number of equations to be
 355 solved and the number of matrix copies for evaluating each sector.
 356 For instance, for an off-diagonal sector,
 357 copies fall from 22 to 5, matrix multiplications from 4 to~1 and calls
 358 to a linear equation solver from 2 to 1.
 359
 360 To implement this version, CAPS
 361 used HMPP\index{HMPP}, a
 362 commercial hybrid and parallel compiler,
 363 based on compiler directives like the new OpenACC\index{OpenACC} standard\footnote{See: \url{www.openacc-standard.org}}.
 364 If the matrices are large enough (the limit sizes are experimental parameters),
 365 they are multiplied on the GPU, otherwise on the CPU.
 366 CAPS
 367  used the MKL BLAS\index{BLAS} implementation on an Intel Xeon
 368 x5560 quad core CPU (2.8 GHz)
 369 and the CUBLAS\index{CUBLAS} library (CUDA 2.2) on one Tesla C1060 GPU.
 370 On the large data set (see Table~\ref{data-sets}), CAPS
 371  obtained a speedup of 1.15 for the GPU
 372 version over the CPU one (with multi-threaded MKL calls on the four
 373 CPU cores). This limited gain in performance is mainly
 374 due to the use of double precision\index{precision!double precision} computation
 375 and to the small or medium sizes of most matrices.
 376 For these matrices, the computation gain on
 377 the GPU is indeed
 378 strongly affected by the overhead
 379 generated by transferring these matrices from
 380 the CPU memory to the GPU memory to perform each matrix multiplication and then
 381 transferring the result back to the CPU memory.
 382
 383 Our goal is to speedup PROP more significantly by porting the whole
 384 code to the GPU and therefore avoiding
 385 the
 386 intermediate data transfers between
 387 the host (CPU) and the GPU. We will also study the
 388 stability of PROP in single precision\index{precision!single precision} because
 389 single precision\index{precision!single precision} computation is faster on the GPU
 390 and CPU-GPU data transfers are twice as fast as those performed in
 391 double precision\index{precision!double precision}.
 392
 393
 394
 395 \section{Numerical validation\index{numerical validation} of PROP in single precision\index{precision!single precision}}
 396 \label{single-precision}
 397
 398 \begin{comment}
 399 \begin{table}[h]
 400 \begin{center}
 401 \begin{tabular}{|c|c|}
 402   \hline
 403    relative error interval & \# occurrences \\
 404   \hline
 405    [0, 1.E-8) & 18 \\
 406   \hline
 407    [1.E-8, 1.E-6) & 1241 \\
 408   \hline
 409    [1.E-6, 1.E-4) & 48728 \\
 410   \hline
 411    [1.E-4, 1.E-2) & 184065 \\
 412   \hline
 413    [1.E-2, 1) & 27723 \\
 414   \hline
 415    [1, 100) & 304 \\
 416   \hline
 417    [100, $+\infty$) & 1 \\
 418   \hline
 419 \end{tabular}
 420 \end{center}
 421 \caption{\label{sp-distrib}Error distribution for medium case in single precision\index{precision!single precision}}
 422 \end{table}
 423 \end{comment}
 424
 425
 426 Floating-point input data, computation and output data of PROP are
 427 originally in double precision\index{precision!double precision} format.
 428 PROP produces a standard $R$-matrix H-file \cite{FARM_2DRMP}
 429  and a collection of Rmat00X files (where X
 430 ranges from 0 to the number of scattering energies - 1)
 431 holding the physical R-matrix for each
 432 energy.
 433 The  H-file  and the Rmat00X files are binary input files of the FARM program \cite{FARM_2DRMP}
 434 (last program of the 2DRMP suite).
 435 Their text equivalent are the  prop.out
 436 and the prop00X.out files.
 437 To study the validity of PROP results in  single precision\index{precision!single precision},
 438 first,
 439 reference results are
 440  generated by running the serial version of PROP in double precision\index{precision!double precision}.
 441 Data used in the most costly computation parts are read from input files in
 442 double precision\index{precision!double precision} format and then
 443 cast to single precision\index{precision!single precision} format.
 444 PROP results  (input of FARM) are computed in single precision\index{precision!single precision} and  written
 445 into files in double precision\index{precision!double precision}.
 446
 447 \subsection{Medium case study}
 448 \begin{figure}[h]
 449 \begin{center}
 450 \includegraphics*[width=0.9\linewidth]{Chapters/chapter15/figures/error.pdf}
 451 \caption{\label{fig:sp-distrib} Error distribution for medium case in single precision\index{precision!single precision}}
 452 \end{center}
 453 \end{figure}
 454
 455 The physical $R$-matrices, in
 456 the prop00X.out files, are compared to the
 457 reference ones for the medium case (see Table~\ref{data-sets}).
 458  The relative
 459 error distribution is
 460 given in Fig.~\ref{fig:sp-distrib}.
 461 We focus on the largest errors.
 462 \begin{itemize}
 463 \item Errors greater than 100: the only impacted value is of order 1.E-6
 464 and is negligible compared to the other ones
 465 in the same prop00X.out file.
 466
 467 \item Errors between 1 and 100: the values corresponding to the
 468   largest errors are of order 1.E-3 and are negligible compared to
 469   the majority of the other values which range between 1.E-2 and
 470   1.E-1.
 471
 472 \item Errors between 1.E-2 and 1: the largest errors ($\ge$ 6\%)
 473   impact values the order of magnitude of which is at most 1.E-1.
 474   These values are negligible.
 475   Relative errors of approximately 5\% impact values the order of
 476   magnitude of which is at most 1.E2.
 477   For instance, the value 164 produced by the reference version of
 478   PROP becomes 172 in the single precision\index{precision!single precision} version.
 479
 480 \end{itemize}
 481
 482 To study the impact of the single precision\index{precision!single precision} version of PROP on the
 483 FARM program, the cross-section
 484 results files corresponding to
 485 transitions
 486 {1s1s},
 487 {1s2s}, {1s2p}, {1s3s}, {1s3p}, {1s3d},
 488 {1s4s}, {2p4d} are compared to the reference ones.
 489 Table~\ref{sp-farm} shows that all cross-section files are impacted by
 490 errors.  Indeed in the  {2p4d} file,  four relative errors are
 491 greater than one and the maximum relative error is 1.60.
 492 However the largest errors impact negligible values. For example, the maximum
 493 error (1.60) impacts a reference value which is 4.5E-4.  The largest
 494 values are impacted by low errors. For instance, the maximum value
 495 (1.16) is impacted by a relative error of the order 1.E-3.
 496
 497 \begin{table}[t]
 498 \begin{center}
 499 \begin{tabular}{|c|c||c|c|} \hline
 500   file & largest relative error & file & largest relative error\\ \hline
 501  {1s1s} & 0.02& {1s3p} & 0.11  \\ \hline
 502  {1s2s} & 0.06 &  {1s3d} &  0.22 \\ \hline
 503  {1s2p} &  0.08 & {1s4s} &  0.20  \\ \hline
 504  {1s3s} &  0.17 &2p4d & 1.60  \\ \hline
 505 \end{tabular}
 506 \caption{\label{sp-farm}Impact  on FARM  of the single precision\index{precision!single precision} version of PROP}
 507 \end{center}
 508 \end{table}
 509
 510 To examine in more detail the impact of PROP on FARM,
 511 cross sections above the ionization threshold (1 Ryd)
 512 are compared in single and
 513 double precision\index{precision!double precision}  for
 514 transitions amongst the 1s, \dots 4s, 2p, \dots 4p, 3d, 4d target states.
 515 This comparison is carried out by generating 45 plots.  In all the
 516  plots, results in single and double precision\index{precision!double precision} match except for few
 517  scattering energies which are very close to pseudo-state thresholds.
 518 For example Fig.~\ref{1s2p} and \ref{1s4d} present the scattering energies corresponding to the
 519 {1s2p} and {1s4d} cross-sections computed in single and double precision\index{precision!double precision}.   For some cross-sections,
 520 increasing a threshold parameter from 1.E-4 to 1.E-3 in the FARM
 521 program
 522 results in energies close to threshold being avoided
 523  and therefore
 524 the cross-sections in double and single precision\index{precision!single precision} match more
 525 accurately.
 526 This is the case for instance for cross-section 1s2p (see Fig.~\ref{1s2p3}).
 527 However for other cross-sections (such as 1s4d) some problematic energies remain even if the
 528 threshold parameter in  the FARM
 529 program is increased to 1.E-3 (see Fig.~\ref{1s4d3}).  A higher
 530 threshold parameter would be required for such cross-sections.
 531
 532 \begin{figure}[t]
 533 \centering
 534 \subfigure[threshold = 1.E-4]{
 535 \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s2p.pdf}
 536    \label{1s2p}
 537  }
 538 \subfigure[threshold = 1.E-3]{
 539 \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s2p3.pdf}
 540  \label{1s2p3}
 541  }
 542 \label{fig:1s2p_10sectors}
 543 \caption{1s2p cross-section, 10 sectors}
 544 \end{figure}
 545
 546 \begin{figure}[t]
 547 \centering
 548 \subfigure[threshold = 1.E-4]{
 549 \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s4d.pdf}
 550    \label{1s4d}
 551  }
 552 \subfigure[threshold = 1.E-3]{
 553 \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s4d3.pdf}
 554  \label{1s4d3}
 555  }
 556 \label{fig:1s4d_10sectors}
 557 \caption{1s4d cross-section, 10 sectors}
 558 \end{figure}
 559
 560 As a conclusion, the medium case study shows that the execution of
 561 PROP in single precision\index{precision!single precision} leads to a few inexact scattering energies to
 562 be computed by the FARM program for some cross-sections.
 563 Thanks to a suitable threshold parameter in the FARM program these problematic energies may possibly
 564 be skipped.
 565 Instead of investigating deeper the choice of such a parameter for the medium case, we analyze the
 566 single precision\index{precision!single precision} computation  in a more
 567 realistic case in Sect.~\ref{huge}.
 568 \begin{comment}
 569 The conclusion of the medium case study is that running PROP in single
 570 precision gives relatively stable results provided that suitable
 571 parameter values are used in the FARM program in order to skip the
 572 problematic energies that are too close to the pseudo-state
 573 thresholds.  To verify if this conclusion is still valid with a larger
 574 data set, the single precision\index{precision!single precision} computation is analyzed in a more
 575 realistic case in Sect.~\ref{huge}.
 576 \end{comment}
 577
 578 \subsection{Huge case study}\label{huge}
 579
 580
 581 \begin{figure}[t]
 582   \centering
 583 \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s2pHT.pdf}
 584 \caption{\label{1s2pHT}1s2p cross-section, threshold = 1.E-4, 210 sectors}
 585 \end{figure}
 586
 587 \begin{figure}[t]
 588   \centering
 589 \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s2pHT.pdf}
 590 \caption{\label{1s4dHT}1s4d cross-section, threshold = 1.E-4, 210 sectors}
 591 \end{figure}
 592
 593 We study here the impact on FARM of the PROP program run in
 594 single precision\index{precision!single precision} for the huge case (see Table~\ref{data-sets}).
 595 The cross-sections
 596 corresponding to all
 597 atomic target states 1s \dots 7i are explored, which
 598 leads to
 599 406 comparison plots.
 600 It should be noted that in this case, over the same energy range above the ionization threshold, the density of pseudo-state thresholds is significantly increased compared to the medium case.
 601 As expected, all the plots exhibit large differences between single and double
 602 precision cross-sections.
 603 For example Fig.~\ref{1s2pHT} and  \ref{1s4dHT} present the 1s2p and 1s4d cross-sections computed in
 604 single and double precision\index{precision!double precision} for the huge case.
 605 We can conclude that PROP in single precision\index{precision!single precision} gives invalid results
 606 for realistic simulation cases above the ionization threshold.
 607 Therefore the  deployment of PROP on GPU, described in Sect.~\ref{gpu-implem},
 608 has been carried out in double precision\index{precision!double precision}.
 609
 610 \section{Towards a complete deployment of PROP on GPUs}
 611 \label{gpu-implem}
 612
 613 We now detail how PROP has been progressively deployed on
 614 GPUs in double precision\index{precision!double precision} in order to avoid the
 615 expensive memory transfers between the host and the GPU.
 616 Different versions with successive improvements and optimizations are presented.
 617 We use CUDA~\cite{CUDA_ProgGuide} for GPU programming, as well as the
 618 CUBLAS\index{CUBLAS}~\cite{CUBLAS}
 619 and MAGMA \cite{MAGMA} libraries for linear algebra operations.
 620 Since PROP is written in Fortran 90, {\em wrappers\index{wrapper}} in C are used to
 621 enable calls to CUDA kernels from PROP routines.
 622
 623
 624 \subsection{Computing the output $R$-matrix on GPU}
 625 \label{gpu-RO}
 626
 627 \begin{figure}[h]
 628   \centering
 629   \includegraphics[width=0.7\linewidth]{Chapters/chapter15/figures/offdiagonal_nb.pdf}
 630   \caption{\label{offdiagonal} The six steps of an off-diagonal sector
 631     evaluation.}
 632 \end{figure}
 633
 634 As mentioned in Algorithm~\ref{prop-algo}, evaluating a sector
 635 mainly consists in constructing local $R$-matrices and in computing
 636 one output $R$-matrix, $\Re^{O}$. In this first step of the porting
 637 process, referred to as GPU V1\label{gpuv1},
 638 we only consider the computation of $\Re^{O}$ on the GPU.
 639 We distinguish the following six steps, related to equations
 640 (\ref{eq_CAPS_1}), (\ref{eq_CAPS_2}) and (\ref{eq_CAPS_3}), and illustrated in
 641 Fig.~\ref{offdiagonal} for an off-diagonal sector.
 642
 643 \begin{description}
 644 \item[Step 1] (``Input copies''):~data are copied from $\Re^{I}$
 645   to temporary arrays ($A$, $U$, $V$) and to $\Re^{O}$.
 646   These copies, along with possible scalings or transpositions, are
 647   implemented as CUDA kernels which can be applied to two
 648   matrices of any size and starting at any offset.
 649   Memory accesses are coalesced\index{coalesced memory accesses} \cite{CUDA_ProgGuide} in order to
 650   provide the best performance for such memory-bound kernels.
 651 \item[Step 2] (``Local copies''):~data are copied from
 652   local $R$-matrices to temporary arrays ($U$, $V$) and to $\Re^{O}$.
 653   Moreover data from local $R$-matrix
 654   $\mathbf{r}_{II}$
 655   is added to matrix $A$ (via a CUDA kernel) and zeroes are written in
 656    $\Re^{O}$  where required.
 657 \item[Step 3] (``Linear system solving''):~matrix $A$ is factorized
 658   using the MAGMA DGETRF\index{MAGMA functions!DGETRF}
 659    routine and the result is stored in-place.
 660 \item[Step 4] (``Linear system solving'' cont.):~the matrix system
 661  of linear equations  $AW$ = $V$ is solved using the MAGMA DGETRS\index{MAGMA functions!DGETRS}
 662 routine. The  solution is stored in matrix $V$.
 663 \item[Step 5] (``Output matrix product''):~matrix $U$
 664   is multiplied by matrix $V$ using the CUBLAS\index{CUBLAS} DGEMM
 665   routine. The result is stored in a temporary matrix~$t$.
 666 \item[Step 6] (``Output add''):~$t$ is added to $\Re^{O}$ (CUDA
 667   kernel).
 668 \end{description}
 669
 670 All the involved matrices are stored in the GPU memory. Only the
 671 local $R$-matrices are first constructed on the host and then sent
 672 to the GPU memory, since these matrices vary from sector to sector.
 673 The evaluation of the axis and diagonal sectors is similar.
 674 However, fewer operations and copies are required because of
 675 symmetry considerations \cite{2DRMP}.
 676
 677 \subsection{Constructing the local $R$-matrices on GPU}
 678
 679 \begin{figure}[t]
 680  \centering
 681   \includegraphics[width=0.7\linewidth]{Chapters/chapter15/figures/amplitudes_nb.pdf}
 682  \caption{\label{amplitudes} Constructing the local $R$-matrix R34
 683  from the $j$ amplitude array associated with edge 4 and the $i$
 684  amplitude array associated with edge~3.}
 685 \end{figure}
 686
 687 Local $R$-matrices are constructed using two three dimensional arrays,
 688 $i$ and $j$. Each three dimensional array contains four
 689 matrices corresponding to the surface amplitudes associated with the
 690 four edges of a sector. Those matrices are named {\em amplitude arrays}.
 691  $j$ amplitude arrays are read from data files and $i$ amplitude arrays
 692 are obtained by scaling each row of the $j$ amplitude arrays.
 693 The main part of the construction of a local $R$-matrix,
 694 presented in Fig.~\ref{amplitudes},
 695 is a matrix product between
 696 one $i$ amplitude array and one transposed $j$ amplitude array
 697 which is performed by a single DGEMM
 698 BLAS\index{BLAS} call.
 699 In this version, hereafter referred to as GPU
 700 V2\label{gpuv2}, $i$ and $j$ amplitude arrays are transferred to the
 701 GPU memory and the required matrix multiplications are performed on
 702 the GPU (via CUBLAS\index{CUBLAS} routines).
 703
 704
 705 The involved matrices having medium sizes (either $3066 \times 383$ or
 706 $5997 \times 383$),
 707 performing these matrix multiplications
 708 on the GPU is expected to be faster than on the CPU.
 709 However, this implies a greater communication volume
 710 between the CPU and the GPU
 711 since the $i$ and $j$ amplitude arrays are larger than the local
 712 $R$-matrices.
 713 It can be noticed that correction data are also used in the
 714 construction of a local $R$-matrix,
 715 but this is a minor part in the
 716 computation. However, these correction data also have to be
 717 transferred from the CPU to the GPU for each sector.
 718
 719 \subsection{Scaling amplitude arrays on GPU}
 720 It
 721 should be worthwhile to try to reduce the CPU-GPU data
 722 transfers of the GPU V2, where the $i$ and $j$ amplitude arrays are
 723 constructed on the host and then sent to the GPU memory for each sector.
 724 In this new version, hereafter referred to as GPU V3\label{gpuv3}, we
 725 transfer only the $j$ amplitude arrays and the
 726 required scaling factors (stored in one 1D array) to the GPU memory,
 727 so that the $i$ amplitude arrays are
 728 then directly computed on the GPU by multiplying the $j$ amplitude
 729 arrays by these scaling factors (via a CUDA kernel).
 730 Therefore, we save the transfer of four $i$ amplitude arrays on
 731 each sector by transferring  only this 1D array of scaling factors.
 732 Moreover, scaling $j$ amplitude arrays is expected to be faster on the
 733 GPU than on the CPU, thanks to the massively parallel architecture of
 734 the GPU and thanks to its higher internal memory bandwidth.
 735
 736 \subsection{Using double-buffering\index{double-buffering} to overlap I/O and computation}
 737
 738 \begin{figure}[t]
 739   \centering
 740   \includegraphics[width=0.8\linewidth]{Chapters/chapter15/figures/C1060_V3_IO_COMP.pdf}
 741   \caption{\label{overlapping} Compute and I/O times for the GPU V3 on
 742     one C1060.}
 743 \end{figure}
 744
 745 As described in Algorithm~\ref{prop-algo}, there are two main steps in
 746 the propagation across a sector: reading  amplitude arrays
 747 and correction data from I/O files and
 748 evaluating the current sector.
 749 Fig.~\ref{overlapping} shows the I/O times and the evaluation times
 750 of each sector for the huge case execution (210 sectors, 20 strips) of the GPU V3
 751 on one C1060.
 752 Whereas the times required by the off-diagonal sectors are similar
 753 within each of the 20 strips,
 754 the times for diagonal sectors of each strip
 755 are the shortest ones, the times for the axis sectors being
 756 intermediate.
 757 The I/O times are roughly constant among all strips.
 758 The evaluation time is equivalent to the I/O
 759 time for the first sectors. But this evaluation time grows
 760 linearly with the strip number, and rapidly exceeds the I/O
 761 time.
 762
 763 It is thus interesting to use a double-buffering\index{double-buffering} technique to overlap the
 764 I/O time with the evaluation time:
 765 for each sector, the evaluation of sector $n$ is performed
 766 (on GPU) simultaneously with the reading of data for sector
 767 $n+1$ (on CPU). This requires the duplication in the CPU memory of all the
 768 data structures
 769 used for storing data read from I/O files for each sector.
 770 This version, hereafter referred to as GPU
 771 V4\label{gpuv4}, uses POSIX threads\index{POSIX threads}. Two threads are
 772 executed concurrently: an I/O thread that reads data from I/O files
 773 for each sector, and a computation thread, dedicated to the propagation
 774 of the global $R$-matrix, that performs successively for each sector
 775 all necessary computations on GPU,
 776 as well as all required CPU-GPU data transfers.
 777 The evaluation of a sector uses the data read for this sector as well
 778 as the global $R$-matrix computed on the previous sector.
 779 This dependency requires synchronizations between the I/O thread and
 780 the computation thread which are implemented through standard POSIX
 781 thread mechanisms.
 782
 783
 784 \subsection{Matrix padding\index{padding}}
 785 The CUBLAS DGEMM
 786 performance and the MAGMA DGETRF\index{MAGMA functions!DGETRF}/DGETRS\index{MAGMA functions!DGETRS}
 787 performance is reduced when the sizes (or
 788 the leading dimensions) of the matrix are not multiples of the inner blocking size \cite{NTD10a}.
 789 This inner blocking size can be 32 or 64, depending on the computation
 790 and on the underlying
 791 GPU architecture \cite{MAGMA,NTD10b}.
 792 In this version (GPU V5\label{gpuv5}),
 793 the matrices are therefore padded with $0.0$ (and $1.0$ on the diagonal for the linear systems)
 794 so that their sizes are
 795 multiples of 64.
 796 This corresponds indeed to the optimal size for the matrix product on the
 797 Fermi architecture \cite{NTD10b}. And as far as linear system solving is
 798 concerned, all the matrices have sizes which are multiples of 383: we
 799 therefore use padding\index{padding} to obtain multiples of 384 (which are
 800 again  multiples of 64).
 801 It can be noticed that this padding\index{padding} has to be performed dynamically
 802 as the matrices increase in size during the propagation
 803 (when possible the
 804  maximum required storage space is however allocated only once in the
 805  GPU memory).
 806
 807
 808
 809 \section{Performance results}
 810 \subsection{PROP deployment on GPU}
 811
 812 \begin{table*}[ht]
 813 \begin{center}
 814 \begin{tabular}{|c||c|c||}
 815  \hline
 816   PROP version & \multicolumn{2}{c|}{Execution time} \\
 817   \hline
 818   \hline
 819   CPU version: 1  core & \multicolumn{2}{c|}{201m32s} \\
 820   \hline
 821   CPU version: 4  cores &  \multicolumn{2}{c|}{113m28s} \\
 822   \hline \hline
 823 GPU version  & C1060 & C2050 \\
 824   \hline\hline
 825   GPU V1 (\S~\ref{gpuv1}) & 79m25s & 66m22s  \\
 826   \hline
 827   GPU V2 (\S~\ref{gpuv2}) & 47m58s & 29m52s \\
 828   \hline
 829   GPU V3 (\S~\ref{gpuv3}) & 41m28s & 23m46s \\
 830   \hline
 831   GPU V4 (\S~\ref{gpuv4}) & 27m21s & 13m55s\\
 832   \hline
 833   GPU V5 (\S~\ref{gpuv5}) & 24m27s & 12m39s  \\
 834   \hline
 835 \end{tabular}
 836 \caption{\label{table:time}
 837 Execution time of PROP on CPU and GPU}
 838 \end{center}
 839 \end{table*}
 840
 841 \begin{comment}
 842 \begin{table*}[ht]
 843 \begin{center}
 844 \begin{tabular}{|c||c|c||}
 845  \hline
 846   PROP version & \multicolumn{2}{c|}{Execution time} \\
 847   \hline  \hline
 848 CPU version & 1 core & 4 cores  \\\hline
 849 & {201m32s} & {113m28s} \\ \hline  \hline
 850 GPU version  & C1060 & C2050 \\
 851   \hline\hline
 852   GPU V1 (\ref{gpuv1}) & 79m25s & 66m22s  \\
 853   \hline
 854   GPU V2 (\ref{gpuv2}) & 47m58s & 29m52s \\
 855   \hline
 856   GPU V3 (\ref{gpuv3}) & 41m28s & 23m46s \\
 857   \hline
 858   GPU V4 (\ref{gpuv4}) & 27m21s & 13m55s\\
 859   \hline
 860   GPU V5 (\ref{gpuv5}) & 24m27s & 12m39s  \\
 861   \hline
 862 \end{tabular}
 863 \caption{\label{table:time}
 864 Execution time of the successive GPU versions}
 865 \end{center}
 866 \end{table*}
 867 \end{comment}
 868
 869 \begin{figure}[h]
 870 \centering
 871 \subfigure[Speedup over 1 CPU core]{
 872 \includegraphics*[width=0.76
 873         \linewidth]{Chapters/chapter15/figures/histo_speedup_1core.pdf}
 874    \label{fig:speedup_1core}
 875  }
 876
 877 \subfigure[Speedup over 4 CPU cores]{
 878 \includegraphics*[width=0.76
 879         \linewidth]{Chapters/chapter15/figures/histo_speedup_4cores.pdf}
 880  \label{fig:speedup_4cores}
 881  }
 882 \label{fig:speedup}
 883 \caption{Speedup of the successive GPU versions.}
 884 \end{figure}
 885
 886 Table~\ref{table:time} presents
 887 the execution times
 888 of PROP on CPUs and GPUs,
 889 each version solves the propagation equations in the
 890 form~(\ref{eq_CAPS_1}-\ref{eq_CAPS_3}) as proposed by CAPS.
 891 Fig.~\ref{fig:speedup_1core} (respectively \ref{fig:speedup_4cores})
 892 shows the speedup of the successive GPU versions
 893 over one CPU core (respectively four CPU cores).
 894 We use here Intel Q8200 quad-core CPUs (2.33 GHz), one C1060 GPU and
 895 one C2050 (Fermi) GPU, located at
 896  UPMC (Universit\'e Pierre et Marie Curie, Paris, France).
 897 As a remark, the execution times measured on the C2050 would be the same
 898 on the C2070 and on  the C2075, the only difference between these GPUs
 899 being their memory size and their TDP (Thermal Design Power)\index{TDP (Thermal Design Power)}.
 900 We emphasize that the execution times correspond to the
 901 complete propagation for all six energies of the large case (see
 902 Table~\ref{data-sets}), that is to say to the complete execution of
 903 the PROP program.
 904 Since energies are independent, execution times for more energies
 905 (e.g. the huge case) should be proportional
 906 to those reported in Table~\ref{table:time}.
 907
 908 These tests, which have been performed with CUDA 3.2, CUBLAS\index{CUBLAS} 3.2 and
 909 MAGMA 0.2,
 910 show that the successive GPU versions of PROP offer
 911 increasing, and at the end interesting, speedups.
 912 More precisely,
 913 V2 shows that it is worth increasing slightly the
 914 CPU-GPU communication volume in order to perform
 915 large enough matrix products on the GPU.
 916  This communication volume can fortunately be
 917 reduced thanks to~V3,
 918 which also accelerates the computation of
 919 amplitude arrays thanks to the GPU.
 920 The
 921 double-buffering\index{double-buffering} technique implemented in V4
 922  effectively enables the overlapping of
 923 I/O operations with computation, while the
 924 padding\index{padding} implemented in V5 also improves the computation times.
 925 It
 926 is noticed that the padding\index{padding}
 927 does offer much more performance gain with,
 928 for example,  CUDA 3.1 and the MAGMA DGEMM\index{MAGMA functions!DGEMM}~\cite{NTD10b}:  the
 929 speedup with respect to one
 930 CPU core was increased from 6.3 to 8.1 on C1060, and from 9.5 to 14.3
 931 on C2050.
 932 Indeed since CUBLAS\index{CUBLAS} 3.2 performance has been improved for non block multiple
 933 matrix sizes through MAGMA code~\cite{NTD10a}.
 934 Although for all versions the C2050 (with its improved
 935 double precision\index{precision!double precision} performance) offers up to almost
 936 double speedup compared to
 937 the C1060, the performance obtained with both architectures justifies the use of
 938 the GPU for such an application.
 939
 940 \subsection{PROP execution profile}
 941
 942 \begin{figure}[t]
 943   \centering
 944  \includegraphics*[width=0.64\linewidth]{Chapters/chapter15/figures/CPU_1_CORE_TIMES.pdf}
 945 \caption{CPU (1 core)  execution times for the off-diagonal sectors of the large case.}
 946 \label{fig:CPU-timing}
 947 \end{figure}
 948
 949 \begin{figure}[t]
 950   \centering
 951   \subfigure[GPU V5 on one C1060]{
 952 \includegraphics*[width=0.64\linewidth]{Chapters/chapter15/figures/GPU_V5_C1060_TIMES.pdf}
 953 \label{GPU-timing}}
 954
 955   \subfigure[GPU V5 on one C2050]{
 956 \includegraphics*[width=0.64\linewidth]{Chapters/chapter15/figures/GPU_V5_C2050_TIMES.pdf}
 957 \label{fermi-timing}}
 958
 959 \caption{GPU execution times for the off-diagonal sectors of
 960   the large case.}
 961 \label{fig:profileGPU}
 962 \end{figure}
 963
 964 We detail here the execution profile on
 965 the CPU and the GPU for the evaluation of all off-diagonal sectors
 966 (the most representative ones) for a complete energy propagation.
 967  Fig.~\ref{fig:CPU-timing} and \ref{fig:profileGPU} show CPU and GPU execution times for the
 968 171 off-diagonal sectors of  the large case (see Table \ref{data-sets}).
 969 ``Copying, adding, scaling'' corresponds to the amplitude
 970   array construction (scaling) as well as to steps 1, 2 and 6 in
 971   Sect.~\ref{gpu-RO}, all implemented via CUDA kernels.
 972 ``CPU-GPU transfers''
 973 aggregate transfer times for the $j$ amplitude
 974 arrays and the scaling factors, as well as for the correction data.
 975  ``Amplitude matrix product'' corresponds to the DGEMM call to
 976  construct the local R-matrices from the $i$ and $j$ amplitude
 977  arrays.
 978 ``Linear system solving'' and ``Output matrix product'' correspond
 979 respectively to steps 3-4 and to step 5 in Sect.~\ref{gpu-RO}.
 980
 981 On one CPU core (see  Fig.~\ref{fig:CPU-timing}),
 982  matrix products for the construction of the local
 983 $R$-matrices require the same
 984 computation time during the whole propagation. Likewise the CPU time required by
 985  matrix products for the output $R$-matrix is constant within each
 986  strip. But as the global $R$-matrix is propagated from strip to
 987  strip, the sizes of
 988 the matrices $U$ and $V$ increase, so does their multiplication time.
 989 The time required to solve the linear system increases
 990 slightly during the propagation.
 991 These three operations (``Amplitude matrix product'', ``Output matrix
 992 product'' and ``Linear system solving'') are clearly dominant in terms
 993 of computation
 994 time compared to the other remaining operations, which justify our
 995 primary focus on such three linear algebra operations.
 996
 997
 998 On the C1060 (see Fig.~\ref{GPU-timing}), we have
 999 generally managed to obtain a similar speedup for all operations
1000 (around 8, which corresponds to Fig.~\ref{fig:speedup_1core}). Only the linear system solving
1001 presents a lower speedup (around~4).
1002 The CUDA kernels and the remaining CPU-GPU transfers make a minor contribution
1003 to the overall computation time and do not require
1004 additional improvements.
1005
1006 On the C2050 (see Fig.~\ref{fermi-timing}), additional speedup is
1007 obtained for all operations, except for the
1008 CPU-GPU transfers and the linear system solving.
1009 The CPU-GPU transfers are mainly due to the $j$ amplitude arrays, and
1010 currently still correspond to minor times. When required, the
1011 double-buffering\index{double-buffering} technique may also be used to overlap such transfers with computation on the GPU.
1012
1013
1014
1015 \section{Propagation of multiple concurrent energies on GPU}\index{concurrent kernel execution}
1016
1017 Finally, we present here an improvement that can
1018 benefit from the Fermi architecture, as well as from the newest Kepler
1019 architecture,
1020 both of which enable the concurrent execution of multiple
1021 CUDA kernels\index{concurrent kernel execution}, thus offering additional speedup on
1022 GPUs for small or medium computation grain kernels.
1023 In our case, the performance gain on the GPU is indeed limited
1024 since most matrices have small or medium sizes.
1025 By using multiple streams within one CUDA context~\cite{CUDA_ProgGuide},
1026 we can propagate multiple energies
1027 concurrently\index{concurrent kernel execution} on the Fermi GPU.
1028 It can be noticed that all GPU computations for all streams are
1029 launched by the same host thread. We therefore rely here on the {\em legacy
1030 API} of CUBLAS\index{CUBLAS}~\cite{CUBLAS} (like MAGMA)
1031 without thread safety problems.
1032 A {\em breadth first} issue order is used for kernel
1033 launchs \cite{CUDA_stream}: for a given GPU kernel, all kernel launchs
1034 are indeed issued together in the host thread, using one stream for each
1035 concurrent energy, in order to maximize concurrent kernel
1036 execution\index{concurrent kernel execution}.
1037 Of course, the memory available on the GPU must be large enough to
1038 store all data structures required by each energy.
1039 Moreover, multiple streams are also used within the
1040 propagation of each single energy
1041 in order to enable concurrent executions among the required kernels.
1042
1043
1044 \begin{table}[t]
1045 \begin{center}
1046 \begin{tabular}{|c|c||c|c|c|c|c|}
1047   \hline
1048   \multirow{4}{0.18\linewidth}{Medium case} & Number of &
1049   \multirow{2}{0.07\linewidth}{\centering 1} &
1050   \multirow{2}{0.07\linewidth}{\centering 2} &
1051   \multirow{2}{0.07\linewidth}{\centering 4} &
1052   \multirow{2}{0.07\linewidth}{\centering 8} &
1053   \multirow{2}{0.07\linewidth}{\centering 16} \\
1054   & energies & & & & & \\
1055   \cline{2-7}
1056   & Time (s) &  11.18 & 6.87 & 5.32 & 4.96 & 4.76 \\
1057   \cline{2-7}
1058   & Speedup & - & 1.63 & 2.10 & 2.26 & 2.35 \\
1059   \hline
1060   \hline
1061   \multirow{4}{0.18\linewidth}{Large case} & Number of &
1062   \multirow{2}{0.07\linewidth}{\centering 1} &
1063   \multicolumn{2}{c|}{\multirow{2}{0.07\linewidth}{\centering 2}} &
1064   \multicolumn{2}{c|}{\multirow{2}{0.07\linewidth}{\centering 3}} \\
1065   & energies & & \multicolumn{2}{c|}{~} & \multicolumn{2}{c|}{~}  \\
1066   \cline{2-7}
1067   & Time (s) & 509.51 & \multicolumn{2}{c|}{451.49} & \multicolumn{2}{c|}{436.72}  \\
1068   \cline{2-7}
1069   & Speedup & - & \multicolumn{2}{c|}{1.13} & \multicolumn{2}{c|}{1.17}  \\
1070   \hline
1071 \end{tabular}
1072 \caption{\label{t:perfs_V6} Performance results with multiple
1073   concurrent energies
1074   on one C2070 GPU. GPU initialization times are not considered here. }
1075 \end{center}
1076 \end{table}
1077
1078 In order to have enough GPU memory to run two or three concurrent
1079 energies for the large case, we use one C2070 GPU
1080 (featuring 6GB of memory)
1081 with one Intel X5650 hex-core CPU, CUDA 4.1 and CUBLAS\index{CUBLAS} 4.1, as
1082 well as the latest MAGMA release (version 1.3.0).
1083 Substantial changes have been required
1084 in the MAGMA calls with respect to the previous versions of PROP that were using MAGMA 0.2.
1085  Table~\ref{t:perfs_V6} presents the speedups
1086 obtained on the Fermi GPU for multiple concurrent
1087 energies (up to sixteen since this is the maximum number of concurrent
1088 kernel launches currently supported \cite{CUDA_ProgGuide}).
1089 With the medium case, speedups greater than 2 are obtained with four
1090 concurrent energies or more.
1091 With the more realistic large case, the performance gain is lower mainly because of
1092 the increase in matrix sizes, which implies a better GPU usage
1093 with only one energy on the GPU. The concurrent execution of multiple
1094 kernels\index{concurrent kernel execution} is also limited by other operations on the
1095 GPU \cite{CUDA_ProgGuide,CUDA_stream} and by the current MAGMA code which
1096 prevents concurrent MAGMA calls in different streams.
1097 Better speedups can be here expected on the latest Kepler GPUs which
1098 offer additional compute power, and whose {\em Hyper-Q} feature may help
1099 improve further the GPU utilization with concurrent energies.
1100 On the contrary, the same code on the C1060 shows no speedup
1101  since the concurrent kernel launches are serialized on this previous GPU architecture.
1102
1103
1104
1105
1106
1107
1108
1109
1110 \section{Conclusion and future work}
1111 \label{conclusion}
1112
1113 In this chapter, we have presented our methodology and our results in
1114 the deployment on
1115 a GPU of an application (the PROP program) in atomic physics.
1116
1117 We have started by studying the numerical stability of PROP using
1118 single precision\index{precision!single precision} arithmetic. This has shown that PROP
1119 using single precision\index{precision!single precision}, while relatively stable for some small cases,
1120 gives unsatisfactory results for realistic simulation cases above the
1121 ionization threshold where there is a
1122 significant density of pseudo-states. It is
1123 expected that this may not be the case below the ionization threshold
1124 where the actual target states are less dense. This requires further
1125 investigation.
1126
1127 We have
1128 therefore deployed the PROP code in double precision\index{precision!double precision} on
1129 a GPU, with successive improvements. The different GPU versions
1130 each offer increasing speedups over the CPU version.
1131 Compared to the single (respectively four) core(s) CPU version, the
1132 optimal GPU implementation
1133 gives a speedup of 8.2 (resp. 4.6) on one C1060 GPU,
1134 and a speedup of 15.9 (resp. 9.0) on one
1135 C2050 GPU with improved double precision\index{precision!double precision} performance.
1136 An additional gain of around 15\%
1137 can also be obtained on one Fermi GPU
1138 with large memory (C2070) thanks to concurrent kernel execution.
1139
1140 Such speedups
1141 cannot be directly compared with the 1.15 speedup
1142 obtained with the HMPP\index{HMPP} version, since in our tests the CPUs are
1143 different and the CUBLAS\index{CUBLAS} versions are more recent.
1144 However, the programming effort required
1145 progressively to deploy PROP on GPUs clearly offers improved and interesting speedups for this
1146 real-life   application in double precision\index{precision!double precision} with varying-size matrices.
1147
1148
1149 We are currently working on a hybrid CPU-GPU version that spreads the
1150 computations of the independent energies on both the CPU
1151 and the GPU. This will enable
1152 multiple energy execution on the CPU, with
1153 one or several core(s) dedicated to each energy (via multi-threaded
1154 BLAS\index{BLAS} libraries). Multiple
1155 concurrent energies may also be propagated on each Fermi GPU.
1156 By merging this work with the current MPI PROP program, we will
1157 obtain a scalable hybrid CPU-GPU version.
1158 This final version will offer an additional level of parallelism
1159 thanks to the MPI
1160 standard in order to exploit multiple
1161 nodes with multiple CPU cores and possibly multiple GPU cards.
1162
1163 \putbib[Chapters/chapter15/biblio]
1164