BookGPU/Chapters/chapter15/ch15.tex

   1 \chapterauthor{Pierre Fortin}{Laboratoire d'Informatique de Paris 6, University Paris 6}
   2 \chapterauthor{Rachid Habel}{T\'el\'ecom SudParis}
   3 \chapterauthor{Fabienne J\'ez\'equel}{Laboratoire d'Informatique de Paris 6, University Paris 6}
   4 \chapterauthor{Jean-Luc Lamotte}{Laboratoire d'Informatique de Paris 6, University Paris 6}
   5 \chapterauthor{Stan Scott}{School of Electronics, Electrical Engineering \& Computer Science,
   6 The Queen's University of Belfast}
   7
   8 \newcommand{\fixme}[1]{{\bf #1}}
   9
  10 \chapter{Numerical validation and performance optimization on GPUs of
  11 an application in atomic physics}
  12 \label{chapter15}
  13
  14 \section{Introduction}\label{ch15:intro}
  15 As described in Chapter~\ref{chapter1}, GPUs are characterized by hundreds
  16 of cores and theoretically perform one order of magnitude better than CPUs.
  17 An important factor to consider when programming on GPUs
  18  is the cost of
  19 data transfers between CPU memory and GPU memory. Thus, to have good
  20 performance on
  21 GPUs, applications should be coarse-grained and have a high arithmetic
  22 intensity
  23 ($i.e.$ the ratio of arithmetic operations to memory operations).
  24 Another important aspect of GPU programming is that floating-point
  25 operations are preferably performed in single precision, if the
  26  validity of results is not impacted by that format.
  27 The GPU compute power for floating-point operations is indeed greater in
  28 single precision than in double precision.
  29 The peak performance ratio between single precision and double
  30 precision varies for example for NVIDIA GPUs from $12$ for the first Tesla
  31 GPUs (C1060),
  32 to $2$ for the Fermi GPUs (C2050 and C2070)
  33 and to $3$ for the latest Kepler architecture (K20/K20X).
  34 As far as AMD GPUs are concerned, the latest AMD GPU (Tahiti HD 7970)
  35 presents a ratio of $4$.
  36 Moreover, GPU internal memory accesses and CPU-GPU data transfers are
  37 faster in single precision than in double precision,
  38 because of the different format lengths.
  39
  40 This chapter describes the deployment on GPUs of PROP, a program of the
  41 2DRMP~\cite{FARM_2DRMP,2DRMP} suite which models electron collisions
  42 with H-like atoms and ions at intermediate energies. 2DRMP operates successfully on serial
  43 computers, high performance clusters and supercomputers.  The primary
  44 purpose of the PROP program is to propagate a global
  45 R-matrix~\cite{Burke_1987}, $\Re$, in the two-electron configuration
  46 space.
  47 The propagation needs to be performed for all collision energies,
  48 for instance hundreds of energies,
  49 which are independent.
  50 Propagation equations are dominated by matrix multiplications involving sub-matrices of $\Re$.
  51 However, the matrix multiplications are not
  52 straightforward in the sense that $\Re$ dynamically changes the designation of its rows and
  53 columns and increases in size as the propagation proceeds \cite{VECPAR}.
  54
  55 In a preliminary investigation PROP was selected by GENCI\footnote{GENCI: Grand Equipement National
  56   de Calcul Intensif, \url{www.genci.fr}} and
  57 CAPS\footnote{CAPS is a software company providing products and solutions
  58   for manycore application programming and deployment,
  59   \url{www.caps-entreprise.com}},
  60 following their first call for projects in 2009-2010
  61 aimed at
  62 deploying applications on hybrid systems based on GPUs.
  63 First CAPS
  64 recast the propagation equations with larger matrices.
  65 For matrix products the GPU performance gain over CPU increases indeed
  66 with the matrix size, since the
  67 CPU-GPU transfer overhead becomes less significant and since CPUs are
  68 still more efficient for fine computation grains.
  69 Then, using HMPP\footnote{
  70 HMPP or {\em CAPS compiler}, see: \url{www.caps-entreprise.com/hmpp.html}},
  71 a commercial
  72 hybrid and parallel compiler, CAPS
  73 developed a version of PROP, in
  74 which matrix multiplications are performed on
  75 the GPU or the CPU, depending on the matrix size.
  76 Unfortunately this partial GPU implementation of PROP does not offer
  77 significant acceleration.
  78
  79 The work described in this chapter, which is based on a study presented in \cite{PF_PDSEC2011}, aims at
  80  improving PROP performance on
  81 GPUs by exploring two directions. First, because the original version of PROP is written
  82 in double precision,
  83 we study the numerical stability of PROP in single precision.
  84 Second, we deploy  the whole
  85 computation code of PROP on
  86 GPUs to avoid the overhead generated by
  87 data transfers
  88 and we propose successive improvements
  89 (including a specific one to the Fermi architecture)
  90 in order to optimize the GPU code.
  91
  92
  93
  94
  95 \section{2DRMP and the PROP program}
  96 \label{s:2DRMP_PROP}
  97 \subsection{Principles of R-matrix propagation}
  98 2DRMP~\cite{FARM_2DRMP,2DRMP} is part of the CPC library\footnote{CPC:
  99 Computer Physics Communications,
 100 \url{http://cpc.cs.qub.ac.uk/}}.
 101 It is a suite of seven
 102 programs aimed at creating virtual experiments on high performance and grid
 103 architectures to enable the study of electron scattering from H-like
 104 atoms and ions at intermediate energies.  The 2DRMP suite  uses the
 105 two-dimensional $R$-matrix propagation approach~\cite{Burke_1987}.
 106 In 2DRMP the two-electron configuration  space  ($r_1$,$r_2$) is
 107 divided into sectors.
 108 Figure~\ref{prop} shows the division of the two-electron configuration
 109 space ($r_1$,$r_2$) into 4 vertical $strips$ representing 10 $sectors$.
 110 The key computation in 2DRMP, performed by the PROP program, is the
 111 propagation of a global
 112 $R$-matrix, $\Re$, from sector to sector across the internal region, as shown in Fig.~\ref{prop}.
 113
 114 \begin{figure}[h]
 115 \begin{center}
 116 \includegraphics*[width=0.65\linewidth]{Chapters/chapter15/figures/prop.pdf}
 117 \caption{\label{prop} Subdivision of the configuration space
 118 ($r_1$,$r_2$) into a set of connected sectors.}
 119 \end{center}
 120 \end{figure}
 121
 122 \begin{figure}[h]
 123 \begin{center}
 124 \includegraphics*[width=0.8\linewidth]{Chapters/chapter15/figures/Domain.pdf}
 125 \caption{\label{domain} Propagation of the R-matrix from domain D to domain D'.}
 126 \end{center}
 127 \end{figure}
 128
 129 We consider the general situation in
 130 Fig.~\ref{domain} where we assume that we already know
 131 the global $R$-matrix, $\Re^{I}$, associated with the boundary defined
 132 by edges 5, 2, 1 and 6
 133 in domain $D$ and we wish to
 134 evaluate the new global $R$-matrix, $\Re^{O}$, associated with edges 5, 3, 4 and 6
 135 in domain $D'$ following propagation across subregion $d$.
 136 Input edges are denoted by I (edges 1 and~2), output edges by O (edges 3 and 4) and
 137 common edges by X (edges 5 and~6).
 138 Because of symmetry, only the lower half of domains $D$ and $D'$ has to be considered.
 139 The global $R$-matrices, $\Re^{I}$ in domain $D$ and $\Re^{O}$ in
 140 domain $D'$, can be written as:
 141 \begin{equation}
 142 \Re^{I} = \left(\begin{array}{cc}
 143       \Re_{II}^{I} & \Re_{IX}^{I}\\
 144       \Re_{XI}^{I} & \Re_{XX}^{I}
 145     \end{array}\right)
 146 \ \
 147 \Re^{O} = \left(\begin{array}{cc}
 148       \Re_{OO}^{O} & \Re_{OX}^{O}\\
 149       \Re_{XO}^{O} & \Re_{XX}^{O}
 150     \end{array}\right).
 151 \label{eq:RI_RO}
 152 \end{equation}
 153
 154
 155
 156 From the set of local $R$-matrices, $\mathbf{R}_{ij}$ ($i,j\in \{1,2,3,4\}$)
 157 associated
 158 with subregion $d$, we can define
 159 \begin{subequations}
 160 \begin{eqnarray}
 161     \mathbf{r}_{II} = \left(\begin{array}{cc}
 162       \mathbf{R}_{11} & \mathbf{R}_{12}\\
 163       \mathbf{R}_{21} & \mathbf{R}_{22}
 164     \end{array}\right), \label{eqaa} &
 165     \mathbf{r}_{IO} = \left(\begin{array}{cc}
 166       \mathbf{R}_{13} & \mathbf{R}_{14}\\
 167       \mathbf{R}_{23} & \mathbf{R}_{24}
 168     \end{array}\right), \label{eqbb}\\
 169     \mathbf{r}_{OI} = \left(\begin{array}{cc}
 170       \mathbf{R}_{31} & \mathbf{R}_{32}\\
 171       \mathbf{R}_{41} & \mathbf{R}_{42}
 172     \end{array}\right), \label{eqcc} &
 173    \mathbf{r}_{OO} = \left(\begin{array}{cc}
 174       \mathbf{R}_{33} & \mathbf{R}_{34}\\
 175       \mathbf{R}_{43} & \mathbf{R}_{44}
 176     \end{array}\right),\label{eqdd}
 177 \end{eqnarray}
 178 \end{subequations}
 179 where $I$ represents the input edges 1 and 2, and $O$ represents
 180 the output edges 3 and 4 (see Fig.~\ref{domain}).
 181 The propagation across each sector is characterized by equations~(\ref{eq1}) to (\ref{eq4}).
 182 \begin{subequations}
 183 \begin{eqnarray}
 184 \Re^{O}_{OO} &=& \mathbf{r}_{OO} - \mathbf{r}_{IO}^T (r_{II} + \Re^{I}_{II})^{-1}\mathbf{r}_{IO}, \label{eq1} \\
 185 \Re^{O}_{OX} &=& \mathbf{r}_{IO}^T (\mathbf{r}_{II} + \Re^{I}_{II})^{-1}\Re^{I}_{IX},  \label{eq2} \\
 186    \Re^{O}_{XO} &=& \Re^{I}_{XI}(\mathbf{r}_{II} + \Re^{I}_{II})^{-1}\mathbf{r}_{IO},  \label{eq3} \\
 187    \Re^{O}_{XX} &=& \Re^{I}_{XX} - \Re^{I}_{XI}(\mathbf{r}_{II} +\Re^{I}_{II})^{-1}\Re^{I}_{IX}. \label{eq4}
 188 \end{eqnarray}
 189 \end{subequations}
 190
 191 The matrix inversions are not explicitly performed. To compute
 192 $(r_{II} + \Re^{I}_{II})^{-1}\mathbf{r}_{IO}$ and $(\mathbf{r}_{II} + \Re^{I}_{II})^{-1}\Re^{I}_{IX}$,
 193 two linear systems are solved.
 194
 195
 196 \medskip
 197
 198 While equations (\ref{eq1})-(\ref{eq4}) can be applied to the
 199 propagation across a general subregion two special situations should be
 200 noted: propagation across a diagonal subregion and propagation across
 201 a subregion bounded by the $r_{1}$-axis at the beginning of a new
 202 strip.
 203
 204 In the case of a diagonal subregion, from symmetry considerations,
 205 edge 2 is identical to edge 1 and edge 3 is identical to edge~4.
 206 Accordingly, with only one input edge and one output edge equations
 207 (\ref{eqaa})-(\ref{eqdd}) become:
 208 \begin{subequations}
 209 \begin{eqnarray}
 210 \mathbf{r}_{II} = 2\mathbf{R}_{11}, \
 211 \mathbf{r}_{IO} = 2\mathbf{R}_{14}, \label{eq4b}\\
 212 \mathbf{r}_{OI} = 2\mathbf{R}_{41}, \
 213 \mathbf{r}_{OO} = 2\mathbf{R}_{44}. \label{eq4d}
 214 \end{eqnarray}
 215 \end{subequations}
 216 In the case of a subregion bounded by the $r_1$-axis at the beginning
 217 of a new strip, we note that the input boundary $I$ consists of only
 218 one edge. When propagating across the first subregion in the second
 219 strip there is no common boundary $X$: in this case only equation
 220 (\ref{eq1}) needs to be solved.
 221
 222 \medskip
 223
 224 Having obtained the global $R$-matrix $\Re$ on the boundary of the
 225 innermost subregion (labeled $0$ in Fig.~\ref{prop}), $\Re$ is propagated across
 226 each subregion in the order indicated in Fig.~\ref{prop},
 227 working systematically from the
 228 $r_1$-axis at the bottom of each strip across all subregions to the
 229 diagonal.
 230
 231
 232
 233 \subsection{Description of the PROP program}
 234 \label{sec:PROP}
 235
 236 \begin{table}[t]
 237 \begin{center}
 238 \begin{tabular}{|c|c|c|c|c|c|}
 239  \hline
 240  \multirow{2}{0.09\linewidth}{\centering Data set} &
 241  \multirow{2}{0.15\linewidth}{\centering
 242   Local $R$-\\matrix size} &
 243  \multirow{2}{0.07\linewidth}{\centering Strips} &
 244  \multirow{2}{0.09\linewidth}{\centering Sectors} &
 245  \multirow{2}{0.19\linewidth}{\centering Final global \\$R$-matrix size} &
 246  \multirow{2}{0.15\linewidth}{\centering Scattering\\energies} \\
 247   & & & & & \\
 248   \hline
 249   Small & 90x90  & 4 & 10 & 360x360 & 6\\
 250  \hline
 251   Medium  & 90x90  & 4 & 10 & 360x360 & 64\\
 252  \hline
 253   Large  & 383x383  & 20 & 210 &  7660x7660 & 6\\
 254  \hline
 255   Huge  & 383x383  & 20 & 210 &  7660x7660 & 64\\ \hline
 256 \end{tabular}
 257 \caption{\label{data-sets}Characteristics of four data sets}
 258 \end{center}
 259 \end{table}
 260
 261 The PROP program computes the propagation of the $R$-matrix across the sectors of the internal region.
 262 Table~\ref{data-sets} shows four different
 263 data sets used in this study and highlights the principal parameters of the PROP program.
 264  PROP execution can be described by Algorithm~\ref{prop-algo}.
 265 First, amplitude arrays and
 266 correction data are read from data files generated by the preceding
 267 program of the 2DRMP suite.
 268 Then, local $R$-matrices are constructed from amplitude arrays.
 269 Correction data is used to compute correction vectors added to the diagonal of the local
 270 $R$-matrices. The local $R$-matrices, together with the input $R$-matrix,
 271  $\Re^{I}$,
 272 computed on the previous sector, are used to compute the current output
 273 $R$-matrix,
 274 $\Re^{O}$.
 275  At the end of a sector evaluation,
 276 the output $R$-matrix becomes  the input $R$-matrix
 277 for the next evaluation.
 278
 279 %% \begin{algorithm}
 280 %% \caption{\label{prop-algo}PROP algorithm}
 281 %% \begin{algorithmic}
 282 %% \FOR{all scattering energies}
 283 %%  \FOR{all sectors}
 284 %%  \STATE Read amplitude arrays
 285 %%  \STATE Read correction data
 286 %% \STATE Construct local $R$-matrices
 287 %% \STATE From $\Re^{I}$ and local $R$-matrices, compute $\Re^{O}$
 288 %% \STATE $\Re^{O}$ becomes $\Re^{I}$ for the next sector
 289 %%  \ENDFOR
 290 %%  \STATE Compute physical $R$-Matrix
 291 %% \ENDFOR
 292 %% \end{algorithmic}
 293 %% \end{algorithm}
 294
 295
 296 On the first sector, there is no input $R$-matrix yet. To bootstrap
 297 the propagation, the first output $R$-matrix is constructed using only
 298 one local $R$-matrix.  On the last sector, that is, on the boundary of
 299 the inner region, a physical $R$-matrix corresponding to the output
 300 $R$-matrix is computed and stored into an output file.
 301
 302 In the PROP program, sectors are characterized into four types,
 303 depending on the computation performed:
 304 \begin{itemize}
 305 \item the starting sector (labeled 0 in Fig.~\ref{prop})
 306 \item the axis sectors (labeled 1, 3 and 6 in Fig.~\ref{prop})
 307 \item the diagonal sectors (labeled 2, 5 and 9 in Fig.~\ref{prop})
 308 \item the off-diagonal sectors (labeled 4, 7 and 8 in Fig.~\ref{prop}).
 309 \end{itemize}
 310
 311
 312 The serial version of PROP is implemented in Fortran~90 and uses
 313 for linear algebra operations BLAS and LAPACK routines
 314 which are fully optimized for x86 architecture.
 315 This
 316 program
 317 serially propagates the $R$-matrix for
 318 all scattering energies.
 319 Since the propagations for these different
 320 energies are independent, there also
 321 exists an embarrassingly parallel version of
 322 PROP
 323 that spreads the computations of
 324 several energies
 325 among multiple CPU nodes via
 326 MPI.
 327
 328
 329
 330 \subsection{CAPS implementation}
 331 \label{caps}
 332
 333 In order to handle larger matrices, and thus obtain better GPU  speedup, CAPS
 334 recast equations (\ref{eq1}) to (\ref{eq4}) into one equation.
 335 The output $R$-matrix $\Re^{O}$ defined by equation~(\ref{eq:RI_RO}) is now computed as follows.
 336 \begin{equation}\label{eq_CAPS_1}
 337 \Re^{O} = \Re^{O^{\ \prime}} + U A^{-1} V,
 338 \end{equation}
 339 \begin{equation}\label{eq_CAPS_2}
 340 {\rm with} \
 341 \Re^{O^{\ \prime}}= \left(\begin{array}{cc}
 342       \mathbf{r}_{OO} & 0\\
 343       0 & \Re^I_{XX}    \end{array}\right), \
 344 U= \left(\begin{array}{c}
 345       \mathbf-{r}_{IO}^{T}\\
 346       \Re^I_{XI}    \end{array}\right),
 347 \end{equation}
 348 \begin{equation}\label{eq_CAPS_3}
 349 A= \mathbf{r}_{II} + \Re^I_{II} \ {\rm and}  \
 350 V= (\mathbf{r}_{IO}\ \ \ -\Re^I_{IX}).
 351 \end{equation}
 352
 353 To compute $W=A^{-1}V$, no matrix inversion is performed. The matrix
 354 system $AW=V$ is solved.
 355 This reimplementation of PROP reduces the number of equations to be
 356 solved and the number of matrix copies for evaluating each sector.
 357 For instance, for an off-diagonal sector,
 358 copies fall from 22 to 5, matrix multiplications from 4 to~1 and calls
 359 to a linear equation solver from 2 to 1.
 360
 361 To implement this version, CAPS
 362 used HMPP, a
 363 commercial hybrid and parallel compiler,
 364 based on compiler directives like the new OpenACC standard\footnote{See: \url{www.openacc-standard.org}}.
 365 If the matrices are large enough (the limit sizes are experimental parameters),
 366 they are multiplied on the GPU, otherwise on the CPU.
 367 CAPS
 368  used the MKL BLAS implementation on an Intel Xeon
 369 x5560 quad core CPU (2.8 GHz)
 370 and the CUBLAS library (CUDA 2.2) on one Tesla C1060 GPU.
 371 On the large data set (see Table~\ref{data-sets}), CAPS
 372  obtained a speedup of 1.15 for the GPU
 373 version over the CPU one (with multi-threaded MKL calls on the four
 374 CPU cores). This limited gain in performance is mainly
 375 due to the use of double precision computation
 376 and to the small or medium sizes of most matrices.
 377 For these matrices, the computation gain on
 378 the GPU is indeed
 379 strongly affected by the overhead
 380 generated by transferring these matrices from
 381 the CPU memory to the GPU memory to perform each matrix multiplication and then
 382 transferring the result back to the CPU memory.
 383
 384 Our goal is to speedup PROP more significantly by porting the whole
 385 code to the GPU and therefore avoiding
 386 the
 387 intermediate data transfers between
 388 the host (CPU) and the GPU. We will also study the
 389 stability of PROP in single precision because
 390 single precision computation is faster on the GPU
 391 and CPU-GPU data transfers are twice as fast as those performed in
 392 double precision.
 393
 394
 395
 396 \section{Numerical validation of PROP in single precision}
 397 \label{single-precision}
 398
 399 \begin{comment}
 400 \begin{table}[h]
 401 \begin{center}
 402 \begin{tabular}{|c|c|}
 403   \hline
 404    relative error interval & \# occurrences \\
 405   \hline
 406    [0, 1.E-8) & 18 \\
 407   \hline
 408    [1.E-8, 1.E-6) & 1241 \\
 409   \hline
 410    [1.E-6, 1.E-4) & 48728 \\
 411   \hline
 412    [1.E-4, 1.E-2) & 184065 \\
 413   \hline
 414    [1.E-2, 1) & 27723 \\
 415   \hline
 416    [1, 100) & 304 \\
 417   \hline
 418    [100, $+\infty$) & 1 \\
 419   \hline
 420 \end{tabular}
 421 \end{center}
 422 \caption{\label{sp-distrib}Error distribution for medium case in single precision}
 423 \end{table}
 424 \end{comment}
 425
 426
 427 Floating-point input data, computation and output data of PROP are
 428 originally in double precision format.
 429 PROP produces a standard $R$-matrix H-file \cite{FARM_2DRMP}
 430  and a collection of Rmat00X files (where X
 431 ranges from 0 to the number of scattering energies - 1)
 432 holding the physical R-matrix for each
 433 energy.
 434 The  H-file  and the Rmat00X files are binary input files of the FARM program \cite{FARM_2DRMP}
 435 (last program of the 2DRMP suite).
 436 Their text equivalent are the  prop.out
 437 and the prop00X.out files.
 438 To study the validity of PROP results in  single precision,
 439 first,
 440 reference results are
 441  generated by running the serial version of PROP in double precision.
 442 Data used in the most costly computation parts are read from input files in
 443 double precision format and then
 444 cast to single precision format.
 445 PROP results  (input of FARM) are computed in single precision and  written
 446 into files in double precision.
 447
 448 \subsection{Medium case study}
 449 \begin{figure}[h]
 450 \begin{center}
 451 \includegraphics*[width=0.9\linewidth]{Chapters/chapter15/figures/error.pdf}
 452 \caption{\label{fig:sp-distrib} Error distribution for medium case in single precision}
 453 \end{center}
 454 \end{figure}
 455
 456 The physical $R$-matrices, in
 457 the prop00X.out files, are compared to the
 458 reference ones for the medium case (see Table~\ref{data-sets}).
 459  The relative
 460 error distribution is
 461 given in Fig.~\ref{fig:sp-distrib}.
 462 We focus on the largest errors.
 463 \begin{itemize}
 464 \item Errors greater than 100: the only impacted value is of order 1.E-6
 465 and is negligible compared to the other ones
 466 in the same prop00X.out file.
 467
 468 \item Errors between 1 and 100: the values corresponding to the
 469   largest errors are of order 1.E-3 and are negligible compared to
 470   the majority of the other values which range between 1.E-2 and
 471   1.E-1.
 472
 473 \item Errors between 1.E-2 and 1: the largest errors ($\ge$ 6\%)
 474   impact values the order of magnitude of which is at most 1.E-1.
 475   These values are negligible.
 476   Relative errors of approximately 5\% impact values the order of
 477   magnitude of which is at most 1.E2.
 478   For instance, the value 164 produced by the reference version of
 479   PROP becomes 172 in the single precision version.
 480
 481 \end{itemize}
 482
 483 To study the impact of the single precision version of PROP on the
 484 FARM program, the cross-section
 485 results files corresponding to
 486 transitions
 487 {1s1s},
 488 {1s2s}, {1s2p}, {1s3s}, {1s3p}, {1s3d},
 489 {1s4s}, {2p4d} are compared to the reference ones.
 490 Table~\ref{sp-farm} shows that all cross-section files are impacted by
 491 errors.  Indeed in the  {2p4d} file,  four relative errors are
 492 greater than one and the maximum relative error is 1.60.
 493 However the largest errors impact negligible values. For example, the maximum
 494 error (1.60) impacts a reference value which is 4.5E-4.  The largest
 495 values are impacted by low errors. For instance, the maximum value
 496 (1.16) is impacted by a relative error of the order 1.E-3.
 497
 498 \begin{table}[t]
 499 \begin{center}
 500 \begin{tabular}{|c|c||c|c|} \hline
 501   file & largest relative error & file & largest relative error\\ \hline
 502  {1s1s} & 0.02& {1s3p} & 0.11  \\ \hline
 503  {1s2s} & 0.06 &  {1s3d} &  0.22 \\ \hline
 504  {1s2p} &  0.08 & {1s4s} &  0.20  \\ \hline
 505  {1s3s} &  0.17 &2p4d & 1.60  \\ \hline
 506 \end{tabular}
 507 \caption{\label{sp-farm}Impact  on FARM  of the single precision version of PROP}
 508 \end{center}
 509 \end{table}
 510
 511 To examine in more detail the impact of PROP on FARM,
 512 cross sections above the ionization threshold (1 Ryd)
 513 are compared in single and
 514 double precision  for
 515 transitions amongst the 1s, \dots 4s, 2p, \dots 4p, 3d, 4d target states.
 516 This comparison is carried out by generating 45 plots.  In all the
 517  plots, results in single and double precision match except for few
 518  scattering energies which are very close to pseudo-state thresholds.
 519 For example Fig.~\ref{1s2p} and \ref{1s4d} present the scattering energies corresponding to the
 520 {1s2p} and {1s4d} cross-sections computed in single and double precision.   For some cross-sections,
 521 increasing a threshold parameter from 1.E-4 to 1.E-3 in the FARM
 522 program
 523 results in energies close to threshold being avoided
 524  and therefore
 525 the cross-sections in double and single precision match more
 526 accurately.
 527 This is the case for instance for cross-section 1s2p (see Fig.~\ref{1s2p3}).
 528 However for other cross-sections (such as 1s4d) some problematic energies remain even if the
 529 threshold parameter in  the FARM
 530 program is increased to 1.E-3 (see Fig.~\ref{1s4d3}).  A higher
 531 threshold parameter would be required for such cross-sections.
 532
 533 \begin{figure}[t]
 534 \centering
 535 \subfigure[threshold = 1.E-4]{
 536 \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s2p.pdf}
 537    \label{1s2p}
 538  }
 539 \subfigure[threshold = 1.E-3]{
 540 \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s2p3.pdf}
 541  \label{1s2p3}
 542  }
 543 \label{fig:1s2p_10sectors}
 544 \caption{1s2p cross-section, 10 sectors}
 545 \end{figure}
 546
 547 \begin{figure}[t]
 548 \centering
 549 \subfigure[threshold = 1.E-4]{
 550 \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s4d.pdf}
 551    \label{1s4d}
 552  }
 553 \subfigure[threshold = 1.E-3]{
 554 \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s4d3.pdf}
 555  \label{1s4d3}
 556  }
 557 \label{fig:1s4d_10sectors}
 558 \caption{1s4d cross-section, 10 sectors}
 559 \end{figure}
 560
 561 As a conclusion, the medium case study shows that the execution of
 562 PROP in single precision leads to a few inexact scattering energies to
 563 be computed by the FARM program for some cross-sections.
 564 Thanks to a suitable threshold parameter in the FARM program these problematic energies may possibly
 565 be skipped.
 566 Instead of investigating deeper the choice of such a parameter for the medium case, we analyze the
 567 single precision computation  in a more
 568 realistic case in Sect.~\ref{huge}.
 569 \begin{comment}
 570 The conclusion of the medium case study is that running PROP in single
 571 precision gives relatively stable results provided that suitable
 572 parameter values are used in the FARM program in order to skip the
 573 problematic energies that are too close to the pseudo-state
 574 thresholds.  To verify if this conclusion is still valid with a larger
 575 data set, the single precision computation is analyzed in a more
 576 realistic case in Sect.~\ref{huge}.
 577 \end{comment}
 578
 579 \subsection{Huge case study}\label{huge}
 580
 581
 582 \begin{figure}[t]
 583   \centering
 584 \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s2pHT.pdf}
 585 \caption{\label{1s2pHT}1s2p cross-section, threshold = 1.E-4, 210 sectors}
 586 \end{figure}
 587
 588 \begin{figure}[t]
 589   \centering
 590 \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s2pHT.pdf}
 591 \caption{\label{1s4dHT}1s4d cross-section, threshold = 1.E-4, 210 sectors}
 592 \end{figure}
 593
 594 We study here the impact on FARM of the PROP program run in
 595 single precision for the huge case (see Table~\ref{data-sets}).
 596 The cross-sections
 597 corresponding to all
 598 atomic target states 1s \dots 7i are explored, which
 599 leads to
 600 406 comparison plots.
 601 It should be noted that in this case, over the same energy range above the ionization threshold, the density of pseudo-state thresholds is significantly increased compared to the medium case.
 602 As expected, all the plots exhibit large differences between single and double
 603 precision cross-sections.
 604 For example Fig.~\ref{1s2pHT} and  \ref{1s4dHT} present the 1s2p and 1s4d cross-sections computed in
 605 single and double precision for the huge case.
 606 We can conclude that PROP in single precision gives invalid results
 607 for realistic simulation cases above the ionization threshold.
 608 Therefore the  deployment of PROP on GPU, described in Sect.~\ref{gpu-implem},
 609 has been carried out in double precision.
 610
 611 \section{Towards a complete deployment of PROP on GPUs}
 612 \label{gpu-implem}
 613
 614 We now detail how PROP has been progressively deployed on
 615 GPUs in double precision in order to avoid the
 616 expensive memory transfers between the host and the GPU.
 617 Different versions with successive improvements and optimizations are presented.
 618 We use CUDA~\cite{CUDA_ProgGuide} for GPU programming, as well as the
 619 CUBLAS~\cite{CUBLAS}
 620 and MAGMA \cite{MAGMA} libraries for linear algebra operations.
 621 Since PROP is written in Fortran 90, {\em wrappers} in C are used to
 622 enable calls to CUDA kernels from PROP routines.
 623
 624
 625 \subsection{Computing the output $R$-matrix on GPU}
 626 \label{gpu-RO}
 627
 628 \begin{figure}[h]
 629   \centering
 630   \includegraphics[width=0.7\linewidth]{Chapters/chapter15/figures/offdiagonal_nb.pdf}
 631   \caption{\label{offdiagonal} The six steps of an off-diagonal sector
 632     evaluation.}
 633 \end{figure}
 634
 635 As mentioned in Algorithm~\ref{prop-algo}, evaluating a sector
 636 mainly consists in constructing local $R$-matrices and in computing
 637 one output $R$-matrix, $\Re^{O}$. In this first step of the porting
 638 process, referred to as GPU V1\label{gpuv1},
 639 we only consider the computation of $\Re^{O}$ on the GPU.
 640 We distinguish the following six steps, related to equations
 641 (\ref{eq_CAPS_1}), (\ref{eq_CAPS_2}) and (\ref{eq_CAPS_3}), and illustrated in
 642 Fig.~\ref{offdiagonal} for an off-diagonal sector.
 643
 644 \begin{description}
 645 \item[Step 1] (``Input copies''):~data are copied from $\Re^{I}$
 646   to temporary arrays ($A$, $U$, $V$) and to $\Re^{O}$.
 647   These copies, along with possible scalings or transpositions, are
 648   implemented as CUDA kernels which can be applied to two
 649   matrices of any size and starting at any offset.
 650   Memory accesses are coalesced \cite{CUDA_ProgGuide} in order to
 651   provide the best performance for such memory-bound kernels.
 652 \item[Step 2] (``Local copies''):~data are copied from
 653   local $R$-matrices to temporary arrays ($U$, $V$) and to $\Re^{O}$.
 654   Moreover data from local $R$-matrix
 655   $\mathbf{r}_{II}$
 656   is added to matrix $A$ (via a CUDA kernel) and zeroes are written in
 657    $\Re^{O}$  where required.
 658 \item[Step 3] (``Linear system solving''):~matrix $A$ is factorized
 659   using the MAGMA DGETRF
 660    routine and the result is stored in-place.
 661 \item[Step 4] (``Linear system solving'' cont.):~the matrix system
 662  of linear equations  $AW$ = $V$ is solved using the MAGMA DGETRS
 663 routine. The  solution is stored in matrix $V$.
 664 \item[Step 5] (``Output matrix product''):~matrix $U$
 665   is multiplied by matrix $V$ using the CUBLAS DGEMM
 666   routine. The result is stored in a temporary matrix~$t$.
 667 \item[Step 6] (``Output add''):~$t$ is added to $\Re^{O}$ (CUDA
 668   kernel).
 669 \end{description}
 670
 671 All the involved matrices are stored in the GPU memory. Only the
 672 local $R$-matrices are first constructed on the host and then sent
 673 to the GPU memory, since these matrices vary from sector to sector.
 674 The evaluation of the axis and diagonal sectors is similar.
 675 However, fewer operations and copies are required because of
 676 symmetry considerations \cite{2DRMP}.
 677
 678 \subsection{Constructing the local $R$-matrices on GPU}
 679
 680 \begin{figure}[t]
 681  \centering
 682   \includegraphics[width=0.7\linewidth]{Chapters/chapter15/figures/amplitudes_nb.pdf}
 683  \caption{\label{amplitudes} Constructing the local $R$-matrix R34
 684  from the $j$ amplitude array associated with edge 4 and the $i$
 685  amplitude array associated with edge~3.}
 686 \end{figure}
 687
 688 Local $R$-matrices are constructed using two three dimensional arrays,
 689 $i$ and $j$. Each three dimensional array contains four
 690 matrices corresponding to the surface amplitudes associated with the
 691 four edges of a sector. Those matrices are named {\em amplitude arrays}.
 692  $j$ amplitude arrays are read from data files and $i$ amplitude arrays
 693 are obtained by scaling each row of the $j$ amplitude arrays.
 694 The main part of the construction of a local $R$-matrix,
 695 presented in Fig.~\ref{amplitudes},
 696 is a matrix product between
 697 one $i$ amplitude array and one transposed $j$ amplitude array
 698 which is performed by a single DGEMM
 699 BLAS call.
 700 In this version, hereafter referred to as GPU
 701 V2\label{gpuv2}, $i$ and $j$ amplitude arrays are transferred to the
 702 GPU memory and the required matrix multiplications are performed on
 703 the GPU (via CUBLAS routines).
 704
 705
 706 The involved matrices having medium sizes (either $3066 \times 383$ or
 707 $5997 \times 383$),
 708 performing these matrix multiplications
 709 on the GPU is expected to be faster than on the CPU.
 710 However, this implies a greater communication volume
 711 between the CPU and the GPU
 712 since the $i$ and $j$ amplitude arrays are larger than the local
 713 $R$-matrices.
 714 It can be noticed that correction data are also used in the
 715 construction of a local $R$-matrix,
 716 but this is a minor part in the
 717 computation. However, these correction data also have to be
 718 transferred from the CPU to the GPU for each sector.
 719
 720 \subsection{Scaling amplitude arrays on GPU}
 721 It
 722 should be worthwhile to try to reduce the CPU-GPU data
 723 transfers of the GPU V2, where the $i$ and $j$ amplitude arrays are
 724 constructed on the host and then sent to the GPU memory for each sector.
 725 In this new version, hereafter referred to as GPU V3\label{gpuv3}, we
 726 transfer only the $j$ amplitude arrays and the
 727 required scaling factors (stored in one 1D array) to the GPU memory,
 728 so that the $i$ amplitude arrays are
 729 then directly computed on the GPU by multiplying the $j$ amplitude
 730 arrays by these scaling factors (via a CUDA kernel).
 731 Therefore, we save the transfer of four $i$ amplitude arrays on
 732 each sector by transferring  only this 1D array of scaling factors.
 733 Moreover, scaling $j$ amplitude arrays is expected to be faster on the
 734 GPU than on the CPU, thanks to the massively parallel architecture of
 735 the GPU and thanks to its higher internal memory bandwidth.
 736
 737 \subsection{Using double-buffering to overlap I/O and computation}
 738
 739 \begin{figure}[t]
 740   \centering
 741   \includegraphics[width=0.8\linewidth]{Chapters/chapter15/figures/C1060_V3_IO_COMP.pdf}
 742   \caption{\label{overlapping} Compute and I/O times for the GPU V3 on
 743     one C1060.}
 744 \end{figure}
 745
 746 As described in Algorithm~\ref{prop-algo}, there are two main steps in
 747 the propagation across a sector: reading  amplitude arrays
 748 and correction data from I/O files and
 749 evaluating the current sector.
 750 Fig.~\ref{overlapping} shows the I/O times and the evaluation times
 751 of each sector for the huge case execution (210 sectors, 20 strips) of the GPU V3
 752 on one C1060.
 753 Whereas the times required by the off-diagonal sectors are similar
 754 within each of the 20 strips,
 755 the times for diagonal sectors of each strip
 756 are the shortest ones, the times for the axis sectors being
 757 intermediate.
 758 The I/O times are roughly constant among all strips.
 759 The evaluation time is equivalent to the I/O
 760 time for the first sectors. But this evaluation time grows
 761 linearly with the strip number, and rapidly exceeds the I/O
 762 time.
 763
 764 It is thus interesting to use a double-buffering technique to overlap the
 765 I/O time with the evaluation time:
 766 for each sector, the evaluation of sector $n$ is performed
 767 (on GPU) simultaneously with the reading of data for sector
 768 $n+1$ (on CPU). This requires the duplication in the CPU memory of all the
 769 data structures
 770 used for storing data read from I/O files for each sector.
 771 This version, hereafter referred to as GPU
 772 V4\label{gpuv4}, uses POSIX threads. Two threads are
 773 executed concurrently: an I/O thread that reads data from I/O files
 774 for each sector, and a computation thread, dedicated to the propagation
 775 of the global $R$-matrix, that performs successively for each sector
 776 all necessary computations on GPU,
 777 as well as all required CPU-GPU data transfers.
 778 The evaluation of a sector uses the data read for this sector as well
 779 as the global $R$-matrix computed on the previous sector.
 780 This dependency requires synchronizations between the I/O thread and
 781 the computation thread which are implemented through standard POSIX
 782 thread mechanisms.
 783
 784
 785 \subsection{Matrix padding}
 786 The CUBLAS DGEMM
 787 performance and the MAGMA DGETRF/DGETRS
 788 performance is reduced when the sizes (or
 789 the leading dimensions) of the matrix are not multiples of the inner blocking size \cite{NTD10a}.
 790 This inner blocking size can be 32 or 64, depending on the computation
 791 and on the underlying
 792 GPU architecture \cite{MAGMA,NTD10b}.
 793 In this version (GPU V5\label{gpuv5}),
 794 the matrices are therefore padded with $0.0$ (and $1.0$ on the diagonal for the linear systems)
 795 so that their sizes are
 796 multiples of 64.
 797 This corresponds indeed to the optimal size for the matrix product on the
 798 Fermi architecture \cite{NTD10b}. And as far as linear system solving is
 799 concerned, all the matrices have sizes which are multiples of 383: we
 800 therefore use padding to obtain multiples of 384 (which are
 801 again  multiples of 64).
 802 It can be noticed that this padding has to be performed dynamically
 803 as the matrices increase in size during the propagation
 804 (when possible the
 805  maximum required storage space is however allocated only once in the
 806  GPU memory).
 807
 808
 809
 810 \section{Performance results}
 811 \subsection{PROP deployment on GPU}
 812
 813 \begin{table*}[ht]
 814 \begin{center}
 815 \begin{tabular}{|c||c|c||}
 816  \hline
 817   PROP version & \multicolumn{2}{c|}{Execution time} \\
 818   \hline
 819   \hline
 820   CPU version: 1  core & \multicolumn{2}{c|}{201m32s} \\
 821   \hline
 822   CPU version: 4  cores &  \multicolumn{2}{c|}{113m28s} \\
 823   \hline \hline
 824 GPU version  & C1060 & C2050 \\
 825   \hline\hline
 826   GPU V1 (\S~\ref{gpuv1}) & 79m25s & 66m22s  \\
 827   \hline
 828   GPU V2 (\S~\ref{gpuv2}) & 47m58s & 29m52s \\
 829   \hline
 830   GPU V3 (\S~\ref{gpuv3}) & 41m28s & 23m46s \\
 831   \hline
 832   GPU V4 (\S~\ref{gpuv4}) & 27m21s & 13m55s\\
 833   \hline
 834   GPU V5 (\S~\ref{gpuv5}) & 24m27s & 12m39s  \\
 835   \hline
 836 \end{tabular}
 837 \caption{\label{table:time}
 838 Execution time of PROP on CPU and GPU}
 839 \end{center}
 840 \end{table*}
 841
 842 \begin{comment}
 843 \begin{table*}[ht]
 844 \begin{center}
 845 \begin{tabular}{|c||c|c||}
 846  \hline
 847   PROP version & \multicolumn{2}{c|}{Execution time} \\
 848   \hline  \hline
 849 CPU version & 1 core & 4 cores  \\\hline
 850 & {201m32s} & {113m28s} \\ \hline  \hline
 851 GPU version  & C1060 & C2050 \\
 852   \hline\hline
 853   GPU V1 (\ref{gpuv1}) & 79m25s & 66m22s  \\
 854   \hline
 855   GPU V2 (\ref{gpuv2}) & 47m58s & 29m52s \\
 856   \hline
 857   GPU V3 (\ref{gpuv3}) & 41m28s & 23m46s \\
 858   \hline
 859   GPU V4 (\ref{gpuv4}) & 27m21s & 13m55s\\
 860   \hline
 861   GPU V5 (\ref{gpuv5}) & 24m27s & 12m39s  \\
 862   \hline
 863 \end{tabular}
 864 \caption{\label{table:time}
 865 Execution time of the successive GPU versions}
 866 \end{center}
 867 \end{table*}
 868 \end{comment}
 869
 870 \begin{figure}[h]
 871 \centering
 872 \subfigure[Speedup over 1 CPU core]{
 873 \includegraphics*[width=0.76
 874         \linewidth]{Chapters/chapter15/figures/histo_speedup_1core.pdf}
 875    \label{fig:speedup_1core}
 876  }
 877
 878 \subfigure[Speedup over 4 CPU cores]{
 879 \includegraphics*[width=0.76
 880         \linewidth]{Chapters/chapter15/figures/histo_speedup_4cores.pdf}
 881  \label{fig:speedup_4cores}
 882  }
 883 \label{fig:speedup}
 884 \caption{Speedup of the successive GPU versions.}
 885 \end{figure}
 886
 887 Table~\ref{table:time} presents
 888 the execution times
 889 of PROP on CPUs and GPUs,
 890 each version solves the propagation equations in the
 891 form~(\ref{eq_CAPS_1}-\ref{eq_CAPS_3}) as proposed by CAPS.
 892 Fig.~\ref{fig:speedup_1core} (respectively \ref{fig:speedup_4cores})
 893 shows the speedup of the successive GPU versions
 894 over one CPU core (respectively four CPU cores).
 895 We use here Intel Q8200 quad-core CPUs (2.33 GHz), one C1060 GPU and
 896 one C2050 (Fermi) GPU, located at
 897  UPMC (Universit\'e Pierre et Marie Curie, Paris, France).
 898 As a remark, the execution times measured on the C2050 would be the same
 899 on the C2070 and on  the C2075, the only difference between these GPUs
 900 being their memory size and their TDP (Thermal Design Power).
 901 We emphasize that the execution times correspond to the
 902 complete propagation for all six energies of the large case (see
 903 Table~\ref{data-sets}), that is to say to the complete execution of
 904 the PROP program.
 905 Since energies are independent, execution times for more energies
 906 (e.g. the huge case) should be proportional
 907 to those reported in Table~\ref{table:time}.
 908
 909 These tests, which have been performed with CUDA 3.2, CUBLAS 3.2 and
 910 MAGMA 0.2,
 911 show that the successive GPU versions of PROP offer
 912 increasing, and at the end interesting, speedups.
 913 More precisely,
 914 V2 shows that it is worth increasing slightly the
 915 CPU-GPU communication volume in order to perform
 916 large enough matrix products on the GPU.
 917  This communication volume can fortunately be
 918 reduced thanks to~V3,
 919 which also accelerates the computation of
 920 amplitude arrays thanks to the GPU.
 921 The
 922 double-buffering technique implemented in V4
 923  effectively enables the overlapping of
 924 I/O operations with computation, while the
 925 padding implemented in V5 also improves the computation times.
 926 It
 927 is noticed that the padding
 928 does offer much more performance gain with,
 929 for example,  CUDA 3.1 and the MAGMA DGEMM~\cite{NTD10b}:  the
 930 speedup with respect to one
 931 CPU core was increased from 6.3 to 8.1 on C1060, and from 9.5 to 14.3
 932 on C2050.
 933 Indeed since CUBLAS 3.2 performance has been improved for non block multiple
 934 matrix sizes through MAGMA code~\cite{NTD10a}.
 935 Although for all versions the C2050 (with its improved
 936 double precision performance) offers up to almost
 937 double speedup compared to
 938 the C1060, the performance obtained with both architectures justifies the use of
 939 the GPU for such an application.
 940
 941 \subsection{PROP execution profile}
 942
 943 \begin{figure}[t]
 944   \centering
 945  \includegraphics*[width=0.64\linewidth]{Chapters/chapter15/figures/CPU_1_CORE_TIMES.pdf}
 946 \caption{CPU (1 core)  execution times for the off-diagonal sectors of the large case.}
 947 \label{fig:CPU-timing}
 948 \end{figure}
 949
 950 \begin{figure}[t]
 951   \centering
 952   \subfigure[GPU V5 on one C1060]{
 953 \includegraphics*[width=0.64\linewidth]{Chapters/chapter15/figures/GPU_V5_C1060_TIMES.pdf}
 954 \label{GPU-timing}}
 955
 956   \subfigure[GPU V5 on one C2050]{
 957 \includegraphics*[width=0.64\linewidth]{Chapters/chapter15/figures/GPU_V5_C2050_TIMES.pdf}
 958 \label{fermi-timing}}
 959
 960 \caption{GPU execution times for the off-diagonal sectors of
 961   the large case.}
 962 \label{fig:profileGPU}
 963 \end{figure}
 964
 965 We detail here the execution profile on
 966 the CPU and the GPU for the evaluation of all off-diagonal sectors
 967 (the most representative ones) for a complete energy propagation.
 968  Fig.~\ref{fig:CPU-timing} and \ref{fig:profileGPU} show CPU and GPU execution times for the
 969 171 off-diagonal sectors of  the large case (see Table \ref{data-sets}).
 970 ``Copying, adding, scaling'' corresponds to the amplitude
 971   array construction (scaling) as well as to steps 1, 2 and 6 in
 972   Sect.~\ref{gpu-RO}, all implemented via CUDA kernels.
 973 ``CPU-GPU transfers''
 974 aggregate transfer times for the $j$ amplitude
 975 arrays and the scaling factors, as well as for the correction data.
 976  ``Amplitude matrix product'' corresponds to the DGEMM call to
 977  construct the local R-matrices from the $i$ and $j$ amplitude
 978  arrays.
 979 ``Linear system solving'' and ``Output matrix product'' correspond
 980 respectively to steps 3-4 and to step 5 in Sect.~\ref{gpu-RO}.
 981
 982 On one CPU core (see  Fig.~\ref{fig:CPU-timing}),
 983  matrix products for the construction of the local
 984 $R$-matrices require the same
 985 computation time during the whole propagation. Likewise the CPU time required by
 986  matrix products for the output $R$-matrix is constant within each
 987  strip. But as the global $R$-matrix is propagated from strip to
 988  strip, the sizes of
 989 the matrices $U$ and $V$ increase, so does their multiplication time.
 990 The time required to solve the linear system increases
 991 slightly during the propagation.
 992 These three operations (``Amplitude matrix product'', ``Output matrix
 993 product'' and ``Linear system solving'') are clearly dominant in terms
 994 of computation
 995 time compared to the other remaining operations, which justify our
 996 primary focus on such three linear algebra operations.
 997
 998
 999 On the C1060 (see Fig.~\ref{GPU-timing}), we have
1000 generally managed to obtain a similar speedup for all operations
1001 (around 8, which corresponds to Fig.~\ref{fig:speedup_1core}). Only the linear system solving
1002 presents a lower speedup (around~4).
1003 The CUDA kernels and the remaining CPU-GPU transfers make a minor contribution
1004 to the overall computation time and do not require
1005 additional improvements.
1006
1007 On the C2050 (see Fig.~\ref{fermi-timing}), additional speedup is
1008 obtained for all operations, except for the
1009 CPU-GPU transfers and the linear system solving.
1010 The CPU-GPU transfers are mainly due to the $j$ amplitude arrays, and
1011 currently still correspond to minor times. When required, the
1012 double-buffering technique may also be used to overlap such transfers
1013 with computation on the GPU.
1014
1015
1016
1017 \section{Propagation of multiple concurrent energies on GPU}
1018
1019 Finally, we present here an improvement that can
1020 benefit from the Fermi architecture, as well as from the newest Kepler
1021 architecture,
1022 both of which enable the concurrent execution of multiple
1023 CUDA kernels, thus offering additional speedup on
1024 GPUs for small or medium computation grain kernels.
1025 In our case, the performance gain on the GPU is indeed limited
1026 since most matrices have small or medium sizes.
1027 By using multiple streams within one CUDA context~\cite{CUDA_ProgGuide},
1028 we can propagate multiple energies
1029 concurrently on the Fermi GPU.
1030 It can be noticed that all GPU computations for all streams are
1031 launched by the same host thread. We therefore rely here on the {\em legacy
1032 API} of CUBLAS~\cite{CUBLAS} (like MAGMA)
1033 without thread safety problems.
1034 A {\em breadth first} issue order is used for kernel
1035 launchs \cite{CUDA_stream}: for a given GPU kernel, all kernel launchs
1036 are indeed issued together in the host thread, using one stream for each
1037 concurrent energy, in order to maximize concurrent kernel
1038 execution.
1039 Of course, the memory available on the GPU must be large enough to
1040 store all data structures required by each energy.
1041 Moreover, multiple streams are also used within the
1042 propagation of each single energy
1043 in order to enable concurrent executions among the required kernels.
1044
1045
1046 \begin{table}[t]
1047 \begin{center}
1048 \begin{tabular}{|c|c||c|c|c|c|c|}
1049   \hline
1050   \multirow{4}{0.18\linewidth}{Medium case} & Number of &
1051   \multirow{2}{0.07\linewidth}{\centering 1} &
1052   \multirow{2}{0.07\linewidth}{\centering 2} &
1053   \multirow{2}{0.07\linewidth}{\centering 4} &
1054   \multirow{2}{0.07\linewidth}{\centering 8} &
1055   \multirow{2}{0.07\linewidth}{\centering 16} \\
1056   & energies & & & & & \\
1057   \cline{2-7}
1058   & Time (s) &  11.18 & 6.87 & 5.32 & 4.96 & 4.76 \\
1059   \cline{2-7}
1060   & Speedup & - & 1.63 & 2.10 & 2.26 & 2.35 \\
1061   \hline
1062   \hline
1063   \multirow{4}{0.18\linewidth}{Large case} & Number of &
1064   \multirow{2}{0.07\linewidth}{\centering 1} &
1065   \multicolumn{2}{c|}{\multirow{2}{0.07\linewidth}{\centering 2}} &
1066   \multicolumn{2}{c|}{\multirow{2}{0.07\linewidth}{\centering 3}} \\
1067   & energies & & \multicolumn{2}{c|}{~} & \multicolumn{2}{c|}{~}  \\
1068   \cline{2-7}
1069   & Time (s) & 509.51 & \multicolumn{2}{c|}{451.49} & \multicolumn{2}{c|}{436.72}  \\
1070   \cline{2-7}
1071   & Speedup & - & \multicolumn{2}{c|}{1.13} & \multicolumn{2}{c|}{1.17}  \\
1072   \hline
1073 \end{tabular}
1074 \caption{\label{t:perfs_V6} Performance results with multiple
1075   concurrent energies
1076   on one C2070 GPU. GPU initialization times are not considered here. }
1077 \end{center}
1078 \end{table}
1079
1080 In order to have enough GPU memory to run two or three concurrent
1081 energies for the large case, we use one C2070 GPU
1082 (featuring 6GB of memory)
1083 with one Intel X5650 hex-core CPU, CUDA 4.1 and CUBLAS 4.1, as
1084 well as the latest MAGMA release (version 1.3.0).
1085 Substantial changes have been required
1086 in the MAGMA calls with respect to the previous versions of PROP that were using MAGMA 0.2.
1087  Table~\ref{t:perfs_V6} presents the speedups
1088 obtained on the Fermi GPU for multiple concurrent
1089 energies (up to sixteen since this is the maximum number of concurrent
1090 kernel launches currently supported \cite{CUDA_ProgGuide}).
1091 With the medium case, speedups greater than 2 are obtained with four
1092 concurrent energies or more.
1093 With the more realistic large case, the performance gain is lower mainly because of
1094 the increase in matrix sizes, which implies a better GPU usage
1095 with only one energy on the GPU. The concurrent execution of multiple
1096 kernels is also limited by other operations on the
1097 GPU \cite{CUDA_ProgGuide,CUDA_stream} and by the current MAGMA code which
1098 prevents concurrent MAGMA calls in different streams.
1099 Better speedups can be here expected on the latest Kepler GPUs which
1100 offer additional compute power, and whose {\em Hyper-Q} feature may help
1101 improve further the GPU utilization with concurrent energies.
1102 On the contrary, the same code on the C1060 shows no speedup
1103  since the concurrent kernel launches are
1104 serialized on this previous GPU architecture.
1105
1106
1107
1108
1109
1110
1111
1112
1113 \section{Conclusion and future work}
1114 \label{conclusion}
1115
1116
1117 In this chapter, we have presented our methodology and our results in
1118 the deployment on
1119 a GPU of an application (the PROP program) in atomic physics.
1120
1121 We have started by studying the numerical stability of PROP using
1122 single precision arithmetic. This has shown that PROP
1123 using single precision, while relatively stable for some small cases,
1124 gives unsatisfactory results for realistic simulation cases above the
1125 ionization threshold where there is a
1126 significant density of pseudo-states. It is
1127 expected that this may not be the case below the ionization threshold
1128 where the actual target states are less dense. This requires further
1129 investigation.
1130
1131 We have
1132 therefore deployed the PROP code in double precision on
1133 a GPU, with successive improvements. The different GPU versions
1134 each offer increasing speedups over the CPU version.
1135 Compared to the single (respectively four) core(s) CPU version, the
1136 optimal GPU implementation
1137 gives a speedup of 8.2 (resp. 4.6) on one C1060 GPU,
1138 and a speedup of 15.9 (resp. 9.0) on one
1139 C2050 GPU with improved double precision performance.
1140 An additional gain of around 15\%
1141 can also be obtained on one Fermi GPU
1142 with large memory (C2070) thanks to concurrent kernel execution.
1143
1144 Such speedups
1145 cannot be directly compared with the 1.15 speedup
1146 obtained with the HMPP version, since in our tests the CPUs are
1147 different and the CUBLAS versions are more recent.
1148 However, the programming effort required
1149 progressively to deploy PROP on GPUs clearly offers improved and interesting speedups for this
1150 real-life   application in double precision with varying-size matrices.
1151
1152
1153 We are currently working on a hybrid CPU-GPU version that spreads the
1154 computations of the independent energies on both the CPU
1155 and the GPU. This will enable
1156 multiple energy execution on the CPU, with
1157 one or several core(s) dedicated to each energy (via multi-threaded
1158 BLAS libraries). Multiple
1159 concurrent energies may also be propagated on each Fermi GPU.
1160 By merging this work with the current MPI PROP program, we will
1161 obtain a scalable hybrid CPU-GPU version.
1162 This final version will offer an additional level of parallelism
1163 thanks to the MPI
1164 standard in order to exploit multiple
1165 nodes with multiple CPU cores and possibly multiple GPU cards.
1166
1167 \putbib[Chapters/chapter15/biblio]
1168