1 \chapterauthor{Pierre Fortin}{Laboratoire d'Informatique de Paris 6, University Paris 6}
2 \chapterauthor{Rachid Habel}{T\'el\'ecom SudParis}
3 \chapterauthor{Fabienne J\'ez\'equel}{Laboratoire d'Informatique de Paris 6, University Paris 6}
4 \chapterauthor{Jean-Luc Lamotte}{Laboratoire d'Informatique de Paris 6, University Paris 6}
5 \chapterauthor{Stan Scott}{School of Electronics, Electrical Engineering \& Computer Science,
6 The Queen's University of Belfast}
8 %\newcommand{\fixme}[1]{{\bf #1}}
10 \chapter[Numerical validation and performance optimization on GPUs in atomic physics]{Numerical validation and performance optimization on GPUs of an application in atomic physics}
13 \section{Introduction}\label{ch15:intro}
14 As described in Chapter~\ref{chapter1}, GPUs are characterized by hundreds
15 of cores and theoretically perform one order of magnitude better than CPUs.
16 An important factor to consider when programming on GPUs
18 data transfers between CPU memory and GPU memory. Thus, to have good
20 GPUs, applications should be coarse-grained and have a high arithmetic
22 ($i.e.$ the ratio of arithmetic operations to memory operations).
23 Another important aspect of GPU programming is that floating-point
24 operations are preferably performed in single precision\index{precision!single precision}, if the
25 validity of results is not impacted by that format.
26 The GPU compute power for floating-point operations is indeed greater in
27 single precision\index{precision!single precision} than in double precision\index{precision!double precision}.
28 The peak performance ratio between single precision\index{precision!single precision} and double
29 precision varies for example for NVIDIA GPUs from $12$ for the first Tesla
31 to $2$ for the Fermi GPUs (C2050 and C2070)
32 and to $3$ for the latest Kepler architecture (K20/K20X).
33 As far as AMD GPUs are concerned, the latest AMD GPU (Tahiti HD 7970)
34 presents a ratio of $4$.
35 Moreover, GPU internal memory accesses and CPU-GPU data transfers are
36 faster in single precision\index{precision!single precision} than in double precision\index{precision!double precision},
37 because of the different format lengths.
39 This chapter describes the deployment on GPUs of PROP, a program of the
40 2DRMP~\cite{FARM_2DRMP,2DRMP} suite which models electron collisions
41 with H-like atoms and ions at intermediate energies. 2DRMP operates successfully on serial
42 computers, high performance clusters and supercomputers. The primary
43 purpose of the PROP program is to propagate a global
44 R-matrix~\cite{Burke_1987}, $\Re$, in the two-electron configuration
46 The propagation needs to be performed for all collision energies,
47 for instance hundreds of energies,
48 which are independent.
49 Propagation equations are dominated by matrix multiplications involving sub-matrices of $\Re$.
50 However, the matrix multiplications are not
51 straightforward in the sense that $\Re$ dynamically changes the designation of its rows and
52 columns and increases in size as the propagation proceeds \cite{VECPAR}.
54 In a preliminary investigation PROP was selected by GENCI\footnote{GENCI: Grand Equipement National
55 de Calcul Intensif, \url{www.genci.fr}} and
56 CAPS\footnote{CAPS is a software company providing products and solutions
57 for manycore application programming and deployment,
58 \url{www.caps-entreprise.com}},
59 following their first call for projects in 2009-2010
61 deploying applications on hybrid systems based on GPUs.
63 recast the propagation equations with larger matrices.
64 For matrix products the GPU performance gain over CPU increases indeed
65 with the matrix size, since the
66 CPU-GPU transfer overhead becomes less significant and since CPUs are
67 still more efficient for fine computation grains.
68 Then, using HMPP\index{HMPP}\footnote{
69 HMPP or {\em CAPS compiler}, see: \url{www.caps-entreprise.com/hmpp.html}},
71 hybrid and parallel compiler, CAPS
72 developed a version of PROP, in
73 which matrix multiplications are performed on
74 the GPU or the CPU, depending on the matrix size.
75 Unfortunately this partial GPU implementation of PROP does not offer
76 significant acceleration.
78 The work described in this chapter, which is based on a study presented in \cite{PF_PDSEC2011}, aims at
79 improving PROP performance on
80 GPUs by exploring two directions. First, because the original version of PROP is written
81 in double precision\index{precision!double precision},
82 we study the numerical stability of PROP in single precision\index{precision!single precision}.
83 Second, we deploy the whole
84 computation code of PROP on
85 GPUs to avoid the overhead generated by
87 and we propose successive improvements
88 (including a specific one to the Fermi architecture)
89 in order to optimize the GPU code.
94 \section{2DRMP and the PROP program}
96 \subsection{Principles of R-matrix propagation}
97 2DRMP~\cite{FARM_2DRMP,2DRMP} is part of the CPC library\footnote{CPC:
98 Computer Physics Communications,
99 \url{http://cpc.cs.qub.ac.uk/}}.
100 It is a suite of seven
101 programs aimed at creating virtual experiments on high performance and grid
102 architectures to enable the study of electron scattering from H-like
103 atoms and ions at intermediate energies. The 2DRMP suite uses the
104 two-dimensional $R$-matrix propagation approach~\cite{Burke_1987}.
105 In 2DRMP the two-electron configuration space ($r_1$,$r_2$) is
106 divided into sectors.
107 Figure~\ref{prop} shows the division of the two-electron configuration
108 space ($r_1$,$r_2$) into 4 vertical $strips$ representing 10 $sectors$.
109 The key computation in 2DRMP, performed by the PROP program, is the
110 propagation of a global
111 $R$-matrix, $\Re$, from sector to sector across the internal region, as shown in Fig.~\ref{prop}.
115 \includegraphics*[width=0.65\linewidth]{Chapters/chapter15/figures/prop.pdf}
116 \caption{\label{prop} Subdivision of the configuration space
117 ($r_1$,$r_2$) into a set of connected sectors.}
123 \includegraphics*[width=0.8\linewidth]{Chapters/chapter15/figures/Domain.pdf}
124 \caption{\label{domain} Propagation of the R-matrix from domain D to domain D'.}
128 We consider the general situation in
129 Fig.~\ref{domain} where we assume that we already know
130 the global $R$-matrix, $\Re^{I}$, associated with the boundary defined
131 by edges 5, 2, 1 and 6
132 in domain $D$ and we wish to
133 evaluate the new global $R$-matrix, $\Re^{O}$, associated with edges 5, 3, 4 and 6
134 in domain $D'$ following propagation across subregion $d$.
135 Input edges are denoted by I (edges 1 and~2), output edges by O (edges 3 and 4) and
136 common edges by X (edges 5 and~6).
137 Because of symmetry, only the lower half of domains $D$ and $D'$ has to be considered.
138 The global $R$-matrices, $\Re^{I}$ in domain $D$ and $\Re^{O}$ in
139 domain $D'$, can be written as:
141 \Re^{I} = \left(\begin{array}{cc}
142 \Re_{II}^{I} & \Re_{IX}^{I}\\
143 \Re_{XI}^{I} & \Re_{XX}^{I}
146 \Re^{O} = \left(\begin{array}{cc}
147 \Re_{OO}^{O} & \Re_{OX}^{O}\\
148 \Re_{XO}^{O} & \Re_{XX}^{O}
155 From the set of local $R$-matrices, $\mathbf{R}_{ij}$ ($i,j\in \{1,2,3,4\}$)
157 with subregion $d$, we can define
160 \mathbf{r}_{II} = \left(\begin{array}{cc}
161 \mathbf{R}_{11} & \mathbf{R}_{12}\\
162 \mathbf{R}_{21} & \mathbf{R}_{22}
163 \end{array}\right), \label{eqaa} &
164 \mathbf{r}_{IO} = \left(\begin{array}{cc}
165 \mathbf{R}_{13} & \mathbf{R}_{14}\\
166 \mathbf{R}_{23} & \mathbf{R}_{24}
167 \end{array}\right), \label{eqbb}\\
168 \mathbf{r}_{OI} = \left(\begin{array}{cc}
169 \mathbf{R}_{31} & \mathbf{R}_{32}\\
170 \mathbf{R}_{41} & \mathbf{R}_{42}
171 \end{array}\right), \label{eqcc} &
172 \mathbf{r}_{OO} = \left(\begin{array}{cc}
173 \mathbf{R}_{33} & \mathbf{R}_{34}\\
174 \mathbf{R}_{43} & \mathbf{R}_{44}
175 \end{array}\right),\label{eqdd}
178 where $I$ represents the input edges 1 and 2, and $O$ represents
179 the output edges 3 and 4 (see Fig.~\ref{domain}).
180 The propagation across each sector is characterized by equations~(\ref{eq1}) to (\ref{eq4}).
183 \Re^{O}_{OO} &=& \mathbf{r}_{OO} - \mathbf{r}_{IO}^T (r_{II} + \Re^{I}_{II})^{-1}\mathbf{r}_{IO}, \label{eq1} \\
184 \Re^{O}_{OX} &=& \mathbf{r}_{IO}^T (\mathbf{r}_{II} + \Re^{I}_{II})^{-1}\Re^{I}_{IX}, \label{eq2} \\
185 \Re^{O}_{XO} &=& \Re^{I}_{XI}(\mathbf{r}_{II} + \Re^{I}_{II})^{-1}\mathbf{r}_{IO}, \label{eq3} \\
186 \Re^{O}_{XX} &=& \Re^{I}_{XX} - \Re^{I}_{XI}(\mathbf{r}_{II} +\Re^{I}_{II})^{-1}\Re^{I}_{IX}. \label{eq4}
190 The matrix inversions are not explicitly performed. To compute
191 $(r_{II} + \Re^{I}_{II})^{-1}\mathbf{r}_{IO}$ and $(\mathbf{r}_{II} + \Re^{I}_{II})^{-1}\Re^{I}_{IX}$,
192 two linear systems are solved.
197 While equations (\ref{eq1})-(\ref{eq4}) can be applied to the
198 propagation across a general subregion two special situations should be
199 noted: propagation across a diagonal subregion and propagation across
200 a subregion bounded by the $r_{1}$-axis at the beginning of a new
203 In the case of a diagonal subregion, from symmetry considerations,
204 edge 2 is identical to edge 1 and edge 3 is identical to edge~4.
205 Accordingly, with only one input edge and one output edge equations
206 (\ref{eqaa})-(\ref{eqdd}) become:
209 \mathbf{r}_{II} = 2\mathbf{R}_{11}, \
210 \mathbf{r}_{IO} = 2\mathbf{R}_{14}, \label{eq4b}\\
211 \mathbf{r}_{OI} = 2\mathbf{R}_{41}, \
212 \mathbf{r}_{OO} = 2\mathbf{R}_{44}. \label{eq4d}
215 In the case of a subregion bounded by the $r_1$-axis at the beginning
216 of a new strip, we note that the input boundary $I$ consists of only
217 one edge. When propagating across the first subregion in the second
218 strip there is no common boundary $X$: in this case only equation
219 (\ref{eq1}) needs to be solved.
223 Having obtained the global $R$-matrix $\Re$ on the boundary of the
224 innermost subregion (labeled $0$ in Fig.~\ref{prop}), $\Re$ is propagated across
225 each subregion in the order indicated in Fig.~\ref{prop},
226 working systematically from the
227 $r_1$-axis at the bottom of each strip across all subregions to the
232 \subsection{Description of the PROP program}
237 \begin{tabular}{|c|c|c|c|c|c|}
239 \multirow{2}{0.09\linewidth}{\centering Data set} &
240 \multirow{2}{0.15\linewidth}{\centering
241 Local $R$-\\matrix size} &
242 \multirow{2}{0.07\linewidth}{\centering Strips} &
243 \multirow{2}{0.09\linewidth}{\centering Sectors} &
244 \multirow{2}{0.19\linewidth}{\centering Final global \\$R$-matrix size} &
245 \multirow{2}{0.15\linewidth}{\centering Scattering\\energies} \\
248 Small & 90x90 & 4 & 10 & 360x360 & 6\\
250 Medium & 90x90 & 4 & 10 & 360x360 & 64\\
252 Large & 383x383 & 20 & 210 & 7660x7660 & 6\\
254 Huge & 383x383 & 20 & 210 & 7660x7660 & 64\\ \hline
256 \caption{\label{data-sets}Characteristics of four data sets}
260 The PROP program computes the propagation of the $R$-matrix across the sectors of the internal region.
261 Table~\ref{data-sets} shows four different
262 data sets used in this study and highlights the principal parameters of the PROP program.
263 PROP execution can be described by Algorithm~\ref{prop-algo}.
264 First, amplitude arrays and
265 correction data are read from data files generated by the preceding
266 program of the 2DRMP suite.
267 Then, local $R$-matrices are constructed from amplitude arrays.
268 Correction data is used to compute correction vectors added to the diagonal of the local
269 $R$-matrices. The local $R$-matrices, together with the input $R$-matrix,
271 computed on the previous sector, are used to compute the current output
274 At the end of a sector evaluation,
275 the output $R$-matrix becomes the input $R$-matrix
276 for the next evaluation.
279 %% \caption{\label{prop-algo}PROP algorithm}
280 %% \begin{algorithmic}
281 %% \FOR{all scattering energies}
283 %% \STATE Read amplitude arrays
284 %% \STATE Read correction data
285 %% \STATE Construct local $R$-matrices
286 %% \STATE From $\Re^{I}$ and local $R$-matrices, compute $\Re^{O}$
287 %% \STATE $\Re^{O}$ becomes $\Re^{I}$ for the next sector
289 %% \STATE Compute physical $R$-Matrix
295 \caption{\label{prop-algo}PROP algorithm}
297 \For{all scattering energies} {
299 Read amplitude arrays\;
300 Read correction data\;
301 Construct local $R$-matrices\;
302 From $\Re^{I}$ and local $R$-matrices, compute $\Re^{O}$\;
303 $\Re^{O}$ becomes $\Re^{I}$ for the next sector\;
305 Compute physical $R$-Matrix \;
311 On the first sector, there is no input $R$-matrix yet. To bootstrap
312 the propagation, the first output $R$-matrix is constructed using only
313 one local $R$-matrix. On the last sector, that is, on the boundary of
314 the inner region, a physical $R$-matrix corresponding to the output
315 $R$-matrix is computed and stored into an output file.
317 In the PROP program, sectors are characterized into four types,
318 depending on the computation performed:
320 \item the starting sector (labeled 0 in Fig.~\ref{prop})
321 \item the axis sectors (labeled 1, 3 and 6 in Fig.~\ref{prop})
322 \item the diagonal sectors (labeled 2, 5 and 9 in Fig.~\ref{prop})
323 \item the off-diagonal sectors (labeled 4, 7 and 8 in Fig.~\ref{prop}).
327 The serial version of PROP is implemented in Fortran~90 and uses
328 for linear algebra operations BLAS\index{BLAS} and LAPACK\index{LAPACK} routines
329 which are fully optimized for x86 architecture.
332 serially propagates the $R$-matrix for
333 all scattering energies.
334 Since the propagations for these different
335 energies are independent, there also
336 exists an embarrassingly parallel version of
338 that spreads the computations of
340 among multiple CPU nodes via
345 \subsection{CAPS implementation}
348 In order to handle larger matrices, and thus obtain better GPU speedup, CAPS
349 recast equations (\ref{eq1}) to (\ref{eq4}) into one equation.
350 The output $R$-matrix $\Re^{O}$ defined by equation~(\ref{eq:RI_RO}) is now computed as follows.
351 \begin{equation}\label{eq_CAPS_1}
352 \Re^{O} = \Re^{O^{\ \prime}} + U A^{-1} V,
354 \begin{equation}\label{eq_CAPS_2}
356 \Re^{O^{\ \prime}}= \left(\begin{array}{cc}
357 \mathbf{r}_{OO} & 0\\
358 0 & \Re^I_{XX} \end{array}\right), \
359 U= \left(\begin{array}{c}
360 \mathbf-{r}_{IO}^{T}\\
361 \Re^I_{XI} \end{array}\right),
363 \begin{equation}\label{eq_CAPS_3}
364 A= \mathbf{r}_{II} + \Re^I_{II} \ {\rm and} \
365 V= (\mathbf{r}_{IO}\ \ \ -\Re^I_{IX}).
368 To compute $W=A^{-1}V$, no matrix inversion is performed. The matrix
369 system $AW=V$ is solved.
370 This reimplementation of PROP reduces the number of equations to be
371 solved and the number of matrix copies for evaluating each sector.
372 For instance, for an off-diagonal sector,
373 copies fall from 22 to 5, matrix multiplications from 4 to~1 and calls
374 to a linear equation solver from 2 to 1.
376 To implement this version, CAPS
377 used HMPP\index{HMPP}, a
378 commercial hybrid and parallel compiler,
379 based on compiler directives like the new OpenACC\index{OpenACC} standard\footnote{See: \url{www.openacc-standard.org}}.
380 If the matrices are large enough (the limit sizes are experimental parameters),
381 they are multiplied on the GPU, otherwise on the CPU.
383 used the MKL BLAS\index{BLAS} implementation on an Intel Xeon
384 x5560 quad core CPU (2.8 GHz)
385 and the CUBLAS\index{CUBLAS} library (CUDA 2.2) on one Tesla C1060 GPU.
386 On the large data set (see Table~\ref{data-sets}), CAPS
387 obtained a speedup of 1.15 for the GPU
388 version over the CPU one (with multi-threaded MKL calls on the four
389 CPU cores). This limited gain in performance is mainly
390 due to the use of double precision\index{precision!double precision} computation
391 and to the small or medium sizes of most matrices.
392 For these matrices, the computation gain on
394 strongly affected by the overhead
395 generated by transferring these matrices from
396 the CPU memory to the GPU memory to perform each matrix multiplication and then
397 transferring the result back to the CPU memory.
399 Our goal is to speedup PROP more significantly by porting the whole
400 code to the GPU and therefore avoiding
402 intermediate data transfers between
403 the host (CPU) and the GPU. We will also study the
404 stability of PROP in single precision\index{precision!single precision} because
405 single precision\index{precision!single precision} computation is faster on the GPU
406 and CPU-GPU data transfers are twice as fast as those performed in
407 double precision\index{precision!double precision}.
411 \section{Numerical validation\index{numerical validation} of PROP in single precision\index{precision!single precision}}
412 \label{single-precision}
417 \begin{tabular}{|c|c|}
419 relative error interval & \# occurrences \\
423 [1.E-8, 1.E-6) & 1241 \\
425 [1.E-6, 1.E-4) & 48728 \\
427 [1.E-4, 1.E-2) & 184065 \\
429 [1.E-2, 1) & 27723 \\
433 [100, $+\infty$) & 1 \\
437 \caption{\label{sp-distrib}Error distribution for medium case in single precision\index{precision!single precision}}
442 Floating-point input data, computation and output data of PROP are
443 originally in double precision\index{precision!double precision} format.
444 PROP produces a standard $R$-matrix H-file \cite{FARM_2DRMP}
445 and a collection of Rmat00X files (where X
446 ranges from 0 to the number of scattering energies - 1)
447 holding the physical R-matrix for each
449 The H-file and the Rmat00X files are binary input files of the FARM program \cite{FARM_2DRMP}
450 (last program of the 2DRMP suite).
451 Their text equivalent are the prop.out
452 and the prop00X.out files.
453 To study the validity of PROP results in single precision\index{precision!single precision},
455 reference results are
456 generated by running the serial version of PROP in double precision\index{precision!double precision}.
457 Data used in the most costly computation parts are read from input files in
458 double precision\index{precision!double precision} format and then
459 cast to single precision\index{precision!single precision} format.
460 PROP results (input of FARM) are computed in single precision\index{precision!single precision} and written
461 into files in double precision\index{precision!double precision}.
463 \subsection{Medium case study}
466 \includegraphics*[width=0.9\linewidth]{Chapters/chapter15/figures/error.pdf}
467 \caption{\label{fig:sp-distrib} Error distribution for medium case in single precision\index{precision!single precision}}
471 The physical $R$-matrices, in
472 the prop00X.out files, are compared to the
473 reference ones for the medium case (see Table~\ref{data-sets}).
475 error distribution is
476 given in Fig.~\ref{fig:sp-distrib}.
477 We focus on the largest errors.
479 \item Errors greater than 100: the only impacted value is of order 1.E-6
480 and is negligible compared to the other ones
481 in the same prop00X.out file.
483 \item Errors between 1 and 100: the values corresponding to the
484 largest errors are of order 1.E-3 and are negligible compared to
485 the majority of the other values which range between 1.E-2 and
488 \item Errors between 1.E-2 and 1: the largest errors ($\ge$ 6\%)
489 impact values the order of magnitude of which is at most 1.E-1.
490 These values are negligible.
491 Relative errors of approximately 5\% impact values the order of
492 magnitude of which is at most 1.E2.
493 For instance, the value 164 produced by the reference version of
494 PROP becomes 172 in the single precision\index{precision!single precision} version.
498 To study the impact of the single precision\index{precision!single precision} version of PROP on the
499 FARM program, the cross-section
500 results files corresponding to
503 {1s2s}, {1s2p}, {1s3s}, {1s3p}, {1s3d},
504 {1s4s}, {2p4d} are compared to the reference ones.
505 Table~\ref{sp-farm} shows that all cross-section files are impacted by
506 errors. Indeed in the {2p4d} file, four relative errors are
507 greater than one and the maximum relative error is 1.60.
508 However the largest errors impact negligible values. For example, the maximum
509 error (1.60) impacts a reference value which is 4.5E-4. The largest
510 values are impacted by low errors. For instance, the maximum value
511 (1.16) is impacted by a relative error of the order 1.E-3.
515 \begin{tabular}{|c|c||c|c|} \hline
516 file & largest relative error & file & largest relative error\\ \hline
517 {1s1s} & 0.02& {1s3p} & 0.11 \\ \hline
518 {1s2s} & 0.06 & {1s3d} & 0.22 \\ \hline
519 {1s2p} & 0.08 & {1s4s} & 0.20 \\ \hline
520 {1s3s} & 0.17 &2p4d & 1.60 \\ \hline
522 \caption{\label{sp-farm}Impact on FARM of the single precision\index{precision!single precision} version of PROP}
526 To examine in more detail the impact of PROP on FARM,
527 cross sections above the ionization threshold (1 Ryd)
528 are compared in single and
529 double precision\index{precision!double precision} for
530 transitions amongst the 1s, \dots 4s, 2p, \dots 4p, 3d, 4d target states.
531 This comparison is carried out by generating 45 plots. In all the
532 plots, results in single and double precision\index{precision!double precision} match except for few
533 scattering energies which are very close to pseudo-state thresholds.
534 For example Fig.~\ref{1s2p} and \ref{1s4d} present the scattering energies corresponding to the
535 {1s2p} and {1s4d} cross-sections computed in single and double precision\index{precision!double precision}. For some cross-sections,
536 increasing a threshold parameter from 1.E-4 to 1.E-3 in the FARM
538 results in energies close to threshold being avoided
540 the cross-sections in double and single precision\index{precision!single precision} match more
542 This is the case for instance for cross-section 1s2p (see Fig.~\ref{1s2p3}).
543 However for other cross-sections (such as 1s4d) some problematic energies remain even if the
544 threshold parameter in the FARM
545 program is increased to 1.E-3 (see Fig.~\ref{1s4d3}). A higher
546 threshold parameter would be required for such cross-sections.
550 \subfigure[threshold = 1.E-4]{
551 \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s2p.pdf}
554 \subfigure[threshold = 1.E-3]{
555 \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s2p3.pdf}
558 \label{fig:1s2p_10sectors}
559 \caption{1s2p cross-section, 10 sectors}
564 \subfigure[threshold = 1.E-4]{
565 \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s4d.pdf}
568 \subfigure[threshold = 1.E-3]{
569 \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s4d3.pdf}
572 \label{fig:1s4d_10sectors}
573 \caption{1s4d cross-section, 10 sectors}
576 As a conclusion, the medium case study shows that the execution of
577 PROP in single precision\index{precision!single precision} leads to a few inexact scattering energies to
578 be computed by the FARM program for some cross-sections.
579 Thanks to a suitable threshold parameter in the FARM program these problematic energies may possibly
581 Instead of investigating deeper the choice of such a parameter for the medium case, we analyze the
582 single precision\index{precision!single precision} computation in a more
583 realistic case in Sect.~\ref{huge}.
585 The conclusion of the medium case study is that running PROP in single
586 precision gives relatively stable results provided that suitable
587 parameter values are used in the FARM program in order to skip the
588 problematic energies that are too close to the pseudo-state
589 thresholds. To verify if this conclusion is still valid with a larger
590 data set, the single precision\index{precision!single precision} computation is analyzed in a more
591 realistic case in Sect.~\ref{huge}.
594 \subsection{Huge case study}\label{huge}
599 \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s2pHT.pdf}
600 \caption{\label{1s2pHT}1s2p cross-section, threshold = 1.E-4, 210 sectors}
605 \includegraphics*[width=.76\linewidth]{Chapters/chapter15/figures/1s2pHT.pdf}
606 \caption{\label{1s4dHT}1s4d cross-section, threshold = 1.E-4, 210 sectors}
609 We study here the impact on FARM of the PROP program run in
610 single precision\index{precision!single precision} for the huge case (see Table~\ref{data-sets}).
613 atomic target states 1s \dots 7i are explored, which
615 406 comparison plots.
616 It should be noted that in this case, over the same energy range above the ionization threshold, the density of pseudo-state thresholds is significantly increased compared to the medium case.
617 As expected, all the plots exhibit large differences between single and double
618 precision cross-sections.
619 For example Fig.~\ref{1s2pHT} and \ref{1s4dHT} present the 1s2p and 1s4d cross-sections computed in
620 single and double precision\index{precision!double precision} for the huge case.
621 We can conclude that PROP in single precision\index{precision!single precision} gives invalid results
622 for realistic simulation cases above the ionization threshold.
623 Therefore the deployment of PROP on GPU, described in Sect.~\ref{gpu-implem},
624 has been carried out in double precision\index{precision!double precision}.
626 \section{Towards a complete deployment of PROP on GPUs}
629 We now detail how PROP has been progressively deployed on
630 GPUs in double precision\index{precision!double precision} in order to avoid the
631 expensive memory transfers between the host and the GPU.
632 Different versions with successive improvements and optimizations are presented.
633 We use CUDA~\cite{CUDA_ProgGuide} for GPU programming, as well as the
634 CUBLAS\index{CUBLAS}~\cite{CUBLAS}
635 and MAGMA \cite{MAGMA} libraries for linear algebra operations.
636 Since PROP is written in Fortran 90, {\em wrappers\index{wrapper}} in C are used to
637 enable calls to CUDA kernels from PROP routines.
640 \subsection{Computing the output $R$-matrix on GPU}
645 \includegraphics[width=0.7\linewidth]{Chapters/chapter15/figures/offdiagonal_nb.pdf}
646 \caption{\label{offdiagonal} The six steps of an off-diagonal sector
650 As mentioned in Algorithm~\ref{prop-algo}, evaluating a sector
651 mainly consists in constructing local $R$-matrices and in computing
652 one output $R$-matrix, $\Re^{O}$. In this first step of the porting
653 process, referred to as GPU V1\label{gpuv1},
654 we only consider the computation of $\Re^{O}$ on the GPU.
655 We distinguish the following six steps, related to equations
656 (\ref{eq_CAPS_1}), (\ref{eq_CAPS_2}) and (\ref{eq_CAPS_3}), and illustrated in
657 Fig.~\ref{offdiagonal} for an off-diagonal sector.
660 \item[Step 1] (``Input copies''):~data are copied from $\Re^{I}$
661 to temporary arrays ($A$, $U$, $V$) and to $\Re^{O}$.
662 These copies, along with possible scalings or transpositions, are
663 implemented as CUDA kernels which can be applied to two
664 matrices of any size and starting at any offset.
665 Memory accesses are coalesced\index{coalesced memory accesses} \cite{CUDA_ProgGuide} in order to
666 provide the best performance for such memory-bound kernels.
667 \item[Step 2] (``Local copies''):~data are copied from
668 local $R$-matrices to temporary arrays ($U$, $V$) and to $\Re^{O}$.
669 Moreover data from local $R$-matrix
671 is added to matrix $A$ (via a CUDA kernel) and zeroes are written in
672 $\Re^{O}$ where required.
673 \item[Step 3] (``Linear system solving''):~matrix $A$ is factorized
674 using the MAGMA DGETRF\index{MAGMA functions!DGETRF}
675 routine and the result is stored in-place.
676 \item[Step 4] (``Linear system solving'' cont.):~the matrix system
677 of linear equations $AW$ = $V$ is solved using the MAGMA DGETRS\index{MAGMA functions!DGETRS}
678 routine. The solution is stored in matrix $V$.
679 \item[Step 5] (``Output matrix product''):~matrix $U$
680 is multiplied by matrix $V$ using the CUBLAS\index{CUBLAS} DGEMM
681 routine. The result is stored in a temporary matrix~$t$.
682 \item[Step 6] (``Output add''):~$t$ is added to $\Re^{O}$ (CUDA
686 All the involved matrices are stored in the GPU memory. Only the
687 local $R$-matrices are first constructed on the host and then sent
688 to the GPU memory, since these matrices vary from sector to sector.
689 The evaluation of the axis and diagonal sectors is similar.
690 However, fewer operations and copies are required because of
691 symmetry considerations \cite{2DRMP}.
693 \subsection{Constructing the local $R$-matrices on GPU}
697 \includegraphics[width=0.7\linewidth]{Chapters/chapter15/figures/amplitudes_nb.pdf}
698 \caption{\label{amplitudes} Constructing the local $R$-matrix R34
699 from the $j$ amplitude array associated with edge 4 and the $i$
700 amplitude array associated with edge~3.}
703 Local $R$-matrices are constructed using two three dimensional arrays,
704 $i$ and $j$. Each three dimensional array contains four
705 matrices corresponding to the surface amplitudes associated with the
706 four edges of a sector. Those matrices are named {\em amplitude arrays}.
707 $j$ amplitude arrays are read from data files and $i$ amplitude arrays
708 are obtained by scaling each row of the $j$ amplitude arrays.
709 The main part of the construction of a local $R$-matrix,
710 presented in Fig.~\ref{amplitudes},
711 is a matrix product between
712 one $i$ amplitude array and one transposed $j$ amplitude array
713 which is performed by a single DGEMM
714 BLAS\index{BLAS} call.
715 In this version, hereafter referred to as GPU
716 V2\label{gpuv2}, $i$ and $j$ amplitude arrays are transferred to the
717 GPU memory and the required matrix multiplications are performed on
718 the GPU (via CUBLAS\index{CUBLAS} routines).
721 The involved matrices having medium sizes (either $3066 \times 383$ or
723 performing these matrix multiplications
724 on the GPU is expected to be faster than on the CPU.
725 However, this implies a greater communication volume
726 between the CPU and the GPU
727 since the $i$ and $j$ amplitude arrays are larger than the local
729 It can be noticed that correction data are also used in the
730 construction of a local $R$-matrix,
731 but this is a minor part in the
732 computation. However, these correction data also have to be
733 transferred from the CPU to the GPU for each sector.
735 \subsection{Scaling amplitude arrays on GPU}
737 should be worthwhile to try to reduce the CPU-GPU data
738 transfers of the GPU V2, where the $i$ and $j$ amplitude arrays are
739 constructed on the host and then sent to the GPU memory for each sector.
740 In this new version, hereafter referred to as GPU V3\label{gpuv3}, we
741 transfer only the $j$ amplitude arrays and the
742 required scaling factors (stored in one 1D array) to the GPU memory,
743 so that the $i$ amplitude arrays are
744 then directly computed on the GPU by multiplying the $j$ amplitude
745 arrays by these scaling factors (via a CUDA kernel).
746 Therefore, we save the transfer of four $i$ amplitude arrays on
747 each sector by transferring only this 1D array of scaling factors.
748 Moreover, scaling $j$ amplitude arrays is expected to be faster on the
749 GPU than on the CPU, thanks to the massively parallel architecture of
750 the GPU and thanks to its higher internal memory bandwidth.
752 \subsection{Using double-buffering\index{double-buffering} to overlap I/O and computation}
756 \includegraphics[width=0.8\linewidth]{Chapters/chapter15/figures/C1060_V3_IO_COMP.pdf}
757 \caption{\label{overlapping} Compute and I/O times for the GPU V3 on
761 As described in Algorithm~\ref{prop-algo}, there are two main steps in
762 the propagation across a sector: reading amplitude arrays
763 and correction data from I/O files and
764 evaluating the current sector.
765 Fig.~\ref{overlapping} shows the I/O times and the evaluation times
766 of each sector for the huge case execution (210 sectors, 20 strips) of the GPU V3
768 Whereas the times required by the off-diagonal sectors are similar
769 within each of the 20 strips,
770 the times for diagonal sectors of each strip
771 are the shortest ones, the times for the axis sectors being
773 The I/O times are roughly constant among all strips.
774 The evaluation time is equivalent to the I/O
775 time for the first sectors. But this evaluation time grows
776 linearly with the strip number, and rapidly exceeds the I/O
779 It is thus interesting to use a double-buffering\index{double-buffering} technique to overlap the
780 I/O time with the evaluation time:
781 for each sector, the evaluation of sector $n$ is performed
782 (on GPU) simultaneously with the reading of data for sector
783 $n+1$ (on CPU). This requires the duplication in the CPU memory of all the
785 used for storing data read from I/O files for each sector.
786 This version, hereafter referred to as GPU
787 V4\label{gpuv4}, uses POSIX threads\index{POSIX threads}. Two threads are
788 executed concurrently: an I/O thread that reads data from I/O files
789 for each sector, and a computation thread, dedicated to the propagation
790 of the global $R$-matrix, that performs successively for each sector
791 all necessary computations on GPU,
792 as well as all required CPU-GPU data transfers.
793 The evaluation of a sector uses the data read for this sector as well
794 as the global $R$-matrix computed on the previous sector.
795 This dependency requires synchronizations between the I/O thread and
796 the computation thread which are implemented through standard POSIX
800 \subsection{Matrix padding\index{padding}}
802 performance and the MAGMA DGETRF\index{MAGMA functions!DGETRF}/DGETRS\index{MAGMA functions!DGETRS}
803 performance is reduced when the sizes (or
804 the leading dimensions) of the matrix are not multiples of the inner blocking size \cite{NTD10a}.
805 This inner blocking size can be 32 or 64, depending on the computation
806 and on the underlying
807 GPU architecture \cite{MAGMA,NTD10b}.
808 In this version (GPU V5\label{gpuv5}),
809 the matrices are therefore padded with $0.0$ (and $1.0$ on the diagonal for the linear systems)
810 so that their sizes are
812 This corresponds indeed to the optimal size for the matrix product on the
813 Fermi architecture \cite{NTD10b}. And as far as linear system solving is
814 concerned, all the matrices have sizes which are multiples of 383: we
815 therefore use padding\index{padding} to obtain multiples of 384 (which are
816 again multiples of 64).
817 It can be noticed that this padding\index{padding} has to be performed dynamically
818 as the matrices increase in size during the propagation
820 maximum required storage space is however allocated only once in the
825 \section{Performance results}
826 \subsection{PROP deployment on GPU}
830 \begin{tabular}{|c||c|c||}
832 PROP version & \multicolumn{2}{c|}{Execution time} \\
835 CPU version: 1 core & \multicolumn{2}{c|}{201m32s} \\
837 CPU version: 4 cores & \multicolumn{2}{c|}{113m28s} \\
839 GPU version & C1060 & C2050 \\
841 GPU V1 (\S~\ref{gpuv1}) & 79m25s & 66m22s \\
843 GPU V2 (\S~\ref{gpuv2}) & 47m58s & 29m52s \\
845 GPU V3 (\S~\ref{gpuv3}) & 41m28s & 23m46s \\
847 GPU V4 (\S~\ref{gpuv4}) & 27m21s & 13m55s\\
849 GPU V5 (\S~\ref{gpuv5}) & 24m27s & 12m39s \\
853 \caption{Execution time of PROP on CPU and GPU}
860 \begin{tabular}{|c||c|c||}
862 PROP version & \multicolumn{2}{c|}{Execution time} \\
864 CPU version & 1 core & 4 cores \\\hline
865 & {201m32s} & {113m28s} \\ \hline \hline
866 GPU version & C1060 & C2050 \\
868 GPU V1 (\ref{gpuv1}) & 79m25s & 66m22s \\
870 GPU V2 (\ref{gpuv2}) & 47m58s & 29m52s \\
872 GPU V3 (\ref{gpuv3}) & 41m28s & 23m46s \\
874 GPU V4 (\ref{gpuv4}) & 27m21s & 13m55s\\
876 GPU V5 (\ref{gpuv5}) & 24m27s & 12m39s \\
880 \caption{Execution time of the successive GPU versions}
886 \subfigure[Speedup over 1 CPU core]{
887 \includegraphics*[width=0.76
888 \linewidth]{Chapters/chapter15/figures/histo_speedup_1core.pdf}
889 \label{fig:speedup_1core}
892 \subfigure[Speedup over 4 CPU cores]{
893 \includegraphics*[width=0.76
894 \linewidth]{Chapters/chapter15/figures/histo_speedup_4cores.pdf}
895 \label{fig:speedup_4cores}
898 \caption{Speedup of the successive GPU versions.}
901 Table~\ref{table:time} presents
903 of PROP on CPUs and GPUs,
904 each version solves the propagation equations in the
905 form~(\ref{eq_CAPS_1}-\ref{eq_CAPS_3}) as proposed by CAPS.
906 Fig.~\ref{fig:speedup_1core} (respectively \ref{fig:speedup_4cores})
907 shows the speedup of the successive GPU versions
908 over one CPU core (respectively four CPU cores).
909 We use here Intel Q8200 quad-core CPUs (2.33 GHz), one C1060 GPU and
910 one C2050 (Fermi) GPU, located at
911 UPMC (Universit\'e Pierre et Marie Curie, Paris, France).
912 As a remark, the execution times measured on the C2050 would be the same
913 on the C2070 and on the C2075, the only difference between these GPUs
914 being their memory size and their TDP (Thermal Design Power)\index{TDP (Thermal Design Power)}.
915 We emphasize that the execution times correspond to the
916 complete propagation for all six energies of the large case (see
917 Table~\ref{data-sets}), that is to say to the complete execution of
919 Since energies are independent, execution times for more energies
920 (e.g. the huge case) should be proportional
921 to those reported in Table~\ref{table:time}.
923 These tests, which have been performed with CUDA 3.2, CUBLAS\index{CUBLAS} 3.2 and
925 show that the successive GPU versions of PROP offer
926 increasing, and at the end interesting, speedups.
928 V2 shows that it is worth increasing slightly the
929 CPU-GPU communication volume in order to perform
930 large enough matrix products on the GPU.
931 This communication volume can fortunately be
932 reduced thanks to~V3,
933 which also accelerates the computation of
934 amplitude arrays thanks to the GPU.
936 double-buffering\index{double-buffering} technique implemented in V4
937 effectively enables the overlapping of
938 I/O operations with computation, while the
939 padding\index{padding} implemented in V5 also improves the computation times.
941 is noticed that the padding\index{padding}
942 does offer much more performance gain with,
943 for example, CUDA 3.1 and the MAGMA DGEMM\index{MAGMA functions!DGEMM}~\cite{NTD10b}: the
944 speedup with respect to one
945 CPU core was increased from 6.3 to 8.1 on C1060, and from 9.5 to 14.3
947 Indeed since CUBLAS\index{CUBLAS} 3.2 performance has been improved for non block multiple
948 matrix sizes through MAGMA code~\cite{NTD10a}.
949 Although for all versions the C2050 (with its improved
950 double precision\index{precision!double precision} performance) offers up to almost
951 double speedup compared to
952 the C1060, the performance obtained with both architectures justifies the use of
953 the GPU for such an application.
955 \subsection{PROP execution profile}
959 \includegraphics*[width=0.64\linewidth]{Chapters/chapter15/figures/CPU_1_CORE_TIMES.pdf}
960 \caption{CPU (1 core) execution times for the off-diagonal sectors of the large case.}
961 \label{fig:CPU-timing}
966 \subfigure[GPU V5 on one C1060]{
967 \includegraphics*[width=0.64\linewidth]{Chapters/chapter15/figures/GPU_V5_C1060_TIMES.pdf}
970 \subfigure[GPU V5 on one C2050]{
971 \includegraphics*[width=0.64\linewidth]{Chapters/chapter15/figures/GPU_V5_C2050_TIMES.pdf}
972 \label{fermi-timing}}
974 \caption{GPU execution times for the off-diagonal sectors of
976 \label{fig:profileGPU}
979 We detail here the execution profile on
980 the CPU and the GPU for the evaluation of all off-diagonal sectors
981 (the most representative ones) for a complete energy propagation.
982 Fig.~\ref{fig:CPU-timing} and \ref{fig:profileGPU} show CPU and GPU execution times for the
983 171 off-diagonal sectors of the large case (see Table \ref{data-sets}).
984 ``Copying, adding, scaling'' corresponds to the amplitude
985 array construction (scaling) as well as to steps 1, 2 and 6 in
986 Sect.~\ref{gpu-RO}, all implemented via CUDA kernels.
987 ``CPU-GPU transfers''
988 aggregate transfer times for the $j$ amplitude
989 arrays and the scaling factors, as well as for the correction data.
990 ``Amplitude matrix product'' corresponds to the DGEMM call to
991 construct the local R-matrices from the $i$ and $j$ amplitude
993 ``Linear system solving'' and ``Output matrix product'' correspond
994 respectively to steps 3-4 and to step 5 in Sect.~\ref{gpu-RO}.
996 On one CPU core (see Fig.~\ref{fig:CPU-timing}),
997 matrix products for the construction of the local
998 $R$-matrices require the same
999 computation time during the whole propagation. Likewise the CPU time required by
1000 matrix products for the output $R$-matrix is constant within each
1001 strip. But as the global $R$-matrix is propagated from strip to
1003 the matrices $U$ and $V$ increase, so does their multiplication time.
1004 The time required to solve the linear system increases
1005 slightly during the propagation.
1006 These three operations (``Amplitude matrix product'', ``Output matrix
1007 product'' and ``Linear system solving'') are clearly dominant in terms
1009 time compared to the other remaining operations, which justify our
1010 primary focus on such three linear algebra operations.
1013 On the C1060 (see Fig.~\ref{GPU-timing}), we have
1014 generally managed to obtain a similar speedup for all operations
1015 (around 8, which corresponds to Fig.~\ref{fig:speedup_1core}). Only the linear system solving
1016 presents a lower speedup (around~4).
1017 The CUDA kernels and the remaining CPU-GPU transfers make a minor contribution
1018 to the overall computation time and do not require
1019 additional improvements.
1021 On the C2050 (see Fig.~\ref{fermi-timing}), additional speedup is
1022 obtained for all operations, except for the
1023 CPU-GPU transfers and the linear system solving.
1024 The CPU-GPU transfers are mainly due to the $j$ amplitude arrays, and
1025 currently still correspond to minor times. When required, the
1026 double-buffering\index{double-buffering} technique may also be used to overlap such transfers with computation on the GPU.
1030 \section{Propagation of multiple concurrent energies on GPU}\index{concurrent kernel execution}
1032 Finally, we present here an improvement that can
1033 benefit from the Fermi architecture, as well as from the newest Kepler
1035 both of which enable the concurrent execution of multiple
1036 CUDA kernels\index{concurrent kernel execution}, thus offering additional speedup on
1037 GPUs for small or medium computation grain kernels.
1038 In our case, the performance gain on the GPU is indeed limited
1039 since most matrices have small or medium sizes.
1040 By using multiple streams within one CUDA context~\cite{CUDA_ProgGuide},
1041 we can propagate multiple energies
1042 concurrently\index{concurrent kernel execution} on the Fermi GPU.
1043 It can be noticed that all GPU computations for all streams are
1044 launched by the same host thread. We therefore rely here on the {\em legacy
1045 API} of CUBLAS\index{CUBLAS}~\cite{CUBLAS} (like MAGMA)
1046 without thread safety problems.
1047 A {\em breadth first} issue order is used for kernel
1048 launchs \cite{CUDA_stream}: for a given GPU kernel, all kernel launchs
1049 are indeed issued together in the host thread, using one stream for each
1050 concurrent energy, in order to maximize concurrent kernel
1051 execution\index{concurrent kernel execution}.
1052 Of course, the memory available on the GPU must be large enough to
1053 store all data structures required by each energy.
1054 Moreover, multiple streams are also used within the
1055 propagation of each single energy
1056 in order to enable concurrent executions among the required kernels.
1061 \begin{tabular}{|c|c||c|c|c|c|c|}
1063 \multirow{4}{0.18\linewidth}{Medium case} & Number of &
1064 \multirow{2}{0.07\linewidth}{\centering 1} &
1065 \multirow{2}{0.07\linewidth}{\centering 2} &
1066 \multirow{2}{0.07\linewidth}{\centering 4} &
1067 \multirow{2}{0.07\linewidth}{\centering 8} &
1068 \multirow{2}{0.07\linewidth}{\centering 16} \\
1069 & energies & & & & & \\
1071 & Time (s) & 11.18 & 6.87 & 5.32 & 4.96 & 4.76 \\
1073 & Speedup & - & 1.63 & 2.10 & 2.26 & 2.35 \\
1076 \multirow{4}{0.18\linewidth}{Large case} & Number of &
1077 \multirow{2}{0.07\linewidth}{\centering 1} &
1078 \multicolumn{2}{c|}{\multirow{2}{0.07\linewidth}{\centering 2}} &
1079 \multicolumn{2}{c|}{\multirow{2}{0.07\linewidth}{\centering 3}} \\
1080 & energies & & \multicolumn{2}{c|}{~} & \multicolumn{2}{c|}{~} \\
1082 & Time (s) & 509.51 & \multicolumn{2}{c|}{451.49} & \multicolumn{2}{c|}{436.72} \\
1084 & Speedup & - & \multicolumn{2}{c|}{1.13} & \multicolumn{2}{c|}{1.17} \\
1087 \caption{\label{t:perfs_V6} Performance results with multiple
1089 on one C2070 GPU. GPU initialization times are not considered here. }
1093 In order to have enough GPU memory to run two or three concurrent
1094 energies for the large case, we use one C2070 GPU
1095 (featuring 6GB of memory)
1096 with one Intel X5650 hex-core CPU, CUDA 4.1 and CUBLAS\index{CUBLAS} 4.1, as
1097 well as the latest MAGMA release (version 1.3.0).
1098 Substantial changes have been required
1099 in the MAGMA calls with respect to the previous versions of PROP that were using MAGMA 0.2.
1100 Table~\ref{t:perfs_V6} presents the speedups
1101 obtained on the Fermi GPU for multiple concurrent
1102 energies (up to sixteen since this is the maximum number of concurrent
1103 kernel launches currently supported \cite{CUDA_ProgGuide}).
1104 With the medium case, speedups greater than 2 are obtained with four
1105 concurrent energies or more.
1106 With the more realistic large case, the performance gain is lower mainly because of
1107 the increase in matrix sizes, which implies a better GPU usage
1108 with only one energy on the GPU. The concurrent execution of multiple
1109 kernels\index{concurrent kernel execution} is also limited by other operations on the
1110 GPU \cite{CUDA_ProgGuide,CUDA_stream} and by the current MAGMA code which
1111 prevents concurrent MAGMA calls in different streams.
1112 Better speedups can be here expected on the latest Kepler GPUs which
1113 offer additional compute power, and whose {\em Hyper-Q} feature may help
1114 improve further the GPU utilization with concurrent energies.
1115 On the contrary, the same code on the C1060 shows no speedup
1116 since the concurrent kernel launches are serialized on this previous GPU architecture.
1125 \section{Conclusion and future work}
1128 In this chapter, we have presented our methodology and our results in
1130 a GPU of an application (the PROP program) in atomic physics.
1132 We have started by studying the numerical stability of PROP using
1133 single precision\index{precision!single precision} arithmetic. This has shown that PROP
1134 using single precision\index{precision!single precision}, while relatively stable for some small cases,
1135 gives unsatisfactory results for realistic simulation cases above the
1136 ionization threshold where there is a
1137 significant density of pseudo-states. It is
1138 expected that this may not be the case below the ionization threshold
1139 where the actual target states are less dense. This requires further
1143 therefore deployed the PROP code in double precision\index{precision!double precision} on
1144 a GPU, with successive improvements. The different GPU versions
1145 each offer increasing speedups over the CPU version.
1146 Compared to the single (respectively four) core(s) CPU version, the
1147 optimal GPU implementation
1148 gives a speedup of 8.2 (resp. 4.6) on one C1060 GPU,
1149 and a speedup of 15.9 (resp. 9.0) on one
1150 C2050 GPU with improved double precision\index{precision!double precision} performance.
1151 An additional gain of around 15\%
1152 can also be obtained on one Fermi GPU
1153 with large memory (C2070) thanks to concurrent kernel execution.
1156 cannot be directly compared with the 1.15 speedup
1157 obtained with the HMPP\index{HMPP} version, since in our tests the CPUs are
1158 different and the CUBLAS\index{CUBLAS} versions are more recent.
1159 However, the programming effort required
1160 progressively to deploy PROP on GPUs clearly offers improved and interesting speedups for this
1161 real-life application in double precision\index{precision!double precision} with varying-size matrices.
1164 We are currently working on a hybrid CPU-GPU version that spreads the
1165 computations of the independent energies on both the CPU
1166 and the GPU. This will enable
1167 multiple energy execution on the CPU, with
1168 one or several core(s) dedicated to each energy (via multi-threaded
1169 BLAS\index{BLAS} libraries). Multiple
1170 concurrent energies may also be propagated on each Fermi GPU.
1171 By merging this work with the current MPI PROP program, we will
1172 obtain a scalable hybrid CPU-GPU version.
1173 This final version will offer an additional level of parallelism
1175 standard in order to exploit multiple
1176 nodes with multiple CPU cores and possibly multiple GPU cards.
1178 \putbib[Chapters/chapter15/biblio]