\newcommand{\fixme}[1]{{\bf #1}}
-\chapter[Numerical validation and performance optimization on GPUs in atomic physics]{Numerical validation and performance optimization on GPUs of
-an application in atomic physics}
+\chapter[Numerical validation and performance optimization on GPUs in atomic physics]{Numerical validation and performance optimization on GPUs of an application in atomic physics}
\label{chapter15}
\section{Introduction}\label{ch15:intro}
intensity
($i.e.$ the ratio of arithmetic operations to memory operations).
Another important aspect of GPU programming is that floating-point
-operations are preferably performed in single precision, if the
+operations are preferably performed in single precision\index{precision!single precision}, if the
validity of results is not impacted by that format.
The GPU compute power for floating-point operations is indeed greater in
-single precision than in double precision.
-The peak performance ratio between single precision and double
+single precision\index{precision!single precision} than in double precision\index{precision!double precision}.
+The peak performance ratio between single precision\index{precision!single precision} and double
precision varies for example for NVIDIA GPUs from $12$ for the first Tesla
GPUs (C1060),
to $2$ for the Fermi GPUs (C2050 and C2070)
As far as AMD GPUs are concerned, the latest AMD GPU (Tahiti HD 7970)
presents a ratio of $4$.
Moreover, GPU internal memory accesses and CPU-GPU data transfers are
-faster in single precision than in double precision,
+faster in single precision\index{precision!single precision} than in double precision\index{precision!double precision},
because of the different format lengths.
This chapter describes the deployment on GPUs of PROP, a program of the
with the matrix size, since the
CPU-GPU transfer overhead becomes less significant and since CPUs are
still more efficient for fine computation grains.
-Then, using HMPP\footnote{
+Then, using HMPP\index{HMPP}\footnote{
HMPP or {\em CAPS compiler}, see: \url{www.caps-entreprise.com/hmpp.html}},
a commercial
hybrid and parallel compiler, CAPS
The work described in this chapter, which is based on a study presented in \cite{PF_PDSEC2011}, aims at
improving PROP performance on
GPUs by exploring two directions. First, because the original version of PROP is written
-in double precision,
-we study the numerical stability of PROP in single precision.
+in double precision\index{precision!double precision},
+we study the numerical stability of PROP in single precision\index{precision!single precision}.
Second, we deploy the whole
computation code of PROP on
GPUs to avoid the overhead generated by
The serial version of PROP is implemented in Fortran~90 and uses
-for linear algebra operations BLAS and LAPACK routines
+for linear algebra operations BLAS\index{BLAS} and LAPACK\index{LAPACK} routines
which are fully optimized for x86 architecture.
This
program
to a linear equation solver from 2 to 1.
To implement this version, CAPS
-used HMPP, a
+used HMPP\index{HMPP}, a
commercial hybrid and parallel compiler,
-based on compiler directives like the new OpenACC standard\footnote{See: \url{www.openacc-standard.org}}.
+based on compiler directives like the new OpenACC\index{OpenACC} standard\footnote{See: \url{www.openacc-standard.org}}.
If the matrices are large enough (the limit sizes are experimental parameters),
they are multiplied on the GPU, otherwise on the CPU.
CAPS
- used the MKL BLAS implementation on an Intel Xeon
+ used the MKL BLAS\index{BLAS} implementation on an Intel Xeon
x5560 quad core CPU (2.8 GHz)
-and the CUBLAS library (CUDA 2.2) on one Tesla C1060 GPU.
+and the CUBLAS\index{CUBLAS} library (CUDA 2.2) on one Tesla C1060 GPU.
On the large data set (see Table~\ref{data-sets}), CAPS
obtained a speedup of 1.15 for the GPU
version over the CPU one (with multi-threaded MKL calls on the four
CPU cores). This limited gain in performance is mainly
-due to the use of double precision computation
+due to the use of double precision\index{precision!double precision} computation
and to the small or medium sizes of most matrices.
For these matrices, the computation gain on
the GPU is indeed
the
intermediate data transfers between
the host (CPU) and the GPU. We will also study the
-stability of PROP in single precision because
-single precision computation is faster on the GPU
+stability of PROP in single precision\index{precision!single precision} because
+single precision\index{precision!single precision} computation is faster on the GPU
and CPU-GPU data transfers are twice as fast as those performed in
-double precision.
+double precision\index{precision!double precision}.
-\section{Numerical validation of PROP in single precision}
+\section{Numerical validation\index{numerical validation} of PROP in single precision\index{precision!single precision}}
\label{single-precision}
\begin{comment}
\hline
\end{tabular}
\end{center}
-\caption{\label{sp-distrib}Error distribution for medium case in single precision}
+\caption{\label{sp-distrib}Error distribution for medium case in single precision\index{precision!single precision}}
\end{table}
\end{comment}
Floating-point input data, computation and output data of PROP are
-originally in double precision format.
+originally in double precision\index{precision!double precision} format.
PROP produces a standard $R$-matrix H-file \cite{FARM_2DRMP}
and a collection of Rmat00X files (where X
ranges from 0 to the number of scattering energies - 1)
(last program of the 2DRMP suite).
Their text equivalent are the prop.out
and the prop00X.out files.
-To study the validity of PROP results in single precision,
+To study the validity of PROP results in single precision\index{precision!single precision},
first,
reference results are
- generated by running the serial version of PROP in double precision.
+ generated by running the serial version of PROP in double precision\index{precision!double precision}.
Data used in the most costly computation parts are read from input files in
-double precision format and then
-cast to single precision format.
-PROP results (input of FARM) are computed in single precision and written
-into files in double precision.
+double precision\index{precision!double precision} format and then
+cast to single precision\index{precision!single precision} format.
+PROP results (input of FARM) are computed in single precision\index{precision!single precision} and written
+into files in double precision\index{precision!double precision}.
\subsection{Medium case study}
\begin{figure}[h]
\begin{center}
\includegraphics*[width=0.9\linewidth]{Chapters/chapter15/figures/error.pdf}
-\caption{\label{fig:sp-distrib} Error distribution for medium case in single precision}
+\caption{\label{fig:sp-distrib} Error distribution for medium case in single precision\index{precision!single precision}}
\end{center}
\end{figure}
Relative errors of approximately 5\% impact values the order of
magnitude of which is at most 1.E2.
For instance, the value 164 produced by the reference version of
- PROP becomes 172 in the single precision version.
+ PROP becomes 172 in the single precision\index{precision!single precision} version.
\end{itemize}
-To study the impact of the single precision version of PROP on the
+To study the impact of the single precision\index{precision!single precision} version of PROP on the
FARM program, the cross-section
results files corresponding to
transitions
{1s2p} & 0.08 & {1s4s} & 0.20 \\ \hline
{1s3s} & 0.17 &2p4d & 1.60 \\ \hline
\end{tabular}
-\caption{\label{sp-farm}Impact on FARM of the single precision version of PROP}
+\caption{\label{sp-farm}Impact on FARM of the single precision\index{precision!single precision} version of PROP}
\end{center}
\end{table}
To examine in more detail the impact of PROP on FARM,
cross sections above the ionization threshold (1 Ryd)
are compared in single and
-double precision for
+double precision\index{precision!double precision} for
transitions amongst the 1s, \dots 4s, 2p, \dots 4p, 3d, 4d target states.
This comparison is carried out by generating 45 plots. In all the
- plots, results in single and double precision match except for few
+ plots, results in single and double precision\index{precision!double precision} match except for few
scattering energies which are very close to pseudo-state thresholds.
For example Fig.~\ref{1s2p} and \ref{1s4d} present the scattering energies corresponding to the
-{1s2p} and {1s4d} cross-sections computed in single and double precision. For some cross-sections,
+{1s2p} and {1s4d} cross-sections computed in single and double precision\index{precision!double precision}. For some cross-sections,
increasing a threshold parameter from 1.E-4 to 1.E-3 in the FARM
program
results in energies close to threshold being avoided
and therefore
-the cross-sections in double and single precision match more
+the cross-sections in double and single precision\index{precision!single precision} match more
accurately.
This is the case for instance for cross-section 1s2p (see Fig.~\ref{1s2p3}).
However for other cross-sections (such as 1s4d) some problematic energies remain even if the
\end{figure}
As a conclusion, the medium case study shows that the execution of
-PROP in single precision leads to a few inexact scattering energies to
+PROP in single precision\index{precision!single precision} leads to a few inexact scattering energies to
be computed by the FARM program for some cross-sections.
Thanks to a suitable threshold parameter in the FARM program these problematic energies may possibly
be skipped.
Instead of investigating deeper the choice of such a parameter for the medium case, we analyze the
-single precision computation in a more
+single precision\index{precision!single precision} computation in a more
realistic case in Sect.~\ref{huge}.
\begin{comment}
The conclusion of the medium case study is that running PROP in single
parameter values are used in the FARM program in order to skip the
problematic energies that are too close to the pseudo-state
thresholds. To verify if this conclusion is still valid with a larger
-data set, the single precision computation is analyzed in a more
+data set, the single precision\index{precision!single precision} computation is analyzed in a more
realistic case in Sect.~\ref{huge}.
\end{comment}
\end{figure}
We study here the impact on FARM of the PROP program run in
-single precision for the huge case (see Table~\ref{data-sets}).
+single precision\index{precision!single precision} for the huge case (see Table~\ref{data-sets}).
The cross-sections
corresponding to all
atomic target states 1s \dots 7i are explored, which
As expected, all the plots exhibit large differences between single and double
precision cross-sections.
For example Fig.~\ref{1s2pHT} and \ref{1s4dHT} present the 1s2p and 1s4d cross-sections computed in
-single and double precision for the huge case.
-We can conclude that PROP in single precision gives invalid results
+single and double precision\index{precision!double precision} for the huge case.
+We can conclude that PROP in single precision\index{precision!single precision} gives invalid results
for realistic simulation cases above the ionization threshold.
Therefore the deployment of PROP on GPU, described in Sect.~\ref{gpu-implem},
-has been carried out in double precision.
+has been carried out in double precision\index{precision!double precision}.
\section{Towards a complete deployment of PROP on GPUs}
\label{gpu-implem}
We now detail how PROP has been progressively deployed on
-GPUs in double precision in order to avoid the
+GPUs in double precision\index{precision!double precision} in order to avoid the
expensive memory transfers between the host and the GPU.
Different versions with successive improvements and optimizations are presented.
We use CUDA~\cite{CUDA_ProgGuide} for GPU programming, as well as the
-CUBLAS~\cite{CUBLAS}
+CUBLAS\index{CUBLAS}~\cite{CUBLAS}
and MAGMA \cite{MAGMA} libraries for linear algebra operations.
-Since PROP is written in Fortran 90, {\em wrappers} in C are used to
-enable calls to CUDA kernels from PROP routines.
+Since PROP is written in Fortran 90, {\em wrappers\index{wrapper}} in C are used to
+enable calls to CUDA kernels from PROP routines.
\subsection{Computing the output $R$-matrix on GPU}
These copies, along with possible scalings or transpositions, are
implemented as CUDA kernels which can be applied to two
matrices of any size and starting at any offset.
- Memory accesses are coalesced \cite{CUDA_ProgGuide} in order to
+ Memory accesses are coalesced\index{coalesced memory accesses} \cite{CUDA_ProgGuide} in order to
provide the best performance for such memory-bound kernels.
\item[Step 2] (``Local copies''):~data are copied from
local $R$-matrices to temporary arrays ($U$, $V$) and to $\Re^{O}$.
is added to matrix $A$ (via a CUDA kernel) and zeroes are written in
$\Re^{O}$ where required.
\item[Step 3] (``Linear system solving''):~matrix $A$ is factorized
- using the MAGMA DGETRF
+ using the MAGMA DGETRF\index{MAGMA functions!DGETRF}
routine and the result is stored in-place.
\item[Step 4] (``Linear system solving'' cont.):~the matrix system
- of linear equations $AW$ = $V$ is solved using the MAGMA DGETRS
+ of linear equations $AW$ = $V$ is solved using the MAGMA DGETRS\index{MAGMA functions!DGETRS}
routine. The solution is stored in matrix $V$.
\item[Step 5] (``Output matrix product''):~matrix $U$
- is multiplied by matrix $V$ using the CUBLAS DGEMM
+ is multiplied by matrix $V$ using the CUBLAS\index{CUBLAS} DGEMM
routine. The result is stored in a temporary matrix~$t$.
\item[Step 6] (``Output add''):~$t$ is added to $\Re^{O}$ (CUDA
kernel).
is a matrix product between
one $i$ amplitude array and one transposed $j$ amplitude array
which is performed by a single DGEMM
-BLAS call.
+BLAS\index{BLAS} call.
In this version, hereafter referred to as GPU
V2\label{gpuv2}, $i$ and $j$ amplitude arrays are transferred to the
GPU memory and the required matrix multiplications are performed on
-the GPU (via CUBLAS routines).
+the GPU (via CUBLAS\index{CUBLAS} routines).
The involved matrices having medium sizes (either $3066 \times 383$ or
GPU than on the CPU, thanks to the massively parallel architecture of
the GPU and thanks to its higher internal memory bandwidth.
-\subsection{Using double-buffering to overlap I/O and computation}
+\subsection{Using double-buffering\index{double-buffering} to overlap I/O and computation}
\begin{figure}[t]
\centering
linearly with the strip number, and rapidly exceeds the I/O
time.
-It is thus interesting to use a double-buffering technique to overlap the
+It is thus interesting to use a double-buffering\index{double-buffering} technique to overlap the
I/O time with the evaluation time:
for each sector, the evaluation of sector $n$ is performed
(on GPU) simultaneously with the reading of data for sector
data structures
used for storing data read from I/O files for each sector.
This version, hereafter referred to as GPU
-V4\label{gpuv4}, uses POSIX threads. Two threads are
+V4\label{gpuv4}, uses POSIX threads\index{POSIX threads}. Two threads are
executed concurrently: an I/O thread that reads data from I/O files
for each sector, and a computation thread, dedicated to the propagation
of the global $R$-matrix, that performs successively for each sector
thread mechanisms.
-\subsection{Matrix padding}
+\subsection{Matrix padding\index{padding}}
The CUBLAS DGEMM
-performance and the MAGMA DGETRF/DGETRS
+performance and the MAGMA DGETRF\index{MAGMA functions!DGETRF}/DGETRS\index{MAGMA functions!DGETRS}
performance is reduced when the sizes (or
the leading dimensions) of the matrix are not multiples of the inner blocking size \cite{NTD10a}.
This inner blocking size can be 32 or 64, depending on the computation
This corresponds indeed to the optimal size for the matrix product on the
Fermi architecture \cite{NTD10b}. And as far as linear system solving is
concerned, all the matrices have sizes which are multiples of 383: we
-therefore use padding to obtain multiples of 384 (which are
+therefore use padding\index{padding} to obtain multiples of 384 (which are
again multiples of 64).
-It can be noticed that this padding has to be performed dynamically
+It can be noticed that this padding\index{padding} has to be performed dynamically
as the matrices increase in size during the propagation
(when possible the
maximum required storage space is however allocated only once in the
UPMC (Universit\'e Pierre et Marie Curie, Paris, France).
As a remark, the execution times measured on the C2050 would be the same
on the C2070 and on the C2075, the only difference between these GPUs
-being their memory size and their TDP (Thermal Design Power).
+being their memory size and their TDP (Thermal Design Power)\index{TDP (Thermal Design Power)}.
We emphasize that the execution times correspond to the
complete propagation for all six energies of the large case (see
Table~\ref{data-sets}), that is to say to the complete execution of
(e.g. the huge case) should be proportional
to those reported in Table~\ref{table:time}.
-These tests, which have been performed with CUDA 3.2, CUBLAS 3.2 and
+These tests, which have been performed with CUDA 3.2, CUBLAS\index{CUBLAS} 3.2 and
MAGMA 0.2,
show that the successive GPU versions of PROP offer
increasing, and at the end interesting, speedups.
which also accelerates the computation of
amplitude arrays thanks to the GPU.
The
-double-buffering technique implemented in V4
+double-buffering\index{double-buffering} technique implemented in V4
effectively enables the overlapping of
I/O operations with computation, while the
-padding implemented in V5 also improves the computation times.
+padding\index{padding} implemented in V5 also improves the computation times.
It
-is noticed that the padding
+is noticed that the padding\index{padding}
does offer much more performance gain with,
-for example, CUDA 3.1 and the MAGMA DGEMM~\cite{NTD10b}: the
+for example, CUDA 3.1 and the MAGMA DGEMM\index{MAGMA functions!DGEMM}~\cite{NTD10b}: the
speedup with respect to one
CPU core was increased from 6.3 to 8.1 on C1060, and from 9.5 to 14.3
on C2050.
-Indeed since CUBLAS 3.2 performance has been improved for non block multiple
+Indeed since CUBLAS\index{CUBLAS} 3.2 performance has been improved for non block multiple
matrix sizes through MAGMA code~\cite{NTD10a}.
Although for all versions the C2050 (with its improved
-double precision performance) offers up to almost
+double precision\index{precision!double precision} performance) offers up to almost
double speedup compared to
the C1060, the performance obtained with both architectures justifies the use of
the GPU for such an application.
CPU-GPU transfers and the linear system solving.
The CPU-GPU transfers are mainly due to the $j$ amplitude arrays, and
currently still correspond to minor times. When required, the
-double-buffering technique may also be used to overlap such transfers
-with computation on the GPU.
+double-buffering\index{double-buffering} technique may also be used to overlap such transfers with computation on the GPU.
-\section{Propagation of multiple concurrent energies on GPU}
+\section{Propagation of multiple concurrent energies on GPU}\index{concurrent kernel execution}
Finally, we present here an improvement that can
benefit from the Fermi architecture, as well as from the newest Kepler
architecture,
both of which enable the concurrent execution of multiple
-CUDA kernels, thus offering additional speedup on
+CUDA kernels\index{concurrent kernel execution}, thus offering additional speedup on
GPUs for small or medium computation grain kernels.
In our case, the performance gain on the GPU is indeed limited
since most matrices have small or medium sizes.
By using multiple streams within one CUDA context~\cite{CUDA_ProgGuide},
we can propagate multiple energies
-concurrently on the Fermi GPU.
+concurrently\index{concurrent kernel execution} on the Fermi GPU.
It can be noticed that all GPU computations for all streams are
launched by the same host thread. We therefore rely here on the {\em legacy
-API} of CUBLAS~\cite{CUBLAS} (like MAGMA)
+API} of CUBLAS\index{CUBLAS}~\cite{CUBLAS} (like MAGMA)
without thread safety problems.
A {\em breadth first} issue order is used for kernel
launchs \cite{CUDA_stream}: for a given GPU kernel, all kernel launchs
are indeed issued together in the host thread, using one stream for each
concurrent energy, in order to maximize concurrent kernel
-execution.
+execution\index{concurrent kernel execution}.
Of course, the memory available on the GPU must be large enough to
store all data structures required by each energy.
Moreover, multiple streams are also used within the
In order to have enough GPU memory to run two or three concurrent
energies for the large case, we use one C2070 GPU
(featuring 6GB of memory)
-with one Intel X5650 hex-core CPU, CUDA 4.1 and CUBLAS 4.1, as
+with one Intel X5650 hex-core CPU, CUDA 4.1 and CUBLAS\index{CUBLAS} 4.1, as
well as the latest MAGMA release (version 1.3.0).
Substantial changes have been required
in the MAGMA calls with respect to the previous versions of PROP that were using MAGMA 0.2.
With the more realistic large case, the performance gain is lower mainly because of
the increase in matrix sizes, which implies a better GPU usage
with only one energy on the GPU. The concurrent execution of multiple
-kernels is also limited by other operations on the
+kernels\index{concurrent kernel execution} is also limited by other operations on the
GPU \cite{CUDA_ProgGuide,CUDA_stream} and by the current MAGMA code which
prevents concurrent MAGMA calls in different streams.
Better speedups can be here expected on the latest Kepler GPUs which
offer additional compute power, and whose {\em Hyper-Q} feature may help
improve further the GPU utilization with concurrent energies.
On the contrary, the same code on the C1060 shows no speedup
- since the concurrent kernel launches are
-serialized on this previous GPU architecture.
+ since the concurrent kernel launches are serialized on this previous GPU architecture.
\section{Conclusion and future work}
-\label{conclusion}
-
+\label{conclusion}
In this chapter, we have presented our methodology and our results in
the deployment on
a GPU of an application (the PROP program) in atomic physics.
We have started by studying the numerical stability of PROP using
-single precision arithmetic. This has shown that PROP
-using single precision, while relatively stable for some small cases,
+single precision\index{precision!single precision} arithmetic. This has shown that PROP
+using single precision\index{precision!single precision}, while relatively stable for some small cases,
gives unsatisfactory results for realistic simulation cases above the
ionization threshold where there is a
significant density of pseudo-states. It is
investigation.
We have
-therefore deployed the PROP code in double precision on
+therefore deployed the PROP code in double precision\index{precision!double precision} on
a GPU, with successive improvements. The different GPU versions
each offer increasing speedups over the CPU version.
Compared to the single (respectively four) core(s) CPU version, the
optimal GPU implementation
gives a speedup of 8.2 (resp. 4.6) on one C1060 GPU,
and a speedup of 15.9 (resp. 9.0) on one
-C2050 GPU with improved double precision performance.
+C2050 GPU with improved double precision\index{precision!double precision} performance.
An additional gain of around 15\%
can also be obtained on one Fermi GPU
with large memory (C2070) thanks to concurrent kernel execution.
Such speedups
cannot be directly compared with the 1.15 speedup
-obtained with the HMPP version, since in our tests the CPUs are
-different and the CUBLAS versions are more recent.
+obtained with the HMPP\index{HMPP} version, since in our tests the CPUs are
+different and the CUBLAS\index{CUBLAS} versions are more recent.
However, the programming effort required
progressively to deploy PROP on GPUs clearly offers improved and interesting speedups for this
-real-life application in double precision with varying-size matrices.
+real-life application in double precision\index{precision!double precision} with varying-size matrices.
We are currently working on a hybrid CPU-GPU version that spreads the
and the GPU. This will enable
multiple energy execution on the CPU, with
one or several core(s) dedicated to each energy (via multi-threaded
-BLAS libraries). Multiple
+BLAS\index{BLAS} libraries). Multiple
concurrent energies may also be propagated on each Fermi GPU.
By merging this work with the current MPI PROP program, we will
obtain a scalable hybrid CPU-GPU version.