branching instructions are used (if, while, ...), the better performance is
obtained on GPU. So with algorithm \ref{algo:seqCIprng} presented in the
previous section, it is possible to build a similar program which computes PRNG
-on GPU. In the CUDA [ref] environment, threads have a local identificator,
-called \texttt{ThreadIdx} relative to the block containing them.
+on GPU. In the CUDA~\cite{Nvid10} environment, threads have a local
+identificator, called \texttt{ThreadIdx} relative to the block containing them.
\subsection{Naive version for GPU}
each thread of the GPU. Of course, it is essential that the three xor-like
PRNGs used for our computation have different parameters. So we chose them
randomly with another PRNG. As the initialisation is performed by the CPU, we
-have chosen to use the ISAAC PRNG [ref] to initalize all the parameters for the
-GPU version of our PRNG. The implementation of the three xor-like PRNGs is
-straightforward as soon as their parameters have been allocated in the GPU
-memory. Each xor-like PRNGs used works with an internal number $x$ which keeps
-the last generated random numbers. Other internal variables are also used by the
-xor-like PRNGs. More precisely, the implementation of the xor128, the xorshift
-and the xorwow respectively require 4, 5 and 6 unsigned long as internal
-variables.
+have chosen to use the ISAAC PRNG~\ref{Jenkins96} to initalize all the
+parameters for the GPU version of our PRNG. The implementation of the three
+xor-like PRNGs is straightforward as soon as their parameters have been
+allocated in the GPU memory. Each xor-like PRNGs used works with an internal
+number $x$ which keeps the last generated random numbers. Other internal
+variables are also used by the xor-like PRNGs. More precisely, the
+implementation of the xor128, the xorshift and the xorwow respectively require
+4, 5 and 6 unsigned long as internal variables.
\begin{algorithm}
by the current thread. In the algorithm, we consider that a 64-bits xor-like
PRNG is used, that is why both 32-bits parts are used.
-This version also succeed to the BigCrush batteries of tests.
+This version also succeeds to the {\it BigCrush} batteries of tests.
\begin{algorithm}
\section{Experiments}
\label{sec:experiments}
-Different experiments have been performed in order to measure the generation
-speed. In Figure~\ref{fig:time_gpu} we compare the number of random numbers generated per second.
+Different experiments have been performed in order to measure the generation
+speed. We have used a computer equiped with Tesla C1060 NVidia GPU card and an
+Intel Xeon E5530 cadenced at 2.40 GHz for our experiments.
+
+In Figure~\ref{fig:time_gpu} we compare the number of random numbers generated
+per second. In order to obtain the optimal number we remove the storage of
+random numbers in the GPU memory. This step is time consumming and slows down
+the random number generation. Moreover, if you are interested by applications
+that consome random number directly when they are generated, their storage is
+completely useless. In this figure we can see that when the number of threads is
+greater than approximately 30,000 upto 5 millions the number of random numbers
+generated per second is almost constant. With the naive version, it is between
+2.5 and 3GSample/s. With the optimized version, it is almost equals to
+20GSample/s.
\begin{figure}[htbp]
\begin{center}
\end{figure}
-First of all we have compared the time to generate X random numbers with both
-the CPU version and the GPU version.
+In comparison, Listing~\ref{algo:seqCIprng} allows us to generate about
+138MSample/s with only one core of the Xeon E5530.
+
-Faire une courbe du nombre de random en fonction du nombre de threads,
-éventuellement en fonction du nombres de threads par bloc.