\maketitle
\begin{abstract}
+In this paper we present a new produce pseudo-random numbers generator (PRNG) on
+graphics processing units (GPU). This PRNG is based on chaotic iterations. it
+is proven to be chaotic in the Devany's formulation. We propose an efficient
+implementation for GPU which succeeds to the {\it BigCrush}, the hardest
+batteries of test of TestU01. Experimentations show that this PRNG can generate
+about 20 billions of random numbers per second on Tesla C1060 and NVidia GTX280
+cards.
+
\end{abstract}
mathematical formulation of chaotic dynamical systems.
In a previous work~\cite{bgw09:ip} we have proposed a new familly of chaotic
-PRNG based on chaotic iterations (IC). We have proven that these PRNGs are
+PRNG based on chaotic iterations. We have proven that these PRNGs are
chaotic in the Devaney's sense. In this paper we propose a faster version which
is also proven to be chaotic.
allows us to generated almost 20 billions of random numbers per second.
In order to establish that our PRNGs are chaotic according to the Devaney's
-formulation, we extend what we have proposed in~\cite{guyeux10}. Moreover, we define a new distance to measure the disorder in the chaos and we prove some interesting properties with this distance.
+formulation, we extend what we have proposed in~\cite{guyeux10}. Moreover, we
+define a new distance to measure the disorder in the chaos and we prove some
+interesting properties with this distance.
The rest of this paper is organised as follows. In Section~\ref{section:related
- works} we review some GPU implementions of PRNG. Section~\ref{section:BASIC RECALLS} gives some basic recalls on Devanay's formation of chaos and
-chaotic iterations. In Section~\ref{sec:pseudo-random} the proof of chaos of our
-PRNGs is studied. Section~\ref{sec:efficient prng} presents an efficient
+ works} we review some GPU implementions of PRNG. Section~\ref{section:BASIC
+ RECALLS} gives some basic recalls on Devanay's formation of chaos and chaotic
+iterations. In Section~\ref{sec:pseudo-random} the proof of chaos of our PRNGs
+is studied. Section~\ref{sec:efficient prng} presents an efficient
implementation of our chaotic PRNG on a CPU. Section~\ref{sec:efficient prng
gpu} describes the GPU implementation of our chaotic PRNG. In
Section~\ref{sec:experiments} some experimentations are presented.
Section~\ref{sec:de la relativité du désordre} describes the relativity of
-disorder. In Section~\ref{sec: chaos order topology} the proof that chaotic
-iterations can be described by iterations on a real interval is established. Finally, we give a conclusion and some perspectives.
+disorder. In Section~\ref{sec: chaos order topology} the proof that chaotic
+iterations can be described by iterations on a real interval is
+established. Finally, we give a conclusion and some perspectives.
\end{array}
$$
-%% \begin{figure}[htbp]
-%% \begin{center}
-%% \fbox{
-%% \begin{minipage}{14cm}
-%% unsigned int CIprng() \{\\
-%% static unsigned int x = 123123123;\\
-%% unsigned long t1 = xorshift();\\
-%% unsigned long t2 = xor128();\\
-%% unsigned long t3 = xorwow();\\
-%% x = x\textasciicircum (unsigned int)t1;\\
-%% x = x\textasciicircum (unsigned int)(t2$>>$32);\\
-%% x = x\textasciicircum (unsigned int)(t3$>>$32);\\
-%% x = x\textasciicircum (unsigned int)t2;\\
-%% x = x\textasciicircum (unsigned int)(t1$>>$32);\\
-%% x = x\textasciicircum (unsigned int)t3;\\
-%% return x;\\
-%% \}
-%% \end{minipage}
-%% }
-%% \end{center}
-%% \caption{sequential Chaotic Iteration PRNG}
-%% \label{algo:seqCIprng}
-%% \end{figure}
+
each thread of the GPU. Of course, it is essential that the three xor-like
PRNGs used for our computation have different parameters. So we chose them
randomly with another PRNG. As the initialisation is performed by the CPU, we
-have chosen to use the ISAAC PRNG~\ref{Jenkins96} to initalize all the
+have chosen to use the ISAAC PRNG~\cite{Jenkins96} to initalize all the
parameters for the GPU version of our PRNG. The implementation of the three
xor-like PRNGs is straightforward as soon as their parameters have been
allocated in the GPU memory. Each xor-like PRNGs used works with an internal
\KwOut{NewNb: array containing random numbers in global memory}
\If{threadId is concerned} {
- retrieve data from InternalVarXorLikeArray[threadId] in local variables\;
+ retrieve data from InternalVarXorLikeArray[threadId] in local variables including shared memory\;
offset = threadIdx\%permutation\_size\;
o1 = threadIdx-offset+tab1[offset]\;
o2 = threadIdx-offset+tab2[offset]\;
\For{i=1 to n} {
t=xor-like()\;
- shared\_mem[threadId]=(unsigned int)t\;
- x = x $\oplus$ (unsigned int) t\;
- x = x $\oplus$ (unsigned int) (t>>32)\;
- x = x $\oplus$ shared[o1]\;
- x = x $\oplus$ shared[o2]\;
+ t=t$\oplus$shmem[o1]$\oplus$shmem[o2]\;
+ shared\_mem[threadId]=t\;
+ x = x $\oplus$ t\;
store the new PRNG in NewNb[NumThreads*threadId+i]\;
}
\subsection{Theoretical Evaluation of the Improved Version}
-A run of Algorithm~\ref{algo:gpu_kernel2} consists in four operations having
+A run of Algorithm~\ref{algo:gpu_kernel2} consists in three operations having
the form of Equation~\ref{equation Oplus}, which is equivalent to the iterative
-system of Eq.~\ref{eq:generalIC}. That is, four iterations of the general chaotic
+system of Eq.~\ref{eq:generalIC}. That is, three iterations of the general chaotic
iterations are realized between two stored values of the PRNG.
To be certain that we are in the framework of Theorem~\ref{t:chaos des general},
we must guarantee that this dynamical system iterates on the space
Different experiments have been performed in order to measure the generation
speed. We have used a computer equiped with Tesla C1060 NVidia GPU card and an
-Intel Xeon E5530 cadenced at 2.40 GHz for our experiments.
+Intel Xeon E5530 cadenced at 2.40 GHz for our experiments and we have used
+another one equipped with a less performant CPU and a GeForce GTX 280. Both
+cards have 240 cores.
In Figure~\ref{fig:time_gpu} we compare the number of random numbers generated
-per second. In order to obtain the optimal number we remove the storage of
-random numbers in the GPU memory. This step is time consumming and slows down
-the random number generation. Moreover, if you are interested by applications
-that consome random number directly when they are generated, their storage is
-completely useless. In this figure we can see that when the number of threads is
-greater than approximately 30,000 upto 5 millions the number of random numbers
-generated per second is almost constant. With the naive version, it is between
-2.5 and 3GSample/s. With the optimized version, it is almost equals to
-20GSample/s.
+per second. The xor-like prng is a xor64 described in~\cite{Marsaglia2003}. In
+order to obtain the optimal performance we remove the storage of random numbers
+in the GPU memory. This step is time consumming and slows down the random number
+generation. Moreover, if you are interested by applications that consome random
+numbers directly when they are generated, their storage is completely
+useless. In this figure we can see that when the number of threads is greater
+than approximately 30,000 upto 5 millions the number of random numbers generated
+per second is almost constant. With the naive version, it is between 2.5 and
+3GSample/s. With the optimized version, it is approximately equals to
+20GSample/s. Finally we can remark that both GPU cards are quite similar. In
+practice, the Tesla C1060 has more memory than the GTX 280 and this memory
+should be of better quality.
\begin{figure}[htbp]
\begin{center}