apply the xor operator between the current number and the strategy. In
order to obtain the strategy we also use a classical PRNG.
-Here is an example with 16-bits numbers showing how the bit operations are
+Here is an example with 16-bits numbers showing how the bitwise operations are
applied. Suppose that $x$ and the strategy $S^i$ are defined in binary mode.
Then the following table shows the result of $x$ xor $S^i$.
$$
-In listing~\ref{algo:seqCIprng} a sequential version of our chaotic iterations
-based PRNG is presented. The xor operator is represented by \textasciicircum. This function uses three classical 64-bits PRNG: the
-\texttt{xorshift}, the \texttt{xor128} and the \texttt{xorwow}. In the
-following, we call them xor-like PRNGSs. These three PRNGs are presented
-in~\cite{Marsaglia2003}. As each xor-like PRNG used works with 64-bits and as our PRNG
-works with 32-bits, the use of \texttt{(unsigned int)} selects the 32 least
-significant bits whereas \texttt{(unsigned int)(t3$>>$32)} selects the 32 most
-significants bits of the variable \texttt{t}. So to produce a random number
-realizes 6 xor operations with 6 32-bits numbers produced by 3 64-bits PRNG.
-This version successes the BigCrush of the TestU01 battery [P. L’ecuyer and
- R. Simard. Testu01].
+In listing~\ref{algo:seqCIprng} a sequential version of our chaotic iterations
+based PRNG is presented. The xor operator is represented by
+\textasciicircum. This function uses three classical 64-bits PRNG: the
+\texttt{xorshift}, the \texttt{xor128} and the \texttt{xorwow}. In the
+following, we call them xor-like PRNGSs. These three PRNGs are presented
+in~\cite{Marsaglia2003}. As each xor-like PRNG used works with 64-bits and as
+our PRNG works with 32-bits, the use of \texttt{(unsigned int)} selects the 32
+least significant bits whereas \texttt{(unsigned int)(t3$>>$32)} selects the 32
+most significants bits of the variable \texttt{t}. So to produce a random
+number realizes 6 xor operations with 6 32-bits numbers produced by 3 64-bits
+PRNG. This version successes the BigCrush of the TestU01 battery [P. L’ecuyer
+ and R. Simard. Testu01].
\section{Efficient prng based on chaotic iterations on GPU}
branching instructions are used (if, while, ...), the better performance is
obtained on GPU. So with algorithm \ref{algo:seqCIprng} presented in the
previous section, it is possible to build a similar program which computes PRNG
-on GPU.
+on GPU. In the CUDA [ref] environment, threads have a local identificator,
+called \texttt{ThreadIdx} relative to the block containing them.
-\subsection{Naive version}
+\subsection{Naive version for GPU}
From the CPU version, it is possible to obtain a quite similar version for GPU.
The principe consists in assigning the computation of a PRNG as in sequential to
\KwIn{InternalVarXorLikeArray: array with internal variables of the 3 xor-like PRNGs in global memory\;
NumThreads: Number of threads\;}
\KwOut{NewNb: array containing random numbers in global memory}
-\If{threadId is concerned} {
- retrieve data from InternalVarXorLikeArray[threadId] in local variables\;
+\If{threadIdx is concerned by the computation} {
+ retrieve data from InternalVarXorLikeArray[threadIdx] in local variables\;
\For{i=1 to n} {
compute a new PRNG as in Listing\ref{algo:seqCIprng}\;
- store the new PRNG in NewNb[NumThreads*threadId+i]\;
+ store the new PRNG in NewNb[NumThreads*threadIdx+i]\;
}
- store internal variables in InternalVarXorLikeArray[threadId]\;
+ store internal variables in InternalVarXorLikeArray[threadIdx]\;
}
-\caption{main kernel for the chaotic iterations based PRNG GPU version}
+\caption{main kernel for the chaotic iterations based PRNG GPU naive version}
\label{algo:gpu_kernel}
\end{algorithm}
-According to the available memory in the GPU and the number of threads used
-simultenaously, the number of random numbers that a thread can generate inside a
-kernel is limited, i.e. the variable \texttt{n} in
+Algorithm~\ref{algo:gpu_kernel} presents a naive implementation of PRNG using
+GPU. According to the available memory in the GPU and the number of threads
+used simultenaously, the number of random numbers that a thread can generate
+inside a kernel is limited, i.e. the variable \texttt{n} in
algorithm~\ref{algo:gpu_kernel}. For example, if $100,000$ threads are used and
if $n=100$\footnote{in fact, we need to add the initial seed (a 32-bits number)}
then the memory required to store internals variables of xor-like
2+(1+100))=1,310,000$ 32-bits numbers, i.e. about $52$Mb.
All the tests performed to pass the BigCrush of TestU01 succeeded. Different
-number of threads have been tested upto $10$ millions.
+number of threads, called \texttt{NumThreads} in our algorithm, have been tested
+upto $10$ millions.
\begin{remark}
Algorithm~\ref{algo:gpu_kernel} has the advantage to manipulate independent
for all the differents nodes involves in the computation.
\end{remark}
-\subsection{Version more suited to GPU}
+\subsection{Improved version for GPU}
+
+As GPU cards using CUDA have shared memory between threads of the same block, it
+is possible to use this feature in order to simplify the previous algorithm,
+i.e. using less than 3 xor-like PRNGs. The solution consists in computing only
+one xor-like PRNG by thread, saving it into shared memory and using the results
+of some other threads in the same block of threads. In order to define which
+thread uses the result of which other one, we can use a permutation array which
+contains the indexes of all threads and for which a permutation has been
+performed. In Algorithm~\ref{algo:gpu_kernel2}, 2 permutations arrays are used.
+The variable \texttt{offset} is computed using the value of
+\texttt{permutation\_size}. Then we can compute \texttt{o1} and \texttt{o2}
+which represent the indexes of the other threads for which the results are used
+by the current thread. In the algorithm, we consider that a 64-bits xor-like
+PRNG is used, that is why both 32-bits parts are used.
+
+This version also succeed to the BigCrush batteries of tests.
+
+\begin{algorithm}
+
+\KwIn{InternalVarXorLikeArray: array with internal variables of 1 xor-like PRNGs in global memory\;
+NumThreads: Number of threads\;
+tab1, tab2: Arrays containing permutations of size permutation\_size\;}
+
+\KwOut{NewNb: array containing random numbers in global memory}
+\If{threadId is concerned} {
+ retrieve data from InternalVarXorLikeArray[threadId] in local variables\;
+ offset = threadIdx\%permutation\_size\;
+ o1 = threadIdx-offset+tab1[offset]\;
+ o2 = threadIdx-offset+tab2[offset]\;
+ \For{i=1 to n} {
+ t=xor-like()\;
+ shared\_mem[threadId]=(unsigned int)t\;
+ x = x $\oplus$ (unsigned int) t\;
+ x = x $\oplus$ (unsigned int) (t>>32)\;
+ x = x $\oplus$ shared[o1]\;
+ x = x $\oplus$ shared[o2]\;
+
+ store the new PRNG in NewNb[NumThreads*threadId+i]\;
+ }
+ store internal variables in InternalVarXorLikeArray[threadId]\;
+}
+
+\caption{main kernel for the chaotic iterations based PRNG GPU efficient version}
+\label{algo:gpu_kernel2}
+\end{algorithm}
-As GPU offers shared memory mechanism between threads of the same block, it is
-possible to use this in order to simplify the previous algorithm, i.e. using
-less than 3 xor-like PRNGs. The solution consists in
- threads of the same block compute a random
-number and uses other random numbers of
\section{Experiments}
Differents experiments have been performed in order to measure the generation speed.
+\begin{figure}[t]
+\begin{center}
+ \includegraphics[scale=.5]{curve_time_gpu.pdf}
+\end{center}
+\caption{Number of random numbers generated per second}
+\label{fig:time_naive_gpu}
+\end{figure}
First of all we have compared the time to generate X random numbers with both the CPU version and the GPU version.