branching instructions are used (if, while, ...), the better performance is
obtained on GPU. So with algorithm \ref{algo:seqCIprng} presented in the
previous section, it is possible to build a similar program which computes PRNG
-on GPU. The principe consists in assigning the computation of a PRNG as in
-sequential to each thread of the GPU. Of course, it is essential that the three
-xor-like PRNGs used for our computation have different parameters. So we chose
-them randomly with another PRNG. As the initialisation is performed by the CPU,
-we have chosen to use the ISAAC PRNG [ref] to initalize all the parameters for
-the GPU version of our PRNG. The implementation of the three xor-like PRNGs is
+on GPU.
+
+
+\subsection{Naive version for GPU}
+
+From the CPU version, it is possible to obtain a quite similar version for GPU.
+The principe consists in assigning the computation of a PRNG as in sequential to
+each thread of the GPU. Of course, it is essential that the three xor-like
+PRNGs used for our computation have different parameters. So we chose them
+randomly with another PRNG. As the initialisation is performed by the CPU, we
+have chosen to use the ISAAC PRNG [ref] to initalize all the parameters for the
+GPU version of our PRNG. The implementation of the three xor-like PRNGs is
straightforward as soon as their parameters have been allocated in the GPU
memory. Each xor-like PRNGs used works with an internal number $x$ which keeps
the last generated random numbers. Other internal variables are also used by the
store internal variables in InternalVarXorLikeArray[threadId]\;
}
-\caption{main kernel for the chaotic iterations based PRNG GPU version}
+\caption{main kernel for the chaotic iterations based PRNG GPU naive version}
\label{algo:gpu_kernel}
\end{algorithm}
All the tests performed to pass the BigCrush of TestU01 succeeded. Different
number of threads have been tested upto $10$ millions.
+\begin{remark}
+Algorithm~\ref{algo:gpu_kernel} has the advantage to manipulate independent
+PRNGs, so this version is easily usable on a cluster of computer. The only thing
+to ensure is to use a single ISAAC PRNG. For this, a simple solution consists in
+using a master node for the initialization which computes the initial parameters
+for all the differents nodes involves in the computation.
+\end{remark}
+
+\subsection{Improved version for GPU}
+
+As GPU cards using CUDA have shared memory between threads of the same block, it
+is possible to use this feature in order to simplify the previous algorithm,
+i.e. using less than 3 xor-like PRNGs. The solution consists in comuting only
+one xor-like PRNG by thread, saving in into shared memory and accessing result
+of some other threads in the same block of threads.
+
+\begin{algorithm}
+
+\KwIn{InternalVarXorLikeArray: array with internal variables of 1 xor-like PRNGs in global memory\;
+NumThreads: Number of threads\;
+tab1, tab2: Arrays containing permutations\;}
+
+\KwOut{NewNb: array containing random numbers in global memory}
+\If{threadId is concerned} {
+ retrieve data from InternalVarXorLikeArray[threadId] in local variables\;
+ offset = threadId\%32;
+ \For{i=1 to n} {
+ t=xor-like()\;
+ shared\_mem[threadId]=(unsigned int)t\;
+ x = x$\oplus$ (unsigned int) t\;
+ x = x$\oplus$ (unsigned int) (t>>32)\;
+ x = x$\oplus$ shared[tab1[offset]]\;
+ x = x$\oplus$ shared[tab2[offset]]\;
+
+ store the new PRNG in NewNb[NumThreads*threadId+i]\;
+ }
+ store internal variables in InternalVarXorLikeArray[threadId]\;
+}
+
+\caption{main kernel for the chaotic iterations based PRNG GPU efficient version}
+\label{algo:gpu_kernel2}
+\end{algorithm}
\section{Experiments}
Differents experiments have been performed in order to measure the generation speed.