+In order to benefit from computing power of GPU, a program needs to define
+independent blocks of threads which can be computed simultaneously. In general,
+the larger the number of threads is, the more local memory is used and the less
+branching instructions are used (if, while, ...), the better performance is
+obtained on GPU. So with algorithm \ref{algo:seqCIprng} presented in the
+previous section, it is possible to build a similar program which computes PRNG
+on GPU.
+
+
+\subsection{Naive version}
+
+From the CPU version, it is possible to obtain a quite similar version for GPU.
+The principe consists in assigning the computation of a PRNG as in sequential to
+each thread of the GPU. Of course, it is essential that the three xor-like
+PRNGs used for our computation have different parameters. So we chose them
+randomly with another PRNG. As the initialisation is performed by the CPU, we
+have chosen to use the ISAAC PRNG [ref] to initalize all the parameters for the
+GPU version of our PRNG. The implementation of the three xor-like PRNGs is
+straightforward as soon as their parameters have been allocated in the GPU
+memory. Each xor-like PRNGs used works with an internal number $x$ which keeps
+the last generated random numbers. Other internal variables are also used by the
+xor-like PRNGs. More precisely, the implementation of the xor128, the xorshift
+and the xorwow respectively require 4, 5 and 6 unsigned long as internal
+variables.
+
+\begin{algorithm}
+
+\KwIn{InternalVarXorLikeArray: array with internal variables of the 3 xor-like PRNGs in global memory\;
+NumThreads: Number of threads\;}
+\KwOut{NewNb: array containing random numbers in global memory}
+\If{threadId is concerned} {
+ retrieve data from InternalVarXorLikeArray[threadId] in local variables\;
+ \For{i=1 to n} {
+ compute a new PRNG as in Listing\ref{algo:seqCIprng}\;
+ store the new PRNG in NewNb[NumThreads*threadId+i]\;
+ }
+ store internal variables in InternalVarXorLikeArray[threadId]\;
+}
+
+\caption{main kernel for the chaotic iterations based PRNG GPU version}
+\label{algo:gpu_kernel}
+\end{algorithm}
+
+According to the available memory in the GPU and the number of threads used
+simultenaously, the number of random numbers that a thread can generate inside a
+kernel is limited, i.e. the variable \texttt{n} in
+algorithm~\ref{algo:gpu_kernel}. For example, if $100,000$ threads are used and
+if $n=100$\footnote{in fact, we need to add the initial seed (a 32-bits number)}
+then the memory required to store internals variables of xor-like
+PRNGs\footnote{we multiply this number by $2$ in order to count 32-bits numbers}
+and random number of our PRNG is equals to $100,000\times ((4+5+6)\times
+2+(1+100))=1,310,000$ 32-bits numbers, i.e. about $52$Mb.
+
+All the tests performed to pass the BigCrush of TestU01 succeeded. Different
+number of threads have been tested upto $10$ millions.
+
+\begin{remark}
+Algorithm~\ref{algo:gpu_kernel} has the advantage to manipulate independent
+PRNGs, so this version is easily usable on a cluster of computer. The only thing
+to ensure is to use a single ISAAC PRNG. For this, a simple solution consists in
+using a master node for the initialization which computes the initial parameters
+for all the differents nodes involves in the computation.
+\end{remark}
+
+\subsection{Version more suited to GPU}
+
+As GPU offers shared memory mechanism between threads of the same block, it is
+possible to use this in order to simplify the previous algorithm, i.e. using
+less than 3 xor-like PRNGs. The solution consists in
+
+ threads of the same block compute a random
+number and uses other random numbers of