+\begin{remark}
+Algorithm~\ref{algo:gpu_kernel} has the advantage to manipulate independent
+PRNGs, so this version is easily usable on a cluster of computer. The only thing
+to ensure is to use a single ISAAC PRNG. For this, a simple solution consists in
+using a master node for the initialization which computes the initial parameters
+for all the differents nodes involves in the computation.
+\end{remark}
+
+\subsection{Improved version for GPU}
+
+As GPU cards using CUDA have shared memory between threads of the same block, it
+is possible to use this feature in order to simplify the previous algorithm,
+i.e. using less than 3 xor-like PRNGs. The solution consists in comuting only
+one xor-like PRNG by thread, saving in into shared memory and accessing result
+of some other threads in the same block of threads.
+
+\begin{algorithm}
+
+\KwIn{InternalVarXorLikeArray: array with internal variables of 1 xor-like PRNGs in global memory\;
+NumThreads: Number of threads\;
+tab1, tab2: Arrays containing permutations\;}
+
+\KwOut{NewNb: array containing random numbers in global memory}
+\If{threadId is concerned} {
+ retrieve data from InternalVarXorLikeArray[threadId] in local variables\;
+ offset = threadId\%32;
+ \For{i=1 to n} {
+ t=xor-like()\;
+ shared\_mem[threadId]=(unsigned int)t\;
+ x = x$\oplus$ (unsigned int) t\;
+ x = x$\oplus$ (unsigned int) (t>>32)\;
+ x = x$\oplus$ shared[tab1[offset]]\;
+ x = x$\oplus$ shared[tab2[offset]]\;
+
+ store the new PRNG in NewNb[NumThreads*threadId+i]\;
+ }
+ store internal variables in InternalVarXorLikeArray[threadId]\;
+}
+
+\caption{main kernel for the chaotic iterations based PRNG GPU efficient version}
+\label{algo:gpu_kernel2}
+\end{algorithm}