+\begin{table}
+$$
+\begin{array}{|cc|cccccccccccccccc|}
+\hline
+x &=&1&0&1&1&1&0&1&0&1&0&0&1&0&0&1&0\\
+\hline
+S^i &=&0&1&1&0&0&1&1&0&1&1&1&0&0&1&1&1\\
+\hline
+x \oplus S^i&=&1&1&0&1&1&1&0&0&0&1&1&1&0&1&0&1\\
+\hline
+
+\hline
+ \end{array}
+$$
+\caption{Example of an arbitrary round of the proposed generator}
+\label{TableExemple}
+\end{table}
+
+
+
+\lstset{language=C,caption={C code of the sequential PRNG based on chaotic iterations},label=algo:seqCIPRNG}
+\begin{lstlisting}
+unsigned int CIPRNG() {
+ static unsigned int x = 123123123;
+ unsigned long t1 = xorshift();
+ unsigned long t2 = xor128();
+ unsigned long t3 = xorwow();
+ x = x^(unsigned int)t1;
+ x = x^(unsigned int)(t2>>32);
+ x = x^(unsigned int)(t3>>32);
+ x = x^(unsigned int)t2;
+ x = x^(unsigned int)(t1>>32);
+ x = x^(unsigned int)t3;
+ return x;
+}
+\end{lstlisting}
+
+
+
+
+
+In Listing~\ref{algo:seqCIPRNG} a sequential version of the proposed PRNG based on chaotic iterations
+ is presented. The xor operator is represented by \textasciicircum.
+This function uses three classical 64-bits PRNGs, namely the \texttt{xorshift}, the
+\texttt{xor128}, and the \texttt{xorwow}~\cite{Marsaglia2003}. In the following, we call them
+``xor-like PRNGs''.
+As
+each xor-like PRNG uses 64-bits whereas our proposed generator works with 32-bits,
+we use the command \texttt{(unsigned int)}, that selects the 32 least significant bits of a given integer, and the code
+\texttt{(unsigned int)(t3$>>$32)} in order to obtain the 32 most significant bits of \texttt{t}.
+
+So producing a pseudorandom number needs 6 xor operations
+with 6 32-bits numbers that are provided by 3 64-bits PRNGs. This version successfully passes the
+stringent BigCrush battery of tests~\cite{LEcuyerS07}.
+
+\section{Efficient PRNGs based on Chaotic Iterations on GPU}
+\label{sec:efficient PRNG gpu}
+
+In order to take benefits from the computing power of GPU, a program needs to have
+independent blocks of threads that can be computed simultaneously. In general,
+the larger the number of threads is, the more local memory is used, and the less
+branching instructions are used (if, while, ...), the better the performances on GPU is.
+Obviously, having these requirements in mind, it is possible to build a program similar to
+the one presented in Algorithm \ref{algo:seqCIPRNG}, which computes pseudorandom numbers
+on GPU.
+To do so, we must firstly recall that in
+ the CUDA~\cite{Nvid10} environment, threads have a local
+identifier called \texttt{ThreadIdx}, which is relative to the block containing them.
+
+
+\subsection{Naive Version for GPU}
+
+
+It is possible to deduce from the CPU version a quite similar version adapted to GPU.
+The simple principle consists to make each thread of the GPU computing the CPU version of our PRNG.
+Of course, the three xor-like
+PRNGs used in these computations must have different parameters.
+In a given thread, these lasts are
+randomly picked from another PRNGs.
+The initialization stage is performed by the CPU.
+To do it, the ISAAC PRNG~\cite{Jenkins96} is used to set all the
+parameters embedded into each thread.
+
+The implementation of the three
+xor-like PRNGs is straightforward when their parameters have been
+allocated in the GPU memory. Each xor-like works with an internal
+number $x$ that saves the last generated pseudorandom number. Additionally, the
+implementation of the xor128, the xorshift, and the xorwow respectively require
+4, 5, and 6 unsigned long as internal variables.
+
+\begin{algorithm}
+
+\KwIn{InternalVarXorLikeArray: array with internal variables of the 3 xor-like
+PRNGs in global memory\;
+NumThreads: number of threads\;}
+\KwOut{NewNb: array containing random numbers in global memory}
+\If{threadIdx is concerned by the computation} {
+ retrieve data from InternalVarXorLikeArray[threadIdx] in local variables\;
+ \For{i=1 to n} {
+ compute a new PRNG as in Listing\ref{algo:seqCIPRNG}\;
+ store the new PRNG in NewNb[NumThreads*threadIdx+i]\;
+ }
+ store internal variables in InternalVarXorLikeArray[threadIdx]\;
+}
+
+\caption{Main kernel of the GPU ``naive'' version of the PRNG based on chaotic iterations}
+\label{algo:gpu_kernel}
+\end{algorithm}
+
+Algorithm~\ref{algo:gpu_kernel} presents a naive implementation of the proposed PRNG on
+GPU. Due to the available memory in the GPU and the number of threads
+used simultenaously, the number of random numbers that a thread can generate
+inside a kernel is limited (\emph{i.e.}, the variable \texttt{n} in
+algorithm~\ref{algo:gpu_kernel}). For instance, if $100,000$ threads are used and
+if $n=100$\footnote{in fact, we need to add the initial seed (a 32-bits number)},
+then the memory required to store all of the internals variables of both the xor-like
+PRNGs\footnote{we multiply this number by $2$ in order to count 32-bits numbers}
+and the pseudorandom numbers generated by our PRNG, is equal to $100,000\times ((4+5+6)\times
+2+(1+100))=1,310,000$ 32-bits numbers, that is, approximately $52$Mb.
+
+This generator is able to pass the whole BigCrush battery of tests, for all
+the versions that have been tested depending on their number of threads
+(called \texttt{NumThreads} in our algorithm, tested until $10$ millions).
+
+\begin{remark}
+The proposed algorithm has the advantage to manipulate independent
+PRNGs, so this version is easily adaptable on a cluster of computers too. The only thing
+to ensure is to use a single ISAAC PRNG. To achieve this requirement, a simple solution consists in
+using a master node for the initialization. This master node computes the initial parameters
+for all the differents nodes involves in the computation.
+\end{remark}
+
+\subsection{Improved Version for GPU}
+
+As GPU cards using CUDA have shared memory between threads of the same block, it
+is possible to use this feature in order to simplify the previous algorithm,
+i.e., to use less than 3 xor-like PRNGs. The solution consists in computing only
+one xor-like PRNG by thread, saving it into the shared memory, and then to use the results
+of some other threads in the same block of threads. In order to define which
+thread uses the result of which other one, we can use a permutation array that
+contains the indexes of all threads and for which a permutation has been
+performed.
+
+In Algorithm~\ref{algo:gpu_kernel2}, two permutations arrays are used.
+The variable \texttt{offset} is computed using the value of
+\texttt{permutation\_size}. Then we can compute \texttt{o1} and \texttt{o2}
+representing the indexes of the other threads whose results are used
+by the current one. In this algorithm, we consider that a 64-bits xor-like
+PRNG has been chosen, and so its two 32-bits parts are used.
+
+This version also can pass the whole {\it BigCrush} battery of tests.
+
+\begin{algorithm}
+
+\KwIn{InternalVarXorLikeArray: array with internal variables of 1 xor-like PRNGs
+in global memory\;
+NumThreads: Number of threads\;
+tab1, tab2: Arrays containing permutations of size permutation\_size\;}
+
+\KwOut{NewNb: array containing random numbers in global memory}
+\If{threadId is concerned} {
+ retrieve data from InternalVarXorLikeArray[threadId] in local variables including shared memory and x\;
+ offset = threadIdx\%permutation\_size\;
+ o1 = threadIdx-offset+tab1[offset]\;
+ o2 = threadIdx-offset+tab2[offset]\;
+ \For{i=1 to n} {
+ t=xor-like()\;
+ t=t$\oplus$shmem[o1]$\oplus$shmem[o2]\;
+ shared\_mem[threadId]=t\;
+ x = x $\oplus$ t\;
+
+ store the new PRNG in NewNb[NumThreads*threadId+i]\;
+ }
+ store internal variables in InternalVarXorLikeArray[threadId]\;
+}
+
+\caption{main kernel for the chaotic iterations based PRNG GPU efficient
+version}
+\label{algo:gpu_kernel2}
+\end{algorithm}
+
+\subsection{Theoretical Evaluation of the Improved Version}
+
+A run of Algorithm~\ref{algo:gpu_kernel2} consists in an operation ($x=x\oplus t$) having
+the form of Equation~\ref{equation Oplus}, which is equivalent to the iterative
+system of Eq.~\ref{eq:generalIC}. That is, an iteration of the general chaotic
+iterations is realized between the last stored value $x$ of the thread and a strategy $t$
+(obtained by a bitwise exclusive or between a value provided by a xor-like() call
+and two values previously obtained by two other threads).
+To be certain that we are in the framework of Theorem~\ref{t:chaos des general},
+we must guarantee that this dynamical system iterates on the space
+$\mathcal{X} = \mathcal{P}\left(\llbracket 1, \mathsf{N} \rrbracket\right)^\mathds{N}\times\mathds{B}^\mathsf{N}$.
+The left term $x$ obviously belongs into $\mathds{B}^ \mathsf{N}$.
+To prevent from any flaws of chaotic properties, we must check that the right
+term (the last $t$), corresponding to the strategies, can possibly be equal to any
+integer of $\llbracket 1, \mathsf{N} \rrbracket$.
+
+Such a result is obvious, as for the xor-like(), all the
+integers belonging into its interval of definition can occur at each iteration, and thus the
+last $t$ respects the requirement. Furthermore, it is possible to
+prove by an immediate mathematical induction that, as the initial $x$
+is uniformly distributed (it is provided by a cryptographically secure PRNG),
+the two other stored values shmem[o1] and shmem[o2] are uniformly distributed too,
+(this can be stated by an immediate mathematical
+induction), and thus the next $x$ is finally uniformly distributed.
+
+Thus Algorithm~\ref{algo:gpu_kernel2} is a concrete realization of the general
+chaotic iterations presented previously, and for this reason, it satisfies the
+Devaney's formulation of a chaotic behavior.
+
+\section{Experiments}
+\label{sec:experiments}
+
+Different experiments have been performed in order to measure the generation
+speed. We have used a first computer equipped with a Tesla C1060 NVidia GPU card
+and an
+Intel Xeon E5530 cadenced at 2.40 GHz, and
+a second computer equipped with a smaller CPU and a GeForce GTX 280.
+All the
+cards have 240 cores.
+
+In Figure~\ref{fig:time_xorlike_gpu} we compare the quantity of pseudorandom numbers
+generated per second with various xor-like based PRNG. In this figure, the optimized
+versions use the {\it xor64} described in~\cite{Marsaglia2003}, whereas the naive versions
+embed the three xor-like PRNGs described in Listing~\ref{algo:seqCIPRNG}. In
+order to obtain the optimal performances, the storage of pseudorandom numbers
+into the GPU memory has been removed. This step is time consuming and slows down the numbers
+generation. Moreover this storage is completely
+useless, in case of applications that consume the pseudorandom
+numbers directly after generation. We can see that when the number of threads is greater
+than approximately 30,000 and lower than 5 millions, the number of pseudorandom numbers generated
+per second is almost constant. With the naive version, this value ranges from 2.5 to
+3GSamples/s. With the optimized version, it is approximately equal to
+20GSamples/s. Finally we can remark that both GPU cards are quite similar, but in
+practice, the Tesla C1060 has more memory than the GTX 280, and this memory
+should be of better quality.
+As a comparison, Listing~\ref{algo:seqCIPRNG} leads to the generation of about
+138MSample/s when using one core of the Xeon E5530.