+\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte}
+\chapterauthor{Christophe Guyeux}{Femto-ST Institute, University of Franche-Comte}
+
+
+\chapter{Pseudo Random Number Generator on GPU}
+\label{chapter18}
+
+\section{Introduction}
+
+Randomness is of importance in many fields such as scientific
+simulations or cryptography. ``Random numbers'' can mainly be
+generated either by a deterministic and reproducible algorithm called
+a pseudorandom number generator (PRNG). In this paper, we focus on
+reproducible generators, useful for instance in Monte-Carlo based
+simulators. These domains need PRNGs that are statistically
+irreproachable. In some fields such as in numerical simulations,
+speed is a strong requirement that is usually attained by using
+parallel architectures. In that case, a recurrent problem is that a
+deflation of the statistical qualities is often reported, when the
+parallelization of a good PRNG is realized. In some fields such as in
+numerical simulations, speed is a strong requirement that is usually
+attained by using parallel architectures. In that case, a recurrent
+problem is that a deflation of the statistical qualities is often
+reported, when the parallelization of a good PRNG is realized. This
+is why ad-hoc PRNGs for each possible architecture must be found to
+achieve both speed and randomness. On the other side, speed is not
+the main requirement in cryptography: the great need is to
+define \emph{secure} generators able to withstand malicious
+attacks. Roughly speaking, an attacker should not be able in practice
+to make the distinction between numbers obtained with the secure
+generator and a true random sequence. Or, in an equivalent
+formulation, he or she should not be able (in practice) to predict the
+next bit of the generator, having the knowledge of all the binary
+digits that have been already released. ``Being able in practice''
+refers here to the possibility to achieve this attack in polynomial
+time, and to the exponential growth of the difficulty of this
+challenge when the size of the parameters of the PRNG increases.
+
+
+Finally, a small part of the community working in this domain focuses on a
+third requirement, that is to define chaotic generators.
+The main idea is to take benefits from a chaotic dynamical system to obtain a
+generator that is unpredictable, disordered, sensible to its seed, or in other word chaotic.
+Their desire is to map a given chaotic dynamics into a sequence that seems random
+and unassailable due to chaos.
+However, the chaotic maps used as a pattern are defined in the real line
+whereas computers deal with finite precision numbers.
+This distortion leads to a deflation of both chaotic properties and speed.
+Furthermore, authors of such chaotic generators often claim their PRNG
+as secure due to their chaos properties, but there is no obvious relation
+between chaos and security as it is understood in cryptography.
+This is why the use of chaos for PRNG still remains marginal and disputable.
+
+
+The remainder of this paper is organized as follows.
+A COMPLETER
+
+
+\section{Basic Recalls}
+\label{section:BASIC RECALLS}
+
+This section is devoted to basic definitions and terminologies in the fields of
+topological chaos and chaotic iterations. We assume the reader is familiar
+with basic notions on topology (see for instance~\cite{Devaney}).
+
+
+\subsection{Devaney's Chaotic Dynamical Systems}
+
+
+A COMPLETER
+
+
+
+\section{Toward Efficiency and Improvement for CI PRNG}
+\label{sec:efficient PRNG}
+
+\subsection{First Efficient Implementation of a PRNG based on Chaotic Iterations}
+%
+%Based on the proof presented in the previous section, it is now possible to
+%improve the speed of the generator formerly presented in~\cite{bgw09:ip,guyeux10}.
+%The first idea is to consider
+%that the provided strategy is a pseudorandom Boolean vector obtained by a
+%given PRNG.
+%An iteration of the system is simply the bitwise exclusive or between
+%the last computed state and the current strategy.
+%Topological properties of disorder exhibited by chaotic
+%iterations can be inherited by the inputted generator, we hope by doing so to
+%obtain some statistical improvements while preserving speed.
+%
+%%RAPH : j'ai viré tout ca
+%% Let us give an example using 16-bits numbers, to clearly understand how the bitwise xor operations
+%% are
+%% done.
+%% Suppose that $x$ and the strategy $S^i$ are given as
+%% binary vectors.
+%% Table~\ref{TableExemple} shows the result of $x \oplus S^i$.
+
+%% \begin{table}
+%% \begin{scriptsize}
+%% $$
+%% \begin{array}{|cc|cccccccccccccccc|}
+%% \hline
+%% x &=&1&0&1&1&1&0&1&0&1&0&0&1&0&0&1&0\\
+%% \hline
+%% S^i &=&0&1&1&0&0&1&1&0&1&1&1&0&0&1&1&1\\
+%% \hline
+%% x \oplus S^i&=&1&1&0&1&1&1&0&0&0&1&1&1&0&1&0&1\\
+%% \hline
+
+%% \hline
+%% \end{array}
+%% $$
+%% \end{scriptsize}
+%% \caption{Example of an arbitrary round of the proposed generator}
+%% \label{TableExemple}
+%% \end{table}
+
+
+
+
+\lstset{language=C,caption={C code of the sequential PRNG based on chaotic iterations},label={algo:seqCIPRNG}}
+\begin{small}
+\begin{lstlisting}
+
+unsigned int CIPRNG() {
+ static unsigned int x = 123123123;
+ unsigned long t1 = xorshift();
+ unsigned long t2 = xor128();
+ unsigned long t3 = xorwow();
+ x = x^(unsigned int)t1;
+ x = x^(unsigned int)(t2>>32);
+ x = x^(unsigned int)(t3>>32);
+ x = x^(unsigned int)t2;
+ x = x^(unsigned int)(t1>>32);
+ x = x^(unsigned int)t3;
+ return x;
+}
+\end{lstlisting}
+\end{small}
+
+
+
+In Listing~\ref{algo:seqCIPRNG} a sequential version of the proposed PRNG based
+on chaotic iterations is presented. The xor operator is represented by
+\textasciicircum. This function uses three classical 64-bits PRNGs, namely the
+\texttt{xorshift}, the \texttt{xor128}, and the
+\texttt{xorwow}~\cite{Marsaglia2003}. In the following, we call them ``xor-like
+PRNGs''. As each xor-like PRNG uses 64-bits whereas our proposed generator
+works with 32-bits, we use the command \texttt{(unsigned int)}, that selects the
+32 least significant bits of a given integer, and the code \texttt{(unsigned
+ int)(t$>>$32)} in order to obtain the 32 most significant bits of \texttt{t}.
+
+Thus producing a pseudorandom number needs 6 xor operations with 6 32-bits numbers
+that are provided by 3 64-bits PRNGs. This version successfully passes the
+stringent BigCrush battery of tests~\cite{LEcuyerS07}.
+At this point, we thus
+have defined an efficient and statistically unbiased generator. Its speed is
+directly related to the use of linear operations, but for the same reason,
+this fast generator cannot be proven as secure.
+
+
+
+\subsection{Efficient PRNGs based on Chaotic Iterations on GPU}
+\label{sec:efficient PRNG gpu}
+
+In order to take benefits from the computing power of GPU, a program
+needs to have independent blocks of threads that can be computed
+simultaneously. In general, the larger the number of threads is, the
+more local memory is used, and the less branching instructions are
+used (if, while, ...), the better the performances on GPU is.
+Obviously, having these requirements in mind, it is possible to build
+a program similar to the one presented in Listing
+\ref{algo:seqCIPRNG}, which computes pseudorandom numbers on GPU. To
+do so, we must firstly recall that in the CUDA~\cite{Nvid10}
+environment, threads have a local identifier called
+\texttt{ThreadIdx}, which is relative to the block containing
+them. Furthermore, in CUDA, parts of the code that are executed by the GPU, are
+called {\it kernels}.
+
+
+\subsection{Naive Version for GPU}
+
+
+It is possible to deduce from the CPU version a quite similar version adapted to GPU.
+The simple principle consists in making each thread of the GPU computing the CPU version of our PRNG.
+Of course, the three xor-like
+PRNGs used in these computations must have different parameters.
+In a given thread, these parameters are
+randomly picked from another PRNGs.
+The initialization stage is performed by the CPU.
+To do it, the ISAAC PRNG~\cite{Jenkins96} is used to set all the
+parameters embedded into each thread.
+
+The implementation of the three
+xor-like PRNGs is straightforward when their parameters have been
+allocated in the GPU memory. Each xor-like works with an internal
+number $x$ that saves the last generated pseudorandom number. Additionally, the
+implementation of the xor128, the xorshift, and the xorwow respectively require
+4, 5, and 6 unsigned long as internal variables.
+
+
+\begin{algorithm}
+\begin{small}
+\KwIn{InternalVarXorLikeArray: array with internal variables of the 3 xor-like
+PRNGs in global memory\;
+NumThreads: number of threads\;}
+\KwOut{NewNb: array containing random numbers in global memory}
+\If{threadIdx is concerned by the computation} {
+ retrieve data from InternalVarXorLikeArray[threadIdx] in local variables\;
+ \For{i=1 to n} {
+ compute a new PRNG as in Listing\ref{algo:seqCIPRNG}\;
+ store the new PRNG in NewNb[NumThreads*threadIdx+i]\;
+ }
+ store internal variables in InternalVarXorLikeArray[threadIdx]\;
+}
+\end{small}
+\caption{Main kernel of the GPU ``naive'' version of the PRNG based on chaotic iterations}
+\label{algo:gpu_kernel}
+\end{algorithm}
+
+
+
+Algorithm~\ref{algo:gpu_kernel} presents a naive implementation of the proposed PRNG on
+GPU. Due to the available memory in the GPU and the number of threads
+used simultaneously, the number of random numbers that a thread can generate
+inside a kernel is limited (\emph{i.e.}, the variable \texttt{n} in
+algorithm~\ref{algo:gpu_kernel}). For instance, if $100,000$ threads are used and
+if $n=100$\footnote{in fact, we need to add the initial seed (a 32-bits number)},
+then the memory required to store all of the internals variables of both the xor-like
+PRNGs\footnote{we multiply this number by $2$ in order to count 32-bits numbers}
+and the pseudorandom numbers generated by our PRNG, is equal to $100,000\times ((4+5+6)\times
+2+(1+100))=1,310,000$ 32-bits numbers, that is, approximately $52$Mb.
+
+This generator is able to pass the whole BigCrush battery of tests, for all
+the versions that have been tested depending on their number of threads
+(called \texttt{NumThreads} in our algorithm, tested up to $5$ million).
+
+
+\subsection{Improved Version for GPU}
+
+As GPU cards using CUDA have shared memory between threads of the same block, it
+is possible to use this feature in order to simplify the previous algorithm,
+i.e., to use less than 3 xor-like PRNGs. The solution consists in computing only
+one xor-like PRNG by thread, saving it into the shared memory, and then to use the results
+of some other threads in the same block of threads. In order to define which
+thread uses the result of which other one, we can use a combination array that
+contains the indexes of all threads and for which a combination has been
+performed.
+
+In Algorithm~\ref{algo:gpu_kernel2}, two combination arrays are used. The
+variable \texttt{offset} is computed using the value of
+\texttt{combination\_size}. Then we can compute \texttt{o1} and \texttt{o2}
+representing the indexes of the other threads whose results are used by the
+current one. In this algorithm, we consider that a 32-bits xor-like PRNG has
+been chosen. In practice, we use the xor128 proposed in~\cite{Marsaglia2003} in
+which unsigned longs (64 bits) have been replaced by unsigned integers (32
+bits).
+
+This version can also pass the whole {\it BigCrush} battery of tests.
+
+\begin{algorithm}
+\begin{small}
+\KwIn{InternalVarXorLikeArray: array with internal variables of 1 xor-like PRNGs
+in global memory\;
+NumThreads: Number of threads\;
+array\_comb1, array\_comb2: Arrays containing combinations of size combination\_size\;}
+
+\KwOut{NewNb: array containing random numbers in global memory}
+\If{threadId is concerned} {
+ retrieve data from InternalVarXorLikeArray[threadId] in local variables including shared memory and x\;
+ offset = threadIdx\%combination\_size\;
+ o1 = threadIdx-offset+array\_comb1[offset]\;
+ o2 = threadIdx-offset+array\_comb2[offset]\;
+ \For{i=1 to n} {
+ t=xor-like()\;
+ t=t\textasciicircum shmem[o1]\textasciicircum shmem[o2]\;
+ shared\_mem[threadId]=t\;
+ x = x\textasciicircum t\;
+
+ store the new PRNG in NewNb[NumThreads*threadId+i]\;
+ }
+ store internal variables in InternalVarXorLikeArray[threadId]\;
+}
+\end{small}
+\caption{Main kernel for the chaotic iterations based PRNG GPU efficient
+version\label{IR}}
+\label{algo:gpu_kernel2}
+\end{algorithm}
+
+\subsection{Chaos Evaluation of the Improved Version}
+
+A run of Algorithm~\ref{algo:gpu_kernel2} consists in an operation ($x=x\oplus t$) having
+the form of Equation~\ref{equation Oplus}, which is equivalent to the iterative
+system of Eq.~\ref{eq:generalIC}. That is, an iteration of the general chaotic
+iterations is realized between the last stored value $x$ of the thread and a strategy $t$
+(obtained by a bitwise exclusive or between a value provided by a xor-like() call
+and two values previously obtained by two other threads).
+To be certain that we are in the framework of Theorem~\ref{t:chaos des general},
+we must guarantee that this dynamical system iterates on the space
+$\mathcal{X} = \mathcal{P}\left(\llbracket 1, \mathsf{N} \rrbracket\right)^\mathds{N}\times\mathds{B}^\mathsf{N}$.
+The left term $x$ obviously belongs to $\mathds{B}^ \mathsf{N}$.
+To prevent from any flaws of chaotic properties, we must check that the right
+term (the last $t$), corresponding to the strategies, can possibly be equal to any
+integer of $\llbracket 1, \mathsf{N} \rrbracket$.
+
+Such a result is obvious, as for the xor-like(), all the
+integers belonging into its interval of definition can occur at each iteration, and thus the
+last $t$ respects the requirement. Furthermore, it is possible to
+prove by an immediate mathematical induction that, as the initial $x$
+is uniformly distributed (it is provided by a cryptographically secure PRNG),
+the two other stored values shmem[o1] and shmem[o2] are uniformly distributed too,
+(this is the induction hypothesis), and thus the next $x$ is finally uniformly distributed.
+
+Thus Algorithm~\ref{algo:gpu_kernel2} is a concrete realization of the general
+chaotic iterations presented previously, and for this reason, it satisfies the
+Devaney's formulation of a chaotic behavior.
+
+\section{Experiments}
+\label{sec:experiments}
+
+Different experiments have been performed in order to measure the generation
+speed. We have used a first computer equipped with a Tesla C1060 NVidia GPU card
+and an
+Intel Xeon E5530 cadenced at 2.40 GHz, and
+a second computer equipped with a smaller CPU and a GeForce GTX 280.
+All the
+cards have 240 cores.
+
+In Figure~\ref{fig:time_xorlike_gpu} we compare the quantity of pseudorandom numbers
+generated per second with various xor-like based PRNGs. In this figure, the optimized
+versions use the {\it xor64} described in~\cite{Marsaglia2003}, whereas the naive versions
+embed the three xor-like PRNGs described in Listing~\ref{algo:seqCIPRNG}. In
+order to obtain the optimal performances, the storage of pseudorandom numbers
+into the GPU memory has been removed. This step is time consuming and slows down the numbers
+generation. Moreover this storage is completely
+useless, in case of applications that consume the pseudorandom
+numbers directly after generation. We can see that when the number of threads is greater
+than approximately 30,000 and lower than 5 million, the number of pseudorandom numbers generated
+per second is almost constant. With the naive version, this value ranges from 2.5 to
+3GSamples/s. With the optimized version, it is approximately equal to
+20GSamples/s. Finally we can remark that both GPU cards are quite similar, but in
+practice, the Tesla C1060 has more memory than the GTX 280, and this memory
+should be of better quality.
+As a comparison, Listing~\ref{algo:seqCIPRNG} leads to the generation of about
+138MSample/s when using one core of the Xeon E5530.
+
+\begin{figure}[htbp]
+\begin{center}
+ \includegraphics[scale=0.65]{Chapters/chapter18/figures/curve_time_xorlike_gpu.pdf}
+\end{center}
+\caption{Quantity of pseudorandom numbers generated per second with the xorlike-based PRNG}
+\label{fig:time_xorlike_gpu}
+\end{figure}
+
+
+
+
+
+In Figure~\ref{fig:time_bbs_gpu} we highlight the performances of the optimized
+BBS-based PRNG on GPU. On the Tesla C1060 we obtain approximately 700MSample/s
+and on the GTX 280 about 670MSample/s, which is obviously slower than the
+xorlike-based PRNG on GPU. However, we will show in the next sections that this
+new PRNG has a strong level of security, which is necessarily paid by a speed
+reduction.
+
+\begin{figure}[htbp]
+\begin{center}
+ \includegraphics[scale=0.65]{Chapters/chapter18/figures/curve_time_bbs_gpu.pdf}
+\end{center}
+\caption{Quantity of pseudorandom numbers generated per second using the BBS-based PRNG}
+\label{fig:time_bbs_gpu}
+\end{figure}
+
+All these experiments allow us to conclude that it is possible to
+generate a very large quantity of pseudorandom numbers statistically perfect with the xor-like version.
+To a certain extend, it is also the case with the secure BBS-based version, the speed deflation being
+explained by the fact that the former version has ``only''
+chaotic properties and statistical perfection, whereas the latter is also cryptographically secure,
+as it is shown in the next sections.
+
+
+
+
+
+
+
+
+
+
+
+\putbib[Chapters/chapter18/biblio]
+