BookGPU/Chapters/chapter18/ch18.tex

   1 \chapterauthor{Raphaël Couturier and Christophe Guyeux}{Femto-ST Institute, University of Franche-Comte, France}
   2 %\chapterauthor{Christophe Guyeux}{Femto-ST Institute, University of Franche-Comt\'{e}}
   3
   4
   5 \chapter{Pseudorandom number generator on GPU}
   6 \label{chapter18}
   7
   8 \section{Introduction}
   9
  10 Randomness is of importance in many fields such as scientific
  11 simulations or cryptography.  Random numbers can mainly be
  12 generated by either a deterministic and reproducible algorithm called
  13 a pseudorandom number generator (PRNG)\index{PRNG}, or by a physical nondeterministic
  14 process having all the characteristics of a random noise, called a truly random number
  15 generator (TRNG). In this chapter, we focus on
  16 reproducible generators, useful for instance in MonteCarlo-based
  17 simulators.  These domains need PRNGs that are statistically
  18 irreproachable.  In some fields such as in numerical simulations,
  19 speed is a strong requirement that is usually attained by using
  20 parallel architectures. In that case, a recurrent problem is that a
  21 deflation of the statistical qualities is often reported, when the
  22 parallelization of a good PRNG is realized.  This
  23 is why adhoc PRNGs for each possible architecture must be found to
  24 achieve both speed and randomness.  On the other hand, speed is not
  25 the main requirement in cryptography: the most important point is to
  26 define \emph{secure} generators able to withstand malicious
  27 attacks. Roughly speaking, an attacker should not be able in practice
  28 to make the distinction between numbers obtained with the secure
  29 generator and a true random sequence.  Or, in an equivalent
  30 formulation, he or she should not be able (in practice) to predict the
  31 next bit of the generator, having the knowledge of all the binary
  32 digits that have been already released~\cite{Goldreich}. ``Being able in practice''
  33 refers here to the possibility to achieve this attack in polynomial
  34 time and to the exponential growth of the difficulty of this
  35 challenge when the size of the parameters of the PRNG increases.
  36
  37
  38 Finally, a small part of the community working in this domain focuses on a
  39 third requirement: to define chaotic generators~\cite{kellert1994wake,  Wu20051195,gleick2011chaos}.
  40 The main idea is to  benefit from a chaotic dynamical system to obtain a
  41 generator that is unpredictable, disordered, sensible to its seed, or in other words, chaotic.
  42 These scientists' desire is to map a given chaotic dynamics into a sequence that seems random
  43 and unassailable due to chaos.
  44 However, the chaotic maps used as a pattern are defined in the real line
  45 whereas computers deal with finite precision numbers.
  46 This distortion leads to a deflation of both chaotic properties and speed.
  47 Furthermore, authors of such chaotic generators often claim their PRNG
  48 as secure due to their chaos properties, but there is no obvious relation
  49 between chaos and security as it is understood in cryptography.
  50 This is why the use of chaos for PRNG still remains marginal and disputable.
  51 However, we have established in previous researches that these flaws can
  52 be circumvented by using a tool called choatic iterations.
  53 Such investigations have led to the definition of a new
  54 family of PRNGs that are chaotic while being fast and statistically perfect,
  55 or cryptographically secure~\cite{bgw09:ip,bgw10:ip,bfgw11:ij,bfg12a:ip}. This family is improved and adapted for GPU
  56 architectures in this chapter.
  57
  58
  59 Let us finish this introduction by noticing that, in this chapter,
  60 statistical perfection refers to the ability to pass the whole
  61 {\it BigCrush} battery of tests, which is widely considered as the most
  62 stringent statistical evaluation of a sequence claimed as random.
  63 This battery can be found in the well-known TestU01 package~\cite{LEcuyerS07}\index{TestU01}.
  64 More precisely, each time we performed a test on a PRNG, we ran it
  65 twice in order to observe if all $p$-values were inside [0.01, 0.99]. In
  66 fact, we observed that few $p$-values (fewer than 10 out of 160) are sometimes
  67 outside this interval but inside [0.001, 0.999], so that is why a
  68 second run has allowed us to confirm that the values outside are not for
  69 the same test. With this approach all our PRNGs pass the {\it
  70   BigCrush} successfully and all $p$-values are at least once inside
  71 [0.01, 0.99].
  72 Chaos, for its part, refers to the well-established definition of a
  73 chaotic dynamical system defined by Devaney~\cite{Devaney}.
  74
  75 The remainder of this chapter is organized as follows.
  76 Basic definitions and terminologies about both topological chaos
  77 and chaotic iterations are provided in the next section. Various
  78 chaotic iterations based on pseudorandom number generators are then
  79 presented in Section~\ref{sec:efficient PRNG}. They encompass
  80 naive and improved efficient generators for CPU and for GPU.
  81 These generators are finally experimented in Section~\ref{sec:experiments}.
  82
  83
  84 \section{Basic remindees}
  85 \label{section:BASIC RECALLS}
  86
  87 This section is devoted to basic definitions and terminologies in the fields of
  88 topological chaos and chaotic iterations. We assume the reader is familiar
  89 with basic notions on topology (see for instance~\cite{Devaney}).
  90
  91
  92 \subsection{A short presentation of chaos}
  93
  94
  95 Chaos theory studies the behavior of dynamical systems that are perfectly predictable, yet appear to be wildly amorphous and meaningless.
  96 Chaotic systems\index{chaotic systems} are highly sensitive to initial conditions,
  97 which is popularly referred to as the butterfly effect.
  98 In other words, small differences in initial conditions (such as those due to rounding errors in numerical computation) yield widely diverging outcomes,
  99 in general rendering long-term prediction impossible  \cite{kellert1994wake}. This happens even though these systems are deterministic, meaning that their future behavior is fully determined by their initial conditions, with no random elements involved \cite{kellert1994wake}. That is, the deterministic nature of these systems does not make them predictable \cite{kellert1994wake,Werndl01032009}. This behavior is known as deterministic chaos, or simply chaos. It has been well-studied in mathematics and
 100 physics, leading among other things to the well-established definition of Devaney which can be found next.
 101
 102
 103
 104
 105
 106 \subsection{On Devaney's definition of chaos}\index{chaos}
 107 \label{sec:dev}
 108 Consider a metric space $(\mathcal{X},d)$ and a continuous function $f:\mathcal{X}\longrightarrow \mathcal{X}$, for one-dimensional dynamical systems of the form:
 109 \begin{equation}
 110 x^0 \in \mathcal{X} \textrm{  and } \forall n \in \mathds{N}^*, x^n=f(x^{n-1}),
 111 \label{Devaney}
 112 \end{equation}
 113 the following definition of chaotic behavior, formulated by Devaney~\cite{Devaney}, is widely accepted.
 114
 115 \begin{definition}
 116  A dynamical system of Form~(\ref{Devaney}) is said to be chaotic if the following conditions hold.
 117 \begin{itemize}
 118 \item Topological transitivity\index{topological transitivity}:
 119
 120 \begin{equation}
 121 \forall U,V \textrm{ open sets of } \mathcal{X},~\exists k>0, f^k(U) \cap V \neq \varnothing .
 122 \end{equation}
 123
 124 Intuitively, a topologically transitive map has points that eventually move under iteration from one arbitrarily small neighborhood to any other. Consequently, the dynamical system cannot be decomposed into two disjoint open sets that are invariant under the map. Note that if a map possesses a dense orbit, then it is clearly topologically transitive.
 125 \item Density of periodic points in $\mathcal{X}$\index{density of periodic points}:
 126
 127 Let $P=\{p\in \mathcal{X}|\exists n \in \mathds{N}^{\ast}:f^n(p)=p\}$ the set of periodic points of $f$. Then $P$ is dense in $\mathcal{X}$:
 128
 129 \begin{equation}
 130  \overline{P}=\mathcal{X} .
 131 \end{equation}
 132
 133 The density of periodic orbits means that every point in  space is closely approached  by periodic orbits in an arbitrary way. Topologically mixing systems failing this condition may not display sensitivity to initial conditions presented below and, hence,may not be chaotic.
 134 \item Sensitive dependence on initial conditions\index{sensitive dependence on initial conditions}:
 135
 136 $\exists \varepsilon>0,$ $\forall x \in \mathcal{X},$ $\forall \delta >0,$ $\exists y \in \mathcal{X},$ $\exists n \in \mathbb{N},$ $d(x,y)<\delta$ and $d\left(f^n(x),f^n(y)\right) \geqslant \varepsilon.$
 137
 138 Intuitively, a map possesses sensitive dependence on initial conditions if there exist points arbitrarily close to $x$ that eventually separate from $x$ by at least $\varepsilon$ under the iteration of $f$. Not all points near $x$ need to be eventually separate from $x$ under iteration, but there must be at least one such point in every neighborhood of $x$. If a map possesses sensitive dependence on initial conditions, then for all practical purposes, the dynamics of the map defy numerical computation. Small errors in computation that are introduced by round-off may become magnified upon iteration. The results of numerical computation of an orbit, no matter how accurate, may bear no resemblance whatsoever with the real orbit.
 139 \end{itemize}
 140
 141 \end{definition}
 142
 143 When $f$ is chaotic, then the system $(\mathcal{X}, f)$ is chaotic and quoting Devaney~\cite[p. 50]{Devaney}: ``it is unpredictable because of the sensitive dependence on initial conditions. It cannot be broken down or decomposed into two subsystems which do not interact because of topological transitivity. And, in the midst of this random behavior, we nevertheless have an element of regularity.'' Fundamentally different behaviors are consequently possible and occur in an unpredictable way.
 144
 145
 146
 147
 148
 149
 150
 151
 152 \subsection{Chaotic iterations}\index{chaotic iterations}
 153 \label{subsection:Chaotic iterations}
 154
 155 Let us now introduce an example of a dynamical systems family that has
 156 the potentiality to become chaotic, depending on the choice of the iteration
 157 function. This family is the basis of the PRNGs we will
 158 develop during this chapter.
 159
 160 \begin{definition}
 161 \label{Chaotic iterations}
 162 The set $\mathds{B}$ denoting $\{0,1\}$, let $f:\mathds{B}^{\mathsf{N}%
 163 }\longrightarrow \mathds{B}^{\mathsf{N}}$ be an ``iteration'' function and $S$ be a
 164 sequence of subsets of $\llbracket 1, \mathsf{N}\rrbracket$ called a chaotic strategy. Then, the so-called \emph{chaotic iterations} are defined by~\cite{Robert1986}:
 165
 166 \begin{equation}
 167 \label{eq:generalIC}
 168 \left\{\begin{array}{l}
 169 x^0\in \mathds{B}^{\mathsf{N}}, \\
 170 \forall n\in \mathds{N}^{\ast },\forall i\in \llbracket1;\mathsf{N}\rrbracket%
 171 ,x_i^n=
 172 \left\{
 173 \begin{array}{ll}
 174 x_i^{n-1} & \text{if}~ i\notin S^n  \\
 175 f(x^{n-1})_{i}  & \text{if}~ i \in S^n.
 176 \end{array}
 177 \right.
 178 \end{array}
 179 \right.
 180 \end{equation}
 181 \end{definition}
 182
 183 In other words, at the $nth$ iteration, only the  cells of $S^{n}$ are
 184 iterated.
 185 Chaotic iterations generate a set of vectors;
 186 they are defined by an initial state $x^{0}$, an iteration function $f$, and a chaotic strategy $S$~\cite{bg10:ij}.
 187 These ``chaotic iterations'' can behave chaotically as defined by Devaney,
 188 depending on the choice of $f$~\cite{bg10:ij}. For instance,
 189 chaos is obtained when $f$ is the vectorial negation.
 190 Note that, with this example of function, chaotic iterations
 191 defined above can be rewritten as
 192 \begin{equation}
 193 \label{equation Oplus}
 194 x^0 \in \llbracket 0,2^\mathsf{N}-1\rrbracket,~\mathcal{S}^n \in \mathcal{P}\left(\llbracket 1,2^\mathsf{N}-1\rrbracket\right)^\mathds{N},~~ x^{n+1}=x^n \oplus \mathcal{S}^n,
 195 \end{equation}
 196 where $\mathcal{P}(X)$ stands for the set of subsets of $X$, whereas
 197 $a\oplus b$ is the bitwise exclusive or operation between the binary
 198 representation of the integers $a$ and $b$. Note that the term
 199  $\mathcal{S}^n$ is directly and obviously linked to the $S^n$ of
 200 Eq.~\ref{eq:generalIC}. Such an iterative sequence
 201 satisfies the Devaney's definition of chaos.
 202
 203 \section{Toward efficiency and improvement for CI PRNG}
 204 \label{sec:efficient PRNG}
 205
 206 \subsection{First efficient implementation of a PRNG based on chaotic iterations}
 207 %
 208 %Based on the proof presented in the previous section, it is now possible to
 209 %improve the speed of the generator formerly presented in~\cite{bgw09:ip,guyeux10}.
 210 %The first idea is to consider
 211 %that the provided strategy is a pseudorandom Boolean vector obtained by a
 212 %given PRNG.
 213 %An iteration of the system is simply the bitwise exclusive or between
 214 %the last computed state and the current strategy.
 215 %Topological properties of disorder exhibited by chaotic
 216 %iterations can be inherited by the inputted generator, we hope by doing so to
 217 %obtain some statistical improvements while preserving speed.
 218 %
 219 %%RAPH : j'ai viré tout ca
 220 %% Let us give an example using 16-bits numbers, to clearly understand how the bitwise xor operations
 221 %% are
 222 %% done.
 223 %% Suppose  that $x$ and the  strategy $S^i$ are given as
 224 %% binary vectors.
 225 %% Table~\ref{TableExemple} shows the result of $x \oplus S^i$.
 226
 227 %% \begin{table}
 228 %% \begin{scriptsize}
 229 %% $$
 230 %% \begin{array}{|cc|cccccccccccccccc|}
 231 %% \hline
 232 %% x      &=&1&0&1&1&1&0&1&0&1&0&0&1&0&0&1&0\\
 233 %% \hline
 234 %% S^i      &=&0&1&1&0&0&1&1&0&1&1&1&0&0&1&1&1\\
 235 %% \hline
 236 %% x \oplus S^i&=&1&1&0&1&1&1&0&0&0&1&1&1&0&1&0&1\\
 237 %% \hline
 238
 239 %% \hline
 240 %%  \end{array}
 241 %% $$
 242 %% \end{scriptsize}
 243 %% \caption{Example of an arbitrary round of the proposed generator}
 244 %% \label{TableExemple}
 245 %% \end{table}
 246
 247
 248
 249
 250 \lstset{language=C,caption={C code of the sequential PRNG based on chaotic iterations},label={algo:seqCIPRNG}}
 251 \begin{small}
 252 \begin{lstlisting}
 253
 254 unsigned int CIPRNG() {
 255   static unsigned int x = 123123123;
 256   unsigned long t1 = xorshift();
 257   unsigned long t2 = xor128();
 258   unsigned long t3 = xorwow();
 259   x = x^(unsigned int)t1;
 260   x = x^(unsigned int)(t2>>32);
 261   x = x^(unsigned int)(t3>>32);
 262   x = x^(unsigned int)t2;
 263   x = x^(unsigned int)(t1>>32);
 264   x = x^(unsigned int)t3;
 265   return x;
 266 }
 267 \end{lstlisting}
 268 \end{small}
 269
 270
 271
 272 In Listing~\ref{algo:seqCIPRNG} a sequential  version of the proposed PRNG based
 273 on  chaotic  iterations  is  presented, which extends the generator family
 274 formerly presented in~\cite{bgw09:ip,guyeux10}.   The xor  operator  is  represented  by
 275 \textasciicircum.  This function uses  three classical 64-bit PRNGs, namely the
 276 \texttt{xorshift},         the          \texttt{xor128},         and         the
 277 \texttt{xorwow}~\cite{Marsaglia2003}.  In the following, we call them ``xor-like
 278 PRNGs''.   As each  xor-like PRNG  uses 64-bits  whereas our  proposed generator
 279 works with 32-bits, we use the command \texttt{(unsigned int)}, which selects the
 280 32 least  significant bits  of a given  integer, and the  code \texttt{(unsigned
 281   int)(t$>>$32)} in order to obtain the 32 most significant bits of \texttt{t}.
 282
 283 Thus producing a pseudorandom number needs 6 xor operations with 6 32-bit numbers
 284 that  are provided by  3 64-bit  PRNGs.  This  version successfully  passes the
 285 stringent {\it BigCrush} battery of tests~\cite{LEcuyerS07}.
 286 At this point, we have defined an efficient and statistically unbiased generator. Its speed is
 287 directly related to the use of linear operations, but for the same reason,
 288 this fast generator cannot be proven as secure.
 289
 290
 291
 292 \subsection{Efficient PRNGs based on chaotic iterations on GPU}
 293 \label{sec:efficient PRNG gpu}
 294
 295 In order to  benefit from the computing power  of GPU, a program
 296 needs  to have  independent blocks  of  threads that  can be  computed
 297 simultaneously. In general,  the larger the number of  threads is, the
 298 more local  memory is  used, and the  less branching  instructions are
 299 used  (if,  while,  etc.) and so,  the  better the  performances  on  GPU  are.
 300 Obviously, having these requirements in  mind, it is possible to build
 301 a   program    similar   to    the   one   presented    in  Listing
 302 \ref{algo:seqCIPRNG}, which computes  pseudorandom numbers on GPU.  To
 303 do  so,  we  must   first  recall  that  in  the  CUDA~\cite{Nvid10}
 304 environment,    threads    have     a    local    identifier    called
 305 \texttt{ThreadIdx},  which   is  relative  to   the  block  containing
 306 them. Furthermore, in  CUDA, parts of  the code that are executed by the  GPU are
 307 called {\it kernels}.
 308
 309
 310 \subsection{Naive version for GPU}
 311
 312
 313 It is possible to deduce from the CPU version a quite similar version adapted to GPU.
 314 The simple principle consists of making each thread of the GPU compute the CPU version of our PRNG.
 315 Of course,  the  three xor-like
 316 PRNGs  used in these computations must have different  parameters.
 317 In a given thread, these parameters are
 318 randomly picked from another PRNGs.
 319 The  initialization stage is performed by  the CPU.
 320 To do this, the   Indirection, Shift, Accumulate, Add, and Count (ISAAC)  PRNG~\cite{Jenkins96} is used to  set  all  the
 321 parameters embedded into each thread.
 322
 323 The implementation of  the three
 324 xor-like  PRNGs  is  straightforward  when  their  parameters  have  been
 325 allocated in  the GPU memory.  Each xor-like  works with  an internal
 326 number  $x$  that saves  the  last  generated  pseudorandom number. Additionally,  the
 327 implementation of the  {\it xor128}, the {\it xorshift}, and the  {\it xorwow}, respectively, require
 328 4, 5, and 6 unsigned long as internal variables.
 329
 330
 331 \begin{algorithm}
 332 \begin{small}
 333 \KwIn{InternalVarXorLikeArray: array with internal variables of the 3 xor-like
 334 PRNGs in global memory\;
 335 NumThreads: number of threads\;}
 336 \KwOut{NewNb: array containing random numbers in global memory}
 337 \If{threadIdx is concerned by the computation} {
 338   retrieve data from InternalVarXorLikeArray[threadIdx] in local variables\;
 339   \For{i=1 to n} {
 340     compute a new PRNG as in Listing\ref{algo:seqCIPRNG}\;
 341     store the new PRNG in NewNb[NumThreads*threadIdx+i]\;
 342   }
 343   store internal variables in InternalVarXorLikeArray[threadIdx]\;
 344 }
 345 \end{small}
 346 \caption{main kernel of the GPU ``naive'' version of the PRNG based on chaotic iterations}
 347 \label{algo:gpu_kernel}
 348 \end{algorithm}
 349
 350
 351
 352 Algorithm~\ref{algo:gpu_kernel}  presents a naive  implementation of the proposed  PRNG on
 353 GPU.  Due to the available  memory in the  GPU and the number  of threads
 354 used simultaneously,  the number  of random numbers  that a thread  can generate
 355 inside   a    kernel   is   limited  (i.e.,    the    variable   {\it n}   in
 356 Algorithm~\ref{algo:gpu_kernel}). For instance, if  $100,000$ threads are used and
 357 if $n=100$\footnote{In fact, we need to add the initial seed (a 32-bit number).},
 358 then   the  memory   required   to  store all of the  internal   variables  of both the  xor-like
 359 PRNGs\footnote{We multiply this number by $2$ in order to count 32-bit numbers.}
 360 and  the pseudorandom  numbers generated by  our  PRNG,  is  equal to  $100,000\times  ((4+5+6)\times
 361 2+(1+100))=1,310,000$ 32-bit numbers, that is, approximately $52$Mb.
 362
 363 This generator is able to pass the whole {\it BigCrush} battery of tests, for all
 364 the versions that have been tested depend on their number of threads
 365 (called NumThreads in our algorithm, tested up to $5$ million).
 366
 367
 368 \subsection{Improved version for GPU}
 369
 370 As GPU cards using CUDA have shared memory between threads of the same block, it
 371 is possible  to use this  feature in order  to simplify the  previous algorithm,
 372 i.e., to use fewer  than 3 xor-like PRNGs. The solution  consists in computing only
 373 one xor-like PRNG by thread, saving  it into the shared memory, and then  using the results
 374 of some  other threads in the  same block of  threads. In order to  define which
 375 thread uses the result of which other  one, we can use a combination array that
 376 contains  the indexes  of  all threads  and  for which  a combination has  been
 377 performed.
 378
 379 In  Algorithm~\ref{algo:gpu_kernel2},  two  combination  arrays are  used.   The
 380 variable     offset    is     computed    using     the     value    of
 381 combination\_size.   Then we  can compute  o1  and o2
 382 representing the  indexes of  the other  threads whose results  are used  by the
 383 current one.   In this algorithm, we  consider that a 32-bit  xor-like PRNG has
 384 been chosen. In practice, we  use the xor128 proposed in~\cite{Marsaglia2003} in
 385 which  unsigned longs  (64 bits)  have been  replaced by  unsigned  integers (32
 386 bits).
 387
 388 This version  can also pass the whole {\it BigCrush} battery of tests.
 389
 390 \begin{algorithm}
 391 \begin{small}
 392 \KwIn{InternalVarXorLikeArray: array with internal variables of 1 xor-like PRNGs
 393 in global memory\;
 394 NumThreads: Number of threads\;
 395 array\_comb1, array\_comb2: Arrays containing combinations of size combination\_size\;}
 396
 397 \KwOut{NewNb: array containing random numbers in global memory}
 398 \If{threadId is concerned} {
 399   retrieve data from InternalVarXorLikeArray[threadIdx] in local variables including shared memory and x\;
 400   offset = threadIdx\%combination\_size\;
 401   o1 = threadIdx-offset+array\_comb1[offset]\;
 402   o2 = threadIdx-offset+array\_comb2[offset]\;
 403   \For{i=1 to n} {
 404     t=xor-like()\;
 405     t=t\textasciicircum shmem[o1]\textasciicircum shmem[o2]\;
 406     shared\_mem[threadId]=t\;
 407     x = x\textasciicircum t\;
 408
 409     store the new PRNG in NewNb[NumThreads*threadIdx+i]\;
 410   }
 411   store internal variables in InternalVarXorLikeArray[threadIdx]\;
 412 }
 413 \end{small}
 414 \caption{Main kernel for the chaotic iterations based PRNG GPU efficient
 415 version\label{IR}}
 416 \label{algo:gpu_kernel2}
 417 \end{algorithm}
 418
 419 \subsection{Chaos evaluation of the improved version}
 420
 421 A run of Algorithm~\ref{algo:gpu_kernel2} consists of an operation ($x=x\oplus t$) having
 422 the form of Equation~\ref{equation Oplus}, which is equivalent to the iterative
 423 system of Eq.~\ref{eq:generalIC}. That is, an iteration of the general chaotic
 424 iterations is realized between the last stored value $x$ of the thread and a strategy $t$
 425 (obtained by a bitwise exclusive or between a value provided by a xor-like() call
 426 and two values previously obtained by two other threads).
 427 To be certain that such iterations correspond to the chaotic one recalled at the
 428 end of  Section~\ref{sec:dev},
 429 we must guarantee that this dynamical system iterates on the space
 430 $\mathcal{X} =\mathds{B}^\mathsf{N} \times \mathcal{P}\left(\llbracket 1, 2^\mathsf{N} \rrbracket\right)^\mathds{N}$.
 431 The left term $x$ obviously belongs to $\mathds{B}^ \mathsf{N}$.
 432 To prevent any flaws of chaotic properties, we must check that the right
 433 term (the last $t$ in Algorithm~\ref{algo:gpu_kernel2}), corresponding to the strategies,  can possibly be equal to any
 434 integer of $\llbracket 1, 2^\mathsf{N} \rrbracket$.
 435
 436 Such a result is obvious; as for the xor-like(), all the
 437 integers belonging to its interval of definition can occur at each iteration, and thus the
 438 last $t$ respects the requirement. Furthermore, it is possible to
 439 prove by an immediate mathematical induction that, supposing that the initial $x$
 440 is uniformly distributed, %(it is provided by a cryptographically secure PRNG),
 441 the two other stored values shmem[o1] and shmem[o2] are uniformly distributed too
 442 (this is the induction hypothesis), and thus the next $x$ is finally uniformly distributed.
 443
 444 Thus Algorithm~\ref{algo:gpu_kernel2} is a concrete realization of the general
 445 chaotic iterations presented previously, and for this reason, it satisfies the
 446 Devaney's formulation of  chaotic behavior.
 447
 448 \section{Experiments}
 449 \label{sec:experiments}
 450
 451 Different experiments  have been  performed in order  to measure  the generation
 452 speed. We have used one computer equipped with a Tesla C1060 NVIDIA  GPU card
 453 and an
 454 Intel  Xeon E5530 cadenced  at 2.40  GHz,  and
 455 a second computer  equipped with a smaller  CPU and  a GeForce GTX  280.
 456 All the
 457 cards have 240 cores.
 458
 459 \begin{figure}[htpb]
 460 \begin{center}
 461   \includegraphics[scale=0.65]{Chapters/chapter18/figures/curve_time_xorlike_gpu.pdf}
 462 \end{center}
 463 \caption{Quantity of pseudorandom numbers generated per second with the xorlike-based PRNG.}
 464 \label{fig:time_xorlike_gpu}
 465 \end{figure}
 466
 467 In  Figure~\ref{fig:time_xorlike_gpu} we  compare the  quantity of  pseudorandom numbers
 468 generated per second with various xor-like based PRNGs. In this figure, the optimized
 469 versions use the  xor64 described in~\cite{Marsaglia2003}, whereas the naive versions
 470 embed  the three  xor-like  PRNGs described  in Listing~\ref{algo:seqCIPRNG}.   In
 471 order to obtain the optimal performances, the storage of pseudorandom numbers
 472 to the GPU memory has been removed. This step is time-consuming and slows down the numbers
 473 generation.  Moreover this   storage  is  completely
 474 useless, in case of applications that consume the pseudorandom
 475 numbers  directly   after they have been generated. We can see  that when the number of  threads is greater
 476 than approximately 30,000 and less than 5 million, the number of pseudorandom numbers generated
 477 per second  is almost constant.  With the  naive version, this value ranges from 2.5 to
 478 3GSamples/s.   With  the  optimized   version,  it  is  approximately  equal to
 479 20GSamples/s. Finally  we can remark  that both GPU  cards are quite  similar, but in
 480 practice,  the Tesla C1060  has more  memory than  the GTX  280, and  this memory
 481 should be of better quality.
 482 As a  comparison,   Listing~\ref{algo:seqCIPRNG}  leads   to the  generation of  about
 483 138MSample/s when using one core of the Xeon E5530.
 484
 485
 486
 487
 488
 489
 490
 491
 492 These  experiments allow  us  to conclude  that  it  is possible  to
 493 generate a very large quantity of pseudorandom  numbers statistically perfect with the  xor-like version.
 494
 495
 496 \section{Summary}
 497
 498
 499 In this chapter,  a PRNG based on chaotic iterations is  presented.   It is proven to be
 500 chaotic according to Devaney.   Efficient implementations  on GPU
 501 using xor-like PRNGs  as input generators have shown that  a very large quantity
 502 of pseudorandom  numbers can be  generated per second (about  20Gsamples/s on a Tesla C1060), and
 503 that these proposed PRNGs succeed in passing the hardest battery in TestU01, namely, the {\it BigCrush}.
 504
 505
 506
 507
 508
 509
 510 \putbib[Chapters/chapter18/biblio18]
 511