1 \chapterauthor{Raphaël Couturier and Christophe Guyeux}{Femto-ST Institute, University of Franche-Comte, France}
2 %\chapterauthor{Christophe Guyeux}{Femto-ST Institute, University of Franche-Comt\'{e}}
5 \chapter{Pseudorandom number generator on GPU}
10 Randomness is of importance in many fields such as scientific
11 simulations or cryptography. Random numbers can mainly be
12 generated by either a deterministic and reproducible algorithm called
13 a pseudorandom number generator (PRNG)\index{PRNG}, or by a physical nondeterministic
14 process having all the characteristics of a random noise, called a truly random number
15 generator (TRNG). In this chapter, we focus on
16 reproducible generators, useful for instance in MonteCarlo-based
17 simulators. These domains need PRNGs that are statistically
18 irreproachable. In some fields such as in numerical simulations,
19 speed is a strong requirement that is usually attained by using
20 parallel architectures. In that case, a recurrent problem is that a
21 deflation of the statistical qualities is often reported, when the
22 parallelization of a good PRNG is realized. This
23 is why adhoc PRNGs for each possible architecture must be found to
24 achieve both speed and randomness. On the other hand, speed is not
25 the main requirement in cryptography: the most important point is to
26 define \emph{secure} generators able to withstand malicious
27 attacks. Roughly speaking, an attacker should not be able in practice
28 to make the distinction between numbers obtained with the secure
29 generator and a true random sequence. Or, in an equivalent
30 formulation, he or she should not be able (in practice) to predict the
31 next bit of the generator, having the knowledge of all the binary
32 digits that have been already released~\cite{Goldreich}. ``Being able in practice''
33 refers here to the possibility to achieve this attack in polynomial
34 time and to the exponential growth of the difficulty of this
35 challenge when the size of the parameters of the PRNG increases.
38 Finally, a small part of the community working in this domain focuses on a
39 third requirement: to define chaotic generators~\cite{kellert1994wake, Wu20051195,gleick2011chaos}.
40 The main idea is to benefit from a chaotic dynamical system to obtain a
41 generator that is unpredictable, disordered, sensible to its seed, or in other words, chaotic.
42 These scientists' desire is to map a given chaotic dynamics into a sequence that seems random
43 and unassailable due to chaos.
44 However, the chaotic maps used as a pattern are defined in the real line
45 whereas computers deal with finite precision numbers.
46 This distortion leads to a deflation of both chaotic properties and speed.
47 Furthermore, authors of such chaotic generators often claim their PRNG
48 as secure due to their chaos properties, but there is no obvious relation
49 between chaos and security as it is understood in cryptography.
50 This is why the use of chaos for PRNG still remains marginal and disputable.
51 However, we have established in previous researches that these flaws can
52 be circumvented by using a tool called choatic iterations.
53 Such investigations have led to the definition of a new
54 family of PRNGs that are chaotic while being fast and statistically perfect,
55 or cryptographically secure~\cite{bgw09:ip,bgw10:ip,bfgw11:ij,bfg12a:ip}. This family is improved and adapted for GPU
56 architectures in this chapter.
59 Let us finish this introduction by noticing that, in this chapter,
60 statistical perfection refers to the ability to pass the whole
61 {\it BigCrush} battery of tests, which is widely considered as the most
62 stringent statistical evaluation of a sequence claimed as random.
63 This battery can be found in the well-known TestU01 package~\cite{LEcuyerS07}\index{TestU01}.
64 More precisely, each time we performed a test on a PRNG, we ran it
65 twice in order to observe if all $p$-values were inside [0.01, 0.99]. In
66 fact, we observed that few $p$-values (fewer than 10 out of 160) are sometimes
67 outside this interval but inside [0.001, 0.999], so that is why a
68 second run has allowed us to confirm that the values outside are not for
69 the same test. With this approach all our PRNGs pass the {\it
70 BigCrush} successfully and all $p$-values are at least once inside
72 Chaos, for its part, refers to the well-established definition of a
73 chaotic dynamical system defined by Devaney~\cite{Devaney}.
75 The remainder of this chapter is organized as follows.
76 Basic definitions and terminologies about both topological chaos
77 and chaotic iterations are provided in the next section. Various
78 chaotic iterations based on pseudorandom number generators are then
79 presented in Section~\ref{sec:efficient PRNG}. They encompass
80 naive and improved efficient generators for CPU and for GPU.
81 These generators are finally experimented in Section~\ref{sec:experiments}.
84 \section{Basic remindees}
85 \label{section:BASIC RECALLS}
87 This section is devoted to basic definitions and terminologies in the fields of
88 topological chaos and chaotic iterations. We assume the reader is familiar
89 with basic notions on topology (see for instance~\cite{Devaney}).
92 \subsection{A short presentation of chaos}
95 Chaos theory studies the behavior of dynamical systems that are perfectly predictable, yet appear to be wildly amorphous and meaningless.
96 Chaotic systems\index{chaotic systems} are highly sensitive to initial conditions,
97 which is popularly referred to as the butterfly effect.
98 In other words, small differences in initial conditions (such as those due to rounding errors in numerical computation) yield widely diverging outcomes,
99 in general rendering long-term prediction impossible \cite{kellert1994wake}. This happens even though these systems are deterministic, meaning that their future behavior is fully determined by their initial conditions, with no random elements involved \cite{kellert1994wake}. That is, the deterministic nature of these systems does not make them predictable \cite{kellert1994wake,Werndl01032009}. This behavior is known as deterministic chaos, or simply chaos. It has been well-studied in mathematics and
100 physics, leading among other things to the well-established definition of Devaney which can be found next.
106 \subsection{On Devaney's definition of chaos}\index{chaos}
108 Consider a metric space $(\mathcal{X},d)$ and a continuous function $f:\mathcal{X}\longrightarrow \mathcal{X}$, for one-dimensional dynamical systems of the form:
110 x^0 \in \mathcal{X} \textrm{ and } \forall n \in \mathds{N}^*, x^n=f(x^{n-1}),
113 the following definition of chaotic behavior, formulated by Devaney~\cite{Devaney}, is widely accepted.
116 A dynamical system of Form~(\ref{Devaney}) is said to be chaotic if the following conditions hold.
118 \item Topological transitivity\index{topological transitivity}:
121 \forall U,V \textrm{ open sets of } \mathcal{X},~\exists k>0, f^k(U) \cap V \neq \varnothing .
124 Intuitively, a topologically transitive map has points that eventually move under iteration from one arbitrarily small neighborhood to any other. Consequently, the dynamical system cannot be decomposed into two disjoint open sets that are invariant under the map. Note that if a map possesses a dense orbit, then it is clearly topologically transitive.
125 \item Density of periodic points in $\mathcal{X}$\index{density of periodic points}:
127 Let $P=\{p\in \mathcal{X}|\exists n \in \mathds{N}^{\ast}:f^n(p)=p\}$ the set of periodic points of $f$. Then $P$ is dense in $\mathcal{X}$:
130 \overline{P}=\mathcal{X} .
133 The density of periodic orbits means that every point in space is closely approached by periodic orbits in an arbitrary way. Topologically mixing systems failing this condition may not display sensitivity to initial conditions presented below and, hence,may not be chaotic.
134 \item Sensitive dependence on initial conditions\index{sensitive dependence on initial conditions}:
136 $\exists \varepsilon>0,$ $\forall x \in \mathcal{X},$ $\forall \delta >0,$ $\exists y \in \mathcal{X},$ $\exists n \in \mathbb{N},$ $d(x,y)<\delta$ and $d\left(f^n(x),f^n(y)\right) \geqslant \varepsilon.$
138 Intuitively, a map possesses sensitive dependence on initial conditions if there exist points arbitrarily close to $x$ that eventually separate from $x$ by at least $\varepsilon$ under the iteration of $f$. Not all points near $x$ need to be eventually separate from $x$ under iteration, but there must be at least one such point in every neighborhood of $x$. If a map possesses sensitive dependence on initial conditions, then for all practical purposes, the dynamics of the map defy numerical computation. Small errors in computation that are introduced by round-off may become magnified upon iteration. The results of numerical computation of an orbit, no matter how accurate, may bear no resemblance whatsoever with the real orbit.
143 When $f$ is chaotic, then the system $(\mathcal{X}, f)$ is chaotic and quoting Devaney~\cite[p. 50]{Devaney}: ``it is unpredictable because of the sensitive dependence on initial conditions. It cannot be broken down or decomposed into two subsystems which do not interact because of topological transitivity. And, in the midst of this random behavior, we nevertheless have an element of regularity.'' Fundamentally different behaviors are consequently possible and occur in an unpredictable way.
152 \subsection{Chaotic iterations}\index{chaotic iterations}
153 \label{subsection:Chaotic iterations}
155 Let us now introduce an example of a dynamical systems family that has
156 the potentiality to become chaotic, depending on the choice of the iteration
157 function. This family is the basis of the PRNGs we will
158 develop during this chapter.
161 \label{Chaotic iterations}
162 The set $\mathds{B}$ denoting $\{0,1\}$, let $f:\mathds{B}^{\mathsf{N}%
163 }\longrightarrow \mathds{B}^{\mathsf{N}}$ be an ``iteration'' function and $S$ be a
164 sequence of subsets of $\llbracket 1, \mathsf{N}\rrbracket$ called a chaotic strategy. Then, the so-called \emph{chaotic iterations} are defined by~\cite{Robert1986}:
168 \left\{\begin{array}{l}
169 x^0\in \mathds{B}^{\mathsf{N}}, \\
170 \forall n\in \mathds{N}^{\ast },\forall i\in \llbracket1;\mathsf{N}\rrbracket%
174 x_i^{n-1} & \text{if}~ i\notin S^n \\
175 f(x^{n-1})_{i} & \text{if}~ i \in S^n.
183 In other words, at the $nth$ iteration, only the cells of $S^{n}$ are
185 Chaotic iterations generate a set of vectors;
186 they are defined by an initial state $x^{0}$, an iteration function $f$, and a chaotic strategy $S$~\cite{bg10:ij}.
187 These ``chaotic iterations'' can behave chaotically as defined by Devaney,
188 depending on the choice of $f$~\cite{bg10:ij}. For instance,
189 chaos is obtained when $f$ is the vectorial negation.
190 Note that, with this example of function, chaotic iterations
191 defined above can be rewritten as
193 \label{equation Oplus}
194 x^0 \in \llbracket 0,2^\mathsf{N}-1\rrbracket,~\mathcal{S}^n \in \mathcal{P}\left(\llbracket 1,2^\mathsf{N}-1\rrbracket\right)^\mathds{N},~~ x^{n+1}=x^n \oplus \mathcal{S}^n,
196 where $\mathcal{P}(X)$ stands for the set of subsets of $X$, whereas
197 $a\oplus b$ is the bitwise exclusive or operation between the binary
198 representation of the integers $a$ and $b$. Note that the term
199 $\mathcal{S}^n$ is directly and obviously linked to the $S^n$ of
200 Eq.~\ref{eq:generalIC}. Such an iterative sequence
201 satisfies the Devaney's definition of chaos.
203 \section{Toward efficiency and improvement for CI PRNG}
204 \label{sec:efficient PRNG}
206 \subsection{First efficient implementation of a PRNG based on chaotic iterations}
208 %Based on the proof presented in the previous section, it is now possible to
209 %improve the speed of the generator formerly presented in~\cite{bgw09:ip,guyeux10}.
210 %The first idea is to consider
211 %that the provided strategy is a pseudorandom Boolean vector obtained by a
213 %An iteration of the system is simply the bitwise exclusive or between
214 %the last computed state and the current strategy.
215 %Topological properties of disorder exhibited by chaotic
216 %iterations can be inherited by the inputted generator, we hope by doing so to
217 %obtain some statistical improvements while preserving speed.
219 %%RAPH : j'ai viré tout ca
220 %% Let us give an example using 16-bits numbers, to clearly understand how the bitwise xor operations
223 %% Suppose that $x$ and the strategy $S^i$ are given as
225 %% Table~\ref{TableExemple} shows the result of $x \oplus S^i$.
228 %% \begin{scriptsize}
230 %% \begin{array}{|cc|cccccccccccccccc|}
232 %% x &=&1&0&1&1&1&0&1&0&1&0&0&1&0&0&1&0\\
234 %% S^i &=&0&1&1&0&0&1&1&0&1&1&1&0&0&1&1&1\\
236 %% x \oplus S^i&=&1&1&0&1&1&1&0&0&0&1&1&1&0&1&0&1\\
243 %% \caption{Example of an arbitrary round of the proposed generator}
244 %% \label{TableExemple}
250 \lstset{language=C,caption={C code of the sequential PRNG based on chaotic iterations},label={algo:seqCIPRNG}}
254 unsigned int CIPRNG() {
255 static unsigned int x = 123123123;
256 unsigned long t1 = xorshift();
257 unsigned long t2 = xor128();
258 unsigned long t3 = xorwow();
259 x = x^(unsigned int)t1;
260 x = x^(unsigned int)(t2>>32);
261 x = x^(unsigned int)(t3>>32);
262 x = x^(unsigned int)t2;
263 x = x^(unsigned int)(t1>>32);
264 x = x^(unsigned int)t3;
272 In Listing~\ref{algo:seqCIPRNG} a sequential version of the proposed PRNG based
273 on chaotic iterations is presented, which extends the generator family
274 formerly presented in~\cite{bgw09:ip,guyeux10}. The xor operator is represented by
275 \textasciicircum. This function uses three classical 64-bit PRNGs, namely the
276 \texttt{xorshift}, the \texttt{xor128}, and the
277 \texttt{xorwow}~\cite{Marsaglia2003}. In the following, we call them ``xor-like
278 PRNGs''. As each xor-like PRNG uses 64-bits whereas our proposed generator
279 works with 32-bits, we use the command \texttt{(unsigned int)}, which selects the
280 32 least significant bits of a given integer, and the code \texttt{(unsigned
281 int)(t$>>$32)} in order to obtain the 32 most significant bits of \texttt{t}.
283 Thus producing a pseudorandom number needs 6 xor operations with 6 32-bit numbers
284 that are provided by 3 64-bit PRNGs. This version successfully passes the
285 stringent {\it BigCrush} battery of tests~\cite{LEcuyerS07}.
286 At this point, we have defined an efficient and statistically unbiased generator. Its speed is
287 directly related to the use of linear operations, but for the same reason,
288 this fast generator cannot be proven as secure.
292 \subsection{Efficient PRNGs based on chaotic iterations on GPU}
293 \label{sec:efficient PRNG gpu}
295 In order to benefit from the computing power of GPU, a program
296 needs to have independent blocks of threads that can be computed
297 simultaneously. In general, the larger the number of threads is, the
298 more local memory is used, and the less branching instructions are
299 used (if, while, etc.) and so, the better the performances on GPU are.
300 Obviously, having these requirements in mind, it is possible to build
301 a program similar to the one presented in Listing
302 \ref{algo:seqCIPRNG}, which computes pseudorandom numbers on GPU. To
303 do so, we must first recall that in the CUDA~\cite{Nvid10}
304 environment, threads have a local identifier called
305 \texttt{ThreadIdx}, which is relative to the block containing
306 them. Furthermore, in CUDA, parts of the code that are executed by the GPU are
307 called {\it kernels}.
310 \subsection{Naive version for GPU}
313 It is possible to deduce from the CPU version a quite similar version adapted to GPU.
314 The simple principle consists of making each thread of the GPU compute the CPU version of our PRNG.
315 Of course, the three xor-like
316 PRNGs used in these computations must have different parameters.
317 In a given thread, these parameters are
318 randomly picked from another PRNGs.
319 The initialization stage is performed by the CPU.
320 To do this, the Indirection, Shift, Accumulate, Add, and Count (ISAAC) PRNG~\cite{Jenkins96} is used to set all the
321 parameters embedded into each thread.
323 The implementation of the three
324 xor-like PRNGs is straightforward when their parameters have been
325 allocated in the GPU memory. Each xor-like works with an internal
326 number $x$ that saves the last generated pseudorandom number. Additionally, the
327 implementation of the {\it xor128}, the {\it xorshift}, and the {\it xorwow}, respectively, require
328 4, 5, and 6 unsigned long as internal variables.
333 \KwIn{InternalVarXorLikeArray: array with internal variables of the 3 xor-like
334 PRNGs in global memory\;
335 NumThreads: number of threads\;}
336 \KwOut{NewNb: array containing random numbers in global memory}
337 \If{threadIdx is concerned by the computation} {
338 retrieve data from InternalVarXorLikeArray[threadIdx] in local variables\;
340 compute a new PRNG as in Listing\ref{algo:seqCIPRNG}\;
341 store the new PRNG in NewNb[NumThreads*threadIdx+i]\;
343 store internal variables in InternalVarXorLikeArray[threadIdx]\;
346 \caption{main kernel of the GPU ``naive'' version of the PRNG based on chaotic iterations}
347 \label{algo:gpu_kernel}
352 Algorithm~\ref{algo:gpu_kernel} presents a naive implementation of the proposed PRNG on
353 GPU. Due to the available memory in the GPU and the number of threads
354 used simultaneously, the number of random numbers that a thread can generate
355 inside a kernel is limited (i.e., the variable {\it n} in
356 Algorithm~\ref{algo:gpu_kernel}). For instance, if $100,000$ threads are used and
357 if $n=100$\footnote{In fact, we need to add the initial seed (a 32-bit number).},
358 then the memory required to store all of the internal variables of both the xor-like
359 PRNGs\footnote{We multiply this number by $2$ in order to count 32-bit numbers.}
360 and the pseudorandom numbers generated by our PRNG, is equal to $100,000\times ((4+5+6)\times
361 2+(1+100))=1,310,000$ 32-bit numbers, that is, approximately $52$Mb.
363 This generator is able to pass the whole {\it BigCrush} battery of tests, for all
364 the versions that have been tested depend on their number of threads
365 (called NumThreads in our algorithm, tested up to $5$ million).
368 \subsection{Improved version for GPU}
370 As GPU cards using CUDA have shared memory between threads of the same block, it
371 is possible to use this feature in order to simplify the previous algorithm,
372 i.e., to use fewer than 3 xor-like PRNGs. The solution consists in computing only
373 one xor-like PRNG by thread, saving it into the shared memory, and then using the results
374 of some other threads in the same block of threads. In order to define which
375 thread uses the result of which other one, we can use a combination array that
376 contains the indexes of all threads and for which a combination has been
379 In Algorithm~\ref{algo:gpu_kernel2}, two combination arrays are used. The
380 variable offset is computed using the value of
381 combination\_size. Then we can compute o1 and o2
382 representing the indexes of the other threads whose results are used by the
383 current one. In this algorithm, we consider that a 32-bit xor-like PRNG has
384 been chosen. In practice, we use the xor128 proposed in~\cite{Marsaglia2003} in
385 which unsigned longs (64 bits) have been replaced by unsigned integers (32
388 This version can also pass the whole {\it BigCrush} battery of tests.
392 \KwIn{InternalVarXorLikeArray: array with internal variables of 1 xor-like PRNGs
394 NumThreads: Number of threads\;
395 array\_comb1, array\_comb2: Arrays containing combinations of size combination\_size\;}
397 \KwOut{NewNb: array containing random numbers in global memory}
398 \If{threadId is concerned} {
399 retrieve data from InternalVarXorLikeArray[threadIdx] in local variables including shared memory and x\;
400 offset = threadIdx\%combination\_size\;
401 o1 = threadIdx-offset+array\_comb1[offset]\;
402 o2 = threadIdx-offset+array\_comb2[offset]\;
405 t=t\textasciicircum shmem[o1]\textasciicircum shmem[o2]\;
406 shared\_mem[threadId]=t\;
407 x = x\textasciicircum t\;
409 store the new PRNG in NewNb[NumThreads*threadIdx+i]\;
411 store internal variables in InternalVarXorLikeArray[threadIdx]\;
414 \caption{Main kernel for the chaotic iterations based PRNG GPU efficient
416 \label{algo:gpu_kernel2}
419 \subsection{Chaos evaluation of the improved version}
421 A run of Algorithm~\ref{algo:gpu_kernel2} consists of an operation ($x=x\oplus t$) having
422 the form of Equation~\ref{equation Oplus}, which is equivalent to the iterative
423 system of Eq.~\ref{eq:generalIC}. That is, an iteration of the general chaotic
424 iterations is realized between the last stored value $x$ of the thread and a strategy $t$
425 (obtained by a bitwise exclusive or between a value provided by a xor-like() call
426 and two values previously obtained by two other threads).
427 To be certain that such iterations correspond to the chaotic one recalled at the
428 end of Section~\ref{sec:dev},
429 we must guarantee that this dynamical system iterates on the space
430 $\mathcal{X} =\mathds{B}^\mathsf{N} \times \mathcal{P}\left(\llbracket 1, 2^\mathsf{N} \rrbracket\right)^\mathds{N}$.
431 The left term $x$ obviously belongs to $\mathds{B}^ \mathsf{N}$.
432 To prevent any flaws of chaotic properties, we must check that the right
433 term (the last $t$ in Algorithm~\ref{algo:gpu_kernel2}), corresponding to the strategies, can possibly be equal to any
434 integer of $\llbracket 1, 2^\mathsf{N} \rrbracket$.
436 Such a result is obvious; as for the xor-like(), all the
437 integers belonging to its interval of definition can occur at each iteration, and thus the
438 last $t$ respects the requirement. Furthermore, it is possible to
439 prove by an immediate mathematical induction that, supposing that the initial $x$
440 is uniformly distributed, %(it is provided by a cryptographically secure PRNG),
441 the two other stored values shmem[o1] and shmem[o2] are uniformly distributed too
442 (this is the induction hypothesis), and thus the next $x$ is finally uniformly distributed.
444 Thus Algorithm~\ref{algo:gpu_kernel2} is a concrete realization of the general
445 chaotic iterations presented previously, and for this reason, it satisfies the
446 Devaney's formulation of chaotic behavior.
448 \section{Experiments}
449 \label{sec:experiments}
451 Different experiments have been performed in order to measure the generation
452 speed. We have used one computer equipped with a Tesla C1060 NVIDIA GPU card
454 Intel Xeon E5530 cadenced at 2.40 GHz, and
455 a second computer equipped with a smaller CPU and a GeForce GTX 280.
457 cards have 240 cores.
461 \includegraphics[scale=0.65]{Chapters/chapter18/figures/curve_time_xorlike_gpu.pdf}
463 \caption{Quantity of pseudorandom numbers generated per second with the xorlike-based PRNG.}
464 \label{fig:time_xorlike_gpu}
467 In Figure~\ref{fig:time_xorlike_gpu} we compare the quantity of pseudorandom numbers
468 generated per second with various xor-like based PRNGs. In this figure, the optimized
469 versions use the xor64 described in~\cite{Marsaglia2003}, whereas the naive versions
470 embed the three xor-like PRNGs described in Listing~\ref{algo:seqCIPRNG}. In
471 order to obtain the optimal performances, the storage of pseudorandom numbers
472 to the GPU memory has been removed. This step is time-consuming and slows down the numbers
473 generation. Moreover this storage is completely
474 useless, in case of applications that consume the pseudorandom
475 numbers directly after they have been generated. We can see that when the number of threads is greater
476 than approximately 30,000 and less than 5 million, the number of pseudorandom numbers generated
477 per second is almost constant. With the naive version, this value ranges from 2.5 to
478 3GSamples/s. With the optimized version, it is approximately equal to
479 20GSamples/s. Finally we can remark that both GPU cards are quite similar, but in
480 practice, the Tesla C1060 has more memory than the GTX 280, and this memory
481 should be of better quality.
482 As a comparison, Listing~\ref{algo:seqCIPRNG} leads to the generation of about
483 138MSample/s when using one core of the Xeon E5530.
492 These experiments allow us to conclude that it is possible to
493 generate a very large quantity of pseudorandom numbers statistically perfect with the xor-like version.
499 In this chapter, a PRNG based on chaotic iterations is presented. It is proven to be
500 chaotic according to Devaney. Efficient implementations on GPU
501 using xor-like PRNGs as input generators have shown that a very large quantity
502 of pseudorandom numbers can be generated per second (about 20Gsamples/s on a Tesla C1060), and
503 that these proposed PRNGs succeed in passing the hardest battery in TestU01, namely, the {\it BigCrush}.
510 \putbib[Chapters/chapter18/biblio18]