1 \chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte}
2 \chapterauthor{Christophe Guyeux}{Femto-ST Institute, University of Franche-Comte}
5 \chapter{Pseudo Random Number Generator on GPU}
10 Randomness is of importance in many fields such as scientific
11 simulations or cryptography. ``Random numbers'' can mainly be
12 generated either by a deterministic and reproducible algorithm called
13 a pseudorandom number generator (PRNG). In this paper, we focus on
14 reproducible generators, useful for instance in Monte-Carlo based
15 simulators. These domains need PRNGs that are statistically
16 irreproachable. In some fields such as in numerical simulations,
17 speed is a strong requirement that is usually attained by using
18 parallel architectures. In that case, a recurrent problem is that a
19 deflation of the statistical qualities is often reported, when the
20 parallelization of a good PRNG is realized. In some fields such as in
21 numerical simulations, speed is a strong requirement that is usually
22 attained by using parallel architectures. In that case, a recurrent
23 problem is that a deflation of the statistical qualities is often
24 reported, when the parallelization of a good PRNG is realized. This
25 is why ad-hoc PRNGs for each possible architecture must be found to
26 achieve both speed and randomness. On the other side, speed is not
27 the main requirement in cryptography: the great need is to
28 define \emph{secure} generators able to withstand malicious
29 attacks. Roughly speaking, an attacker should not be able in practice
30 to make the distinction between numbers obtained with the secure
31 generator and a true random sequence. Or, in an equivalent
32 formulation, he or she should not be able (in practice) to predict the
33 next bit of the generator, having the knowledge of all the binary
34 digits that have been already released. ``Being able in practice''
35 refers here to the possibility to achieve this attack in polynomial
36 time, and to the exponential growth of the difficulty of this
37 challenge when the size of the parameters of the PRNG increases.
40 Finally, a small part of the community working in this domain focuses on a
41 third requirement, that is to define chaotic generators.
42 The main idea is to take benefits from a chaotic dynamical system to obtain a
43 generator that is unpredictable, disordered, sensible to its seed, or in other word chaotic.
44 Their desire is to map a given chaotic dynamics into a sequence that seems random
45 and unassailable due to chaos.
46 However, the chaotic maps used as a pattern are defined in the real line
47 whereas computers deal with finite precision numbers.
48 This distortion leads to a deflation of both chaotic properties and speed.
49 Furthermore, authors of such chaotic generators often claim their PRNG
50 as secure due to their chaos properties, but there is no obvious relation
51 between chaos and security as it is understood in cryptography.
52 This is why the use of chaos for PRNG still remains marginal and disputable.
53 Let us finish this paragraph by noticing that, in this paper,
54 statistical perfection refers to the ability to pass the whole
55 {\it BigCrush} battery of tests, which is widely considered as the most
56 stringent statistical evaluation of a sequence claimed as random.
57 This battery can be found in the well-known TestU01 package~\cite{LEcuyerS07}.
58 More precisely, each time we performed a test on a PRNG, we ran it
59 twice in order to observe if all $p-$values are inside [0.01, 0.99]. In
60 fact, we observed that few $p-$values (less than ten) are sometimes
61 outside this interval but inside [0.001, 0.999], so that is why a
62 second run allows us to confirm that the values outside are not for
63 the same test. With this approach all our PRNGs pass the {\it
64 BigCrush} successfully and all $p-$values are at least once inside
66 Chaos, for its part, refers to the well-established definition of a
67 chaotic dynamical system defined by Devaney~\cite{Devaney}.
69 The remainder of this paper is organized as follows.
73 \section{Basic Recalls}
74 \label{section:BASIC RECALLS}
76 This section is devoted to basic definitions and terminologies in the fields of
77 topological chaos and chaotic iterations. We assume the reader is familiar
78 with basic notions on topology (see for instance~\cite{Devaney}).
81 \subsection{Devaney's Chaotic Dynamical Systems}
88 \section{Toward Efficiency and Improvement for CI PRNG}
89 \label{sec:efficient PRNG}
91 \subsection{First Efficient Implementation of a PRNG based on Chaotic Iterations}
93 %Based on the proof presented in the previous section, it is now possible to
94 %improve the speed of the generator formerly presented in~\cite{bgw09:ip,guyeux10}.
95 %The first idea is to consider
96 %that the provided strategy is a pseudorandom Boolean vector obtained by a
98 %An iteration of the system is simply the bitwise exclusive or between
99 %the last computed state and the current strategy.
100 %Topological properties of disorder exhibited by chaotic
101 %iterations can be inherited by the inputted generator, we hope by doing so to
102 %obtain some statistical improvements while preserving speed.
104 %%RAPH : j'ai viré tout ca
105 %% Let us give an example using 16-bits numbers, to clearly understand how the bitwise xor operations
108 %% Suppose that $x$ and the strategy $S^i$ are given as
110 %% Table~\ref{TableExemple} shows the result of $x \oplus S^i$.
113 %% \begin{scriptsize}
115 %% \begin{array}{|cc|cccccccccccccccc|}
117 %% x &=&1&0&1&1&1&0&1&0&1&0&0&1&0&0&1&0\\
119 %% S^i &=&0&1&1&0&0&1&1&0&1&1&1&0&0&1&1&1\\
121 %% x \oplus S^i&=&1&1&0&1&1&1&0&0&0&1&1&1&0&1&0&1\\
128 %% \caption{Example of an arbitrary round of the proposed generator}
129 %% \label{TableExemple}
135 \lstset{language=C,caption={C code of the sequential PRNG based on chaotic iterations},label={algo:seqCIPRNG}}
139 unsigned int CIPRNG() {
140 static unsigned int x = 123123123;
141 unsigned long t1 = xorshift();
142 unsigned long t2 = xor128();
143 unsigned long t3 = xorwow();
144 x = x^(unsigned int)t1;
145 x = x^(unsigned int)(t2>>32);
146 x = x^(unsigned int)(t3>>32);
147 x = x^(unsigned int)t2;
148 x = x^(unsigned int)(t1>>32);
149 x = x^(unsigned int)t3;
157 In Listing~\ref{algo:seqCIPRNG} a sequential version of the proposed PRNG based
158 on chaotic iterations is presented. The xor operator is represented by
159 \textasciicircum. This function uses three classical 64-bits PRNGs, namely the
160 \texttt{xorshift}, the \texttt{xor128}, and the
161 \texttt{xorwow}~\cite{Marsaglia2003}. In the following, we call them ``xor-like
162 PRNGs''. As each xor-like PRNG uses 64-bits whereas our proposed generator
163 works with 32-bits, we use the command \texttt{(unsigned int)}, that selects the
164 32 least significant bits of a given integer, and the code \texttt{(unsigned
165 int)(t$>>$32)} in order to obtain the 32 most significant bits of \texttt{t}.
167 Thus producing a pseudorandom number needs 6 xor operations with 6 32-bits numbers
168 that are provided by 3 64-bits PRNGs. This version successfully passes the
169 stringent BigCrush battery of tests~\cite{LEcuyerS07}.
170 At this point, we thus
171 have defined an efficient and statistically unbiased generator. Its speed is
172 directly related to the use of linear operations, but for the same reason,
173 this fast generator cannot be proven as secure.
177 \subsection{Efficient PRNGs based on Chaotic Iterations on GPU}
178 \label{sec:efficient PRNG gpu}
180 In order to take benefits from the computing power of GPU, a program
181 needs to have independent blocks of threads that can be computed
182 simultaneously. In general, the larger the number of threads is, the
183 more local memory is used, and the less branching instructions are
184 used (if, while, ...), the better the performances on GPU is.
185 Obviously, having these requirements in mind, it is possible to build
186 a program similar to the one presented in Listing
187 \ref{algo:seqCIPRNG}, which computes pseudorandom numbers on GPU. To
188 do so, we must firstly recall that in the CUDA~\cite{Nvid10}
189 environment, threads have a local identifier called
190 \texttt{ThreadIdx}, which is relative to the block containing
191 them. Furthermore, in CUDA, parts of the code that are executed by the GPU, are
192 called {\it kernels}.
195 \subsection{Naive Version for GPU}
198 It is possible to deduce from the CPU version a quite similar version adapted to GPU.
199 The simple principle consists in making each thread of the GPU computing the CPU version of our PRNG.
200 Of course, the three xor-like
201 PRNGs used in these computations must have different parameters.
202 In a given thread, these parameters are
203 randomly picked from another PRNGs.
204 The initialization stage is performed by the CPU.
205 To do it, the ISAAC PRNG~\cite{Jenkins96} is used to set all the
206 parameters embedded into each thread.
208 The implementation of the three
209 xor-like PRNGs is straightforward when their parameters have been
210 allocated in the GPU memory. Each xor-like works with an internal
211 number $x$ that saves the last generated pseudorandom number. Additionally, the
212 implementation of the xor128, the xorshift, and the xorwow respectively require
213 4, 5, and 6 unsigned long as internal variables.
218 \KwIn{InternalVarXorLikeArray: array with internal variables of the 3 xor-like
219 PRNGs in global memory\;
220 NumThreads: number of threads\;}
221 \KwOut{NewNb: array containing random numbers in global memory}
222 \If{threadIdx is concerned by the computation} {
223 retrieve data from InternalVarXorLikeArray[threadIdx] in local variables\;
225 compute a new PRNG as in Listing\ref{algo:seqCIPRNG}\;
226 store the new PRNG in NewNb[NumThreads*threadIdx+i]\;
228 store internal variables in InternalVarXorLikeArray[threadIdx]\;
231 \caption{Main kernel of the GPU ``naive'' version of the PRNG based on chaotic iterations}
232 \label{algo:gpu_kernel}
237 Algorithm~\ref{algo:gpu_kernel} presents a naive implementation of the proposed PRNG on
238 GPU. Due to the available memory in the GPU and the number of threads
239 used simultaneously, the number of random numbers that a thread can generate
240 inside a kernel is limited (\emph{i.e.}, the variable \texttt{n} in
241 algorithm~\ref{algo:gpu_kernel}). For instance, if $100,000$ threads are used and
242 if $n=100$\footnote{in fact, we need to add the initial seed (a 32-bits number)},
243 then the memory required to store all of the internals variables of both the xor-like
244 PRNGs\footnote{we multiply this number by $2$ in order to count 32-bits numbers}
245 and the pseudorandom numbers generated by our PRNG, is equal to $100,000\times ((4+5+6)\times
246 2+(1+100))=1,310,000$ 32-bits numbers, that is, approximately $52$Mb.
248 This generator is able to pass the whole BigCrush battery of tests, for all
249 the versions that have been tested depending on their number of threads
250 (called \texttt{NumThreads} in our algorithm, tested up to $5$ million).
253 \subsection{Improved Version for GPU}
255 As GPU cards using CUDA have shared memory between threads of the same block, it
256 is possible to use this feature in order to simplify the previous algorithm,
257 i.e., to use less than 3 xor-like PRNGs. The solution consists in computing only
258 one xor-like PRNG by thread, saving it into the shared memory, and then to use the results
259 of some other threads in the same block of threads. In order to define which
260 thread uses the result of which other one, we can use a combination array that
261 contains the indexes of all threads and for which a combination has been
264 In Algorithm~\ref{algo:gpu_kernel2}, two combination arrays are used. The
265 variable \texttt{offset} is computed using the value of
266 \texttt{combination\_size}. Then we can compute \texttt{o1} and \texttt{o2}
267 representing the indexes of the other threads whose results are used by the
268 current one. In this algorithm, we consider that a 32-bits xor-like PRNG has
269 been chosen. In practice, we use the xor128 proposed in~\cite{Marsaglia2003} in
270 which unsigned longs (64 bits) have been replaced by unsigned integers (32
273 This version can also pass the whole {\it BigCrush} battery of tests.
277 \KwIn{InternalVarXorLikeArray: array with internal variables of 1 xor-like PRNGs
279 NumThreads: Number of threads\;
280 array\_comb1, array\_comb2: Arrays containing combinations of size combination\_size\;}
282 \KwOut{NewNb: array containing random numbers in global memory}
283 \If{threadId is concerned} {
284 retrieve data from InternalVarXorLikeArray[threadId] in local variables including shared memory and x\;
285 offset = threadIdx\%combination\_size\;
286 o1 = threadIdx-offset+array\_comb1[offset]\;
287 o2 = threadIdx-offset+array\_comb2[offset]\;
290 t=t\textasciicircum shmem[o1]\textasciicircum shmem[o2]\;
291 shared\_mem[threadId]=t\;
292 x = x\textasciicircum t\;
294 store the new PRNG in NewNb[NumThreads*threadId+i]\;
296 store internal variables in InternalVarXorLikeArray[threadId]\;
299 \caption{Main kernel for the chaotic iterations based PRNG GPU efficient
301 \label{algo:gpu_kernel2}
304 \subsection{Chaos Evaluation of the Improved Version}
306 A run of Algorithm~\ref{algo:gpu_kernel2} consists in an operation ($x=x\oplus t$) having
307 the form of Equation~\ref{equation Oplus}, which is equivalent to the iterative
308 system of Eq.~\ref{eq:generalIC}. That is, an iteration of the general chaotic
309 iterations is realized between the last stored value $x$ of the thread and a strategy $t$
310 (obtained by a bitwise exclusive or between a value provided by a xor-like() call
311 and two values previously obtained by two other threads).
312 To be certain that we are in the framework of Theorem~\ref{t:chaos des general},
313 we must guarantee that this dynamical system iterates on the space
314 $\mathcal{X} = \mathcal{P}\left(\llbracket 1, \mathsf{N} \rrbracket\right)^\mathds{N}\times\mathds{B}^\mathsf{N}$.
315 The left term $x$ obviously belongs to $\mathds{B}^ \mathsf{N}$.
316 To prevent from any flaws of chaotic properties, we must check that the right
317 term (the last $t$), corresponding to the strategies, can possibly be equal to any
318 integer of $\llbracket 1, \mathsf{N} \rrbracket$.
320 Such a result is obvious, as for the xor-like(), all the
321 integers belonging into its interval of definition can occur at each iteration, and thus the
322 last $t$ respects the requirement. Furthermore, it is possible to
323 prove by an immediate mathematical induction that, as the initial $x$
324 is uniformly distributed (it is provided by a cryptographically secure PRNG),
325 the two other stored values shmem[o1] and shmem[o2] are uniformly distributed too,
326 (this is the induction hypothesis), and thus the next $x$ is finally uniformly distributed.
328 Thus Algorithm~\ref{algo:gpu_kernel2} is a concrete realization of the general
329 chaotic iterations presented previously, and for this reason, it satisfies the
330 Devaney's formulation of a chaotic behavior.
332 \section{Experiments}
333 \label{sec:experiments}
335 Different experiments have been performed in order to measure the generation
336 speed. We have used a first computer equipped with a Tesla C1060 NVidia GPU card
338 Intel Xeon E5530 cadenced at 2.40 GHz, and
339 a second computer equipped with a smaller CPU and a GeForce GTX 280.
341 cards have 240 cores.
343 In Figure~\ref{fig:time_xorlike_gpu} we compare the quantity of pseudorandom numbers
344 generated per second with various xor-like based PRNGs. In this figure, the optimized
345 versions use the {\it xor64} described in~\cite{Marsaglia2003}, whereas the naive versions
346 embed the three xor-like PRNGs described in Listing~\ref{algo:seqCIPRNG}. In
347 order to obtain the optimal performances, the storage of pseudorandom numbers
348 into the GPU memory has been removed. This step is time consuming and slows down the numbers
349 generation. Moreover this storage is completely
350 useless, in case of applications that consume the pseudorandom
351 numbers directly after generation. We can see that when the number of threads is greater
352 than approximately 30,000 and lower than 5 million, the number of pseudorandom numbers generated
353 per second is almost constant. With the naive version, this value ranges from 2.5 to
354 3GSamples/s. With the optimized version, it is approximately equal to
355 20GSamples/s. Finally we can remark that both GPU cards are quite similar, but in
356 practice, the Tesla C1060 has more memory than the GTX 280, and this memory
357 should be of better quality.
358 As a comparison, Listing~\ref{algo:seqCIPRNG} leads to the generation of about
359 138MSample/s when using one core of the Xeon E5530.
363 \includegraphics[scale=0.65]{Chapters/chapter18/figures/curve_time_xorlike_gpu.pdf}
365 \caption{Quantity of pseudorandom numbers generated per second with the xorlike-based PRNG}
366 \label{fig:time_xorlike_gpu}
373 In Figure~\ref{fig:time_bbs_gpu} we highlight the performances of the optimized
374 BBS-based PRNG on GPU. On the Tesla C1060 we obtain approximately 700MSample/s
375 and on the GTX 280 about 670MSample/s, which is obviously slower than the
376 xorlike-based PRNG on GPU. However, we will show in the next sections that this
377 new PRNG has a strong level of security, which is necessarily paid by a speed
382 \includegraphics[scale=0.65]{Chapters/chapter18/figures/curve_time_bbs_gpu.pdf}
384 \caption{Quantity of pseudorandom numbers generated per second using the BBS-based PRNG}
385 \label{fig:time_bbs_gpu}
388 All these experiments allow us to conclude that it is possible to
389 generate a very large quantity of pseudorandom numbers statistically perfect with the xor-like version.
390 To a certain extend, it is also the case with the secure BBS-based version, the speed deflation being
391 explained by the fact that the former version has ``only''
392 chaotic properties and statistical perfection, whereas the latter is also cryptographically secure,
393 as it is shown in the next sections.
405 \putbib[Chapters/chapter18/biblio]