BookGPU/Chapters/chapter18/ch18.tex

   1 \chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte}
   2 \chapterauthor{Christophe Guyeux}{Femto-ST Institute, University of Franche-Comte}
   3
   4
   5 \chapter{Pseudo Random Number Generator on GPU}
   6 \label{chapter18}
   7
   8 \section{Introduction}
   9
  10 Randomness is of importance in many fields such as scientific
  11 simulations or cryptography.  ``Random numbers'' can mainly be
  12 generated either by a deterministic and reproducible algorithm called
  13 a pseudorandom number generator (PRNG). In this paper, we focus on
  14 reproducible generators, useful for instance in Monte-Carlo based
  15 simulators.  These domains need PRNGs that are statistically
  16 irreproachable.  In some fields such as in numerical simulations,
  17 speed is a strong requirement that is usually attained by using
  18 parallel architectures. In that case, a recurrent problem is that a
  19 deflation of the statistical qualities is often reported, when the
  20 parallelization of a good PRNG is realized. In some fields such as in
  21 numerical simulations, speed is a strong requirement that is usually
  22 attained by using parallel architectures. In that case, a recurrent
  23 problem is that a deflation of the statistical qualities is often
  24 reported, when the parallelization of a good PRNG is realized.  This
  25 is why ad-hoc PRNGs for each possible architecture must be found to
  26 achieve both speed and randomness.  On the other side, speed is not
  27 the main requirement in cryptography: the great need is to
  28 define \emph{secure} generators able to withstand malicious
  29 attacks. Roughly speaking, an attacker should not be able in practice
  30 to make the distinction between numbers obtained with the secure
  31 generator and a true random sequence.  Or, in an equivalent
  32 formulation, he or she should not be able (in practice) to predict the
  33 next bit of the generator, having the knowledge of all the binary
  34 digits that have been already released. ``Being able in practice''
  35 refers here to the possibility to achieve this attack in polynomial
  36 time, and to the exponential growth of the difficulty of this
  37 challenge when the size of the parameters of the PRNG increases.
  38
  39
  40 Finally, a small part of the community working in this domain focuses on a
  41 third requirement, that is to define chaotic generators.
  42 The main idea is to take benefits from a chaotic dynamical system to obtain a
  43 generator that is unpredictable, disordered, sensible to its seed, or in other word chaotic.
  44 Their desire is to map a given chaotic dynamics into a sequence that seems random
  45 and unassailable due to chaos.
  46 However, the chaotic maps used as a pattern are defined in the real line
  47 whereas computers deal with finite precision numbers.
  48 This distortion leads to a deflation of both chaotic properties and speed.
  49 Furthermore, authors of such chaotic generators often claim their PRNG
  50 as secure due to their chaos properties, but there is no obvious relation
  51 between chaos and security as it is understood in cryptography.
  52 This is why the use of chaos for PRNG still remains marginal and disputable.
  53 Let us finish this paragraph by noticing that, in this paper,
  54 statistical perfection refers to the ability to pass the whole
  55 {\it BigCrush} battery of tests, which is widely considered as the most
  56 stringent statistical evaluation of a sequence claimed as random.
  57 This battery can be found in the well-known TestU01 package~\cite{LEcuyerS07}.
  58 More precisely, each time we performed a test on a PRNG, we ran it
  59 twice in order to observe if all $p-$values are inside [0.01, 0.99]. In
  60 fact, we observed that few $p-$values (less than ten) are sometimes
  61 outside this interval but inside [0.001, 0.999], so that is why a
  62 second run allows us to confirm that the values outside are not for
  63 the same test. With this approach all our PRNGs pass the {\it
  64   BigCrush} successfully and all $p-$values are at least once inside
  65 [0.01, 0.99].
  66 Chaos, for its part, refers to the well-established definition of a
  67 chaotic dynamical system defined by Devaney~\cite{Devaney}.
  68
  69 The remainder of this paper is organized as follows.
  70 A COMPLETER
  71
  72
  73 \section{Basic Recalls}
  74 \label{section:BASIC RECALLS}
  75
  76 This section is devoted to basic definitions and terminologies in the fields of
  77 topological chaos and chaotic iterations. We assume the reader is familiar
  78 with basic notions on topology (see for instance~\cite{Devaney}).
  79
  80
  81 \subsection{Devaney's Chaotic Dynamical Systems}
  82
  83
  84 A COMPLETER
  85
  86
  87
  88 \section{Toward Efficiency and Improvement for CI PRNG}
  89 \label{sec:efficient PRNG}
  90
  91 \subsection{First Efficient Implementation of a PRNG based on Chaotic Iterations}
  92 %
  93 %Based on the proof presented in the previous section, it is now possible to
  94 %improve the speed of the generator formerly presented in~\cite{bgw09:ip,guyeux10}.
  95 %The first idea is to consider
  96 %that the provided strategy is a pseudorandom Boolean vector obtained by a
  97 %given PRNG.
  98 %An iteration of the system is simply the bitwise exclusive or between
  99 %the last computed state and the current strategy.
 100 %Topological properties of disorder exhibited by chaotic
 101 %iterations can be inherited by the inputted generator, we hope by doing so to
 102 %obtain some statistical improvements while preserving speed.
 103 %
 104 %%RAPH : j'ai viré tout ca
 105 %% Let us give an example using 16-bits numbers, to clearly understand how the bitwise xor operations
 106 %% are
 107 %% done.
 108 %% Suppose  that $x$ and the  strategy $S^i$ are given as
 109 %% binary vectors.
 110 %% Table~\ref{TableExemple} shows the result of $x \oplus S^i$.
 111
 112 %% \begin{table}
 113 %% \begin{scriptsize}
 114 %% $$
 115 %% \begin{array}{|cc|cccccccccccccccc|}
 116 %% \hline
 117 %% x      &=&1&0&1&1&1&0&1&0&1&0&0&1&0&0&1&0\\
 118 %% \hline
 119 %% S^i      &=&0&1&1&0&0&1&1&0&1&1&1&0&0&1&1&1\\
 120 %% \hline
 121 %% x \oplus S^i&=&1&1&0&1&1&1&0&0&0&1&1&1&0&1&0&1\\
 122 %% \hline
 123
 124 %% \hline
 125 %%  \end{array}
 126 %% $$
 127 %% \end{scriptsize}
 128 %% \caption{Example of an arbitrary round of the proposed generator}
 129 %% \label{TableExemple}
 130 %% \end{table}
 131
 132
 133
 134
 135 \lstset{language=C,caption={C code of the sequential PRNG based on chaotic iterations},label={algo:seqCIPRNG}}
 136 \begin{small}
 137 \begin{lstlisting}
 138
 139 unsigned int CIPRNG() {
 140   static unsigned int x = 123123123;
 141   unsigned long t1 = xorshift();
 142   unsigned long t2 = xor128();
 143   unsigned long t3 = xorwow();
 144   x = x^(unsigned int)t1;
 145   x = x^(unsigned int)(t2>>32);
 146   x = x^(unsigned int)(t3>>32);
 147   x = x^(unsigned int)t2;
 148   x = x^(unsigned int)(t1>>32);
 149   x = x^(unsigned int)t3;
 150   return x;
 151 }
 152 \end{lstlisting}
 153 \end{small}
 154
 155
 156
 157 In Listing~\ref{algo:seqCIPRNG} a sequential  version of the proposed PRNG based
 158 on  chaotic  iterations  is  presented.   The xor  operator  is  represented  by
 159 \textasciicircum.  This function uses  three classical 64-bits PRNGs, namely the
 160 \texttt{xorshift},         the          \texttt{xor128},         and         the
 161 \texttt{xorwow}~\cite{Marsaglia2003}.  In the following, we call them ``xor-like
 162 PRNGs''.   As each  xor-like PRNG  uses 64-bits  whereas our  proposed generator
 163 works with 32-bits, we use the command \texttt{(unsigned int)}, that selects the
 164 32 least  significant bits  of a given  integer, and the  code \texttt{(unsigned
 165   int)(t$>>$32)} in order to obtain the 32 most significant bits of \texttt{t}.
 166
 167 Thus producing a pseudorandom number needs 6 xor operations with 6 32-bits numbers
 168 that  are provided by  3 64-bits  PRNGs.  This  version successfully  passes the
 169 stringent BigCrush battery of tests~\cite{LEcuyerS07}.
 170 At this point, we thus
 171 have defined an efficient and statistically unbiased generator. Its speed is
 172 directly related to the use of linear operations, but for the same reason,
 173 this fast generator cannot be proven as secure.
 174
 175
 176
 177 \subsection{Efficient PRNGs based on Chaotic Iterations on GPU}
 178 \label{sec:efficient PRNG gpu}
 179
 180 In order to  take benefits from the computing power  of GPU, a program
 181 needs  to have  independent blocks  of  threads that  can be  computed
 182 simultaneously. In general,  the larger the number of  threads is, the
 183 more local  memory is  used, and the  less branching  instructions are
 184 used  (if,  while,  ...),  the  better the  performances  on  GPU  is.
 185 Obviously, having these requirements in  mind, it is possible to build
 186 a   program    similar   to    the   one   presented    in  Listing
 187 \ref{algo:seqCIPRNG}, which computes  pseudorandom numbers on GPU.  To
 188 do  so,  we  must   firstly  recall  that  in  the  CUDA~\cite{Nvid10}
 189 environment,    threads    have     a    local    identifier    called
 190 \texttt{ThreadIdx},  which   is  relative  to   the  block  containing
 191 them. Furthermore, in  CUDA, parts of  the code that are executed by the  GPU, are
 192 called {\it kernels}.
 193
 194
 195 \subsection{Naive Version for GPU}
 196
 197
 198 It is possible to deduce from the CPU version a quite similar version adapted to GPU.
 199 The simple principle consists in making each thread of the GPU computing the CPU version of our PRNG.
 200 Of course,  the  three xor-like
 201 PRNGs  used in these computations must have different  parameters.
 202 In a given thread, these parameters are
 203 randomly picked from another PRNGs.
 204 The  initialization stage is performed by  the CPU.
 205 To do it, the  ISAAC  PRNG~\cite{Jenkins96} is used to  set  all  the
 206 parameters embedded into each thread.
 207
 208 The implementation of  the three
 209 xor-like  PRNGs  is  straightforward  when  their  parameters  have  been
 210 allocated in  the GPU memory.  Each xor-like  works with  an internal
 211 number  $x$  that saves  the  last  generated  pseudorandom number. Additionally,  the
 212 implementation of the  xor128, the xorshift, and the  xorwow respectively require
 213 4, 5, and 6 unsigned long as internal variables.
 214
 215
 216 \begin{algorithm}
 217 \begin{small}
 218 \KwIn{InternalVarXorLikeArray: array with internal variables of the 3 xor-like
 219 PRNGs in global memory\;
 220 NumThreads: number of threads\;}
 221 \KwOut{NewNb: array containing random numbers in global memory}
 222 \If{threadIdx is concerned by the computation} {
 223   retrieve data from InternalVarXorLikeArray[threadIdx] in local variables\;
 224   \For{i=1 to n} {
 225     compute a new PRNG as in Listing\ref{algo:seqCIPRNG}\;
 226     store the new PRNG in NewNb[NumThreads*threadIdx+i]\;
 227   }
 228   store internal variables in InternalVarXorLikeArray[threadIdx]\;
 229 }
 230 \end{small}
 231 \caption{Main kernel of the GPU ``naive'' version of the PRNG based on chaotic iterations}
 232 \label{algo:gpu_kernel}
 233 \end{algorithm}
 234
 235
 236
 237 Algorithm~\ref{algo:gpu_kernel}  presents a naive  implementation of the proposed  PRNG on
 238 GPU.  Due to the available  memory in the  GPU and the number  of threads
 239 used simultaneously,  the number  of random numbers  that a thread  can generate
 240 inside   a    kernel   is   limited  (\emph{i.e.},    the    variable   \texttt{n}   in
 241 algorithm~\ref{algo:gpu_kernel}). For instance, if  $100,000$ threads are used and
 242 if $n=100$\footnote{in fact, we need to add the initial seed (a 32-bits number)},
 243 then   the  memory   required   to  store all of the  internals   variables  of both the  xor-like
 244 PRNGs\footnote{we multiply this number by $2$ in order to count 32-bits numbers}
 245 and  the pseudorandom  numbers generated by  our  PRNG,  is  equal to  $100,000\times  ((4+5+6)\times
 246 2+(1+100))=1,310,000$ 32-bits numbers, that is, approximately $52$Mb.
 247
 248 This generator is able to pass the whole BigCrush battery of tests, for all
 249 the versions that have been tested depending on their number of threads
 250 (called \texttt{NumThreads} in our algorithm, tested up to $5$ million).
 251
 252
 253 \subsection{Improved Version for GPU}
 254
 255 As GPU cards using CUDA have shared memory between threads of the same block, it
 256 is possible  to use this  feature in order  to simplify the  previous algorithm,
 257 i.e., to use less  than 3 xor-like PRNGs. The solution  consists in computing only
 258 one xor-like PRNG by thread, saving  it into the shared memory, and then to use the results
 259 of some  other threads in the  same block of  threads. In order to  define which
 260 thread uses the result of which other  one, we can use a combination array that
 261 contains  the indexes  of  all threads  and  for which  a combination has  been
 262 performed.
 263
 264 In  Algorithm~\ref{algo:gpu_kernel2},  two  combination  arrays are  used.   The
 265 variable     \texttt{offset}    is     computed    using     the     value    of
 266 \texttt{combination\_size}.   Then we  can compute  \texttt{o1}  and \texttt{o2}
 267 representing the  indexes of  the other  threads whose results  are used  by the
 268 current one.   In this algorithm, we  consider that a 32-bits  xor-like PRNG has
 269 been chosen. In practice, we  use the xor128 proposed in~\cite{Marsaglia2003} in
 270 which  unsigned longs  (64 bits)  have been  replaced by  unsigned  integers (32
 271 bits).
 272
 273 This version  can also pass the whole {\it BigCrush} battery of tests.
 274
 275 \begin{algorithm}
 276 \begin{small}
 277 \KwIn{InternalVarXorLikeArray: array with internal variables of 1 xor-like PRNGs
 278 in global memory\;
 279 NumThreads: Number of threads\;
 280 array\_comb1, array\_comb2: Arrays containing combinations of size combination\_size\;}
 281
 282 \KwOut{NewNb: array containing random numbers in global memory}
 283 \If{threadId is concerned} {
 284   retrieve data from InternalVarXorLikeArray[threadId] in local variables including shared memory and x\;
 285   offset = threadIdx\%combination\_size\;
 286   o1 = threadIdx-offset+array\_comb1[offset]\;
 287   o2 = threadIdx-offset+array\_comb2[offset]\;
 288   \For{i=1 to n} {
 289     t=xor-like()\;
 290     t=t\textasciicircum shmem[o1]\textasciicircum shmem[o2]\;
 291     shared\_mem[threadId]=t\;
 292     x = x\textasciicircum t\;
 293
 294     store the new PRNG in NewNb[NumThreads*threadId+i]\;
 295   }
 296   store internal variables in InternalVarXorLikeArray[threadId]\;
 297 }
 298 \end{small}
 299 \caption{Main kernel for the chaotic iterations based PRNG GPU efficient
 300 version\label{IR}}
 301 \label{algo:gpu_kernel2}
 302 \end{algorithm}
 303
 304 \subsection{Chaos Evaluation of the Improved Version}
 305
 306 A run of Algorithm~\ref{algo:gpu_kernel2} consists in an operation ($x=x\oplus t$) having
 307 the form of Equation~\ref{equation Oplus}, which is equivalent to the iterative
 308 system of Eq.~\ref{eq:generalIC}. That is, an iteration of the general chaotic
 309 iterations is realized between the last stored value $x$ of the thread and a strategy $t$
 310 (obtained by a bitwise exclusive or between a value provided by a xor-like() call
 311 and two values previously obtained by two other threads).
 312 To be certain that we are in the framework of Theorem~\ref{t:chaos des general},
 313 we must guarantee that this dynamical system iterates on the space
 314 $\mathcal{X} = \mathcal{P}\left(\llbracket 1, \mathsf{N} \rrbracket\right)^\mathds{N}\times\mathds{B}^\mathsf{N}$.
 315 The left term $x$ obviously belongs to $\mathds{B}^ \mathsf{N}$.
 316 To prevent from any flaws of chaotic properties, we must check that the right
 317 term (the last $t$), corresponding to the strategies,  can possibly be equal to any
 318 integer of $\llbracket 1, \mathsf{N} \rrbracket$.
 319
 320 Such a result is obvious, as for the xor-like(), all the
 321 integers belonging into its interval of definition can occur at each iteration, and thus the
 322 last $t$ respects the requirement. Furthermore, it is possible to
 323 prove by an immediate mathematical induction that, as the initial $x$
 324 is uniformly distributed (it is provided by a cryptographically secure PRNG),
 325 the two other stored values shmem[o1] and shmem[o2] are uniformly distributed too,
 326 (this is the induction hypothesis), and thus the next $x$ is finally uniformly distributed.
 327
 328 Thus Algorithm~\ref{algo:gpu_kernel2} is a concrete realization of the general
 329 chaotic iterations presented previously, and for this reason, it satisfies the
 330 Devaney's formulation of a chaotic behavior.
 331
 332 \section{Experiments}
 333 \label{sec:experiments}
 334
 335 Different experiments  have been  performed in order  to measure  the generation
 336 speed. We have used a first computer equipped with a Tesla C1060 NVidia  GPU card
 337 and an
 338 Intel  Xeon E5530 cadenced  at 2.40  GHz,  and
 339 a second computer  equipped with a smaller  CPU and  a GeForce GTX  280.
 340 All the
 341 cards have 240 cores.
 342
 343 In  Figure~\ref{fig:time_xorlike_gpu} we  compare the  quantity of  pseudorandom numbers
 344 generated per second with various xor-like based PRNGs. In this figure, the optimized
 345 versions use the {\it xor64} described in~\cite{Marsaglia2003}, whereas the naive versions
 346 embed  the three  xor-like  PRNGs described  in Listing~\ref{algo:seqCIPRNG}.   In
 347 order to obtain the optimal performances, the storage of pseudorandom numbers
 348 into the GPU memory has been removed. This step is time consuming and slows down the numbers
 349 generation.  Moreover this   storage  is  completely
 350 useless, in case of applications that consume the pseudorandom
 351 numbers  directly   after generation. We can see  that when the number of  threads is greater
 352 than approximately 30,000 and lower than 5 million, the number of pseudorandom numbers generated
 353 per second  is almost constant.  With the  naive version, this value ranges from 2.5 to
 354 3GSamples/s.   With  the  optimized   version,  it  is  approximately  equal to
 355 20GSamples/s. Finally  we can remark  that both GPU  cards are quite  similar, but in
 356 practice,  the Tesla C1060  has more  memory than  the GTX  280, and  this memory
 357 should be of better quality.
 358 As a  comparison,   Listing~\ref{algo:seqCIPRNG}  leads   to the  generation of  about
 359 138MSample/s when using one core of the Xeon E5530.
 360
 361 \begin{figure}[htbp]
 362 \begin{center}
 363   \includegraphics[scale=0.65]{Chapters/chapter18/figures/curve_time_xorlike_gpu.pdf}
 364 \end{center}
 365 \caption{Quantity of pseudorandom numbers generated per second with the xorlike-based PRNG}
 366 \label{fig:time_xorlike_gpu}
 367 \end{figure}
 368
 369
 370
 371
 372
 373 In Figure~\ref{fig:time_bbs_gpu} we highlight  the performances of the optimized
 374 BBS-based PRNG on GPU.  On  the Tesla C1060 we obtain approximately 700MSample/s
 375 and  on the  GTX 280  about  670MSample/s, which  is obviously  slower than  the
 376 xorlike-based PRNG on GPU. However, we  will show in the next sections that this
 377 new PRNG  has a strong  level of  security, which is  necessarily paid by  a speed
 378 reduction.
 379
 380 \begin{figure}[htbp]
 381 \begin{center}
 382   \includegraphics[scale=0.65]{Chapters/chapter18/figures/curve_time_bbs_gpu.pdf}
 383 \end{center}
 384 \caption{Quantity of pseudorandom numbers generated per second using the BBS-based PRNG}
 385 \label{fig:time_bbs_gpu}
 386 \end{figure}
 387
 388 All  these  experiments allow  us  to conclude  that  it  is possible  to
 389 generate a very large quantity of pseudorandom  numbers statistically perfect with the  xor-like version.
 390 To a certain extend, it is also the case with the secure BBS-based version, the speed deflation being
 391 explained by the fact that the former  version has ``only''
 392 chaotic properties and statistical perfection, whereas the latter is also cryptographically secure,
 393 as it is shown in the next sections.
 394
 395
 396
 397
 398
 399
 400
 401
 402
 403
 404
 405 \putbib[Chapters/chapter18/biblio]
 406