ch18

[book_gpu.git] / BookGPU / Chapters / chapter18 / ch18.tex
diff --git a/BookGPU/Chapters/chapter18/ch18.tex b/BookGPU/Chapters/chapter18/ch18.tex

index 9c9a85a4f057d9b63baca6308b6fa8145f399288..d72e9a352c8c1ab6a0080d03bbec43b3a5c4f143 100755 (executable)
--- a/BookGPU/Chapters/chapter18/ch18.tex
+++ b/BookGPU/Chapters/chapter18/ch18.tex
@@ -1,5 +1,5 @@
-\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comt\'{e}}
-\chapterauthor{Christophe Guyeux}{Femto-ST Institute, University of Franche-Comt\'{e}}
+\chapterauthor{Raphaël Couturier and Christophe Guyeux}{Femto-ST Institute, University of Franche-Comte, France}
+%\chapterauthor{Christophe Guyeux}{Femto-ST Institute, University of Franche-Comt\'{e}}
  
  
  \chapter{Pseudorandom Number Generator on GPU}
  
  
  \chapter{Pseudorandom Number Generator on GPU}
@@ -19,14 +19,10 @@ irreproachable.  In some fields such as in numerical simulations,
  speed is a strong requirement that is usually attained by using
  parallel architectures. In that case, a recurrent problem is that a
  deflation of the statistical qualities is often reported, when the
  speed is a strong requirement that is usually attained by using
  parallel architectures. In that case, a recurrent problem is that a
  deflation of the statistical qualities is often reported, when the
-parallelization of a good PRNG is realized. In some fields such as in
-numerical simulations, speed is a strong requirement that is usually
-attained by using parallel architectures. In that case, a recurrent
-problem is that a deflation of the statistical qualities is often
-reported, when the parallelization of a good PRNG is realized.  This
+parallelization of a good PRNG is realized.  This
  is why ad-hoc PRNGs for each possible architecture must be found to
  is why ad-hoc PRNGs for each possible architecture must be found to
-achieve both speed and randomness.  On the other side, speed is not
-the main requirement in cryptography: the great need is to
+achieve both speed and randomness.  On the other hand, speed is not
+the main requirement in cryptography: the most important point is to
  define \emph{secure} generators able to withstand malicious
  attacks. Roughly speaking, an attacker should not be able in practice
  to make the distinction between numbers obtained with the secure
  define \emph{secure} generators able to withstand malicious
  attacks. Roughly speaking, an attacker should not be able in practice
  to make the distinction between numbers obtained with the secure
@@ -41,9 +37,9 @@ challenge when the size of the parameters of the PRNG increases.
  
  Finally, a small part of the community working in this domain focuses on a
  third requirement, that is to define chaotic generators~\cite{kellert1994wake,  Wu20051195,gleick2011chaos}.
  
  Finally, a small part of the community working in this domain focuses on a
  third requirement, that is to define chaotic generators~\cite{kellert1994wake,  Wu20051195,gleick2011chaos}.
-The main idea is to take benefits from a chaotic dynamical system to obtain a
+The main idea is to  benefits from a chaotic dynamical system to obtain a
  generator that is unpredictable, disordered, sensible to its seed, or in other word chaotic.
  generator that is unpredictable, disordered, sensible to its seed, or in other word chaotic.
-Their desire is to map a given chaotic dynamics into a sequence that seems random 
+These scientists' desire is to map a given chaotic dynamics into a sequence that seems random 
  and unassailable due to chaos.
  However, the chaotic maps used as a pattern are defined in the real line 
  whereas computers deal with finite precision numbers.
  and unassailable due to chaos.
  However, the chaotic maps used as a pattern are defined in the real line 
  whereas computers deal with finite precision numbers.
@@ -66,10 +62,10 @@ statistical perfection refers to the ability to pass the whole
  stringent statistical evaluation of a sequence claimed as random.
  This battery can be found in the well-known TestU01 package~\cite{LEcuyerS07}\index{TestU01}.
  More precisely, each time we performed a test on a PRNG, we ran it
  stringent statistical evaluation of a sequence claimed as random.
  This battery can be found in the well-known TestU01 package~\cite{LEcuyerS07}\index{TestU01}.
  More precisely, each time we performed a test on a PRNG, we ran it
-twice in order to observe if all $p-$values are inside [0.01, 0.99]. In
+twice in order to observe if all $p-$values were inside [0.01, 0.99]. In
  fact, we observed that few $p-$values (less than ten) are sometimes
  outside this interval but inside [0.001, 0.999], so that is why a
  fact, we observed that few $p-$values (less than ten) are sometimes
  outside this interval but inside [0.001, 0.999], so that is why a
-second run allows us to confirm that the values outside are not for
+second run has allowed us to confirm that the values outside are not for
  the same test. With this approach all our PRNGs pass the {\it
    BigCrush} successfully and all $p-$values are at least once inside
  [0.01, 0.99].
  the same test. With this approach all our PRNGs pass the {\it
    BigCrush} successfully and all $p-$values are at least once inside
  [0.01, 0.99].
@@ -79,13 +75,13 @@ chaotic dynamical system defined by Devaney~\cite{Devaney}.
  The remainder of this chapter is organized as follows. 
  Basic definitions and terminologies about both topological chaos
  and chaotic iterations are provided in the next section. Various 
  The remainder of this chapter is organized as follows. 
  Basic definitions and terminologies about both topological chaos
  and chaotic iterations are provided in the next section. Various 
-chaotic iterations based pseudorandom number generators are then
+chaotic iterations based on pseudorandom number generators are then
  presented in Section~\ref{sec:efficient PRNG}. They encompass
  naive and improved efficient generators for CPU and for GPU. 
  These generators are finally experimented in Section~\ref{sec:experiments}.
  
  
  presented in Section~\ref{sec:efficient PRNG}. They encompass
  naive and improved efficient generators for CPU and for GPU. 
  These generators are finally experimented in Section~\ref{sec:experiments}.
  
  
-\section{Basic Recalls}
+\section{Basic Remindees}
  \label{section:BASIC RECALLS}
  
  This section is devoted to basic definitions and terminologies in the fields of
  \label{section:BASIC RECALLS}
  
  This section is devoted to basic definitions and terminologies in the fields of
@@ -96,13 +92,12 @@ with basic notions on topology (see for instance~\cite{Devaney}).
  \subsection{A Short Presentation of Chaos}
  
  
  \subsection{A Short Presentation of Chaos}
  
  
-Chaos theory studies the behavior of dynamical systems that are perfectly predictable, yet appear to be wildly amorphous and without meaningful. 
+Chaos theory studies the behavior of dynamical systems that are perfectly predictable, yet appear to be wildly amorphous and meaningless. 
  Chaotic systems\index{chaotic systems} are highly sensitive to initial conditions, 
  which is popularly referred to as the butterfly effect. 
  In other words, small differences in initial conditions (such as those due to rounding errors in numerical computation) yield widely diverging outcomes, 
  rendering long-term prediction impossible in general \cite{kellert1994wake}. This happens even though these systems are deterministic, meaning that their future behavior is fully determined by their initial conditions, with no random elements involved \cite{kellert1994wake}. That is, the deterministic nature of these systems does not make them predictable \cite{kellert1994wake,Werndl01032009}. This behavior is known as deterministic chaos, or simply chaos. It has been well-studied in mathematics and
  Chaotic systems\index{chaotic systems} are highly sensitive to initial conditions, 
  which is popularly referred to as the butterfly effect. 
  In other words, small differences in initial conditions (such as those due to rounding errors in numerical computation) yield widely diverging outcomes, 
  rendering long-term prediction impossible in general \cite{kellert1994wake}. This happens even though these systems are deterministic, meaning that their future behavior is fully determined by their initial conditions, with no random elements involved \cite{kellert1994wake}. That is, the deterministic nature of these systems does not make them predictable \cite{kellert1994wake,Werndl01032009}. This behavior is known as deterministic chaos, or simply chaos. It has been well-studied in mathematics and
-physics, leading among other things to the well-established definition of Devaney
-recalled thereafter.
+physics, leading among other things to the well-established definition of Devaney which can be found next.
  
  
  
  
  
  
@@ -135,12 +130,12 @@ Let $P=\{p\in \mathcal{X}|\exists n \in \mathds{N}^{\ast}:f^n(p)=p\}$ the set of
   \overline{P}=\mathcal{X} .
  \end{equation}
  
   \overline{P}=\mathcal{X} .
  \end{equation}
  
-Density of periodic orbits means that every point in the space is approached arbitrarily closely by periodic orbits. Topologically mixing systems failing this condition may not display sensitivity to initial conditions presented below, and hence may not be chaotic.
+The density of periodic orbits means that every point in  space is closely approached  by periodic orbits in an arbitrary way. Topologically mixing systems failing this condition may not display sensitivity to initial conditions presented below, and hence may not be chaotic.
  \item Sensitive dependence on initial conditions\index{sensitive dependence on initial conditions}:
  
  $\exists \varepsilon>0,$ $\forall x \in \mathcal{X},$ $\forall \delta >0,$ $\exists y \in \mathcal{X},$ $\exists n \in \mathbb{N},$ $d(x,y)<\delta$ and $d\left(f^n(x),f^n(y)\right) \geqslant \varepsilon.$
  
  \item Sensitive dependence on initial conditions\index{sensitive dependence on initial conditions}:
  
  $\exists \varepsilon>0,$ $\forall x \in \mathcal{X},$ $\forall \delta >0,$ $\exists y \in \mathcal{X},$ $\exists n \in \mathbb{N},$ $d(x,y)<\delta$ and $d\left(f^n(x),f^n(y)\right) \geqslant \varepsilon.$
  
-Intuitively, a map possesses sensitive dependence on initial conditions if there exist points arbitrarily close to $x$ that eventually separate from $x$ by at least $\varepsilon$ under iteration of $f$. Not all points near $x$ need eventually separate from $x$ under iteration, but there must be at least one such point in every neighborhood of $x$. If a map possesses sensitive dependence on initial conditions, then for all practical purposes, the dynamics of the map defy numerical computation. Small errors in computation that are introduced by round-off may become magnified upon iteration. The results of numerical computation of an orbit, no matter how accurate, may bear no resemblance whatsoever with the real orbit.
+Intuitively, a map possesses sensitive dependence on initial conditions if there exist points arbitrarily close to $x$ that eventually separate from $x$ by at least $\varepsilon$ under the iteration of $f$. Not all points near $x$ need eventually separate from $x$ under iteration, but there must be at least one such point in every neighborhood of $x$. If a map possesses sensitive dependence on initial conditions, then for all practical purposes, the dynamics of the map defy numerical computation. Small errors in computation that are introduced by round-off may become magnified upon iteration. The results of numerical computation of an orbit, no matter how accurate, may bear no resemblance whatsoever with the real orbit.
  \end{itemize}
  
  \end{definition}
  \end{itemize}
  
  \end{definition}
@@ -298,11 +293,11 @@ this fast generator cannot be proven as secure.
  \subsection{Efficient PRNGs based on Chaotic Iterations on GPU}
  \label{sec:efficient PRNG gpu}
  
  \subsection{Efficient PRNGs based on Chaotic Iterations on GPU}
  \label{sec:efficient PRNG gpu}
  
-In order to  take benefits from the computing power  of GPU, a program
+In order to  benefit from the computing power  of GPU, a program
  needs  to have  independent blocks  of  threads that  can be  computed
  simultaneously. In general,  the larger the number of  threads is, the
  more local  memory is  used, and the  less branching  instructions are
  needs  to have  independent blocks  of  threads that  can be  computed
  simultaneously. In general,  the larger the number of  threads is, the
  more local  memory is  used, and the  less branching  instructions are
-used  (if,  while,  ...),  the  better the  performances  on  GPU  is.
+used  (if,  while,  ...) and so,  the  better the  performances  on  GPU  are.
  Obviously, having these requirements in  mind, it is possible to build
  a   program    similar   to    the   one   presented    in  Listing 
  \ref{algo:seqCIPRNG}, which computes  pseudorandom numbers on GPU.  To
  Obviously, having these requirements in  mind, it is possible to build
  a   program    similar   to    the   one   presented    in  Listing 
  \ref{algo:seqCIPRNG}, which computes  pseudorandom numbers on GPU.  To
@@ -323,7 +318,7 @@ PRNGs  used in these computations must have different  parameters.
  In a given thread, these parameters are
  randomly picked from another PRNGs. 
  The  initialization stage is performed by  the CPU.
  In a given thread, these parameters are
  randomly picked from another PRNGs. 
  The  initialization stage is performed by  the CPU.
-To do it, the  ISAAC  PRNG~\cite{Jenkins96} is used to  set  all  the
+To do this, the  ISAAC  PRNG~\cite{Jenkins96} is used to  set  all  the
  parameters embedded into each thread.   
  
  The implementation of  the three
  parameters embedded into each thread.   
  
  The implementation of  the three
@@ -361,7 +356,7 @@ used simultaneously,  the number  of random numbers  that a thread  can generate
  inside   a    kernel   is   limited  (\emph{i.e.},    the    variable   \texttt{n}   in
  algorithm~\ref{algo:gpu_kernel}). For instance, if  $100,000$ threads are used and
  if $n=100$\footnote{in fact, we need to add the initial seed (a 32-bits number)},
  inside   a    kernel   is   limited  (\emph{i.e.},    the    variable   \texttt{n}   in
  algorithm~\ref{algo:gpu_kernel}). For instance, if  $100,000$ threads are used and
  if $n=100$\footnote{in fact, we need to add the initial seed (a 32-bits number)},
-then   the  memory   required   to  store all of the  internals   variables  of both the  xor-like
+then   the  memory   required   to  store all of the  internal   variables  of both the  xor-like
  PRNGs\footnote{we multiply this number by $2$ in order to count 32-bits numbers}
  and  the pseudorandom  numbers generated by  our  PRNG,  is  equal to  $100,000\times  ((4+5+6)\times
  2+(1+100))=1,310,000$ 32-bits numbers, that is, approximately $52$Mb.
  PRNGs\footnote{we multiply this number by $2$ in order to count 32-bits numbers}
  and  the pseudorandom  numbers generated by  our  PRNG,  is  equal to  $100,000\times  ((4+5+6)\times
  2+(1+100))=1,310,000$ 32-bits numbers, that is, approximately $52$Mb.
@@ -431,7 +426,7 @@ iterations is realized between the last stored value $x$ of the thread and a str
  (obtained by a bitwise exclusive or between a value provided by a xor-like() call
  and two values previously obtained by two other threads).
  To be certain that such iterations corresponds to the chaotic one recalled at the
  (obtained by a bitwise exclusive or between a value provided by a xor-like() call
  and two values previously obtained by two other threads).
  To be certain that such iterations corresponds to the chaotic one recalled at the
-end of the Section~\ref{sec:dev},
+end of  Section~\ref{sec:dev},
  we must guarantee that this dynamical system iterates on the space 
  $\mathcal{X} =\mathds{B}^\mathsf{N} \times \mathcal{P}\left(\llbracket 1, 2^\mathsf{N} \rrbracket\right)^\mathds{N}$.
  The left term $x$ obviously belongs to $\mathds{B}^ \mathsf{N}$.
  we must guarantee that this dynamical system iterates on the space 
  $\mathcal{X} =\mathds{B}^\mathsf{N} \times \mathcal{P}\left(\llbracket 1, 2^\mathsf{N} \rrbracket\right)^\mathds{N}$.
  The left term $x$ obviously belongs to $\mathds{B}^ \mathsf{N}$.
@@ -442,7 +437,7 @@ integer of $\llbracket 1, 2^\mathsf{N} \rrbracket$.
  Such a result is obvious, as for the xor-like(), all the
  integers belonging into its interval of definition can occur at each iteration, and thus the 
  last $t$ respects the requirement. Furthermore, it is possible to
  Such a result is obvious, as for the xor-like(), all the
  integers belonging into its interval of definition can occur at each iteration, and thus the 
  last $t$ respects the requirement. Furthermore, it is possible to
-prove by an immediate mathematical induction that, supposes that the initial $x$
+prove by an immediate mathematical induction that, supposing that the initial $x$
  is uniformly distributed, %(it is provided by a cryptographically secure PRNG),
  the two other stored values shmem[o1] and shmem[o2] are uniformly distributed too,
  (this is the induction hypothesis), and thus the next $x$ is finally uniformly distributed.
  is uniformly distributed, %(it is provided by a cryptographically secure PRNG),
  the two other stored values shmem[o1] and shmem[o2] are uniformly distributed too,
  (this is the induction hypothesis), and thus the next $x$ is finally uniformly distributed.
@@ -470,7 +465,7 @@ order to obtain the optimal performances, the storage of pseudorandom numbers
  into the GPU memory has been removed. This step is time consuming and slows down the numbers
  generation.  Moreover this   storage  is  completely
  useless, in case of applications that consume the pseudorandom
  into the GPU memory has been removed. This step is time consuming and slows down the numbers
  generation.  Moreover this   storage  is  completely
  useless, in case of applications that consume the pseudorandom
-numbers  directly   after generation. We can see  that when the number of  threads is greater
+numbers  directly   after they have been generated. We can see  that when the number of  threads is greater
  than approximately 30,000 and lower than 5 million, the number of pseudorandom numbers generated
  per second  is almost constant.  With the  naive version, this value ranges from 2.5 to
  3GSamples/s.   With  the  optimized   version,  it  is  approximately  equal to
  than approximately 30,000 and lower than 5 million, the number of pseudorandom numbers generated
  per second  is almost constant.  With the  naive version, this value ranges from 2.5 to
  3GSamples/s.   With  the  optimized   version,  it  is  approximately  equal to
@@ -497,10 +492,15 @@ These  experiments allow  us  to conclude  that  it  is possible  to
  generate a very large quantity of pseudorandom  numbers statistically perfect with the  xor-like version.
  
  
  generate a very large quantity of pseudorandom  numbers statistically perfect with the  xor-like version.
  
  
+\section{Summary}
  
  
  
  
-
-
+In this chapter,  a PRNG based on chaotic iterations is  presented.   It is proven to be
+chaotic according to Devaney.   Efficient implementations  on GPU
+using xor-like PRNGs  as input generators have shown that  a very large quantity
+of pseudorandom  numbers can be  generated per second (about  20Gsamples/s on a Tesla C1060), and
+that these proposed PRNGs succeed to pass the hardest battery in TestU01, namely
+the BigCrush.