Relecture

[prng_gpu.git] / prng_gpu.tex
diff --git a/prng_gpu.tex b/prng_gpu.tex

index 84efafa5443f23f4504a723320ec63e7c2f36b49..dc3d3cb8c84b23b9e2ec02267ca2598466104977 100644 (file)
--- a/prng_gpu.tex
+++ b/prng_gpu.tex
@@ -433,10 +433,10 @@ during implementations (due to the discrete nature of $f$). It is as if
  $\mathds{B}^\mathsf{N}$ represents the memory of the computer whereas $\llbracket 1 ;  \mathsf{N}
  \rrbracket^{\mathds{N}}$ is its input stream (the seeds, for instance, in PRNG, or a physical noise in TRNG).
  
-\section{Application to pseudorandomness}
+\section{Application to Pseudorandomness}
  \label{sec:pseudorandom}
  
-\subsection{A First pseudorandom Number Generator}
+\subsection{A First Pseudorandom Number Generator}
  
  We have proposed in~\cite{bgw09:ip} a new family of generators that receives 
  two PRNGs as inputs. These two generators are mixed with chaotic iterations, 
@@ -807,17 +807,26 @@ have $d((S,E),(\tilde S,E))<\epsilon$.
  \section{Efficient PRNG based on Chaotic Iterations}
  \label{sec:efficient prng}
  
-In  order to  implement efficiently  a PRNG  based on  chaotic iterations  it is
-possible to improve  previous works [ref]. One solution  consists in considering
-that the  strategy used contains all the  bits for which the  negation is
-achieved out. Then in order to apply  the negation on these bits we can simply
-apply the  xor operator between  the current number  and the strategy. In
-order to obtain the strategy we also use a classical PRNG.
+Based on the proof presented in the previous section, it is now possible to 
+improve the speed of the generator formerly presented in~\cite{bgw09:ip,guyeux10}. 
+The first idea is to consider
+that the provided strategy is a pseudorandom Boolean vector obtained by a
+given PRNG.
+An iteration of the system is simply the bitwise exclusive or between
+the last computed state and the current strategy.
+Topological properties of disorder exhibited by chaotic 
+iterations can be inherited by the inputted generator, hoping by doing so to 
+obtain some statistical improvements while preserving speed.
  
-Here  is an  example with  16-bits numbers  showing how  the bitwise  operations
+
+Let us give an example using 16-bits numbers, to clearly understand how the bitwise xor operations
  are
-applied.  Suppose  that $x$ and the  strategy $S^i$ are defined  in binary mode.
-Then the following table shows the result of $x$ xor $S^i$.
+done.  
+Suppose  that $x$ and the  strategy $S^i$ are given as
+binary vectors.
+Table~\ref{TableExemple} shows the result of $x \oplus S^i$.
+
+\begin{table}
  $$
  \begin{array}{|cc|cccccccccccccccc|}
  \hline
@@ -831,13 +840,13 @@ x \oplus S^i&=&1&1&0&1&1&1&0&0&0&1&1&1&0&1&0&1\\
  \hline
   \end{array}
  $$
+\caption{Example of an arbitrary round of the proposed generator}
+\label{TableExemple}
+\end{table}
  
  
  
-
-
-\lstset{language=C,caption={C code of the sequential chaotic iterations based
-PRNG},label=algo:seqCIprng}
+\lstset{language=C,caption={C code of the sequential PRNG based on chaotic iterations},label=algo:seqCIprng}
  \begin{lstlisting}
  unsigned int CIprng() {
    static unsigned int x = 123123123;
@@ -858,52 +867,60 @@ unsigned int CIprng() {
  
  
  
-In listing~\ref{algo:seqCIprng}  a sequential version of  our chaotic iterations
-based PRNG is  presented.  The xor operator is  represented by \textasciicircum.
-This  function uses  three classical  64-bits PRNG:  the  \texttt{xorshift}, the
-\texttt{xor128}  and  the  \texttt{xorwow}.   In  the following,  we  call  them
-xor-like PRNGSs.   These three PRNGs are  presented in~\cite{Marsaglia2003}.  As
-each xor-like PRNG  used works with 64-bits and as our  PRNG works with 32-bits,
-the use of \texttt{(unsigned int)} selects the 32 least significant bits whereas
-\texttt{(unsigned int)(t3$>>$32)}  selects the 32 most significants  bits of the
-variable \texttt{t}.   So to produce a  random number realizes  6 xor operations
-with 6 32-bits  numbers produced by 3 64-bits PRNG.   This version successes the
-BigCrush of the TestU01 battery~\cite{LEcuyerS07}.
+In Listing~\ref{algo:seqCIprng}  a sequential version of  the proposed PRNG based on chaotic iterations
+ is  presented.  The xor operator is  represented by \textasciicircum.
+This  function uses  three classical  64-bits PRNGs, namely the  \texttt{xorshift}, the
+\texttt{xor128},  and  the  \texttt{xorwow}~\cite{Marsaglia2003}.   In  the following,  we  call  them
+``xor-like PRNGs''. 
+As
+each xor-like PRNG  uses 64-bits whereas our proposed generator works with 32-bits,
+we use the command \texttt{(unsigned int)}, that selects the 32 least significant bits of a given integer, and the code
+\texttt{(unsigned int)(t3$>>$32)}  in order to obtain the 32 most significant  bits of \texttt{t}.   
  
-\section{Efficient PRNGs based on chaotic iterations on GPU}
+So producing a  pseudorandom number needs  6 xor operations
+with 6 32-bits  numbers that are provided by 3 64-bits PRNGs.   This version successfully passes the
+stringent BigCrush battery of tests~\cite{LEcuyerS07}.
+
+\section{Efficient PRNGs based on Chaotic Iterations on GPU}
  \label{sec:efficient prng gpu}
  
-In  order to benefit  from computing  power of  GPU, a  program needs  to define
-independent blocks of threads which  can be computed simultaneously. In general,
-the larger the number of threads is,  the more local memory is used and the less
-branching  instructions are  used (if,  while, ...),  the better  performance is
-obtained  on  GPU.  So  with  algorithm  \ref{algo:seqCIprng}  presented in  the
-previous section, it is possible to  build a similar program which computes PRNG
-on   GPU.  In  the   CUDA~\cite{Nvid10}  environment,   threads  have   a  local
-identificator, called \texttt{ThreadIdx} relative to the block containing them.
-
-
-\subsection{Naive version for GPU}
-
-From the CPU version, it is possible  to obtain a quite similar version for GPU.
-The principe consists in assigning the computation of a PRNG as in sequential to
-each thread  of the  GPU.  Of course,  it is  essential that the  three xor-like
-PRNGs  used for  our computation  have different  parameters. So  we  chose them
-randomly with  another PRNG. As the  initialisation is performed by  the CPU, we
-have  chosen  to  use  the  ISAAC  PRNG~\cite{Jenkins96}  to  initalize  all  the
-parameters for  the GPU version  of our PRNG.   The implementation of  the three
-xor-like  PRNGs  is  straightforward  as  soon as  their  parameters  have  been
-allocated in  the GPU memory.  Each xor-like PRNGs  used works with  an internal
-number  $x$  which keeps  the  last  generated  random numbers.  Other  internal
-variables  are   also  used   by  the  xor-like   PRNGs.  More   precisely,  the
-implementation of the  xor128, the xorshift and the  xorwow respectively require
-4, 5 and 6 unsigned long as internal variables.
+In  order to take benefits  from the computing  power of  GPU, a  program needs  to have
+independent blocks of threads that can be computed simultaneously. In general,
+the larger the number of threads is,  the more local memory is used, and the less
+branching  instructions are  used (if,  while, ...),  the better the performances on GPU is.  
+Obviously, having these requirements in mind, it is possible to  build a program similar to 
+the one presented in Algorithm  \ref{algo:seqCIprng}, which computes pseudorandom numbers
+on   GPU.  
+To do so, we must firstly recall that in
+ the   CUDA~\cite{Nvid10}  environment,   threads  have   a  local
+identifier called \texttt{ThreadIdx}, which is relative to the block containing them.
+
+
+\subsection{Naive Version for GPU}
+
+ 
+It is possible to deduce from the CPU version a quite similar version adapted to GPU.
+The simple principle consists to make each thread of the GPU computing the CPU version of our PRNG.  
+Of course,  the  three xor-like
+PRNGs  used in these computations must have different  parameters. 
+In a given thread, these lasts are
+randomly picked from another PRNGs. 
+The  initialization stage is performed by  the CPU.
+To do it, the  ISAAC  PRNG~\cite{Jenkins96} is used to  set  all  the
+parameters embedded into each thread.   
+
+The implementation of  the three
+xor-like  PRNGs  is  straightforward  when  their  parameters  have  been
+allocated in  the GPU memory.  Each xor-like  works with  an internal
+number  $x$  that saves  the  last  generated  pseudorandom number. Additionally,  the
+implementation of the  xor128, the xorshift, and the  xorwow respectively require
+4, 5, and 6 unsigned long as internal variables.
  
  \begin{algorithm}
  
  \KwIn{InternalVarXorLikeArray: array with internal variables of the 3 xor-like
  PRNGs in global memory\;
-NumThreads: Number of threads\;}
+NumThreads: number of threads\;}
  \KwOut{NewNb: array containing random numbers in global memory}
  \If{threadIdx is concerned by the computation} {
    retrieve data from InternalVarXorLikeArray[threadIdx] in local variables\;
@@ -914,37 +931,34 @@ NumThreads: Number of threads\;}
    store internal variables in InternalVarXorLikeArray[threadIdx]\;
  }
  
-\caption{main kernel for the chaotic iterations based PRNG GPU naive version}
+\caption{Main kernel of the GPU ``naive'' version of the PRNG based on chaotic iterations}
  \label{algo:gpu_kernel}
  \end{algorithm}
  
-Algorithm~\ref{algo:gpu_kernel}  presents a naive  implementation of  PRNG using
-GPU.  According  to the available  memory in the  GPU and the number  of threads
+Algorithm~\ref{algo:gpu_kernel}  presents a naive  implementation of the proposed  PRNG on
+GPU.  Due to the available  memory in the  GPU and the number  of threads
  used simultenaously,  the number  of random numbers  that a thread  can generate
-inside   a    kernel   is   limited,   i.e.    the    variable   \texttt{n}   in
-algorithm~\ref{algo:gpu_kernel}. For example, if  $100,000$ threads are used and
-if $n=100$\footnote{in fact, we need to add the initial seed (a 32-bits number)}
-then   the  memory   required   to  store   internals   variables  of   xor-like
+inside   a    kernel   is   limited  (\emph{i.e.},    the    variable   \texttt{n}   in
+algorithm~\ref{algo:gpu_kernel}). For instance, if  $100,000$ threads are used and
+if $n=100$\footnote{in fact, we need to add the initial seed (a 32-bits number)},
+then   the  memory   required   to  store all of the  internals   variables  of both the  xor-like
  PRNGs\footnote{we multiply this number by $2$ in order to count 32-bits numbers}
-and  random  number of  our  PRNG  is  equals to  $100,000\times  ((4+5+6)\times
-2+(1+100))=1,310,000$ 32-bits numbers, i.e. about $52$Mb.
+and  the pseudorandom  numbers generated by  our  PRNG,  is  equal to  $100,000\times  ((4+5+6)\times
+2+(1+100))=1,310,000$ 32-bits numbers, that is, approximately $52$Mb.
  
-All the  tests performed  to pass the  BigCrush of TestU01  succeeded. Different
-number of threads, called \texttt{NumThreads} in our algorithm, have been tested
-upto $10$ millions.
-\newline
-\newline
-{\bf QUESTION : on laisse cette remarque, je suis mitigé !!!}
+This generator is able to pass the whole BigCrush battery of tests, for all
+the versions that have been tested depending on their number of threads 
+(called \texttt{NumThreads} in our algorithm, tested until $10$ millions).
  
  \begin{remark}
-Algorithm~\ref{algo:gpu_kernel}  has  the  advantage to  manipulate  independent
-PRNGs, so this version is easily usable on a cluster of computer. The only thing
-to ensure is to use a single ISAAC PRNG. For this, a simple solution consists in
-using a master node for the initialization which computes the initial parameters
+The proposed algorithm has  the  advantage to  manipulate  independent
+PRNGs, so this version is easily adaptable on a cluster of computers too. The only thing
+to ensure is to use a single ISAAC PRNG. To achieve this requirement, a simple solution consists in
+using a master node for the initialization. This master node computes the initial parameters
  for all the differents nodes involves in the computation.
  \end{remark}
  
-\subsection{Improved version for GPU}
+\subsection{Improved Version for GPU}
  
  As GPU cards using CUDA have shared memory between threads of the same block, it
  is possible  to use this  feature in order  to simplify the  previous algorithm,
@@ -1717,7 +1731,7 @@ proving that $H$ is not secure, a contradiction.
  
  
  
-\section{A cryptographically secure prng for GPU}
+\section{A Cryptographically Secure PRNG for GPU}
  \label{sec:CSGPU}
  It is  possible to build a  cryptographically secure prng based  on the previous
  algorithm (algorithm~\ref{algo:gpu_kernel2}).   It simply consists  in replacing