From 26aa4e0cf2939580eae07e8bbad140d527f47234 Mon Sep 17 00:00:00 2001
From: guyeux <guyeux@gmail.com>
Date: Wed, 30 Nov 2011 21:23:53 +0100
Subject: [PATCH 1/1] Presque fin d'une relecture

---
 prng_gpu.tex | 170 ++++++++++++++++++++++++++++++---------------------
 1 file changed, 99 insertions(+), 71 deletions(-)

diff --git a/prng_gpu.tex b/prng_gpu.tex
index 1c7c9fe..0279f03 100644
--- a/prng_gpu.tex
+++ b/prng_gpu.tex
@@ -141,8 +141,8 @@ The remainder of this paper  is organized as follows. In Section~\ref{section:re
   and on an iteration process called ``chaotic
 iterations'' on which the post-treatment is based. 
 Proofs of chaos are given in  Section~\ref{sec:pseudorandom}.
-Section~\ref{sec:efficient    prng}   presents   an   efficient
-implementation of  this chaotic PRNG  on a CPU, whereas   Section~\ref{sec:efficient prng
+Section~\ref{sec:efficient    PRNG}   presents   an   efficient
+implementation of  this chaotic PRNG  on a CPU, whereas   Section~\ref{sec:efficient PRNG
   gpu}   describes   the  GPU   implementation. 
 Such generators are experimented in 
 Section~\ref{sec:experiments}.
@@ -805,7 +805,7 @@ have $d((S,E),(\tilde S,E))<\epsilon$.
 
 
 \section{Efficient PRNG based on Chaotic Iterations}
-\label{sec:efficient prng}
+\label{sec:efficient PRNG}
 
 Based on the proof presented in the previous section, it is now possible to 
 improve the speed of the generator formerly presented in~\cite{bgw09:ip,guyeux10}. 
@@ -846,9 +846,9 @@ $$
 
 
 
-\lstset{language=C,caption={C code of the sequential PRNG based on chaotic iterations},label=algo:seqCIprng}
+\lstset{language=C,caption={C code of the sequential PRNG based on chaotic iterations},label=algo:seqCIPRNG}
 \begin{lstlisting}
-unsigned int CIprng() {
+unsigned int CIPRNG() {
   static unsigned int x = 123123123;
   unsigned long t1 = xorshift();
   unsigned long t2 = xor128();
@@ -867,7 +867,7 @@ unsigned int CIprng() {
 
 
 
-In Listing~\ref{algo:seqCIprng}  a sequential version of  the proposed PRNG based on chaotic iterations
+In Listing~\ref{algo:seqCIPRNG}  a sequential version of  the proposed PRNG based on chaotic iterations
  is  presented.  The xor operator is  represented by \textasciicircum.
 This  function uses  three classical  64-bits PRNGs, namely the  \texttt{xorshift}, the
 \texttt{xor128},  and  the  \texttt{xorwow}~\cite{Marsaglia2003}.   In  the following,  we  call  them
@@ -882,14 +882,14 @@ with 6 32-bits  numbers that are provided by 3 64-bits PRNGs.   This version suc
 stringent BigCrush battery of tests~\cite{LEcuyerS07}.
 
 \section{Efficient PRNGs based on Chaotic Iterations on GPU}
-\label{sec:efficient prng gpu}
+\label{sec:efficient PRNG gpu}
 
 In  order to take benefits  from the computing  power of  GPU, a  program needs  to have
 independent blocks of threads that can be computed simultaneously. In general,
 the larger the number of threads is,  the more local memory is used, and the less
 branching  instructions are  used (if,  while, ...),  the better the performances on GPU is.  
 Obviously, having these requirements in mind, it is possible to  build a program similar to 
-the one presented in Algorithm  \ref{algo:seqCIprng}, which computes pseudorandom numbers
+the one presented in Algorithm  \ref{algo:seqCIPRNG}, which computes pseudorandom numbers
 on   GPU.  
 To do so, we must firstly recall that in
  the   CUDA~\cite{Nvid10}  environment,   threads  have   a  local
@@ -925,7 +925,7 @@ NumThreads: number of threads\;}
 \If{threadIdx is concerned by the computation} {
   retrieve data from InternalVarXorLikeArray[threadIdx] in local variables\;
   \For{i=1 to n} {
-    compute a new PRNG as in Listing\ref{algo:seqCIprng}\;
+    compute a new PRNG as in Listing\ref{algo:seqCIPRNG}\;
     store the new PRNG in NewNb[NumThreads*threadIdx+i]\;
   }
   store internal variables in InternalVarXorLikeArray[threadIdx]\;
@@ -962,19 +962,21 @@ for all the differents nodes involves in the computation.
 
 As GPU cards using CUDA have shared memory between threads of the same block, it
 is possible  to use this  feature in order  to simplify the  previous algorithm,
-i.e., using less  than 3 xor-like PRNGs. The solution  consists in computing only
-one xor-like PRNG by thread, saving  it into shared memory and using the results
+i.e., to use less  than 3 xor-like PRNGs. The solution  consists in computing only
+one xor-like PRNG by thread, saving  it into the shared memory, and then to use the results
 of some  other threads in the  same block of  threads. In order to  define which
-thread uses the result of which other  one, we can use a permutation array which
+thread uses the result of which other  one, we can use a permutation array that
 contains  the indexes  of  all threads  and  for which  a  permutation has  been
-performed.  In Algorithm~\ref{algo:gpu_kernel2}, 2 permutations arrays are used.
+performed. 
+
+In Algorithm~\ref{algo:gpu_kernel2}, two permutations arrays are used.
 The    variable   \texttt{offset}    is    computed   using    the   value    of
 \texttt{permutation\_size}.   Then we  can compute  \texttt{o1}  and \texttt{o2}
-which represent the indexes of the  other threads for which the results are used
-by the  current thread. In  the algorithm, we  consider that a  64-bits xor-like
-PRNG is used, that is why both 32-bits parts are used.
+representing the indexes of the  other threads whose results are used
+by the  current one. In  this algorithm, we  consider that a  64-bits xor-like
+PRNG has been chosen, and so its two 32-bits parts are used.
 
-This version also succeeds to the {\it BigCrush} batteries of tests.
+This version also can pass the whole {\it BigCrush} battery of tests.
 
 \begin{algorithm}
 
@@ -1007,22 +1009,28 @@ version}
 
 \subsection{Theoretical Evaluation of the Improved Version}
 
-A run of Algorithm~\ref{algo:gpu_kernel2} consists in three operations having 
+A run of Algorithm~\ref{algo:gpu_kernel2} consists in an operation ($x=x\oplus t$) having 
 the form of Equation~\ref{equation Oplus}, which is equivalent to the iterative
-system of Eq.~\ref{eq:generalIC}. That is, three iterations of the general chaotic
-iterations are realized between two stored values of the PRNG.
+system of Eq.~\ref{eq:generalIC}. That is, an iteration of the general chaotic
+iterations is realized between the last stored value $x$ of the thread and a strategy $t$
+(obtained by a bitwise exclusive or between a value provided by a xor-like() call
+and two values previously obtained by two other threads).
 To be certain that we are in the framework of Theorem~\ref{t:chaos des general},
 we must guarantee that this dynamical system iterates on the space 
 $\mathcal{X} = \mathcal{P}\left(\llbracket 1, \mathsf{N} \rrbracket\right)^\mathds{N}\times\mathds{B}^\mathsf{N}$.
 The left term $x$ obviously belongs into $\mathds{B}^ \mathsf{N}$.
-To prevent from any flaws of chaotic properties, we must check that each right 
-term, corresponding to terms of the strategies,  can possibly be equal to any
+To prevent from any flaws of chaotic properties, we must check that the right 
+term (the last $t$), corresponding to the strategies,  can possibly be equal to any
 integer of $\llbracket 1, \mathsf{N} \rrbracket$. 
 
-Such a result is obvious for the two first lines, as for the xor-like(), all the
-integers belonging into its interval of definition can occur at each iteration.
-It can be easily stated for the two last lines by an immediate mathematical
-induction.
+Such a result is obvious, as for the xor-like(), all the
+integers belonging into its interval of definition can occur at each iteration, and thus the 
+last $t$ respects the requirement. Furthermore, it is possible to
+prove by an immediate mathematical induction that, as the initial $x$
+is uniformly distributed (it is provided by a cryptographically secure PRNG),
+the two other stored values shmem[o1] and shmem[o2] are uniformly distributed too,
+(this can be stated by an immediate mathematical
+induction), and thus the next $x$ is finally uniformly distributed.
 
 Thus Algorithm~\ref{algo:gpu_kernel2} is a concrete realization of the general
 chaotic iterations presented previously, and for this reason, it satisfies the 
@@ -1032,56 +1040,65 @@ Devaney's formulation of a chaotic behavior.
 \label{sec:experiments}
 
 Different experiments  have been  performed in order  to measure  the generation
-speed. We have used  a computer equiped with Tesla C1060 NVidia  GPU card and an
-Intel  Xeon E5530 cadenced  at 2.40  GHz for  our experiments  and we  have used
-another one  equipped with  a less performant  CPU and  a GeForce GTX  280. Both
+speed. We have used a first computer equipped with a Tesla C1060 NVidia  GPU card
+and an
+Intel  Xeon E5530 cadenced  at 2.40  GHz,  and 
+a second computer  equipped with a smaller  CPU and  a GeForce GTX  280. 
+All the
 cards have 240 cores.
 
-In  Figure~\ref{fig:time_xorlike_gpu} we  compare the  number of  random numbers
-generated per second with the xor-like based PRNG. In this figure, the optimized
-version use the {\it xor64} described in~\cite{Marsaglia2003}. The naive version
-use  the three  xor-like  PRNGs described  in Listing~\ref{algo:seqCIprng}.   In
-order to obtain the optimal performance we removed the storage of random numbers
-in the GPU memory. This step is time consuming and slows down the random numbers
-generation.  Moreover, if one is  interested by applications that consume random
-numbers  directly   when  they  are  generated,  their   storage  are  completely
-useless. In this  figure we can see  that when the number of  threads is greater
-than approximately 30,000 upto 5 millions the number of random numbers generated
-per second  is almost constant.  With the  naive version, it is  between 2.5 and
-3GSample/s.   With  the  optimized   version,  it  is  approximately  equals  to
-20GSample/s. Finally  we can remark  that both GPU  cards are quite  similar. In
-practice,  the Tesla C1060  has more  memory than  the GTX  280 and  this memory
+In  Figure~\ref{fig:time_xorlike_gpu} we  compare the  quantity of  pseudorandom numbers
+generated per second with various xor-like based PRNG. In this figure, the optimized
+versions use the {\it xor64} described in~\cite{Marsaglia2003}, whereas the naive versions
+embed  the three  xor-like  PRNGs described  in Listing~\ref{algo:seqCIPRNG}.   In
+order to obtain the optimal performances, the storage of pseudorandom numbers
+into the GPU memory has been removed. This step is time consuming and slows down the numbers
+generation.  Moreover this   storage  is  completely
+useless, in case of applications that consume the pseudorandom
+numbers  directly   after generation. We can see  that when the number of  threads is greater
+than approximately 30,000 and lower than 5 millions, the number of pseudorandom numbers generated
+per second  is almost constant.  With the  naive version, this value ranges from 2.5 to
+3GSamples/s.   With  the  optimized   version,  it  is  approximately  equal to
+20GSamples/s. Finally  we can remark  that both GPU  cards are quite  similar, but in
+practice,  the Tesla C1060  has more  memory than  the GTX  280, and  this memory
 should be of better quality.
+As a  comparison,   Listing~\ref{algo:seqCIPRNG}  leads   to the  generation of  about
+138MSample/s when using one core of the Xeon E5530.
 
 \begin{figure}[htbp]
 \begin{center}
   \includegraphics[scale=.7]{curve_time_xorlike_gpu.pdf}
 \end{center}
-\caption{Number of random numbers generated per second with the xorlike based PRNG}
+\caption{Quantity of pseudorandom numbers generated per second with the xorlike-based PRNG}
 \label{fig:time_xorlike_gpu}
 \end{figure}
 
 
-In  comparison,   Listing~\ref{algo:seqCIprng}  allows  us   to  generate  about
-138MSample/s with only one core of the Xeon E5530.
 
 
-In Figure~\ref{fig:time_bbs_gpu}  we highlight the performance  of the optimized
-BBS based  PRNG on GPU. Performances are  less important. On the  Tesla C1060 we
-obtain approximately 1.8GSample/s and on the GTX 280 about 1.6GSample/s.
+
+In Figure~\ref{fig:time_bbs_gpu}  we highlight the performances  of the optimized
+BBS-based  PRNG on GPU. On the  Tesla C1060 we
+obtain approximately 1.8GSample/s and on the GTX 280 about 1.6GSample/s, which is
+obviously slower than the xorlike-based PRNG on GPU. However, we will show in the 
+next sections that 
+this new PRNG has a strong level of security, which is necessary paid by a speed
+reduction. 
 
 \begin{figure}[htbp]
 \begin{center}
   \includegraphics[scale=.7]{curve_time_bbs_gpu.pdf}
 \end{center}
-\caption{Number of random numbers generated per second with the BBS based PRNG}
+\caption{Quantity of pseudorandom numbers generated per second using the BBS-based PRNG}
 \label{fig:time_bbs_gpu}
 \end{figure}
 
-Both  these  experiments allows  us  to conclude  that  it  is possible  to
-generate a  huge number of pseudorandom  numbers with the  xor-like version and
-about tens  times less with the BBS  based version. The former  version has only
-chaotic properties whereas the latter also has cryptographically properties.
+All  these  experiments allow  us  to conclude  that  it  is possible  to
+generate a very large quantity of pseudorandom  numbers statistically perfect with the  xor-like version.
+In a certain extend, it is the case too with the secure BBS-based version, the speed deflation being
+explained by the fact that the former  version has ``only''
+chaotic properties and statistical perfection, whereas the latter is also cryptographically secure,
+as it is shown in the next sections.
 
 
 
@@ -1106,7 +1123,7 @@ The notion of {\it secure} PRNGs can now be defined as follows.
 A cryptographic PRNG $G$ is secure if for any probabilistic polynomial time
 algorithm $D$, for any positive polynomial $p$, and for all sufficiently
 large $k$'s,
-$$| \mathrm{Pr}[D(G(U_k))=1]-Pr[D(U_{\ell_G(k)}=1]|< \frac{1}{p(N)},$$
+$$| \mathrm{Pr}[D(G(U_k))=1]-Pr[D(U_{\ell_G(k)})=1]|< \frac{1}{p(N)},$$
 where $U_r$ is the uniform distribution over $\{0,1\}^r$ and the
 probabilities are taken over $U_N$, $U_{\ell_G(N)}$ as well as over the
 internal coin tosses of $D$. 
@@ -1133,6 +1150,7 @@ We claim now that if this PRNG is secure,
 then the new one is secure too.
 
 \begin{proposition}
+\label{cryptopreuve}
 If $H$ is a secure cryptographic PRNG, then $X$ is a secure cryptographic
 PRNG too.
 \end{proposition}
@@ -1199,40 +1217,50 @@ proving that $H$ is not secure, a contradiction.
 \end{proof}
 
 
+\section{Cryptographical Applications}
 
-
-\section{A Cryptographically Secure PRNG for GPU}
+\subsection{A Cryptographically Secure PRNG for GPU}
 \label{sec:CSGPU}
-It is  possible to build a  cryptographically secure prng based  on the previous
-algorithm (algorithm~\ref{algo:gpu_kernel2}).   It simply consists  in replacing
-the  {\it  xor-like} algorithm  by  another  cryptographically  secure prng.  In
-practice, we suggest  to use the BBS algorithm~\cite{BBS}  which takes the form:
-$$x_{n+1}=x_n^2~ mod~ M$$  where $M$ is the product of  two prime numbers. Those
-prime numbers  need to be congruent  to 3 modulus  4. In practice, this  PRNG is
-known to  be slow and  not efficient for  the generation of random  numbers. For
-current  GPU   cards,  the  modulus   operation  is  the  most   time  consuming
-operation. So in  order to obtain quite reasonable  performances, it is required
+
+It is  possible to build a  cryptographically secure PRNG based  on the previous
+algorithm (Algorithm~\ref{algo:gpu_kernel2}).   Due to Proposition~\ref{cryptopreuve},
+it simply consists  in replacing
+the  {\it  xor-like} PRNG  by  a  cryptographically  secure one.  
+We have chosen the Blum Blum Shum generator~\cite{BBS} (usually denoted by BBS) having the form:
+$$x_{n+1}=x_n^2~ mod~ M$$  where $M$ is the product of  two prime numbers. These
+prime numbers  need to be congruent  to 3 modulus  4. BBS is
+very slow and only usable for cryptographic applications. 
+
+  
+The  modulus   operation  is  the  most   time  consuming operation for
+current  GPU   cards. 
+So in  order to obtain quite reasonable  performances, it is required
 to use only modulus on 32  bits integer numbers. Consequently $x_n^2$ need to be
 less than  $2^{32}$ and the  number $M$  need to be  less than $2^{16}$.   So in
-pratice we can  choose prime numbers around 256 that are  congruent to 3 modulus
+practice we can  choose prime numbers around 256 that are  congruent to 3 modulus
 4.  With  32 bits numbers,  only the  4 least significant  bits of $x_n$  can be
-chosen  (the   maximum  number  of   undistinguishing  is  less  or   equals  to
+chosen  (the   maximum  number  of   indistinguishable bits  is  lesser than  or   equals  to
 $log_2(log_2(x_n))$). So  to generate a 32 bits  number, we need to  use 8 times
-the BBS algorithm, with different combinations of $M$ is required.
+the BBS algorithm with different combinations of $M$.
 
 Currently this PRNG does not succeed to pass all the tests of TestU01.
 
 
+\subsection{A Secure Asymetric Cryptosystem}
+
+
+
+
 \section{Conclusion}
 
 
 In  this  paper  we have  presented  a  new  class  of  PRNGs based  on  chaotic
-iterations. We have proven that these PRNGs are chaotic in the sense of Devenay.
+iterations. We have proven that these PRNGs are chaotic in the sense of Devaney.
 We also propose a PRNG cryptographically secure and its implementation on GPU.
 
 An  efficient implementation  on  GPU based  on  a xor-like  PRNG  allows us  to
 generate   a  huge   number   of  pseudorandom   numbers   per  second   (about
-20Gsample/s). This PRNG succeeds to pass the hardest batteries of TestU01.
+20Gsamples/s). This PRNG succeeds to pass the hardest batteries of TestU01.
 
 In future  work we plan to  extend this work  for parallel PRNG for  clusters or
 grid computing. We also plan to improve  the BBS version in order to succeed all
-- 
2.39.5