modif

[prng_gpu.git] / prng_gpu.tex
diff --git a/prng_gpu.tex b/prng_gpu.tex

index 39536c619accd44ebe4ff4c130b2bcfb4a56ab2e..5b07118a5f04aaeae8bc521e7aa1998e4d324321 100644 (file)
--- a/prng_gpu.tex
+++ b/prng_gpu.tex
@@ -44,6 +44,14 @@ Guyeux\thanks{Authors in alphabetic order}}
  \maketitle
  
  \begin{abstract}
+In this paper we present a new produce pseudo-random numbers generator (PRNG) on
+graphics processing units  (GPU). This PRNG is based  on chaotic iterations.  it
+is proven  to be chaotic  in the Devany's  formulation. We propose  an efficient
+implementation  for  GPU which  succeeds  to  the  {\it BigCrush},  the  hardest
+batteries of test of TestU01.  Experimentations show that this PRNG can generate
+about 20 billions of random numbers  per second on Tesla C1060 and NVidia GTX280
+cards.
+
  
  \end{abstract}
  
@@ -56,44 +64,79 @@ generator (PRNG) needs  to verify some features to be used  by scientists. It is
  important  to  be  able  to  generate  pseudo-random  numbers  efficiently,  the
  generation  needs to  be reproducible  and a  PRNG needs  to satisfy  many usual
  statistical properties. Finally, from our point a view, it is essential to prove
-that a  PRNG is chaotic.  Devaney~\cite{Devaney} proposed  a common mathematical
-formulation  of chaotic  dynamical  systems. Concerning  the statistical  tests,
-TestU01the is  the best-known public-domain statistical testing  packages. So we
-use it for all our PRNGs, especially  the {\it BigCrush} which is based on the largest
-serie of tests.
+that  a PRNG  is  chaotic.  Concerning  the  statistical tests,  TestU01 is  the
+best-known public-domain statistical testing package.   So we use it for all our
+PRNGs, especially the {\it BigCrush}  which provides the largest serie of tests.
+Concerning  the  chaotic properties,  Devaney~\cite{Devaney}  proposed a  common
+mathematical formulation of chaotic dynamical systems.
  
  In a  previous work~\cite{bgw09:ip}  we have proposed  a new familly  of chaotic
-PRNG  based on  chaotic iterations  (IC).   In this  paper we  propose a  faster
-version which is also proven to be chaotic with the Devaney formulation.
+PRNG  based on  chaotic iterations. We  have proven  that these  PRNGs are
+chaotic in the Devaney's sense.  In this paper we propose a faster version which
+is also proven to be chaotic.
  
  Although graphics  processing units (GPU)  was initially designed  to accelerate
  the manipulation of  images, they are nowadays commonly  used in many scientific
  applications. Therefore,  it is important  to be able to  generate pseudo-random
  numbers inside a GPU when a scientific application runs in a GPU. That is why we
-also provide an efficient PRNG for GPU respecting based on IC.
+also provide  an efficient  PRNG for  GPU respecting based  on IC.  Such devices
+allows us to generated almost 20 billions of random numbers per second.
  
+In order  to establish  that our  PRNGs are chaotic  according to  the Devaney's
+formulation, we  extend what we  have proposed in~\cite{guyeux10}.  Moreover, we
+define a  new distance to measure  the disorder in  the chaos and we  prove some
+interesting properties with this distance.
  
+The rest of this paper  is organised as follows. In Section~\ref{section:related
+  works} we  review some GPU implementions  of PRNG.  Section~\ref{section:BASIC
+  RECALLS} gives some basic recalls  on Devanay's formation of chaos and chaotic
+iterations. In  Section~\ref{sec:pseudo-random} the proof of chaos  of our PRNGs
+is   studied.    Section~\ref{sec:efficient    prng}   presents   an   efficient
+implementation of  our chaotic PRNG  on a CPU.   Section~\ref{sec:efficient prng
+  gpu}   describes   the  GPU   implementation   of   our   chaotic  PRNG.    In
+Section~\ref{sec:experiments}     some    experimentations     are    presented.
+Section~\ref{sec:de  la  relativité du  désordre}  describes  the relativity  of
+disorder.   In Section~\ref{sec: chaos  order topology}  the proof  that chaotic
+iterations   can  be   described   by   iterations  on   a   real  interval   is
+established. Finally, we give a conclusion and some perspectives.
  
  
-Interet des itérations chaotiques pour générer des nombre alea\\
-Interet de générer des nombres alea sur GPU
  
  
  \section{Related works on GPU based PRNGs}
-
+\label{section:related works}
  In the litterature many authors have work on defining GPU based PRNGs. We do not
  want to be exhaustive and we just give the most significant works from our point
-of view.
+of view. When authors mention the  number of random numbers generated per second
+we mention  it. We  consider that  a million numbers  per second  corresponds to
+1MSample/s and than a billion numbers per second corresponds to 1GSample/s.
  
  In \cite{Pang:2008:cec},  the authors define  a PRNG based on  cellular automata
  which  does   not  require  high  precision  integer   arithmetics  nor  bitwise
  operations. There is no mention of statistical tests nor proof that this PRNG is
-chaotic. Concerning  the speed  of generation, they  can generate  about 3200000
-random numbers per seconds on a GeForce 7800 GTX GPU (which is quite old now).
+chaotic.  Concerning   the  speed  of   generation,  they  can   generate  about
+3.2MSample/s on a GeForce 7800 GTX GPU (which is quite old now).
  
  In \cite{ZRKB10}, the authors propose  different versions of efficient GPU PRNGs
-based on  Lagged Fibonacci,  Hybrid Taus  or Hybrid Taus.  They have  used these
-PRNGs for Langevin simulations of biomolecules fully implemented on GPU.
+based on  Lagged Fibonacci, Hybrid  Taus or Hybrid  Taus.  They have  used these
+PRNGs   for  Langevin   simulations   of  biomolecules   fully  implemented   on
+GPU. Performance of  the GPU versions are far better than  those obtained with a
+CPU and these PRNGs succeed to pass the {\it BigCrush} test of TestU01. There is
+no mention that their PRNGs have chaos mathematical properties.
+
+
+Authors of~\cite{conf/fpga/ThomasHL09}  have studied the  implementation of some
+PRNGs on  diferrent computing architectures: CPU,  field-programmable gate array
+(FPGA), GPU and massively parallel  processor. This study is interesting because
+it  shows the  performance  of the  same  PRNGs on  different architeture.   For
+example,  the FPGA  is globally  the  fastest architecture  and it  is also  the
+efficient one because it provides the fastest number of generated random numbers
+per joule. Concerning the GPU,  authors can generate betweend 11 and 16GSample/s
+with a GTX 280  GPU. The drawback of this work is  that those PRNGs only succeed
+the {\it Crush} test which is easier than the {\it Big Crush} test.
+\newline
+\newline
+To the best of our knowledge no GPU implementation have been proven to have chaotic properties.
  
  \section{Basic Recalls}
  \label{section:BASIC RECALLS}
@@ -320,7 +363,7 @@ $\mathds{B}^\mathsf{N}$ represents the memory of the computer whereas $\llbracke
  \rrbracket^{\mathds{N}}$ is its input stream (the seeds, for instance).
  
  \section{Application to Pseudo-Randomness}
-
+\label{sec:pseudo-random}
  \subsection{A First Pseudo-Random Number Generator}
  
  We have proposed in~\cite{bgw09:ip} a new family of generators that receives 
@@ -690,6 +733,7 @@ have $d((S,E),(\tilde S,E))<\epsilon$.
  
  
  \section{Efficient PRNG based on Chaotic Iterations}
+\label{sec:efficient prng}
  
  In  order to  implement efficiently  a PRNG  based on  chaotic iterations  it is
  possible to improve  previous works [ref]. One solution  consists in considering
@@ -716,29 +760,7 @@ x \oplus S^i&=&1&1&0&1&1&1&0&0&0&1&1&1&0&1&0&1\\
   \end{array}
  $$
  
-%% \begin{figure}[htbp]
-%% \begin{center}
-%% \fbox{
-%% \begin{minipage}{14cm}
-%% unsigned int CIprng() \{\\
-%%   static unsigned int x = 123123123;\\
-%%   unsigned long t1 = xorshift();\\
-%%   unsigned long t2 = xor128();\\
-%%   unsigned long t3 = xorwow();\\
-%%   x = x\textasciicircum (unsigned int)t1;\\
-%%   x = x\textasciicircum (unsigned int)(t2$>>$32);\\
-%%   x = x\textasciicircum (unsigned int)(t3$>>$32);\\
-%%   x = x\textasciicircum (unsigned int)t2;\\
-%%   x = x\textasciicircum (unsigned int)(t1$>>$32);\\
-%%   x = x\textasciicircum (unsigned int)t3;\\
-%%   return x;\\
-%% \}
-%% \end{minipage}
-%% }
-%% \end{center}
-%% \caption{sequential Chaotic Iteration PRNG}
-%% \label{algo:seqCIprng}
-%% \end{figure}
+
  
  
  
@@ -776,7 +798,8 @@ variable \texttt{t}.   So to produce a  random number realizes  6 xor operations
  with 6 32-bits  numbers produced by 3 64-bits PRNG.   This version successes the
  BigCrush of the TestU01 battery~\cite{LEcuyerS07}.
  
-\section{Efficient prng based on chaotic iterations on GPU}
+\section{Efficient PRNGs based on chaotic iterations on GPU}
+\label{sec:efficient prng gpu}
  
  In  order to benefit  from computing  power of  GPU, a  program needs  to define
  independent blocks of threads which  can be computed simultaneously. In general,
@@ -784,8 +807,8 @@ the larger the number of threads is,  the more local memory is used and the less
  branching  instructions are  used (if,  while, ...),  the better  performance is
  obtained  on  GPU.  So  with  algorithm  \ref{algo:seqCIprng}  presented in  the
  previous section, it is possible to  build a similar program which computes PRNG
-on  GPU. In  the CUDA  [ref] environment,  threads have  a  local identificator,
-called \texttt{ThreadIdx} relative to the block containing them.
+on   GPU.  In  the   CUDA~\cite{Nvid10}  environment,   threads  have   a  local
+identificator, called \texttt{ThreadIdx} relative to the block containing them.
  
  
  \subsection{Naive version for GPU}
@@ -795,14 +818,14 @@ The principe consists in assigning the computation of a PRNG as in sequential to
  each thread  of the  GPU.  Of course,  it is  essential that the  three xor-like
  PRNGs  used for  our computation  have different  parameters. So  we  chose them
  randomly with  another PRNG. As the  initialisation is performed by  the CPU, we
-have chosen to use the ISAAC PRNG  [ref] to initalize all the parameters for the
-GPU version  of our  PRNG.  The  implementation of the  three xor-like  PRNGs is
-straightforward  as soon  as their  parameters have  been allocated  in  the GPU
-memory. Each xor-like  PRNGs used works with an internal  number $x$ which keeps
-the last generated random numbers. Other internal variables are also used by the
-xor-like PRNGs. More  precisely, the implementation of the  xor128, the xorshift
-and  the xorwow  respectively  require 4,  5  and 6  unsigned  long as  internal
-variables.
+have  chosen  to  use  the  ISAAC  PRNG~\cite{Jenkins96}  to  initalize  all  the
+parameters for  the GPU version  of our PRNG.   The implementation of  the three
+xor-like  PRNGs  is  straightforward  as  soon as  their  parameters  have  been
+allocated in  the GPU memory.  Each xor-like PRNGs  used works with  an internal
+number  $x$  which keeps  the  last  generated  random numbers.  Other  internal
+variables  are   also  used   by  the  xor-like   PRNGs.  More   precisely,  the
+implementation of the  xor128, the xorshift and the  xorwow respectively require
+4, 5 and 6 unsigned long as internal variables.
  
  \begin{algorithm}
  
@@ -862,7 +885,7 @@ which represent the indexes of the  other threads for which the results are used
  by the  current thread. In  the algorithm, we  consider that a  64-bits xor-like
  PRNG is used, that is why both 32-bits parts are used.
  
-This version also succeed to the BigCrush batteries of tests.
+This version also succeeds to the {\it BigCrush} batteries of tests.
  
  \begin{algorithm}
  
@@ -873,17 +896,15 @@ tab1, tab2: Arrays containing permutations of size permutation\_size\;}
  
  \KwOut{NewNb: array containing random numbers in global memory}
  \If{threadId is concerned} {
-  retrieve data from InternalVarXorLikeArray[threadId] in local variables\;
+  retrieve data from InternalVarXorLikeArray[threadId] in local variables including shared memory\;
    offset = threadIdx\%permutation\_size\;
    o1 = threadIdx-offset+tab1[offset]\;
    o2 = threadIdx-offset+tab2[offset]\;
    \For{i=1 to n} {
      t=xor-like()\;
-    shared\_mem[threadId]=(unsigned int)t\;
-    x = x $\oplus$ (unsigned int) t\;
-    x = x $\oplus$ (unsigned int) (t>>32)\;
-    x = x $\oplus$ shared[o1]\;
-    x = x $\oplus$ shared[o2]\;
+    t=t$\oplus$shmem[o1]$\oplus$shmem[o2]\;
+    shared\_mem[threadId]=t\;
+    x = x $\oplus$ t\;
  
      store the new PRNG in NewNb[NumThreads*threadId+i]\;
    }
@@ -897,9 +918,9 @@ version}
  
  \subsection{Theoretical Evaluation of the Improved Version}
  
-A run of Algorithm~\ref{algo:gpu_kernel2} consists in four operations having 
+A run of Algorithm~\ref{algo:gpu_kernel2} consists in three operations having 
  the form of Equation~\ref{equation Oplus}, which is equivalent to the iterative
-system of Eq.~\ref{eq:generalIC}. That is, four iterations of the general chaotic
+system of Eq.~\ref{eq:generalIC}. That is, three iterations of the general chaotic
  iterations are realized between two stored values of the PRNG.
  To be certain that we are in the framework of Theorem~\ref{t:chaos des general},
  we must guarantee that this dynamical system iterates on the space 
@@ -919,23 +940,41 @@ chaotic iterations presented previously, and for this reason, it satisfies the
  Devaney's formulation of a chaotic behavior.
  
  \section{Experiments}
-
-Different experiments have been performed in order to measure the generation
-speed.
-\begin{figure}[t]
+\label{sec:experiments}
+
+Different experiments  have been  performed in order  to measure  the generation
+speed. We have used  a computer equiped with Tesla C1060 NVidia  GPU card and an
+Intel  Xeon E5530 cadenced  at 2.40  GHz for  our experiments  and we  have used
+another one  equipped with  a less performant  CPU and  a GeForce GTX  280. Both
+cards have 240 cores.
+
+In Figure~\ref{fig:time_gpu}  we compare the number of  random numbers generated
+per second. The xor-like prng  is a xor64 described in~\cite{Marsaglia2003}.  In
+order to obtain the optimal performance  we remove the storage of random numbers
+in the GPU memory. This step is time consumming and slows down the random number
+generation.  Moreover, if you are interested by applications that consome random
+numbers  directly   when  they  are  generated,  their   storage  is  completely
+useless. In this  figure we can see  that when the number of  threads is greater
+than approximately 30,000 upto 5 millions the number of random numbers generated
+per second  is almost constant.  With the  naive version, it is  between 2.5 and
+3GSample/s.   With  the  optimized   version,  it  is  approximately  equals  to
+20GSample/s. Finally  we can remark  that both GPU  cards are quite  similar. In
+practice,  the Tesla C1060  has more  memory than  the GTX  280 and  this memory
+should be of better quality.
+
+\begin{figure}[htbp]
  \begin{center}
    \includegraphics[scale=.7]{curve_time_gpu.pdf}
  \end{center}
  \caption{Number of random numbers generated per second}
-\label{fig:time_naive_gpu}
+\label{fig:time_gpu}
  \end{figure}
  
  
-First of all we have compared the time to generate X random numbers with both
-the CPU version and the GPU version. 
+In  comparison,   Listing~\ref{algo:seqCIprng}  allows  us   to  generate  about
+138MSample/s with only one core of the Xeon E5530.
+
  
-Faire une courbe du nombre de random en fonction du nombre de threads,
-éventuellement en fonction du nombres de threads par bloc.
  
  
  
@@ -1050,7 +1089,7 @@ sets $A$ and $B$, an integer $n$ must satisfy $f^{(n)}(A) \cap B \neq
  
  
  \section{Chaos on the order topology}
-
+\label{sec: chaos order topology}
  \subsection{The phase space is an interval of the real line}
  
  \subsubsection{Toward a topological semiconjugacy}