modif

[prng_gpu.git] / prng_gpu.tex
diff --git a/prng_gpu.tex b/prng_gpu.tex

index bc5b3e55f5d7896f4374abcec6c8f56b0b4f3ee8..5b07118a5f04aaeae8bc521e7aa1998e4d324321 100644 (file)
--- a/prng_gpu.tex
+++ b/prng_gpu.tex
@@ -44,6 +44,14 @@ Guyeux\thanks{Authors in alphabetic order}}
  \maketitle
  
  \begin{abstract}
+In this paper we present a new produce pseudo-random numbers generator (PRNG) on
+graphics processing units  (GPU). This PRNG is based  on chaotic iterations.  it
+is proven  to be chaotic  in the Devany's  formulation. We propose  an efficient
+implementation  for  GPU which  succeeds  to  the  {\it BigCrush},  the  hardest
+batteries of test of TestU01.  Experimentations show that this PRNG can generate
+about 20 billions of random numbers  per second on Tesla C1060 and NVidia GTX280
+cards.
+
  
  \end{abstract}
  
@@ -63,7 +71,7 @@ Concerning  the  chaotic properties,  Devaney~\cite{Devaney}  proposed a  common
  mathematical formulation of chaotic dynamical systems.
  
  In a  previous work~\cite{bgw09:ip}  we have proposed  a new familly  of chaotic
-PRNG  based on  chaotic iterations  (IC). We  have proven  that these  PRNGs are
+PRNG  based on  chaotic iterations. We  have proven  that these  PRNGs are
  chaotic in the Devaney's sense.  In this paper we propose a faster version which
  is also proven to be chaotic.
  
@@ -75,18 +83,22 @@ also provide  an efficient  PRNG for  GPU respecting based  on IC.  Such devices
  allows us to generated almost 20 billions of random numbers per second.
  
  In order  to establish  that our  PRNGs are chaotic  according to  the Devaney's
-formulation, we extend what we have proposed in~\cite{guyeux10}. Moreover,  we define a new distance to measure the disorder in the chaos and we prove some interesting properties with this distance.
+formulation, we  extend what we  have proposed in~\cite{guyeux10}.  Moreover, we
+define a  new distance to measure  the disorder in  the chaos and we  prove some
+interesting properties with this distance.
  
  The rest of this paper  is organised as follows. In Section~\ref{section:related
-  works}  we review  some GPU  implementions of  PRNG.  Section~\ref{section:BASIC RECALLS}  gives some  basic recalls  on  Devanay's formation  of chaos  and
-chaotic iterations. In Section~\ref{sec:pseudo-random} the proof of chaos of our
-PRNGs  is  studied.   Section~\ref{sec:efficient  prng}  presents  an  efficient
+  works} we  review some GPU implementions  of PRNG.  Section~\ref{section:BASIC
+  RECALLS} gives some basic recalls  on Devanay's formation of chaos and chaotic
+iterations. In  Section~\ref{sec:pseudo-random} the proof of chaos  of our PRNGs
+is   studied.    Section~\ref{sec:efficient    prng}   presents   an   efficient
  implementation of  our chaotic PRNG  on a CPU.   Section~\ref{sec:efficient prng
    gpu}   describes   the  GPU   implementation   of   our   chaotic  PRNG.    In
  Section~\ref{sec:experiments}     some    experimentations     are    presented.
  Section~\ref{sec:de  la  relativité du  désordre}  describes  the relativity  of
-disorder.  In Section~\ref{sec:  chaos order  topology} the  proof  that chaotic
-iterations can be described by iterations on a real interval is established. Finally, we give a conclusion and some perspectives.
+disorder.   In Section~\ref{sec: chaos  order topology}  the proof  that chaotic
+iterations   can  be   described   by   iterations  on   a   real  interval   is
+established. Finally, we give a conclusion and some perspectives.
  
  
  
@@ -748,29 +760,7 @@ x \oplus S^i&=&1&1&0&1&1&1&0&0&0&1&1&1&0&1&0&1\\
   \end{array}
  $$
  
-%% \begin{figure}[htbp]
-%% \begin{center}
-%% \fbox{
-%% \begin{minipage}{14cm}
-%% unsigned int CIprng() \{\\
-%%   static unsigned int x = 123123123;\\
-%%   unsigned long t1 = xorshift();\\
-%%   unsigned long t2 = xor128();\\
-%%   unsigned long t3 = xorwow();\\
-%%   x = x\textasciicircum (unsigned int)t1;\\
-%%   x = x\textasciicircum (unsigned int)(t2$>>$32);\\
-%%   x = x\textasciicircum (unsigned int)(t3$>>$32);\\
-%%   x = x\textasciicircum (unsigned int)t2;\\
-%%   x = x\textasciicircum (unsigned int)(t1$>>$32);\\
-%%   x = x\textasciicircum (unsigned int)t3;\\
-%%   return x;\\
-%% \}
-%% \end{minipage}
-%% }
-%% \end{center}
-%% \caption{sequential Chaotic Iteration PRNG}
-%% \label{algo:seqCIprng}
-%% \end{figure}
+
  
  
  
@@ -828,7 +818,7 @@ The principe consists in assigning the computation of a PRNG as in sequential to
  each thread  of the  GPU.  Of course,  it is  essential that the  three xor-like
  PRNGs  used for  our computation  have different  parameters. So  we  chose them
  randomly with  another PRNG. As the  initialisation is performed by  the CPU, we
-have  chosen  to  use  the  ISAAC  PRNG~\ref{Jenkins96}  to  initalize  all  the
+have  chosen  to  use  the  ISAAC  PRNG~\cite{Jenkins96}  to  initalize  all  the
  parameters for  the GPU version  of our PRNG.   The implementation of  the three
  xor-like  PRNGs  is  straightforward  as  soon as  their  parameters  have  been
  allocated in  the GPU memory.  Each xor-like PRNGs  used works with  an internal
@@ -895,7 +885,7 @@ which represent the indexes of the  other threads for which the results are used
  by the  current thread. In  the algorithm, we  consider that a  64-bits xor-like
  PRNG is used, that is why both 32-bits parts are used.
  
-This version also succeed to the BigCrush batteries of tests.
+This version also succeeds to the {\it BigCrush} batteries of tests.
  
  \begin{algorithm}
  
@@ -906,17 +896,15 @@ tab1, tab2: Arrays containing permutations of size permutation\_size\;}
  
  \KwOut{NewNb: array containing random numbers in global memory}
  \If{threadId is concerned} {
-  retrieve data from InternalVarXorLikeArray[threadId] in local variables\;
+  retrieve data from InternalVarXorLikeArray[threadId] in local variables including shared memory\;
    offset = threadIdx\%permutation\_size\;
    o1 = threadIdx-offset+tab1[offset]\;
    o2 = threadIdx-offset+tab2[offset]\;
    \For{i=1 to n} {
      t=xor-like()\;
-    shared\_mem[threadId]=(unsigned int)t\;
-    x = x $\oplus$ (unsigned int) t\;
-    x = x $\oplus$ (unsigned int) (t>>32)\;
-    x = x $\oplus$ shared[o1]\;
-    x = x $\oplus$ shared[o2]\;
+    t=t$\oplus$shmem[o1]$\oplus$shmem[o2]\;
+    shared\_mem[threadId]=t\;
+    x = x $\oplus$ t\;
  
      store the new PRNG in NewNb[NumThreads*threadId+i]\;
    }
@@ -930,9 +918,9 @@ version}
  
  \subsection{Theoretical Evaluation of the Improved Version}
  
-A run of Algorithm~\ref{algo:gpu_kernel2} consists in four operations having 
+A run of Algorithm~\ref{algo:gpu_kernel2} consists in three operations having 
  the form of Equation~\ref{equation Oplus}, which is equivalent to the iterative
-system of Eq.~\ref{eq:generalIC}. That is, four iterations of the general chaotic
+system of Eq.~\ref{eq:generalIC}. That is, three iterations of the general chaotic
  iterations are realized between two stored values of the PRNG.
  To be certain that we are in the framework of Theorem~\ref{t:chaos des general},
  we must guarantee that this dynamical system iterates on the space 
@@ -956,18 +944,23 @@ Devaney's formulation of a chaotic behavior.
  
  Different experiments  have been  performed in order  to measure  the generation
  speed. We have used  a computer equiped with Tesla C1060 NVidia  GPU card and an
-Intel Xeon E5530 cadenced at 2.40 GHz for our experiments.
+Intel  Xeon E5530 cadenced  at 2.40  GHz for  our experiments  and we  have used
+another one  equipped with  a less performant  CPU and  a GeForce GTX  280. Both
+cards have 240 cores.
  
  In Figure~\ref{fig:time_gpu}  we compare the number of  random numbers generated
-per second.   In order  to obtain the  optimal number  we remove the  storage of
-random numbers  in the GPU memory. This  step is time consumming  and slows down
-the random number  generation.  Moreover, if you are  interested by applications
-that consome  random number directly when  they are generated,  their storage is
-completely useless. In this figure we can see that when the number of threads is
-greater than approximately  30,000 upto 5 millions the  number of random numbers
-generated per second is almost constant.   With the naive version, it is between
-2.5  and  3GSample/s.   With the  optimized  version,  it  is almost  equals  to
-20GSample/s.
+per second. The xor-like prng  is a xor64 described in~\cite{Marsaglia2003}.  In
+order to obtain the optimal performance  we remove the storage of random numbers
+in the GPU memory. This step is time consumming and slows down the random number
+generation.  Moreover, if you are interested by applications that consome random
+numbers  directly   when  they  are  generated,  their   storage  is  completely
+useless. In this  figure we can see  that when the number of  threads is greater
+than approximately 30,000 upto 5 millions the number of random numbers generated
+per second  is almost constant.  With the  naive version, it is  between 2.5 and
+3GSample/s.   With  the  optimized   version,  it  is  approximately  equals  to
+20GSample/s. Finally  we can remark  that both GPU  cards are quite  similar. In
+practice,  the Tesla C1060  has more  memory than  the GTX  280 and  this memory
+should be of better quality.
  
  \begin{figure}[htbp]
  \begin{center}
@@ -978,11 +971,10 @@ generated per second is almost constant.   With the naive version, it is between
  \end{figure}
  
  
-First of all we have compared the time to generate X random numbers with both
-the CPU version and the GPU version. 
+In  comparison,   Listing~\ref{algo:seqCIprng}  allows  us   to  generate  about
+138MSample/s with only one core of the Xeon E5530.
+
  
-Faire une courbe du nombre de random en fonction du nombre de threads,
-éventuellement en fonction du nombres de threads par bloc.