X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/prng_gpu.git/blobdiff_plain/0a8d52f6e872c9df6f150a008456bab4557b7d09..e443cbfb0433e0a7e1551c9841b0173698151f96:/prng_gpu.tex?ds=inline diff --git a/prng_gpu.tex b/prng_gpu.tex index 9de2d15..dc3d3cb 100644 --- a/prng_gpu.tex +++ b/prng_gpu.tex @@ -109,7 +109,7 @@ Let us finish this paragraph by noticing that, in this paper, statistical perfection refers to the ability to pass the whole {\it BigCrush} battery of tests, which is widely considered as the most stringent statistical evaluation of a sequence claimed as random. -This battery can be found into the well-known TestU01 package. +This battery can be found into the well-known TestU01 package~\cite{LEcuyerS07}. Chaos, for its part, refers to the well-established definition of a chaotic dynamical system proposed by Devaney~\cite{Devaney}. @@ -118,7 +118,7 @@ In a previous work~\cite{bgw09:ip,guyeux10} we have proposed a post-treatment on as a chaotic dynamical system. Such a post-treatment leads to a new category of PRNGs. We have shown that proofs of Devaney's chaos can be established for this family, and that the sequence obtained after this post-treatment can pass the -NIST, DieHARD, and TestU01 batteries of tests, even if the inputted generators +NIST~\cite{Nist10}, DieHARD~\cite{Marsaglia1996}, and TestU01~\cite{LEcuyerS07} batteries of tests, even if the inputted generators cannot. The proposition of this paper is to improve widely the speed of the formerly proposed generator, without any lack of chaos or statistical properties. @@ -160,47 +160,57 @@ summarized and intended future work is presented. \section{Related works on GPU based PRNGs} \label{section:related works} -In the litterature many authors have work on defining GPU based PRNGs. We do not -want to be exhaustive and we just give the most significant works from our point -of view. When authors mention the number of random numbers generated per second -we mention it. We consider that a million numbers per second corresponds to -1MSample/s and than a billion numbers per second corresponds to 1GSample/s. - -In \cite{Pang:2008:cec}, the authors define a PRNG based on cellular automata -which does not require high precision integer arithmetics nor bitwise -operations. There is no mention of statistical tests nor proof that this PRNG is -chaotic. Concerning the speed of generation, they can generate about -3.2MSample/s on a GeForce 7800 GTX GPU (which is quite old now). + +Numerous research works on defining GPU based PRNGs have yet been proposed in the +literature, so that completeness is impossible. +This is why authors of this document only give reference to the most significant attempts +in this domain, from their subjective point of view. +The quantity of pseudorandom numbers generated per second is mentioned here +only when the information is given in the related work. +A million numbers per second will be simply written as +1MSample/s whereas a billion numbers per second is 1GSample/s. + +In \cite{Pang:2008:cec} a PRNG based on cellular automata is defined +with no requirement to an high precision integer arithmetic or to any bitwise +operations. Authors can generate about +3.2MSamples/s on a GeForce 7800 GTX GPU, which is quite an old card now. +However, there is neither a mention of statistical tests nor any proof of +chaos or cryptography in this document. In \cite{ZRKB10}, the authors propose different versions of efficient GPU PRNGs -based on Lagged Fibonacci, Hybrid Taus or Hybrid Taus. They have used these +based on Lagged Fibonacci or Hybrid Taus. They have used these PRNGs for Langevin simulations of biomolecules fully implemented on GPU. Performance of the GPU versions are far better than those obtained with a -CPU and these PRNGs succeed to pass the {\it BigCrush} test of TestU01. There is -no mention that their PRNGs have chaos mathematical properties. +CPU, and these PRNGs succeed to pass the {\it BigCrush} battery of TestU01. +However the evaluations of the proposed PRNGs are only statistical ones. Authors of~\cite{conf/fpga/ThomasHL09} have studied the implementation of some -PRNGs on diferrent computing architectures: CPU, field-programmable gate array -(FPGA), GPU and massively parallel processor. This study is interesting because -it shows the performance of the same PRNGs on different architeture. For -example, the FPGA is globally the fastest architecture and it is also the -efficient one because it provides the fastest number of generated random numbers -per joule. Concerning the GPU, authors can generate betweend 11 and 16GSample/s -with a GTX 280 GPU. The drawback of this work is that those PRNGs only succeed -the {\it Crush} test which is easier than the {\it Big Crush} test. - -Cuda has developped a library for the generation of random numbers called -Curand~\cite{curand11}. Several PRNGs are implemented: -Xorwow~\cite{Marsaglia2003} and some variants of Sobol. Some tests report that -the fastest version provides 15GSample/s on the new Fermi C2050 card. Their -PRNGs fail to succeed the whole tests of TestU01 on only one test. +PRNGs on different computing architectures: CPU, field-programmable gate array +(FPGA), massively parallel processors, and GPU. This study is of interest, because +the performance of the same PRNGs on different architectures are compared. +FPGA appears as the fastest and the most +efficient architecture, providing the fastest number of generated pseudorandom numbers +per joule. +However, we can notice that authors can ``only'' generate between 11 and 16GSamples/s +with a GTX 280 GPU, which should be compared with +the results presented in this document. +We can remark too that the PRNGs proposed in~\cite{conf/fpga/ThomasHL09} are only +able to pass the {\it Crush} battery, which is very easy compared to the {\it Big Crush} one. + +Lastly, Cuda has developed a library for the generation of pseudorandom numbers called +Curand~\cite{curand11}. Several PRNGs are implemented, among +other things +Xorwow~\cite{Marsaglia2003} and some variants of Sobol. The tests reported show that +their fastest version provides 15GSamples/s on the new Fermi C2050 card. +But their PRNGs cannot pass the whole TestU01 battery (only one test is failed). \newline \newline -To the best of our knowledge no GPU implementation have been proven to have chaotic properties. +We can finally remark that, to the best of our knowledge, no GPU implementation have been proven to be chaotic, and the cryptographically secure property is surprisingly never regarded. \section{Basic Recalls} \label{section:BASIC RECALLS} + This section is devoted to basic definitions and terminologies in the fields of topological chaos and chaotic iterations. \subsection{Devaney's Chaotic Dynamical Systems} @@ -421,11 +431,12 @@ As $G_f$, defined on the domain $\llbracket 1 ; \mathsf{N} \rrbracket^{\mathd \rightarrow \mathds{B}^\mathsf{N}$, we can preserve the theoretical properties on $G_f$ during implementations (due to the discrete nature of $f$). It is as if $\mathds{B}^\mathsf{N}$ represents the memory of the computer whereas $\llbracket 1 ; \mathsf{N} -\rrbracket^{\mathds{N}}$ is its input stream (the seeds, for instance). +\rrbracket^{\mathds{N}}$ is its input stream (the seeds, for instance, in PRNG, or a physical noise in TRNG). -\section{Application to pseudorandomness} +\section{Application to Pseudorandomness} \label{sec:pseudorandom} -\subsection{A First pseudorandom Number Generator} + +\subsection{A First Pseudorandom Number Generator} We have proposed in~\cite{bgw09:ip} a new family of generators that receives two PRNGs as inputs. These two generators are mixed with chaotic iterations, @@ -470,7 +481,7 @@ return $y$\; This generator is synthesized in Algorithm~\ref{CI Algorithm}. -It takes as input: a function $f$; +It takes as input: a Boolean function $f$ satisfying Theorem~\ref{Th:Caractérisation des IC chaotiques}; an integer $b$, ensuring that the number of executed iterations is at least $b$ and at most $2b+1$; and an initial configuration $x^0$. It returns the new generated configuration $x$. Internally, it embeds two @@ -495,7 +506,7 @@ We have proven in \cite{bcgr11:ip} that, if and only if $M$ is a double stochastic matrix. \end{theorem} -This former generator as successively passed various batteries of statistical tests, as the NIST tests~\cite{bcgr11:ip}. +This former generator as successively passed various batteries of statistical tests, as the NIST~\cite{bcgr11:ip}, DieHARD~\cite{Marsaglia1996}, and TestU01~\cite{LEcuyerS07}. \subsection{Improving the Speed of the Former Generator} @@ -796,17 +807,26 @@ have $d((S,E),(\tilde S,E))<\epsilon$. \section{Efficient PRNG based on Chaotic Iterations} \label{sec:efficient prng} -In order to implement efficiently a PRNG based on chaotic iterations it is -possible to improve previous works [ref]. One solution consists in considering -that the strategy used contains all the bits for which the negation is -achieved out. Then in order to apply the negation on these bits we can simply -apply the xor operator between the current number and the strategy. In -order to obtain the strategy we also use a classical PRNG. +Based on the proof presented in the previous section, it is now possible to +improve the speed of the generator formerly presented in~\cite{bgw09:ip,guyeux10}. +The first idea is to consider +that the provided strategy is a pseudorandom Boolean vector obtained by a +given PRNG. +An iteration of the system is simply the bitwise exclusive or between +the last computed state and the current strategy. +Topological properties of disorder exhibited by chaotic +iterations can be inherited by the inputted generator, hoping by doing so to +obtain some statistical improvements while preserving speed. -Here is an example with 16-bits numbers showing how the bitwise operations + +Let us give an example using 16-bits numbers, to clearly understand how the bitwise xor operations are -applied. Suppose that $x$ and the strategy $S^i$ are defined in binary mode. -Then the following table shows the result of $x$ xor $S^i$. +done. +Suppose that $x$ and the strategy $S^i$ are given as +binary vectors. +Table~\ref{TableExemple} shows the result of $x \oplus S^i$. + +\begin{table} $$ \begin{array}{|cc|cccccccccccccccc|} \hline @@ -820,13 +840,13 @@ x \oplus S^i&=&1&1&0&1&1&1&0&0&0&1&1&1&0&1&0&1\\ \hline \end{array} $$ +\caption{Example of an arbitrary round of the proposed generator} +\label{TableExemple} +\end{table} - - -\lstset{language=C,caption={C code of the sequential chaotic iterations based -PRNG},label=algo:seqCIprng} +\lstset{language=C,caption={C code of the sequential PRNG based on chaotic iterations},label=algo:seqCIprng} \begin{lstlisting} unsigned int CIprng() { static unsigned int x = 123123123; @@ -847,52 +867,60 @@ unsigned int CIprng() { -In listing~\ref{algo:seqCIprng} a sequential version of our chaotic iterations -based PRNG is presented. The xor operator is represented by \textasciicircum. -This function uses three classical 64-bits PRNG: the \texttt{xorshift}, the -\texttt{xor128} and the \texttt{xorwow}. In the following, we call them -xor-like PRNGSs. These three PRNGs are presented in~\cite{Marsaglia2003}. As -each xor-like PRNG used works with 64-bits and as our PRNG works with 32-bits, -the use of \texttt{(unsigned int)} selects the 32 least significant bits whereas -\texttt{(unsigned int)(t3$>>$32)} selects the 32 most significants bits of the -variable \texttt{t}. So to produce a random number realizes 6 xor operations -with 6 32-bits numbers produced by 3 64-bits PRNG. This version successes the -BigCrush of the TestU01 battery~\cite{LEcuyerS07}. +In Listing~\ref{algo:seqCIprng} a sequential version of the proposed PRNG based on chaotic iterations + is presented. The xor operator is represented by \textasciicircum. +This function uses three classical 64-bits PRNGs, namely the \texttt{xorshift}, the +\texttt{xor128}, and the \texttt{xorwow}~\cite{Marsaglia2003}. In the following, we call them +``xor-like PRNGs''. +As +each xor-like PRNG uses 64-bits whereas our proposed generator works with 32-bits, +we use the command \texttt{(unsigned int)}, that selects the 32 least significant bits of a given integer, and the code +\texttt{(unsigned int)(t3$>>$32)} in order to obtain the 32 most significant bits of \texttt{t}. + +So producing a pseudorandom number needs 6 xor operations +with 6 32-bits numbers that are provided by 3 64-bits PRNGs. This version successfully passes the +stringent BigCrush battery of tests~\cite{LEcuyerS07}. -\section{Efficient PRNGs based on chaotic iterations on GPU} +\section{Efficient PRNGs based on Chaotic Iterations on GPU} \label{sec:efficient prng gpu} -In order to benefit from computing power of GPU, a program needs to define -independent blocks of threads which can be computed simultaneously. In general, -the larger the number of threads is, the more local memory is used and the less -branching instructions are used (if, while, ...), the better performance is -obtained on GPU. So with algorithm \ref{algo:seqCIprng} presented in the -previous section, it is possible to build a similar program which computes PRNG -on GPU. In the CUDA~\cite{Nvid10} environment, threads have a local -identificator, called \texttt{ThreadIdx} relative to the block containing them. - - -\subsection{Naive version for GPU} - -From the CPU version, it is possible to obtain a quite similar version for GPU. -The principe consists in assigning the computation of a PRNG as in sequential to -each thread of the GPU. Of course, it is essential that the three xor-like -PRNGs used for our computation have different parameters. So we chose them -randomly with another PRNG. As the initialisation is performed by the CPU, we -have chosen to use the ISAAC PRNG~\cite{Jenkins96} to initalize all the -parameters for the GPU version of our PRNG. The implementation of the three -xor-like PRNGs is straightforward as soon as their parameters have been -allocated in the GPU memory. Each xor-like PRNGs used works with an internal -number $x$ which keeps the last generated random numbers. Other internal -variables are also used by the xor-like PRNGs. More precisely, the -implementation of the xor128, the xorshift and the xorwow respectively require -4, 5 and 6 unsigned long as internal variables. +In order to take benefits from the computing power of GPU, a program needs to have +independent blocks of threads that can be computed simultaneously. In general, +the larger the number of threads is, the more local memory is used, and the less +branching instructions are used (if, while, ...), the better the performances on GPU is. +Obviously, having these requirements in mind, it is possible to build a program similar to +the one presented in Algorithm \ref{algo:seqCIprng}, which computes pseudorandom numbers +on GPU. +To do so, we must firstly recall that in + the CUDA~\cite{Nvid10} environment, threads have a local +identifier called \texttt{ThreadIdx}, which is relative to the block containing them. + + +\subsection{Naive Version for GPU} + + +It is possible to deduce from the CPU version a quite similar version adapted to GPU. +The simple principle consists to make each thread of the GPU computing the CPU version of our PRNG. +Of course, the three xor-like +PRNGs used in these computations must have different parameters. +In a given thread, these lasts are +randomly picked from another PRNGs. +The initialization stage is performed by the CPU. +To do it, the ISAAC PRNG~\cite{Jenkins96} is used to set all the +parameters embedded into each thread. + +The implementation of the three +xor-like PRNGs is straightforward when their parameters have been +allocated in the GPU memory. Each xor-like works with an internal +number $x$ that saves the last generated pseudorandom number. Additionally, the +implementation of the xor128, the xorshift, and the xorwow respectively require +4, 5, and 6 unsigned long as internal variables. \begin{algorithm} \KwIn{InternalVarXorLikeArray: array with internal variables of the 3 xor-like PRNGs in global memory\; -NumThreads: Number of threads\;} +NumThreads: number of threads\;} \KwOut{NewNb: array containing random numbers in global memory} \If{threadIdx is concerned by the computation} { retrieve data from InternalVarXorLikeArray[threadIdx] in local variables\; @@ -903,37 +931,34 @@ NumThreads: Number of threads\;} store internal variables in InternalVarXorLikeArray[threadIdx]\; } -\caption{main kernel for the chaotic iterations based PRNG GPU naive version} +\caption{Main kernel of the GPU ``naive'' version of the PRNG based on chaotic iterations} \label{algo:gpu_kernel} \end{algorithm} -Algorithm~\ref{algo:gpu_kernel} presents a naive implementation of PRNG using -GPU. According to the available memory in the GPU and the number of threads +Algorithm~\ref{algo:gpu_kernel} presents a naive implementation of the proposed PRNG on +GPU. Due to the available memory in the GPU and the number of threads used simultenaously, the number of random numbers that a thread can generate -inside a kernel is limited, i.e. the variable \texttt{n} in -algorithm~\ref{algo:gpu_kernel}. For example, if $100,000$ threads are used and -if $n=100$\footnote{in fact, we need to add the initial seed (a 32-bits number)} -then the memory required to store internals variables of xor-like +inside a kernel is limited (\emph{i.e.}, the variable \texttt{n} in +algorithm~\ref{algo:gpu_kernel}). For instance, if $100,000$ threads are used and +if $n=100$\footnote{in fact, we need to add the initial seed (a 32-bits number)}, +then the memory required to store all of the internals variables of both the xor-like PRNGs\footnote{we multiply this number by $2$ in order to count 32-bits numbers} -and random number of our PRNG is equals to $100,000\times ((4+5+6)\times -2+(1+100))=1,310,000$ 32-bits numbers, i.e. about $52$Mb. +and the pseudorandom numbers generated by our PRNG, is equal to $100,000\times ((4+5+6)\times +2+(1+100))=1,310,000$ 32-bits numbers, that is, approximately $52$Mb. -All the tests performed to pass the BigCrush of TestU01 succeeded. Different -number of threads, called \texttt{NumThreads} in our algorithm, have been tested -upto $10$ millions. -\newline -\newline -{\bf QUESTION : on laisse cette remarque, je suis mitigé !!!} +This generator is able to pass the whole BigCrush battery of tests, for all +the versions that have been tested depending on their number of threads +(called \texttt{NumThreads} in our algorithm, tested until $10$ millions). \begin{remark} -Algorithm~\ref{algo:gpu_kernel} has the advantage to manipulate independent -PRNGs, so this version is easily usable on a cluster of computer. The only thing -to ensure is to use a single ISAAC PRNG. For this, a simple solution consists in -using a master node for the initialization which computes the initial parameters +The proposed algorithm has the advantage to manipulate independent +PRNGs, so this version is easily adaptable on a cluster of computers too. The only thing +to ensure is to use a single ISAAC PRNG. To achieve this requirement, a simple solution consists in +using a master node for the initialization. This master node computes the initial parameters for all the differents nodes involves in the computation. \end{remark} -\subsection{Improved version for GPU} +\subsection{Improved Version for GPU} As GPU cards using CUDA have shared memory between threads of the same block, it is possible to use this feature in order to simplify the previous algorithm, @@ -1706,7 +1731,7 @@ proving that $H$ is not secure, a contradiction. -\section{A cryptographically secure prng for GPU} +\section{A Cryptographically Secure PRNG for GPU} \label{sec:CSGPU} It is possible to build a cryptographically secure prng based on the previous algorithm (algorithm~\ref{algo:gpu_kernel2}). It simply consists in replacing