X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/prng_gpu.git/blobdiff_plain/228143ea0da3dc28561a873ee227024d21bb1e57..e443cbfb0433e0a7e1551c9841b0173698151f96:/prng_gpu.tex diff --git a/prng_gpu.tex b/prng_gpu.tex index 84efafa..dc3d3cb 100644 --- a/prng_gpu.tex +++ b/prng_gpu.tex @@ -433,10 +433,10 @@ during implementations (due to the discrete nature of $f$). It is as if $\mathds{B}^\mathsf{N}$ represents the memory of the computer whereas $\llbracket 1 ; \mathsf{N} \rrbracket^{\mathds{N}}$ is its input stream (the seeds, for instance, in PRNG, or a physical noise in TRNG). -\section{Application to pseudorandomness} +\section{Application to Pseudorandomness} \label{sec:pseudorandom} -\subsection{A First pseudorandom Number Generator} +\subsection{A First Pseudorandom Number Generator} We have proposed in~\cite{bgw09:ip} a new family of generators that receives two PRNGs as inputs. These two generators are mixed with chaotic iterations, @@ -807,17 +807,26 @@ have $d((S,E),(\tilde S,E))<\epsilon$. \section{Efficient PRNG based on Chaotic Iterations} \label{sec:efficient prng} -In order to implement efficiently a PRNG based on chaotic iterations it is -possible to improve previous works [ref]. One solution consists in considering -that the strategy used contains all the bits for which the negation is -achieved out. Then in order to apply the negation on these bits we can simply -apply the xor operator between the current number and the strategy. In -order to obtain the strategy we also use a classical PRNG. +Based on the proof presented in the previous section, it is now possible to +improve the speed of the generator formerly presented in~\cite{bgw09:ip,guyeux10}. +The first idea is to consider +that the provided strategy is a pseudorandom Boolean vector obtained by a +given PRNG. +An iteration of the system is simply the bitwise exclusive or between +the last computed state and the current strategy. +Topological properties of disorder exhibited by chaotic +iterations can be inherited by the inputted generator, hoping by doing so to +obtain some statistical improvements while preserving speed. -Here is an example with 16-bits numbers showing how the bitwise operations + +Let us give an example using 16-bits numbers, to clearly understand how the bitwise xor operations are -applied. Suppose that $x$ and the strategy $S^i$ are defined in binary mode. -Then the following table shows the result of $x$ xor $S^i$. +done. +Suppose that $x$ and the strategy $S^i$ are given as +binary vectors. +Table~\ref{TableExemple} shows the result of $x \oplus S^i$. + +\begin{table} $$ \begin{array}{|cc|cccccccccccccccc|} \hline @@ -831,13 +840,13 @@ x \oplus S^i&=&1&1&0&1&1&1&0&0&0&1&1&1&0&1&0&1\\ \hline \end{array} $$ +\caption{Example of an arbitrary round of the proposed generator} +\label{TableExemple} +\end{table} - - -\lstset{language=C,caption={C code of the sequential chaotic iterations based -PRNG},label=algo:seqCIprng} +\lstset{language=C,caption={C code of the sequential PRNG based on chaotic iterations},label=algo:seqCIprng} \begin{lstlisting} unsigned int CIprng() { static unsigned int x = 123123123; @@ -858,52 +867,60 @@ unsigned int CIprng() { -In listing~\ref{algo:seqCIprng} a sequential version of our chaotic iterations -based PRNG is presented. The xor operator is represented by \textasciicircum. -This function uses three classical 64-bits PRNG: the \texttt{xorshift}, the -\texttt{xor128} and the \texttt{xorwow}. In the following, we call them -xor-like PRNGSs. These three PRNGs are presented in~\cite{Marsaglia2003}. As -each xor-like PRNG used works with 64-bits and as our PRNG works with 32-bits, -the use of \texttt{(unsigned int)} selects the 32 least significant bits whereas -\texttt{(unsigned int)(t3$>>$32)} selects the 32 most significants bits of the -variable \texttt{t}. So to produce a random number realizes 6 xor operations -with 6 32-bits numbers produced by 3 64-bits PRNG. This version successes the -BigCrush of the TestU01 battery~\cite{LEcuyerS07}. +In Listing~\ref{algo:seqCIprng} a sequential version of the proposed PRNG based on chaotic iterations + is presented. The xor operator is represented by \textasciicircum. +This function uses three classical 64-bits PRNGs, namely the \texttt{xorshift}, the +\texttt{xor128}, and the \texttt{xorwow}~\cite{Marsaglia2003}. In the following, we call them +``xor-like PRNGs''. +As +each xor-like PRNG uses 64-bits whereas our proposed generator works with 32-bits, +we use the command \texttt{(unsigned int)}, that selects the 32 least significant bits of a given integer, and the code +\texttt{(unsigned int)(t3$>>$32)} in order to obtain the 32 most significant bits of \texttt{t}. -\section{Efficient PRNGs based on chaotic iterations on GPU} +So producing a pseudorandom number needs 6 xor operations +with 6 32-bits numbers that are provided by 3 64-bits PRNGs. This version successfully passes the +stringent BigCrush battery of tests~\cite{LEcuyerS07}. + +\section{Efficient PRNGs based on Chaotic Iterations on GPU} \label{sec:efficient prng gpu} -In order to benefit from computing power of GPU, a program needs to define -independent blocks of threads which can be computed simultaneously. In general, -the larger the number of threads is, the more local memory is used and the less -branching instructions are used (if, while, ...), the better performance is -obtained on GPU. So with algorithm \ref{algo:seqCIprng} presented in the -previous section, it is possible to build a similar program which computes PRNG -on GPU. In the CUDA~\cite{Nvid10} environment, threads have a local -identificator, called \texttt{ThreadIdx} relative to the block containing them. - - -\subsection{Naive version for GPU} - -From the CPU version, it is possible to obtain a quite similar version for GPU. -The principe consists in assigning the computation of a PRNG as in sequential to -each thread of the GPU. Of course, it is essential that the three xor-like -PRNGs used for our computation have different parameters. So we chose them -randomly with another PRNG. As the initialisation is performed by the CPU, we -have chosen to use the ISAAC PRNG~\cite{Jenkins96} to initalize all the -parameters for the GPU version of our PRNG. The implementation of the three -xor-like PRNGs is straightforward as soon as their parameters have been -allocated in the GPU memory. Each xor-like PRNGs used works with an internal -number $x$ which keeps the last generated random numbers. Other internal -variables are also used by the xor-like PRNGs. More precisely, the -implementation of the xor128, the xorshift and the xorwow respectively require -4, 5 and 6 unsigned long as internal variables. +In order to take benefits from the computing power of GPU, a program needs to have +independent blocks of threads that can be computed simultaneously. In general, +the larger the number of threads is, the more local memory is used, and the less +branching instructions are used (if, while, ...), the better the performances on GPU is. +Obviously, having these requirements in mind, it is possible to build a program similar to +the one presented in Algorithm \ref{algo:seqCIprng}, which computes pseudorandom numbers +on GPU. +To do so, we must firstly recall that in + the CUDA~\cite{Nvid10} environment, threads have a local +identifier called \texttt{ThreadIdx}, which is relative to the block containing them. + + +\subsection{Naive Version for GPU} + + +It is possible to deduce from the CPU version a quite similar version adapted to GPU. +The simple principle consists to make each thread of the GPU computing the CPU version of our PRNG. +Of course, the three xor-like +PRNGs used in these computations must have different parameters. +In a given thread, these lasts are +randomly picked from another PRNGs. +The initialization stage is performed by the CPU. +To do it, the ISAAC PRNG~\cite{Jenkins96} is used to set all the +parameters embedded into each thread. + +The implementation of the three +xor-like PRNGs is straightforward when their parameters have been +allocated in the GPU memory. Each xor-like works with an internal +number $x$ that saves the last generated pseudorandom number. Additionally, the +implementation of the xor128, the xorshift, and the xorwow respectively require +4, 5, and 6 unsigned long as internal variables. \begin{algorithm} \KwIn{InternalVarXorLikeArray: array with internal variables of the 3 xor-like PRNGs in global memory\; -NumThreads: Number of threads\;} +NumThreads: number of threads\;} \KwOut{NewNb: array containing random numbers in global memory} \If{threadIdx is concerned by the computation} { retrieve data from InternalVarXorLikeArray[threadIdx] in local variables\; @@ -914,37 +931,34 @@ NumThreads: Number of threads\;} store internal variables in InternalVarXorLikeArray[threadIdx]\; } -\caption{main kernel for the chaotic iterations based PRNG GPU naive version} +\caption{Main kernel of the GPU ``naive'' version of the PRNG based on chaotic iterations} \label{algo:gpu_kernel} \end{algorithm} -Algorithm~\ref{algo:gpu_kernel} presents a naive implementation of PRNG using -GPU. According to the available memory in the GPU and the number of threads +Algorithm~\ref{algo:gpu_kernel} presents a naive implementation of the proposed PRNG on +GPU. Due to the available memory in the GPU and the number of threads used simultenaously, the number of random numbers that a thread can generate -inside a kernel is limited, i.e. the variable \texttt{n} in -algorithm~\ref{algo:gpu_kernel}. For example, if $100,000$ threads are used and -if $n=100$\footnote{in fact, we need to add the initial seed (a 32-bits number)} -then the memory required to store internals variables of xor-like +inside a kernel is limited (\emph{i.e.}, the variable \texttt{n} in +algorithm~\ref{algo:gpu_kernel}). For instance, if $100,000$ threads are used and +if $n=100$\footnote{in fact, we need to add the initial seed (a 32-bits number)}, +then the memory required to store all of the internals variables of both the xor-like PRNGs\footnote{we multiply this number by $2$ in order to count 32-bits numbers} -and random number of our PRNG is equals to $100,000\times ((4+5+6)\times -2+(1+100))=1,310,000$ 32-bits numbers, i.e. about $52$Mb. +and the pseudorandom numbers generated by our PRNG, is equal to $100,000\times ((4+5+6)\times +2+(1+100))=1,310,000$ 32-bits numbers, that is, approximately $52$Mb. -All the tests performed to pass the BigCrush of TestU01 succeeded. Different -number of threads, called \texttt{NumThreads} in our algorithm, have been tested -upto $10$ millions. -\newline -\newline -{\bf QUESTION : on laisse cette remarque, je suis mitigé !!!} +This generator is able to pass the whole BigCrush battery of tests, for all +the versions that have been tested depending on their number of threads +(called \texttt{NumThreads} in our algorithm, tested until $10$ millions). \begin{remark} -Algorithm~\ref{algo:gpu_kernel} has the advantage to manipulate independent -PRNGs, so this version is easily usable on a cluster of computer. The only thing -to ensure is to use a single ISAAC PRNG. For this, a simple solution consists in -using a master node for the initialization which computes the initial parameters +The proposed algorithm has the advantage to manipulate independent +PRNGs, so this version is easily adaptable on a cluster of computers too. The only thing +to ensure is to use a single ISAAC PRNG. To achieve this requirement, a simple solution consists in +using a master node for the initialization. This master node computes the initial parameters for all the differents nodes involves in the computation. \end{remark} -\subsection{Improved version for GPU} +\subsection{Improved Version for GPU} As GPU cards using CUDA have shared memory between threads of the same block, it is possible to use this feature in order to simplify the previous algorithm, @@ -1717,7 +1731,7 @@ proving that $H$ is not secure, a contradiction. -\section{A cryptographically secure prng for GPU} +\section{A Cryptographically Secure PRNG for GPU} \label{sec:CSGPU} It is possible to build a cryptographically secure prng based on the previous algorithm (algorithm~\ref{algo:gpu_kernel2}). It simply consists in replacing