X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/prng_gpu.git/blobdiff_plain/ec3fcbcdbf09ed6be33553cbb41c938cf74d3a74..136b4f47b57133edcd5627342c6c721c861d507e:/prng_gpu.tex?ds=sidebyside diff --git a/prng_gpu.tex b/prng_gpu.tex index f21f19e..4feac7c 100644 --- a/prng_gpu.tex +++ b/prng_gpu.tex @@ -7,7 +7,7 @@ \usepackage{amscd} \usepackage{moreverb} \usepackage{commath} -\usepackage{algorithm2e} +\usepackage[ruled,vlined]{algorithm2e} \usepackage{listings} \usepackage[standard]{ntheorem} @@ -856,7 +856,9 @@ $$ -\lstset{language=C,caption={C code of the sequential PRNG based on chaotic iterations},label=algo:seqCIPRNG} + +\lstset{language=C,caption={C code of the sequential PRNG based on chaotic iteration\ +s},label=algo:seqCIPRNG} \begin{lstlisting} unsigned int CIPRNG() { static unsigned int x = 123123123; @@ -876,19 +878,18 @@ unsigned int CIPRNG() { +In Listing~\ref{algo:seqCIPRNG} a sequential version of the proposed PRNG based +on chaotic iterations is presented. The xor operator is represented by +\textasciicircum. This function uses three classical 64-bits PRNGs, namely the +\texttt{xorshift}, the \texttt{xor128}, and the +\texttt{xorwow}~\cite{Marsaglia2003}. In the following, we call them ``xor-like +PRNGs''. As each xor-like PRNG uses 64-bits whereas our proposed generator +works with 32-bits, we use the command \texttt{(unsigned int)}, that selects the +32 least significant bits of a given integer, and the code \texttt{(unsigned + int)(t$>>$32)} in order to obtain the 32 most significant bits of \texttt{t}. -In Listing~\ref{algo:seqCIPRNG} a sequential version of the proposed PRNG based on chaotic iterations - is presented. The xor operator is represented by \textasciicircum. -This function uses three classical 64-bits PRNGs, namely the \texttt{xorshift}, the -\texttt{xor128}, and the \texttt{xorwow}~\cite{Marsaglia2003}. In the following, we call them -``xor-like PRNGs''. -As -each xor-like PRNG uses 64-bits whereas our proposed generator works with 32-bits, -we use the command \texttt{(unsigned int)}, that selects the 32 least significant bits of a given integer, and the code -\texttt{(unsigned int)(t3$>>$32)} in order to obtain the 32 most significant bits of \texttt{t}. - -So producing a pseudorandom number needs 6 xor operations -with 6 32-bits numbers that are provided by 3 64-bits PRNGs. This version successfully passes the +So producing a pseudorandom number needs 6 xor operations with 6 32-bits numbers +that are provided by 3 64-bits PRNGs. This version successfully passes the stringent BigCrush battery of tests~\cite{LEcuyerS07}. \section{Efficient PRNGs based on Chaotic Iterations on GPU} @@ -982,12 +983,14 @@ thread uses the result of which other one, we can use a combination array that contains the indexes of all threads and for which a combination has been performed. -In Algorithm~\ref{algo:gpu_kernel2}, two combination arrays are used. -The variable \texttt{offset} is computed using the value of +In Algorithm~\ref{algo:gpu_kernel2}, two combination arrays are used. The +variable \texttt{offset} is computed using the value of \texttt{combination\_size}. Then we can compute \texttt{o1} and \texttt{o2} -representing the indexes of the other threads whose results are used -by the current one. In this algorithm, we consider that a 64-bits xor-like -PRNG has been chosen, and so its two 32-bits parts are used. +representing the indexes of the other threads whose results are used by the +current one. In this algorithm, we consider that a 32-bits xor-like PRNG has +been chosen. In practice, we use the xor128 proposed in~\cite{Marsaglia2003} in +which unsigned longs (64 bits) have been replaced by unsigned integers (32 +bits). This version also can pass the whole {\it BigCrush} battery of tests. @@ -996,28 +999,28 @@ This version also can pass the whole {\it BigCrush} battery of tests. \KwIn{InternalVarXorLikeArray: array with internal variables of 1 xor-like PRNGs in global memory\; NumThreads: Number of threads\; -tab1, tab2: Arrays containing combinations of size combination\_size\;} +array\_comb1, array\_comb2: Arrays containing combinations of size combination\_size\;} \KwOut{NewNb: array containing random numbers in global memory} \If{threadId is concerned} { retrieve data from InternalVarXorLikeArray[threadId] in local variables including shared memory and x\; offset = threadIdx\%combination\_size\; - o1 = threadIdx-offset+tab1[offset]\; - o2 = threadIdx-offset+tab2[offset]\; + o1 = threadIdx-offset+array\_comb1[offset]\; + o2 = threadIdx-offset+array\_comb2[offset]\; \For{i=1 to n} { t=xor-like()\; - t=t $\wedge$ shmem[o1] $\wedge$ shmem[o2]\; + t=t\textasciicircum shmem[o1]\textasciicircum shmem[o2]\; shared\_mem[threadId]=t\; - x = x $\wedge$ t\; + x = x\textasciicircum t\; store the new PRNG in NewNb[NumThreads*threadId+i]\; } store internal variables in InternalVarXorLikeArray[threadId]\; } -\caption{main kernel for the chaotic iterations based PRNG GPU efficient -version} -\label{algo:gpu_kernel2} +\caption{Main kernel for the chaotic iterations based PRNG GPU efficient +version\label{IR}} +\label{algo:gpu_kernel2} \end{algorithm} \subsection{Theoretical Evaluation of the Improved Version} @@ -1246,13 +1249,13 @@ very slow and only usable for cryptographic applications. The modulus operation is the most time consuming operation for current GPU cards. So in order to obtain quite reasonable performances, it is -required to use only modulus on 32 bits integer numbers. Consequently +required to use only modulus on 32-bits integer numbers. Consequently $x_n^2$ need to be lesser than $2^{32}$, and thus the number $M$ must be lesser than $2^{16}$. So in practice we can choose prime numbers around -256 that are congruent to 3 modulus 4. With 32 bits numbers, only the +256 that are congruent to 3 modulus 4. With 32-bits numbers, only the 4 least significant bits of $x_n$ can be chosen (the maximum number of indistinguishable bits is lesser than or equals to -$log_2(log_2(M))$). In other words, to generate a 32 bits number, we need to use +$log_2(log_2(M))$). In other words, to generate a 32-bits number, we need to use 8 times the BBS algorithm with possibly different combinations of $M$. This approach is not sufficient to be able to pass all the TestU01, as small values of $M$ for the BBS lead to @@ -1271,9 +1274,9 @@ character \& is for the bitwise AND. Thus using \&7 with a number gives the last 3 bits, providing so a number between 0 and 7. \item Secondly, after the generation of the 8 BBS numbers for each thread, we -have a 32 bits number whose period is possibly quite small. So +have a 32-bits number whose period is possibly quite small. So to add randomness, we generate 4 more BBS numbers to -shift the 32 bits numbers, and add up to 6 new bits. This improvement is +shift the 32-bits numbers, and add up to 6 new bits. This improvement is described in Algorithm~\ref{algo:bbs_gpu}. In practice, the last 2 bits of the first new BBS number are used to make a left shift of at most 3 bits. The last 3 bits of the second new BBS number are add to the @@ -1293,29 +1296,33 @@ variable for BBS number 8 is stored in place 1. \KwIn{InternalVarBBSArray: array with internal variables of the 8 BBS in global memory\; NumThreads: Number of threads\; -tab: 2D Arrays containing 16 combinations (in first dimension) of size combination\_size (in second dimension)\;} +array\_comb: 2D Arrays containing 16 combinations (in first dimension) of size combination\_size (in second dimension)\; +array\_shift[4]=\{0,1,3,7\}\; +} \KwOut{NewNb: array containing random numbers in global memory} \If{threadId is concerned} { retrieve data from InternalVarBBSArray[threadId] in local variables including shared memory and x\; we consider that bbs1 ... bbs8 represent the internal states of the 8 BBS numbers\; offset = threadIdx\%combination\_size\; - o1 = threadIdx-offset+tab[bbs1\&7][offset]\; - o2 = threadIdx-offset+tab[8+bbs2\&7][offset]\; + o1 = threadIdx-offset+array\_comb[bbs1\&7][offset]\; + o2 = threadIdx-offset+array\_comb[8+bbs2\&7][offset]\; \For{i=1 to n} { - t<<=4\; + t$<<$=4\; t|=BBS1(bbs1)\&15\; ...\; - t<<=4\; + t$<<$=4\; t|=BBS8(bbs8)\&15\; - //two new shifts\; - t<<=BBS3(bbs3)\&3\; - t|=BBS1(bbs1)\&7\; - t<<=BBS7(bbs7)\&3\; - t|=BBS2(bbs2)\&7\; - t=t \^{ } shmem[o1] \^{ } shmem[o2]\; + \tcp{two new shifts} + shift=BBS3(bbs3)\&3\; + t$<<$=shift\; + t|=BBS1(bbs1)\&array\_shift[shift]\; + shift=BBS7(bbs7)\&3\; + t$<<$=shift\; + t|=BBS2(bbs2)\&array\_shift[shift]\; + t=t\textasciicircum shmem[o1]\textasciicircum shmem[o2]\; shared\_mem[threadId]=t\; - x = x \^{ } t\; + x = x\textasciicircum t\; store the new PRNG in NewNb[NumThreads*threadId+i]\; } @@ -1326,24 +1333,19 @@ tab: 2D Arrays containing 16 combinations (in first dimension) of size combinat \label{algo:bbs_gpu} \end{algorithm} -In Algorithm~\ref{algo:bbs_gpu}, $n$ is for the quantity -of random numbers that a thread has to generate. -The operation t<<=4 performs a left shift of 4 bits -on the variable $t$ and stores the result in $t$, and -$BBS1(bbs1)\&15$ selects -the last four bits of the result of $BBS1$. -Thus an operation of the form $t<<=4; t|=BBS1(bbs1)\&15\;$ -realizes in $t$ a left shift of 4 bits, and then puts -the 4 last bits of $BBS1(bbs1)$ in the four last -positions of $t$. -Let us remark that to initialize $t$ is not a necessity as we -fill it 4 bits by 4 bits, until having obtained 32 bits. -The two last new shifts are realized in order to enlarge -the small periods of the BBS used here, to introduce a kind of variability. -In these operations, we make twice a left shift of $t$ of \emph{at most} -3 bits and we put \emph{exactly} the 3 last bits from a BBS into -the 3 last bits of $t$, leading possibly to a loss of a few -bits of $t$. +In Algorithm~\ref{algo:bbs_gpu}, $n$ is for the quantity of random numbers that +a thread has to generate. The operation t<<=4 performs a left shift of 4 bits +on the variable $t$ and stores the result in $t$, and $BBS1(bbs1)\&15$ selects +the last four bits of the result of $BBS1$. Thus an operation of the form +$t<<=4; t|=BBS1(bbs1)\&15\;$ realizes in $t$ a left shift of 4 bits, and then +puts the 4 last bits of $BBS1(bbs1)$ in the four last positions of $t$. Let us +remark that to initialize $t$ is not a necessity as we fill it 4 bits by 4 bits, +until having obtained 32-bits. The two last new shifts are realized in order to +enlarge the small periods of the BBS used here, to introduce a kind of +variability. In these operations, we make twice a left shift of $t$ of \emph{at + most} 3 bits, represented by \texttt{shift} in the algorithm, and we put +\emph{exactly} the \texttt{shift} last bits from a BBS into the \texttt{shift} +last bits of $t$. It should be noticed that this generator has another time the form $x^{n+1} = x^n \oplus S^n$, where $S^n$ is referred in this algorithm as $t$: each iteration of this