\usepackage{amscd}
\usepackage{moreverb}
\usepackage{commath}
-\usepackage{algorithm2e}
+\usepackage[ruled,vlined]{algorithm2e}
\usepackage{listings}
\usepackage[standard]{ntheorem}
-\lstset{language=C,caption={C code of the sequential PRNG based on chaotic iterations},label=algo:seqCIPRNG}
+
+\lstset{language=C,caption={C code of the sequential PRNG based on chaotic iteration\
+s},label=algo:seqCIPRNG}
\begin{lstlisting}
unsigned int CIPRNG() {
static unsigned int x = 123123123;
-
In Listing~\ref{algo:seqCIPRNG} a sequential version of the proposed PRNG based
on chaotic iterations is presented. The xor operator is represented by
\textasciicircum. This function uses three classical 64-bits PRNGs, namely the
o2 = threadIdx-offset+array\_comb2[offset]\;
\For{i=1 to n} {
t=xor-like()\;
- t=t $\wedge$ shmem[o1] $\wedge$ shmem[o2]\;
+ t=t\textasciicircum shmem[o1]\textasciicircum shmem[o2]\;
shared\_mem[threadId]=t\;
- x = x $\wedge$ t\;
+ x = x\textasciicircum t\;
store the new PRNG in NewNb[NumThreads*threadId+i]\;
}
store internal variables in InternalVarXorLikeArray[threadId]\;
}
-\caption{main kernel for the chaotic iterations based PRNG GPU efficient
-version}
-\label{algo:gpu_kernel2}
+\caption{Main kernel for the chaotic iterations based PRNG GPU efficient
+version\label{IR}}
+\label{algo:gpu_kernel2}
\end{algorithm}
\subsection{Theoretical Evaluation of the Improved Version}
The modulus operation is the most time consuming operation for current
GPU cards. So in order to obtain quite reasonable performances, it is
-required to use only modulus on 32 bits integer numbers. Consequently
+required to use only modulus on 32-bits integer numbers. Consequently
$x_n^2$ need to be lesser than $2^{32}$, and thus the number $M$ must be
lesser than $2^{16}$. So in practice we can choose prime numbers around
-256 that are congruent to 3 modulus 4. With 32 bits numbers, only the
+256 that are congruent to 3 modulus 4. With 32-bits numbers, only the
4 least significant bits of $x_n$ can be chosen (the maximum number of
indistinguishable bits is lesser than or equals to
-$log_2(log_2(M))$). In other words, to generate a 32 bits number, we need to use
+$log_2(log_2(M))$). In other words, to generate a 32-bits number, we need to use
8 times the BBS algorithm with possibly different combinations of $M$. This
approach is not sufficient to be able to pass all the TestU01,
as small values of $M$ for the BBS lead to
gives the last 3 bits, providing so a number between 0 and 7.
\item
Secondly, after the generation of the 8 BBS numbers for each thread, we
-have a 32 bits number whose period is possibly quite small. So
+have a 32-bits number whose period is possibly quite small. So
to add randomness, we generate 4 more BBS numbers to
-shift the 32 bits numbers, and add up to 6 new bits. This improvement is
+shift the 32-bits numbers, and add up to 6 new bits. This improvement is
described in Algorithm~\ref{algo:bbs_gpu}. In practice, the last 2 bits
of the first new BBS number are used to make a left shift of at most
3 bits. The last 3 bits of the second new BBS number are add to the
in global memory\;
NumThreads: Number of threads\;
array\_comb: 2D Arrays containing 16 combinations (in first dimension) of size combination\_size (in second dimension)\;
-
+array\_shift[4]=\{0,1,3,7\}\;
}
\KwOut{NewNb: array containing random numbers in global memory}
o1 = threadIdx-offset+array\_comb[bbs1\&7][offset]\;
o2 = threadIdx-offset+array\_comb[8+bbs2\&7][offset]\;
\For{i=1 to n} {
- t<<=4\;
+ t$<<$=4\;
t|=BBS1(bbs1)\&15\;
...\;
- t<<=4\;
+ t$<<$=4\;
t|=BBS8(bbs8)\&15\;
- //two new shifts\;
- t<<=BBS3(bbs3)\&3\;
- t|=BBS1(bbs1)\&7\;
- t<<=BBS7(bbs7)\&3\;
- t|=BBS2(bbs2)\&7\;
- t=t \^{ } shmem[o1] \^{ } shmem[o2]\;
+ \tcp{two new shifts}
+ shift=BBS3(bbs3)\&3\;
+ t$<<$=shift\;
+ t|=BBS1(bbs1)\&array\_shift[shift]\;
+ shift=BBS7(bbs7)\&3\;
+ t$<<$=shift\;
+ t|=BBS2(bbs2)\&array\_shift[shift]\;
+ t=t\textasciicircum shmem[o1]\textasciicircum shmem[o2]\;
shared\_mem[threadId]=t\;
- x = x \^{ } t\;
+ x = x\textasciicircum t\;
store the new PRNG in NewNb[NumThreads*threadId+i]\;
}
\label{algo:bbs_gpu}
\end{algorithm}
-In Algorithm~\ref{algo:bbs_gpu}, $n$ is for the quantity
-of random numbers that a thread has to generate.
-The operation t<<=4 performs a left shift of 4 bits
-on the variable $t$ and stores the result in $t$, and
-$BBS1(bbs1)\&15$ selects
-the last four bits of the result of $BBS1$.
-Thus an operation of the form $t<<=4; t|=BBS1(bbs1)\&15\;$
-realizes in $t$ a left shift of 4 bits, and then puts
-the 4 last bits of $BBS1(bbs1)$ in the four last
-positions of $t$.
-Let us remark that to initialize $t$ is not a necessity as we
-fill it 4 bits by 4 bits, until having obtained 32 bits.
-The two last new shifts are realized in order to enlarge
-the small periods of the BBS used here, to introduce a kind of variability.
-In these operations, we make twice a left shift of $t$ of \emph{at most}
-3 bits and we put \emph{exactly} the 3 last bits from a BBS into
-the 3 last bits of $t$, leading possibly to a loss of a few
-bits of $t$.
+In Algorithm~\ref{algo:bbs_gpu}, $n$ is for the quantity of random numbers that
+a thread has to generate. The operation t<<=4 performs a left shift of 4 bits
+on the variable $t$ and stores the result in $t$, and $BBS1(bbs1)\&15$ selects
+the last four bits of the result of $BBS1$. Thus an operation of the form
+$t<<=4; t|=BBS1(bbs1)\&15\;$ realizes in $t$ a left shift of 4 bits, and then
+puts the 4 last bits of $BBS1(bbs1)$ in the four last positions of $t$. Let us
+remark that to initialize $t$ is not a necessity as we fill it 4 bits by 4 bits,
+until having obtained 32-bits. The two last new shifts are realized in order to
+enlarge the small periods of the BBS used here, to introduce a kind of
+variability. In these operations, we make twice a left shift of $t$ of \emph{at
+ most} 3 bits, represented by \texttt{shift} in the algorithm, and we put
+\emph{exactly} the \texttt{shift} last bits from a BBS into the \texttt{shift}
+last bits of $t$.
It should be noticed that this generator has another time the form $x^{n+1} = x^n \oplus S^n$,
where $S^n$ is referred in this algorithm as $t$: each iteration of this