+The modulus operation is the most time consuming operation for current
+GPU cards. So in order to obtain quite reasonable performances, it is
+required to use only modulus on 32 bits integer numbers. Consequently
+$x_n^2$ need to be less than $2^{32}$ and the number $M$ need to be
+less than $2^{16}$. So in practice we can choose prime numbers around
+256 that are congruent to 3 modulus 4. With 32 bits numbers, only the
+4 least significant bits of $x_n$ can be chosen (the maximum number of
+indistinguishable bits is lesser than or equals to
+$log_2(log_2(x_n))$). So to generate a 32 bits number, we need to use
+8 times the BBS algorithm with different combinations of $M$. This
+approach is not sufficient to pass all the tests of TestU01 because
+the fact of having chosen small values of $M$ for the BBS leads to
+have a small period. So, in order to add randomness we proceed with
+the followings modifications.
+\begin{itemize}
+\item
+First we define 16 arrangement arrays instead of 2 (as described in
+algorithm \ref{algo:gpu_kernel2}) but only 2 are used at each call of
+the PRNG kernels. In practice, the selection of which combinations
+arrays will be used is different for all the threads and is determined
+by using the three last bits of two internal variables used by BBS.
+This approach adds more randomness. In algorithm~\ref{algo:bbs_gpu},
+character \& performs the AND bitwise. So using \&7 with a number
+gives the last 3 bits, so it provides a number between 0 and 7.
+\item
+Second, after the generation of the 8 BBS numbers for each thread we
+have a 32 bits number for which the period is possibly quite small. So
+to add randomness, we generate 4 more BBS numbers which allows us to
+shift the 32 bits numbers and add upto 6 new bits. This part is
+described in algorithm~\ref{algo:bbs_gpu}. In practice, if we call
+{\it strategy}, the number representing the strategy, the last 2 bits
+of the first new BBS number are used to make a left shift of at least
+3 bits. The last 3 bits of the second new BBS number are add to the
+strategy whatever the value of the first left shift. The third and the
+fourth new BBS numbers are used similarly to apply a new left shift
+and add 3 new bits.
+\item
+Finally, as we use 8 BBS numbers for each thread, the store of these
+numbers at the end of the kernel is performed using a rotation. So,
+internal variable for BBS number 1 is stored in place 2, internal
+variable for BBS number 2 is store ind place 3, ... and internal
+variable for BBS number 8 is stored in place 1.
+\end{itemize}
+
+
+\begin{algorithm}
+
+\KwIn{InternalVarBBSArray: array with internal variables of the 8 BBS
+in global memory\;
+NumThreads: Number of threads\;
+tab: 2D Arrays containing 16 combinations (in first dimension) of size combination\_size (in second dimension)\;}
+
+\KwOut{NewNb: array containing random numbers in global memory}
+\If{threadId is concerned} {
+ retrieve data from InternalVarBBSArray[threadId] in local variables including shared memory and x\;
+ we consider that bbs1 ... bbs8 represent the internal states of the 8 BBS numbers\;
+ offset = threadIdx\%combination\_size\;
+ o1 = threadIdx-offset+tab[bbs1\&7][offset]\;
+ o2 = threadIdx-offset+tab[8+bbs2\&7][offset]\;
+ \For{i=1 to n} {
+ t<<=4\;
+ t|=BBS1(bbs1)\&15\;
+ ...\;
+ t<<=4\;
+ t|=BBS8(bbs8)\&15\;
+ //two new shifts\;
+ t<<=BBS3(bbs3)\&3\;
+ t|=BBS1(bbs1)\&7\;
+ t<<=BBS7(bbs7)\&3\;
+ t|=BBS2(bbs2)\&7\;
+ t=t$\oplus$shmem[o1]$\oplus$shmem[o2]\;
+ shared\_mem[threadId]=t\;
+ x = x $\oplus$ t\;
+
+ store the new PRNG in NewNb[NumThreads*threadId+i]\;
+ }
+ store internal variables in InternalVarXorLikeArray[threadId] using a rotation\;
+}
+
+\caption{main kernel for the BBS based PRNG GPU}
+\label{algo:bbs_gpu}
+\end{algorithm}
+
+In algorithm~\ref{algo:bbs_gpu}, t<<=4 performs a left shift of 4 bits
+on the variable t and stores the result in t. BBS1(bbs1)\&15 selects
+the last four bits of the result of BBS1. It should be noticed that
+for the two new shifts, we use arbitrarily 4 BBSs that have previously
+been used.