-The modulus operation is the most time consuming operation for
-current GPU cards.
-So in order to obtain quite reasonable performances, it is required
-to use only modulus on 32 bits integer numbers. Consequently $x_n^2$ need to be
-less than $2^{32}$ and the number $M$ need to be less than $2^{16}$. So in
-practice we can choose prime numbers around 256 that are congruent to 3 modulus
-4. With 32 bits numbers, only the 4 least significant bits of $x_n$ can be
-chosen (the maximum number of indistinguishable bits is lesser than or equals to
-$log_2(log_2(x_n))$). So to generate a 32 bits number, we need to use 8 times
-the BBS algorithm with different combinations of $M$.
+The modulus operation is the most time consuming operation for current
+GPU cards. So in order to obtain quite reasonable performances, it is
+required to use only modulus on 32-bits integer numbers. Consequently
+$x_n^2$ need to be lesser than $2^{32}$, and thus the number $M$ must be
+lesser than $2^{16}$. So in practice we can choose prime numbers around
+256 that are congruent to 3 modulus 4. With 32-bits numbers, only the
+4 least significant bits of $x_n$ can be chosen (the maximum number of
+indistinguishable bits is lesser than or equals to
+$log_2(log_2(M))$). In other words, to generate a 32-bits number, we need to use
+8 times the BBS algorithm with possibly different combinations of $M$. This
+approach is not sufficient to be able to pass all the TestU01,
+as small values of $M$ for the BBS lead to
+ small periods. So, in order to add randomness we proceed with
+the followings modifications.
+\begin{itemize}
+\item
+Firstly, we define 16 arrangement arrays instead of 2 (as described in
+Algorithm \ref{algo:gpu_kernel2}), but only 2 of them are used at each call of
+the PRNG kernels. In practice, the selection of combinations
+arrays to be used is different for all the threads. It is determined
+by using the three last bits of two internal variables used by BBS.
+%This approach adds more randomness.
+In Algorithm~\ref{algo:bbs_gpu},
+character \& is for the bitwise AND. Thus using \&7 with a number
+gives the last 3 bits, providing so a number between 0 and 7.
+\item
+Secondly, after the generation of the 8 BBS numbers for each thread, we
+have a 32-bits number whose period is possibly quite small. So
+to add randomness, we generate 4 more BBS numbers to
+shift the 32-bits numbers, and add up to 6 new bits. This improvement is
+described in Algorithm~\ref{algo:bbs_gpu}. In practice, the last 2 bits
+of the first new BBS number are used to make a left shift of at most
+3 bits. The last 3 bits of the second new BBS number are add to the
+strategy whatever the value of the first left shift. The third and the
+fourth new BBS numbers are used similarly to apply a new left shift
+and add 3 new bits.
+\item
+Finally, as we use 8 BBS numbers for each thread, the storage of these
+numbers at the end of the kernel is performed using a rotation. So,
+internal variable for BBS number 1 is stored in place 2, internal
+variable for BBS number 2 is stored in place 3, ..., and finally, internal
+variable for BBS number 8 is stored in place 1.
+\end{itemize}
+
+\begin{algorithm}
+
+\KwIn{InternalVarBBSArray: array with internal variables of the 8 BBS
+in global memory\;
+NumThreads: Number of threads\;
+array\_comb: 2D Arrays containing 16 combinations (in first dimension) of size combination\_size (in second dimension)\;
+array\_shift[4]=\{0,1,3,7\}\;
+}