+\section{Cryptographical Applications}
+
+\subsection{A Cryptographically Secure PRNG for GPU}
+\label{sec:CSGPU}
+
+It is possible to build a cryptographically secure PRNG based on the previous
+algorithm (Algorithm~\ref{algo:gpu_kernel2}). Due to Proposition~\ref{cryptopreuve},
+it simply consists in replacing
+the {\it xor-like} PRNG by a cryptographically secure one.
+We have chosen the Blum Blum Shub generator~\cite{BBS} (usually denoted by BBS) having the form:
+$$x_{n+1}=x_n^2~ mod~ M$$ where $M$ is the product of two prime numbers (these
+prime numbers need to be congruent to 3 modulus 4). BBS is known to be
+very slow and only usable for cryptographic applications.
+
+
+The modulus operation is the most time consuming operation for current
+GPU cards. So in order to obtain quite reasonable performances, it is
+required to use only modulus on 32-bits integer numbers. Consequently
+$x_n^2$ need to be lesser than $2^{32}$, and thus the number $M$ must be
+lesser than $2^{16}$. So in practice we can choose prime numbers around
+256 that are congruent to 3 modulus 4. With 32-bits numbers, only the
+4 least significant bits of $x_n$ can be chosen (the maximum number of
+indistinguishable bits is lesser than or equals to
+$log_2(log_2(M))$). In other words, to generate a 32-bits number, we need to use
+8 times the BBS algorithm with possibly different combinations of $M$. This
+approach is not sufficient to be able to pass all the tests of TestU01,
+as small values of $M$ for the BBS lead to
+ small periods. So, in order to add randomness we have proceeded with
+the followings modifications.
+\begin{itemize}
+\item
+Firstly, we define 16 arrangement arrays instead of 2 (as described in
+Algorithm \ref{algo:gpu_kernel2}), but only 2 of them are used at each call of
+the PRNG kernels. In practice, the selection of combination
+arrays to be used is different for all the threads. It is determined
+by using the three last bits of two internal variables used by BBS.
+%This approach adds more randomness.
+In Algorithm~\ref{algo:bbs_gpu},
+character \& is for the bitwise AND. Thus using \&7 with a number
+gives the last 3 bits, thus providing a number between 0 and 7.
+\item
+Secondly, after the generation of the 8 BBS numbers for each thread, we
+have a 32-bits number whose period is possibly quite small. So
+to add randomness, we generate 4 more BBS numbers to
+shift the 32-bits numbers, and add up to 6 new bits. This improvement is
+described in Algorithm~\ref{algo:bbs_gpu}. In practice, the last 2 bits
+of the first new BBS number are used to make a left shift of at most
+3 bits. The last 3 bits of the second new BBS number are added to the
+strategy whatever the value of the first left shift. The third and the
+fourth new BBS numbers are used similarly to apply a new left shift
+and add 3 new bits.
+\item
+Finally, as we use 8 BBS numbers for each thread, the storage of these
+numbers at the end of the kernel is performed using a rotation. So,
+internal variable for BBS number 1 is stored in place 2, internal
+variable for BBS number 2 is stored in place 3, ..., and finally, internal
+variable for BBS number 8 is stored in place 1.
+\end{itemize}
+
+\begin{algorithm}
+\begin{small}
+\KwIn{InternalVarBBSArray: array with internal variables of the 8 BBS
+in global memory\;
+NumThreads: Number of threads\;
+array\_comb: 2D Arrays containing 16 combinations (in first dimension) of size combination\_size (in second dimension)\;
+array\_shift[4]=\{0,1,3,7\}\;
+}
+
+\KwOut{NewNb: array containing random numbers in global memory}
+\If{threadId is concerned} {
+ retrieve data from InternalVarBBSArray[threadId] in local variables including shared memory and x\;
+ we consider that bbs1 ... bbs8 represent the internal states of the 8 BBS numbers\;
+ offset = threadIdx\%combination\_size\;
+ o1 = threadIdx-offset+array\_comb[bbs1\&7][offset]\;
+ o2 = threadIdx-offset+array\_comb[8+bbs2\&7][offset]\;
+ \For{i=1 to n} {
+ t$<<$=4\;
+ t|=BBS1(bbs1)\&15\;
+ ...\;
+ t$<<$=4\;
+ t|=BBS8(bbs8)\&15\;
+ \tcp{two new shifts}
+ shift=BBS3(bbs3)\&3\;
+ t$<<$=shift\;
+ t|=BBS1(bbs1)\&array\_shift[shift]\;
+ shift=BBS7(bbs7)\&3\;
+ t$<<$=shift\;
+ t|=BBS2(bbs2)\&array\_shift[shift]\;
+ t=t\textasciicircum shmem[o1]\textasciicircum shmem[o2]\;
+ shared\_mem[threadId]=t\;
+ x = x\textasciicircum t\;
+
+ store the new PRNG in NewNb[NumThreads*threadId+i]\;
+ }
+ store internal variables in InternalVarXorLikeArray[threadId] using a rotation\;
+}
+\end{small}
+\caption{main kernel for the BBS based PRNG GPU}
+\label{algo:bbs_gpu}
+\end{algorithm}
+
+In Algorithm~\ref{algo:bbs_gpu}, $n$ is for the quantity of random numbers that
+a thread has to generate. The operation t<<=4 performs a left shift of 4 bits
+on the variable $t$ and stores the result in $t$, and $BBS1(bbs1)\&15$ selects
+the last four bits of the result of $BBS1$. Thus an operation of the form
+$t<<=4; t|=BBS1(bbs1)\&15\;$ realizes in $t$ a left shift of 4 bits, and then
+puts the 4 last bits of $BBS1(bbs1)$ in the four last positions of $t$. Let us
+remark that the initialization $t$ is not a necessity as we fill it 4 bits by 4
+bits, until having obtained 32-bits. The two last new shifts are realized in
+order to enlarge the small periods of the BBS used here, to introduce a kind of
+variability. In these operations, we make twice a left shift of $t$ of \emph{at
+ most} 3 bits, represented by \texttt{shift} in the algorithm, and we put
+\emph{exactly} the \texttt{shift} last bits from a BBS into the \texttt{shift}
+last bits of $t$. For this, an array named \texttt{array\_shift}, containing the
+correspondence between the shift and the number obtained with \texttt{shift} 1
+to make the \texttt{and} operation is used. For example, with a left shift of 0,
+we make an and operation with 0, with a left shift of 3, we make an and
+operation with 7 (represented by 111 in binary mode).
+
+It should be noticed that this generator has once more the form $x^{n+1} = x^n \oplus S^n$,
+where $S^n$ is referred in this algorithm as $t$: each iteration of this
+PRNG ends with $x = x \wedge t$. This $S^n$ is only constituted
+by secure bits produced by the BBS generator, and thus, due to
+Proposition~\ref{cryptopreuve}, the resulted PRNG is cryptographically
+secure.
+
+
+
+\begin{color}{red}
+\subsection{Practical Security Evaluation}
+
+Suppose now that the PRNG will work during
+$M=100$ time units, and that during this period,
+an attacker can realize $10^{12}$ clock cycles.
+We thus wonder whether, during the PRNG's
+lifetime, the attacker can distinguish this
+sequence from truly random one, with a probability
+greater than $\varepsilon = 0.2$.
+We consider that $N$ has 900 bits.
+
+The random process is the BBS generator, which
+is cryptographically secure. More precisely, it
+is $(T,\varepsilon)-$secure: no
+$(T,\varepsilon)-$distinguishing attack can be
+successfully realized on this PRNG, if~\cite{Fischlin}
+$$
+T \leqslant \dfrac{L(N)}{6 N (log_2(N))\varepsilon^{-2}M^2}-2^7 N \varepsilon^{-2} M^2 log_2 (8 N \varepsilon^{-1}M)
+$$
+where $M$ is the length of the output ($M=100$ in
+our example), and $L(N)$ is equal to
+$$
+2.8\times 10^{-3} exp \left(1.9229 \times (N ~ln(2)^\frac{1}{3}) \times ln(N~ln 2)^\frac{2}{3}\right)
+$$
+is the number of clock cycles to factor a $N-$bit
+integer.
+
+A direct numerical application shows that this attacker
+cannot achieve its $(10^{12},0.2)$ distinguishing
+attack in that context.
+
+\end{color}