X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/prng_gpu.git/blobdiff_plain/1ef8a2c0156684f00384e536ee77170afcca38d4..28690929433ca34390a326790df02387bbae7c6e:/prng_gpu.tex diff --git a/prng_gpu.tex b/prng_gpu.tex index e6650ce..7629e10 100644 --- a/prng_gpu.tex +++ b/prng_gpu.tex @@ -1,5 +1,1414 @@ \documentclass{article} +\usepackage[utf8]{inputenc} +\usepackage[T1]{fontenc} +\usepackage{fullpage} +\usepackage{fancybox} +\usepackage{amsmath} +\usepackage{amscd} +\usepackage{moreverb} +\usepackage{commath} +\usepackage{algorithm2e} +\usepackage{listings} +\usepackage[standard]{ntheorem} + +% Pour mathds : les ensembles IR, IN, etc. +\usepackage{dsfont} + +% Pour avoir des intervalles d'entiers +\usepackage{stmaryrd} + +\usepackage{graphicx} +% Pour faire des sous-figures dans les figures +\usepackage{subfigure} + +\usepackage{color} + +\newtheorem{notation}{Notation} + +\newcommand{\X}{\mathcal{X}} +\newcommand{\Go}{G_{f_0}} +\newcommand{\B}{\mathds{B}} +\newcommand{\N}{\mathds{N}} +\newcommand{\BN}{\mathds{B}^\mathsf{N}} +\let\sur=\overline + +\newcommand{\alert}[1]{\begin{color}{blue}\textit{#1}\end{color}} + +\title{Efficient and Cryptographically Secure Generation of Chaotic Pseudorandom Numbers on GPU} \begin{document} -qdsqsd -qsdqsd + +\author{Jacques M. Bahi, Rapha\"{e}l Couturier, Christophe +Guyeux, and Pierre-Cyrille Heam\thanks{Authors in alphabetic order}} + +\maketitle + +\begin{abstract} +In this paper we present a new pseudorandom number generator (PRNG) on +graphics processing units (GPU). This PRNG is based on the so-called chaotic iterations. It +is firstly proven to be chaotic according to the Devaney's formulation. We thus propose an efficient +implementation for GPU that successfully passes the {\it BigCrush} tests, deemed to be the hardest +battery of tests in TestU01. Experiments show that this PRNG can generate +about 20 billions of random numbers per second on Tesla C1060 and NVidia GTX280 +cards. +It is finally established that, under reasonable assumptions, the proposed PRNG can be cryptographically +secure. + + +\end{abstract} + +\section{Introduction} + +Randomness is of importance in many fields as scientific simulations or cryptography. +``Random numbers'' can mainly be generated either by a deterministic and reproducible algorithm +called a pseudorandom number generator (PRNG), or by a physical non-deterministic +process having all the characteristics of a random noise, called a truly random number +generator (TRNG). +In this paper, we focus on reproducible generators, useful for instance in +Monte-Carlo based simulators or in several cryptographic schemes. +These domains need PRNGs that are statistically irreproachable. +On some fields as in numerical simulations, speed is a strong requirement +that is usually attained by using parallel architectures. In that case, +a recurrent problem is that a deflate of the statistical qualities is often +reported, when the parallelization of a good PRNG is realized. +This is why ad-hoc PRNGs for each possible architecture must be found to +achieve both speed and randomness. +On the other side, speed is not the main requirement in cryptography: the great +need is to define \emph{secure} generators being able to withstand malicious +attacks. Roughly speaking, an attacker should not be able in practice to make +the distinction between numbers obtained with the secure generator and a true random +sequence. +Finally, a small part of the community working in this domain focus on a +third requirement, that is to define chaotic generators. +The main idea is to take benefits from a chaotic dynamical system to obtain a +generator that is unpredictable, disordered, sensible to its seed, or in other words chaotic. +Their desire is to map a given chaotic dynamics into a sequence that seems random +and unassailable due to chaos. +However, the chaotic maps used as a pattern are defined in the real line +whereas computers deal with finite precision numbers. +This distortion leads to a deflation of both chaotic properties and speed. +Furthermore, authors of such chaotic generators often claim their PRNG +as secure due to their chaos properties, but there is no obvious relation +between chaos and security as it is understood in cryptography. +This is why the use of chaos for PRNG still remains marginal and disputable. + +The authors' opinion is that topological properties of disorder, as they are +properly defined in the mathematical theory of chaos, can reinforce the quality +of a PRNG. But they are not substitutable for security or statistical perfection. +Indeed, to the authors' point of view, such properties can be useful in the two following situations. On the +one hand, a post-treatment based on a chaotic dynamical system can be applied +to a PRNG statistically deflective, in order to improve its statistical +properties. Such an improvement can be found, for instance, in~\cite{bgw09:ip,bcgr11:ip}. +On the other hand, chaos can be added to a fast, statistically perfect PRNG and/or a +cryptographically secure one, in case where chaos can be of interest, +\emph{only if these last properties are not lost during +the proposed post-treatment}. Such an assumption is behind this research work. +It leads to the attempts to define a +family of PRNGs that are chaotic while being fast and statistically perfect, +or cryptographically secure. +Let us finish this paragraph by noticing that, in this paper, +statistical perfection refers to the ability to pass the whole +{\it BigCrush} battery of tests, which is widely considered as the most +stringent statistical evaluation of a sequence claimed as random. +This battery can be found into the well-known TestU01 package~\cite{LEcuyerS07}. +Chaos, for its part, refers to the well-established definition of a +chaotic dynamical system proposed by Devaney~\cite{Devaney}. + + +In a previous work~\cite{bgw09:ip,guyeux10} we have proposed a post-treatment on PRNGs making them behave +as a chaotic dynamical system. Such a post-treatment leads to a new category of +PRNGs. We have shown that proofs of Devaney's chaos can be established for this +family, and that the sequence obtained after this post-treatment can pass the +NIST~\cite{Nist10}, DieHARD~\cite{Marsaglia1996}, and TestU01~\cite{LEcuyerS07} batteries of tests, even if the inputted generators +cannot. +The proposition of this paper is to improve widely the speed of the formerly +proposed generator, without any lack of chaos or statistical properties. +In particular, a version of this PRNG on graphics processing units (GPU) +is proposed. +Although GPU was initially designed to accelerate +the manipulation of images, they are nowadays commonly used in many scientific +applications. Therefore, it is important to be able to generate pseudorandom +numbers inside a GPU when a scientific application runs in it. This remark +motivates our proposal of a chaotic and statistically perfect PRNG for GPU. +Such device +allows us to generated almost 20 billions of pseudorandom numbers per second. +Last, but not least, we show that the proposed post-treatment preserves the +cryptographical security of the inputted PRNG, when this last has such a +property. + +The remainder of this paper is organized as follows. In Section~\ref{section:related + works} we review some GPU implementations of PRNGs. Section~\ref{section:BASIC + RECALLS} gives some basic recalls on the well-known Devaney's formulation of chaos, + and on an iteration process called ``chaotic +iterations'' on which the post-treatment is based. +Proofs of chaos are given in Section~\ref{sec:pseudorandom}. +Section~\ref{sec:efficient PRNG} presents an efficient +implementation of this chaotic PRNG on a CPU, whereas Section~\ref{sec:efficient PRNG + gpu} describes the GPU implementation. +Such generators are experimented in +Section~\ref{sec:experiments}. +We show in Section~\ref{sec:security analysis} that, if the inputted +generator is cryptographically secure, then it is the case too for the +generator provided by the post-treatment. +Such a proof leads to the proposition of a cryptographically secure and +chaotic generator on GPU based on the famous Blum Blum Shum +in Section~\ref{sec:CSGPU}. +This research work ends by a conclusion section, in which the contribution is +summarized and intended future work is presented. + + + + +\section{Related works on GPU based PRNGs} +\label{section:related works} + +Numerous research works on defining GPU based PRNGs have yet been proposed in the +literature, so that completeness is impossible. +This is why authors of this document only give reference to the most significant attempts +in this domain, from their subjective point of view. +The quantity of pseudorandom numbers generated per second is mentioned here +only when the information is given in the related work. +A million numbers per second will be simply written as +1MSample/s whereas a billion numbers per second is 1GSample/s. + +In \cite{Pang:2008:cec} a PRNG based on cellular automata is defined +with no requirement to an high precision integer arithmetic or to any bitwise +operations. Authors can generate about +3.2MSamples/s on a GeForce 7800 GTX GPU, which is quite an old card now. +However, there is neither a mention of statistical tests nor any proof of +chaos or cryptography in this document. + +In \cite{ZRKB10}, the authors propose different versions of efficient GPU PRNGs +based on Lagged Fibonacci or Hybrid Taus. They have used these +PRNGs for Langevin simulations of biomolecules fully implemented on +GPU. Performance of the GPU versions are far better than those obtained with a +CPU, and these PRNGs succeed to pass the {\it BigCrush} battery of TestU01. +However the evaluations of the proposed PRNGs are only statistical ones. + + +Authors of~\cite{conf/fpga/ThomasHL09} have studied the implementation of some +PRNGs on different computing architectures: CPU, field-programmable gate array +(FPGA), massively parallel processors, and GPU. This study is of interest, because +the performance of the same PRNGs on different architectures are compared. +FPGA appears as the fastest and the most +efficient architecture, providing the fastest number of generated pseudorandom numbers +per joule. +However, we can notice that authors can ``only'' generate between 11 and 16GSamples/s +with a GTX 280 GPU, which should be compared with +the results presented in this document. +We can remark too that the PRNGs proposed in~\cite{conf/fpga/ThomasHL09} are only +able to pass the {\it Crush} battery, which is very easy compared to the {\it Big Crush} one. + +Lastly, Cuda has developed a library for the generation of pseudorandom numbers called +Curand~\cite{curand11}. Several PRNGs are implemented, among +other things +Xorwow~\cite{Marsaglia2003} and some variants of Sobol. The tests reported show that +their fastest version provides 15GSamples/s on the new Fermi C2050 card. +But their PRNGs cannot pass the whole TestU01 battery (only one test is failed). +\newline +\newline +We can finally remark that, to the best of our knowledge, no GPU implementation have been proven to be chaotic, and the cryptographically secure property is surprisingly never regarded. + +\section{Basic Recalls} +\label{section:BASIC RECALLS} + +This section is devoted to basic definitions and terminologies in the fields of +topological chaos and chaotic iterations. +\subsection{Devaney's Chaotic Dynamical Systems} + +In the sequel $S^{n}$ denotes the $n^{th}$ term of a sequence $S$ and $V_{i}$ +denotes the $i^{th}$ component of a vector $V$. $f^{k}=f\circ ...\circ f$ +is for the $k^{th}$ composition of a function $f$. Finally, the following +notation is used: $\llbracket1;N\rrbracket=\{1,2,\hdots,N\}$. + + +Consider a topological space $(\mathcal{X},\tau)$ and a continuous function $f : +\mathcal{X} \rightarrow \mathcal{X}$. + +\begin{definition} +$f$ is said to be \emph{topologically transitive} if, for any pair of open sets +$U,V \subset \mathcal{X}$, there exists $k>0$ such that $f^k(U) \cap V \neq +\varnothing$. +\end{definition} + +\begin{definition} +An element $x$ is a \emph{periodic point} for $f$ of period $n\in \mathds{N}^*$ +if $f^{n}(x)=x$.% The set of periodic points of $f$ is denoted $Per(f).$ +\end{definition} + +\begin{definition} +$f$ is said to be \emph{regular} on $(\mathcal{X}, \tau)$ if the set of periodic +points for $f$ is dense in $\mathcal{X}$: for any point $x$ in $\mathcal{X}$, +any neighborhood of $x$ contains at least one periodic point (without +necessarily the same period). +\end{definition} + + +\begin{definition}[Devaney's formulation of chaos~\cite{Devaney}] +$f$ is said to be \emph{chaotic} on $(\mathcal{X},\tau)$ if $f$ is regular and +topologically transitive. +\end{definition} + +The chaos property is strongly linked to the notion of ``sensitivity'', defined +on a metric space $(\mathcal{X},d)$ by: + +\begin{definition} +\label{sensitivity} $f$ has \emph{sensitive dependence on initial conditions} +if there exists $\delta >0$ such that, for any $x\in \mathcal{X}$ and any +neighborhood $V$ of $x$, there exist $y\in V$ and $n > 0$ such that +$d\left(f^{n}(x), f^{n}(y)\right) >\delta $. + +$\delta$ is called the \emph{constant of sensitivity} of $f$. +\end{definition} + +Indeed, Banks \emph{et al.} have proven in~\cite{Banks92} that when $f$ is +chaotic and $(\mathcal{X}, d)$ is a metric space, then $f$ has the property of +sensitive dependence on initial conditions (this property was formerly an +element of the definition of chaos). To sum up, quoting Devaney +in~\cite{Devaney}, a chaotic dynamical system ``is unpredictable because of the +sensitive dependence on initial conditions. It cannot be broken down or +simplified into two subsystems which do not interact because of topological +transitivity. And in the midst of this random behavior, we nevertheless have an +element of regularity''. Fundamentally different behaviors are consequently +possible and occur in an unpredictable way. + + + +\subsection{Chaotic Iterations} +\label{sec:chaotic iterations} + + +Let us consider a \emph{system} with a finite number $\mathsf{N} \in +\mathds{N}^*$ of elements (or \emph{cells}), so that each cell has a +Boolean \emph{state}. Having $\mathsf{N}$ Boolean values for these + cells leads to the definition of a particular \emph{state of the +system}. A sequence which elements belong to $\llbracket 1;\mathsf{N} +\rrbracket $ is called a \emph{strategy}. The set of all strategies is +denoted by $\llbracket 1, \mathsf{N} \rrbracket^\mathds{N}.$ + +\begin{definition} +\label{Def:chaotic iterations} +The set $\mathds{B}$ denoting $\{0,1\}$, let +$f:\mathds{B}^{\mathsf{N}}\longrightarrow \mathds{B}^{\mathsf{N}}$ be +a function and $S\in \llbracket 1, \mathsf{N} \rrbracket^\mathds{N}$ be a ``strategy''. The so-called +\emph{chaotic iterations} are defined by $x^0\in +\mathds{B}^{\mathsf{N}}$ and +\begin{equation} +\forall n\in \mathds{N}^{\ast }, \forall i\in +\llbracket1;\mathsf{N}\rrbracket ,x_i^n=\left\{ +\begin{array}{ll} + x_i^{n-1} & \text{ if }S^n\neq i \\ + \left(f(x^{n-1})\right)_{S^n} & \text{ if }S^n=i. +\end{array}\right. +\end{equation} +\end{definition} + +In other words, at the $n^{th}$ iteration, only the $S^{n}-$th cell is +\textquotedblleft iterated\textquotedblright . Note that in a more +general formulation, $S^n$ can be a subset of components and +$\left(f(x^{n-1})\right)_{S^{n}}$ can be replaced by +$\left(f(x^{k})\right)_{S^{n}}$, where $k0$. \medskip +\begin{itemize} +\item If $\varepsilon \geqslant 1$, we see that distance +between $\left( G_{f}(S^n,E^n)\right) $ and $\left( G_{f}(S,E)\right) $ is +strictly less than 1 after the $max(n_{0},n_{1})^{th}$ term (same state). +\medskip +\item If $\varepsilon <1$, then $\exists k\in \mathds{N},10^{-k}\geqslant +\varepsilon > 10^{-(k+1)}$. But $d_{s}(S^n,S)$ converges to 0, so +\begin{equation*} +\exists n_{2}\in \mathds{N},\forall n\geqslant +n_{2},d_{s}(S^n,S)<10^{-(k+2)}, +\end{equation*}% +thus after $n_{2}$, the $k+2$ first terms of $S^n$ and $S$ are equal. +\end{itemize} +\noindent As a consequence, the $k+1$ first entries of the strategies of $% +G_{f}(S^n,E^n)$ and $G_{f}(S,E)$ are the same ($G_{f}$ is a shift of strategies) and due to the definition of $d_{s}$, the floating part of +the distance between $(S^n,E^n)$ and $(S,E)$ is strictly less than $% +10^{-(k+1)}\leqslant \varepsilon $.\bigskip \newline +In conclusion, +$$ +\forall \varepsilon >0,\exists N_{0}=max(n_{0},n_{1},n_{2})\in \mathds{N}% +,\forall n\geqslant N_{0}, + d\left( G_{f}(S^n,E^n);G_{f}(S,E)\right) +\leqslant \varepsilon . +$$ +$G_{f}$ is consequently continuous. +\end{proof} + + +It is now possible to study the topological behavior of the general chaotic +iterations. We will prove that, + +\begin{theorem} +\label{t:chaos des general} + The general chaotic iterations defined on Equation~\ref{general CIs} satisfy +the Devaney's property of chaos. +\end{theorem} + +Let us firstly prove the following lemma. + +\begin{lemma}[Strong transitivity] +\label{strongTrans} + For all couples $X,Y \in \mathcal{X}$ and any neighborhood $V$ of $X$, we can +find $n \in \mathds{N}^*$ and $X' \in V$ such that $G^n(X')=Y$. +\end{lemma} + +\begin{proof} + Let $X=(S,E)$, $\varepsilon>0$, and $k_0 = \lfloor log_{10}(\varepsilon)+1 \rfloor$. +Any point $X'=(S',E')$ such that $E'=E$ and $\forall k \leqslant k_0, S'^k=S^k$, +are in the open ball $\mathcal{B}\left(X,\varepsilon\right)$. Let us define +$\check{X} = \left(\check{S},\check{E}\right)$, where $\check{X}= G^{k_0}(X)$. +We denote by $s\subset \llbracket 1; \mathsf{N} \rrbracket$ the set of coordinates +that are different between $\check{E}$ and the state of $Y$. Thus each point $X'$ of +the form $(S',E')$ where $E'=E$ and $S'$ starts with +$(S^0, S^1, \hdots, S^{k_0},s,\hdots)$, verifies the following properties: +\begin{itemize} + \item $X'$ is in $\mathcal{B}\left(X,\varepsilon\right)$, + \item the state of $G_f^{k_0+1}(X')$ is the state of $Y$. +\end{itemize} +Finally the point $\left(\left(S^0, S^1, \hdots, S^{k_0},s,s^0, s^1, \hdots\right); E\right)$, +where $(s^0,s^1, \hdots)$ is the strategy of $Y$, satisfies the properties +claimed in the lemma. +\end{proof} + +We can now prove the Theorem~\ref{t:chaos des general}... + +\begin{proof}[Theorem~\ref{t:chaos des general}] +Firstly, strong transitivity implies transitivity. + +Let $(S,E) \in\mathcal{X}$ and $\varepsilon >0$. To +prove that $G_f$ is regular, it is sufficient to prove that +there exists a strategy $\tilde S$ such that the distance between +$(\tilde S,E)$ and $(S,E)$ is less than $\varepsilon$, and such that +$(\tilde S,E)$ is a periodic point. + +Let $t_1=\lfloor-\log_{10}(\varepsilon)\rfloor$, and let $E'$ be the +configuration that we obtain from $(S,E)$ after $t_1$ iterations of +$G_f$. As $G_f$ is strongly transitive, there exists a strategy $S'$ +and $t_2\in\mathds{N}$ such +that $E$ is reached from $(S',E')$ after $t_2$ iterations of $G_f$. + +Consider the strategy $\tilde S$ that alternates the first $t_1$ terms +of $S$ and the first $t_2$ terms of $S'$: $$\tilde +S=(S_0,\dots,S_{t_1-1},S'_0,\dots,S'_{t_2-1},S_0,\dots,S_{t_1-1},S'_0,\dots,S'_{t_2-1},S_0,\dots).$$ It +is clear that $(\tilde S,E)$ is obtained from $(\tilde S,E)$ after +$t_1+t_2$ iterations of $G_f$. So $(\tilde S,E)$ is a periodic +point. Since $\tilde S_t=S_t$ for $t>32); + x = x^(unsigned int)(t3>>32); + x = x^(unsigned int)t2; + x = x^(unsigned int)(t1>>32); + x = x^(unsigned int)t3; + return x; +} +\end{lstlisting} + + + + + +In Listing~\ref{algo:seqCIPRNG} a sequential version of the proposed PRNG based on chaotic iterations + is presented. The xor operator is represented by \textasciicircum. +This function uses three classical 64-bits PRNGs, namely the \texttt{xorshift}, the +\texttt{xor128}, and the \texttt{xorwow}~\cite{Marsaglia2003}. In the following, we call them +``xor-like PRNGs''. +As +each xor-like PRNG uses 64-bits whereas our proposed generator works with 32-bits, +we use the command \texttt{(unsigned int)}, that selects the 32 least significant bits of a given integer, and the code +\texttt{(unsigned int)(t3$>>$32)} in order to obtain the 32 most significant bits of \texttt{t}. + +So producing a pseudorandom number needs 6 xor operations +with 6 32-bits numbers that are provided by 3 64-bits PRNGs. This version successfully passes the +stringent BigCrush battery of tests~\cite{LEcuyerS07}. + +\section{Efficient PRNGs based on Chaotic Iterations on GPU} +\label{sec:efficient PRNG gpu} + +In order to take benefits from the computing power of GPU, a program +needs to have independent blocks of threads that can be computed +simultaneously. In general, the larger the number of threads is, the +more local memory is used, and the less branching instructions are +used (if, while, ...), the better the performances on GPU is. +Obviously, having these requirements in mind, it is possible to build +a program similar to the one presented in Algorithm +\ref{algo:seqCIPRNG}, which computes pseudorandom numbers on GPU. To +do so, we must firstly recall that in the CUDA~\cite{Nvid10} +environment, threads have a local identifier called +\texttt{ThreadIdx}, which is relative to the block containing +them. With CUDA parts of the code which are executed by the GPU are +called {\it kernels}. + + +\subsection{Naive Version for GPU} + + +It is possible to deduce from the CPU version a quite similar version adapted to GPU. +The simple principle consists to make each thread of the GPU computing the CPU version of our PRNG. +Of course, the three xor-like +PRNGs used in these computations must have different parameters. +In a given thread, these lasts are +randomly picked from another PRNGs. +The initialization stage is performed by the CPU. +To do it, the ISAAC PRNG~\cite{Jenkins96} is used to set all the +parameters embedded into each thread. + +The implementation of the three +xor-like PRNGs is straightforward when their parameters have been +allocated in the GPU memory. Each xor-like works with an internal +number $x$ that saves the last generated pseudorandom number. Additionally, the +implementation of the xor128, the xorshift, and the xorwow respectively require +4, 5, and 6 unsigned long as internal variables. + +\begin{algorithm} + +\KwIn{InternalVarXorLikeArray: array with internal variables of the 3 xor-like +PRNGs in global memory\; +NumThreads: number of threads\;} +\KwOut{NewNb: array containing random numbers in global memory} +\If{threadIdx is concerned by the computation} { + retrieve data from InternalVarXorLikeArray[threadIdx] in local variables\; + \For{i=1 to n} { + compute a new PRNG as in Listing\ref{algo:seqCIPRNG}\; + store the new PRNG in NewNb[NumThreads*threadIdx+i]\; + } + store internal variables in InternalVarXorLikeArray[threadIdx]\; +} + +\caption{Main kernel of the GPU ``naive'' version of the PRNG based on chaotic iterations} +\label{algo:gpu_kernel} +\end{algorithm} + +Algorithm~\ref{algo:gpu_kernel} presents a naive implementation of the proposed PRNG on +GPU. Due to the available memory in the GPU and the number of threads +used simultenaously, the number of random numbers that a thread can generate +inside a kernel is limited (\emph{i.e.}, the variable \texttt{n} in +algorithm~\ref{algo:gpu_kernel}). For instance, if $100,000$ threads are used and +if $n=100$\footnote{in fact, we need to add the initial seed (a 32-bits number)}, +then the memory required to store all of the internals variables of both the xor-like +PRNGs\footnote{we multiply this number by $2$ in order to count 32-bits numbers} +and the pseudorandom numbers generated by our PRNG, is equal to $100,000\times ((4+5+6)\times +2+(1+100))=1,310,000$ 32-bits numbers, that is, approximately $52$Mb. + +This generator is able to pass the whole BigCrush battery of tests, for all +the versions that have been tested depending on their number of threads +(called \texttt{NumThreads} in our algorithm, tested until $10$ millions). + +\begin{remark} +The proposed algorithm has the advantage to manipulate independent +PRNGs, so this version is easily adaptable on a cluster of computers too. The only thing +to ensure is to use a single ISAAC PRNG. To achieve this requirement, a simple solution consists in +using a master node for the initialization. This master node computes the initial parameters +for all the differents nodes involves in the computation. +\end{remark} + +\subsection{Improved Version for GPU} + +As GPU cards using CUDA have shared memory between threads of the same block, it +is possible to use this feature in order to simplify the previous algorithm, +i.e., to use less than 3 xor-like PRNGs. The solution consists in computing only +one xor-like PRNG by thread, saving it into the shared memory, and then to use the results +of some other threads in the same block of threads. In order to define which +thread uses the result of which other one, we can use a combination array that +contains the indexes of all threads and for which a combination has been +performed. + +In Algorithm~\ref{algo:gpu_kernel2}, two combination arrays are used. +The variable \texttt{offset} is computed using the value of +\texttt{combination\_size}. Then we can compute \texttt{o1} and \texttt{o2} +representing the indexes of the other threads whose results are used +by the current one. In this algorithm, we consider that a 64-bits xor-like +PRNG has been chosen, and so its two 32-bits parts are used. + +This version also can pass the whole {\it BigCrush} battery of tests. + +\begin{algorithm} + +\KwIn{InternalVarXorLikeArray: array with internal variables of 1 xor-like PRNGs +in global memory\; +NumThreads: Number of threads\; +tab1, tab2: Arrays containing combinations of size combination\_size\;} + +\KwOut{NewNb: array containing random numbers in global memory} +\If{threadId is concerned} { + retrieve data from InternalVarXorLikeArray[threadId] in local variables including shared memory and x\; + offset = threadIdx\%combination\_size\; + o1 = threadIdx-offset+tab1[offset]\; + o2 = threadIdx-offset+tab2[offset]\; + \For{i=1 to n} { + t=xor-like()\; + t=t $\hat{ }$ shmem[o1] $\hat{ }$ shmem[o2]\; + shared\_mem[threadId]=t\; + x = x $\hat{ }$ t\; + + store the new PRNG in NewNb[NumThreads*threadId+i]\; + } + store internal variables in InternalVarXorLikeArray[threadId]\; +} + +\caption{main kernel for the chaotic iterations based PRNG GPU efficient +version} +\label{algo:gpu_kernel2} +\end{algorithm} + +\subsection{Theoretical Evaluation of the Improved Version} + +A run of Algorithm~\ref{algo:gpu_kernel2} consists in an operation ($x=x\oplus t$) having +the form of Equation~\ref{equation Oplus}, which is equivalent to the iterative +system of Eq.~\ref{eq:generalIC}. That is, an iteration of the general chaotic +iterations is realized between the last stored value $x$ of the thread and a strategy $t$ +(obtained by a bitwise exclusive or between a value provided by a xor-like() call +and two values previously obtained by two other threads). +To be certain that we are in the framework of Theorem~\ref{t:chaos des general}, +we must guarantee that this dynamical system iterates on the space +$\mathcal{X} = \mathcal{P}\left(\llbracket 1, \mathsf{N} \rrbracket\right)^\mathds{N}\times\mathds{B}^\mathsf{N}$. +The left term $x$ obviously belongs into $\mathds{B}^ \mathsf{N}$. +To prevent from any flaws of chaotic properties, we must check that the right +term (the last $t$), corresponding to the strategies, can possibly be equal to any +integer of $\llbracket 1, \mathsf{N} \rrbracket$. + +Such a result is obvious, as for the xor-like(), all the +integers belonging into its interval of definition can occur at each iteration, and thus the +last $t$ respects the requirement. Furthermore, it is possible to +prove by an immediate mathematical induction that, as the initial $x$ +is uniformly distributed (it is provided by a cryptographically secure PRNG), +the two other stored values shmem[o1] and shmem[o2] are uniformly distributed too, +(this can be stated by an immediate mathematical +induction), and thus the next $x$ is finally uniformly distributed. + +Thus Algorithm~\ref{algo:gpu_kernel2} is a concrete realization of the general +chaotic iterations presented previously, and for this reason, it satisfies the +Devaney's formulation of a chaotic behavior. + +\section{Experiments} +\label{sec:experiments} + +Different experiments have been performed in order to measure the generation +speed. We have used a first computer equipped with a Tesla C1060 NVidia GPU card +and an +Intel Xeon E5530 cadenced at 2.40 GHz, and +a second computer equipped with a smaller CPU and a GeForce GTX 280. +All the +cards have 240 cores. + +In Figure~\ref{fig:time_xorlike_gpu} we compare the quantity of pseudorandom numbers +generated per second with various xor-like based PRNG. In this figure, the optimized +versions use the {\it xor64} described in~\cite{Marsaglia2003}, whereas the naive versions +embed the three xor-like PRNGs described in Listing~\ref{algo:seqCIPRNG}. In +order to obtain the optimal performances, the storage of pseudorandom numbers +into the GPU memory has been removed. This step is time consuming and slows down the numbers +generation. Moreover this storage is completely +useless, in case of applications that consume the pseudorandom +numbers directly after generation. We can see that when the number of threads is greater +than approximately 30,000 and lower than 5 millions, the number of pseudorandom numbers generated +per second is almost constant. With the naive version, this value ranges from 2.5 to +3GSamples/s. With the optimized version, it is approximately equal to +20GSamples/s. Finally we can remark that both GPU cards are quite similar, but in +practice, the Tesla C1060 has more memory than the GTX 280, and this memory +should be of better quality. +As a comparison, Listing~\ref{algo:seqCIPRNG} leads to the generation of about +138MSample/s when using one core of the Xeon E5530. + +\begin{figure}[htbp] +\begin{center} + \includegraphics[scale=.7]{curve_time_xorlike_gpu.pdf} +\end{center} +\caption{Quantity of pseudorandom numbers generated per second with the xorlike-based PRNG} +\label{fig:time_xorlike_gpu} +\end{figure} + + + + + +In Figure~\ref{fig:time_bbs_gpu} we highlight the performances of the optimized +BBS-based PRNG on GPU. On the Tesla C1060 we +obtain approximately 700MSample/s and on the GTX 280 about 670MSample/s, which is +obviously slower than the xorlike-based PRNG on GPU. However, we will show in the +next sections that +this new PRNG has a strong level of security, which is necessary paid by a speed +reduction. + +\begin{figure}[htbp] +\begin{center} + \includegraphics[scale=.7]{curve_time_bbs_gpu.pdf} +\end{center} +\caption{Quantity of pseudorandom numbers generated per second using the BBS-based PRNG} +\label{fig:time_bbs_gpu} +\end{figure} + +All these experiments allow us to conclude that it is possible to +generate a very large quantity of pseudorandom numbers statistically perfect with the xor-like version. +In a certain extend, it is the case too with the secure BBS-based version, the speed deflation being +explained by the fact that the former version has ``only'' +chaotic properties and statistical perfection, whereas the latter is also cryptographically secure, +as it is shown in the next sections. + + + + + + + +\section{Security Analysis} +\label{sec:security analysis} + + + +In this section the concatenation of two strings $u$ and $v$ is classically +denoted by $uv$. +In a cryptographic context, a pseudorandom generator is a deterministic +algorithm $G$ transforming strings into strings and such that, for any +seed $w$ of length $N$, $G(w)$ (the output of $G$ on the input $w$) has size +$\ell_G(N)$ with $\ell_G(N)>N$. +The notion of {\it secure} PRNGs can now be defined as follows. + +\begin{definition} +A cryptographic PRNG $G$ is secure if for any probabilistic polynomial time +algorithm $D$, for any positive polynomial $p$, and for all sufficiently +large $k$'s, +$$| \mathrm{Pr}[D(G(U_k))=1]-Pr[D(U_{\ell_G(k)})=1]|< \frac{1}{p(N)},$$ +where $U_r$ is the uniform distribution over $\{0,1\}^r$ and the +probabilities are taken over $U_N$, $U_{\ell_G(N)}$ as well as over the +internal coin tosses of $D$. +\end{definition} + +Intuitively, it means that there is no polynomial time algorithm that can +distinguish a perfect uniform random generator from $G$ with a non +negligible probability. The interested reader is referred +to~\cite[chapter~3]{Goldreich} for more information. Note that it is +quite easily possible to change the function $\ell$ into any polynomial +function $\ell^\prime$ satisfying $\ell^\prime(N)>N)$~\cite[Chapter 3.3]{Goldreich}. + +The generation schema developed in (\ref{equation Oplus}) is based on a +pseudorandom generator. Let $H$ be a cryptographic PRNG. We may assume, +without loss of generality, that for any string $S_0$ of size $N$, the size +of $H(S_0)$ is $kN$, with $k>2$. It means that $\ell_H(N)=kN$. +Let $S_1,\ldots,S_k$ be the +strings of length $N$ such that $H(S_0)=S_1 \ldots S_k$ ($H(S_0)$ is the concatenation of +the $S_i$'s). The cryptographic PRNG $X$ defined in (\ref{equation Oplus}) +is the algorithm mapping any string of length $2N$ $x_0S_0$ into the string +$(x_0\oplus S_0 \oplus S_1)(x_0\oplus S_0 \oplus S_1\oplus S_2)\ldots +(x_o\bigoplus_{i=0}^{i=k}S_i)$. Particularly one has $\ell_{X}(2N)=kN=\ell_H(N)$. +We claim now that if this PRNG is secure, +then the new one is secure too. + +\begin{proposition} +\label{cryptopreuve} +If $H$ is a secure cryptographic PRNG, then $X$ is a secure cryptographic +PRNG too. +\end{proposition} + +\begin{proof} +The proposition is proved by contraposition. Assume that $X$ is not +secure. By Definition, there exists a polynomial time probabilistic +algorithm $D$, a positive polynomial $p$, such that for all $k_0$ there exists +$N\geq \frac{k_0}{2}$ satisfying +$$| \mathrm{Pr}[D(X(U_{2N}))=1]-\mathrm{Pr}[D(U_{kN}=1]|\geq \frac{1}{p(2N)}.$$ +We describe a new probabilistic algorithm $D^\prime$ on an input $w$ of size +$kN$: +\begin{enumerate} +\item Decompose $w$ into $w=w_1\ldots w_{k}$, where each $w_i$ has size $N$. +\item Pick a string $y$ of size $N$ uniformly at random. +\item Compute $z=(y\oplus w_1)(y\oplus w_1\oplus w_2)\ldots (y + \bigoplus_{i=1}^{i=k} w_i).$ +\item Return $D(z)$. +\end{enumerate} + + +Consider for each $y\in \mathbb{B}^{kN}$ the function $\varphi_{y}$ +from $\mathbb{B}^{kN}$ into $\mathbb{B}^{kN}$ mapping $w=w_1\ldots w_k$ +(each $w_i$ has length $N$) to +$(y\oplus w_1)(y\oplus w_1\oplus w_2)\ldots (y + \bigoplus_{i=1}^{i=k_1} w_i).$ By construction, one has for every $w$, +\begin{equation}\label{PCH-1} +D^\prime(w)=D(\varphi_y(w)), +\end{equation} +where $y$ is randomly generated. +Moreover, for each $y$, $\varphi_{y}$ is injective: if +$(y\oplus w_1)(y\oplus w_1\oplus w_2)\ldots (y\bigoplus_{i=1}^{i=k_1} +w_i)=(y\oplus w_1^\prime)(y\oplus w_1^\prime\oplus w_2^\prime)\ldots +(y\bigoplus_{i=1}^{i=k} w_i^\prime)$, then for every $1\leq j\leq k$, +$y\bigoplus_{i=1}^{i=j} w_i^\prime=y\bigoplus_{i=1}^{i=j} w_i$. It follows, +by a direct induction, that $w_i=w_i^\prime$. Furthermore, since $\mathbb{B}^{kN}$ +is finite, each $\varphi_y$ is bijective. Therefore, and using (\ref{PCH-1}), +one has +\begin{equation}\label{PCH-2} +\mathrm{Pr}[D^\prime(U_{kN})=1]=\mathrm{Pr}[D(\varphi_y(U_{kN}))=1]=\mathrm{Pr}[D(U_{kN})=1]. +\end{equation} + +Now, using (\ref{PCH-1}) again, one has for every $x$, +\begin{equation}\label{PCH-3} +D^\prime(H(x))=D(\varphi_y(H(x))), +\end{equation} +where $y$ is randomly generated. By construction, $\varphi_y(H(x))=X(yx)$, +thus +\begin{equation}\label{PCH-3} +D^\prime(H(x))=D(yx), +\end{equation} +where $y$ is randomly generated. +It follows that + +\begin{equation}\label{PCH-4} +\mathrm{Pr}[D^\prime(H(U_{N}))=1]=\mathrm{Pr}[D(U_{2N})=1]. +\end{equation} + From (\ref{PCH-2}) and (\ref{PCH-4}), one can deduce that +there exist a polynomial time probabilistic +algorithm $D^\prime$, a positive polynomial $p$, such that for all $k_0$ there exists +$N\geq \frac{k_0}{2}$ satisfying +$$| \mathrm{Pr}[D(H(U_{N}))=1]-\mathrm{Pr}[D(U_{kN}=1]|\geq \frac{1}{p(2N)},$$ +proving that $H$ is not secure, a contradiction. +\end{proof} + + +\section{Cryptographical Applications} + +\subsection{A Cryptographically Secure PRNG for GPU} +\label{sec:CSGPU} + +It is possible to build a cryptographically secure PRNG based on the previous +algorithm (Algorithm~\ref{algo:gpu_kernel2}). Due to Proposition~\ref{cryptopreuve}, +it simply consists in replacing +the {\it xor-like} PRNG by a cryptographically secure one. +We have chosen the Blum Blum Shum generator~\cite{BBS} (usually denoted by BBS) having the form: +$$x_{n+1}=x_n^2~ mod~ M$$ where $M$ is the product of two prime numbers. These +prime numbers need to be congruent to 3 modulus 4. BBS is +very slow and only usable for cryptographic applications. + + +The modulus operation is the most time consuming operation for current +GPU cards. So in order to obtain quite reasonable performances, it is +required to use only modulus on 32 bits integer numbers. Consequently +$x_n^2$ need to be less than $2^{32}$ and the number $M$ need to be +less than $2^{16}$. So in practice we can choose prime numbers around +256 that are congruent to 3 modulus 4. With 32 bits numbers, only the +4 least significant bits of $x_n$ can be chosen (the maximum number of +indistinguishable bits is lesser than or equals to +$log_2(log_2(x_n))$). So to generate a 32 bits number, we need to use +8 times the BBS algorithm with different combinations of $M$. This +approach is not sufficient to pass all the tests of TestU01 because +the fact of having chosen small values of $M$ for the BBS leads to +have a small period. So, in order to add randomness we proceed with +the followings modifications. +\begin{itemize} +\item +First we define 16 arrangement arrays instead of 2 (as described in +algorithm \ref{algo:gpu_kernel2}) but only 2 are used at each call of +the PRNG kernels. In practice, the selection of which combinations +arrays will be used is different for all the threads and is determined +by using the three last bits of two internal variables used by BBS. +This approach adds more randomness. In algorithm~\ref{algo:bbs_gpu}, +character \& performs the AND bitwise. So using \&7 with a number +gives the last 3 bits, so it provides a number between 0 and 7. +\item +Second, after the generation of the 8 BBS numbers for each thread we +have a 32 bits number for which the period is possibly quite small. So +to add randomness, we generate 4 more BBS numbers which allows us to +shift the 32 bits numbers and add upto 6 new bits. This part is +described in algorithm~\ref{algo:bbs_gpu}. In practice, if we call +{\it strategy}, the number representing the strategy, the last 2 bits +of the first new BBS number are used to make a left shift of at least +3 bits. The last 3 bits of the second new BBS number are add to the +strategy whatever the value of the first left shift. The third and the +fourth new BBS numbers are used similarly to apply a new left shift +and add 3 new bits. +\item +Finally, as we use 8 BBS numbers for each thread, the store of these +numbers at the end of the kernel is performed using a rotation. So, +internal variable for BBS number 1 is stored in place 2, internal +variable for BBS number 2 is store ind place 3, ... and internal +variable for BBS number 8 is stored in place 1. +\end{itemize} + + +\begin{algorithm} + +\KwIn{InternalVarBBSArray: array with internal variables of the 8 BBS +in global memory\; +NumThreads: Number of threads\; +tab: 2D Arrays containing 16 combinations (in first dimension) of size combination\_size (in second dimension)\;} + +\KwOut{NewNb: array containing random numbers in global memory} +\If{threadId is concerned} { + retrieve data from InternalVarBBSArray[threadId] in local variables including shared memory and x\; + we consider that bbs1 ... bbs8 represent the internal states of the 8 BBS numbers\; + offset = threadIdx\%combination\_size\; + o1 = threadIdx-offset+tab[bbs1\&7][offset]\; + o2 = threadIdx-offset+tab[8+bbs2\&7][offset]\; + \For{i=1 to n} { + t<<=4\; + t|=BBS1(bbs1)\&15\; + ...\; + t<<=4\; + t|=BBS8(bbs8)\&15\; + //two new shifts\; + t<<=BBS3(bbs3)\&3\; + t|=BBS1(bbs1)\&7\; + t<<=BBS7(bbs7)\&3\; + t|=BBS2(bbs2)\&7\; + t=t $\hat{ }$ shmem[o1] $\hat{ }$ shmem[o2]\; + shared\_mem[threadId]=t\; + x = x $\hat{ }$ t\; + + store the new PRNG in NewNb[NumThreads*threadId+i]\; + } + store internal variables in InternalVarXorLikeArray[threadId] using a rotation\; +} + +\caption{main kernel for the BBS based PRNG GPU} +\label{algo:bbs_gpu} +\end{algorithm} + +In algorithm~\ref{algo:bbs_gpu}, t<<=4 performs a left shift of 4 bits +on the variable t and stores the result in t. BBS1(bbs1)\&15 selects +the last four bits of the result of BBS1. It should be noticed that +for the two new shifts, we use arbitrarily 4 BBSs that have previously +been used. + + + +\subsection{Toward a Cryptographically Secure and Chaotic Asymmetric Cryptosystem} + +We finish this research work by giving some thoughts about the use of +the proposed PRNG in an asymmetric cryptosystem. +This first approach will be further investigated in a future work. + +\subsubsection{Recalls of the Blum-Goldwasser Probabilistic Cryptosystem} + +The Blum-Goldwasser cryptosystem is a cryptographically secure asymmetric key encryption algorithm +proposed in 1984~\cite{Blum:1985:EPP:19478.19501}. The encryption algorithm +implements a XOR-based stream cipher using the BBS PRNG, in order to generate +the keystream. Decryption is done by obtaining the initial seed thanks to +the final state of the BBS generator and the secret key, thus leading to the + reconstruction of the keystream. + +The key generation consists in generating two prime numbers $(p,q)$, +randomly and independently of each other, that are + congruent to 3 mod 4, and to compute the modulus $N=pq$. +The public key is $N$, whereas the secret key is the factorization $(p,q)$. + + +Suppose Bob wishes to send a string $m=(m_0, \dots, m_{L-1})$ of $L$ bits to Alice: +\begin{enumerate} +\item Bob picks an integer $r$ randomly in the interval $\llbracket 1,N\rrbracket$ and computes $x_0 = r^2~mod~N$. +\item He uses the BBS to generate the keystream of $L$ pseudorandom bits $(b_0, \dots, b_{L-1})$, as follows. For $i=0$ to $L-1$, +\begin{itemize} +\item $i=0$. +\item While $i \leqslant L-1$: +\begin{itemize} +\item Set $b_i$ equal to the least-significant\footnote{BBS can securely output up to $\mathsf{N} = \lfloor log(log(N)) \rfloor$ of the least-significant bits of $x_i$ during each round.} bit of $x_i$, +\item $i=i+1$, +\item $x_i = (x_{i-1})^2~mod~N.$ +\end{itemize} +\end{itemize} +\item The ciphertext is computed by XORing the plaintext bits $m$ with the keystream: $ c = (c_0, \dots, c_{L-1}) = m \oplus b$. This ciphertext is $[c, y]$, where $y=x_{0}^{2^{L}}~mod~N.$ +\end{enumerate} + + +When Alice receives $\left[(c_0, \dots, c_{L-1}), y\right]$, she can recover $m$ as follows: +\begin{enumerate} +\item Using the secret key $(p,q)$, she computes $r_p = y^{((p+1)/4)^{L}}~mod~p$ and $r_q = y^{((q+1)/4)^{L}}~mod~q$. +\item The initial seed can be obtained using the following procedure: $x_0=q(q^{-1}~{mod}~p)r_p + p(p^{-1}~{mod}~q)r_q~{mod}~N$. +\item She recomputes the bit-vector $b$ by using BBS and $x_0$. +\item Alice computes finally the plaintext by XORing the keystream with the ciphertext: $ m = c \oplus b$. +\end{enumerate} + + +\subsubsection{Proposal of a new Asymmetric Cryptosystem Adapted from Blum-Goldwasser} + +We propose to adapt the Blum-Goldwasser protocol as follows. +Let $\mathsf{N} = \lfloor log(log(N)) \rfloor$ be the number of bits that can +be obtained securely with the BBS generator using the public key $N$ of Alice. +Alice will pick randomly $S^0$ in $\llbracket 0, 2^{\mathsf{N}-1}\rrbracket$ too, and +her new public key will be $(S^0, N)$. + +To encrypt his message, Bob will compute +\begin{equation} +c = \left(m_0 \oplus (b_0 \oplus S^0), m_1 \oplus (b_0 \oplus b_1 \oplus S^0), \hdots, m_{L-1} \oplus (b_0 \oplus b_1 \hdots \oplus b_{L-1} \oplus S^0) \right) +\end{equation} +instead of $\left(m_0 \oplus b_0, m_1 \oplus b_1, \hdots, m_{L-1} \oplus b_{L-1} \right)$. + +The same decryption stage as in Blum-Goldwasser leads to the sequence +$\left(m_0 \oplus S^0, m_1 \oplus S^0, \hdots, m_{L-1} \oplus S^0 \right)$. +Thus, with a simple use of $S^0$, Alice can obtained the plaintext. +By doing so, the proposed generator is used in place of BBS, leading to +the inheritance of all the properties presented in this paper. + +\section{Conclusion} + + +In this paper we have presented a new class of PRNGs based on chaotic +iterations. We have proven that these PRNGs are chaotic in the sense of Devaney. +We also propose a PRNG cryptographically secure and its implementation on GPU. + +An efficient implementation on GPU based on a xor-like PRNG allows us to +generate a huge number of pseudorandom numbers per second (about +20Gsamples/s). This PRNG succeeds to pass the hardest batteries of TestU01. + +In future work we plan to extend this work for parallel PRNG for clusters or +grid computing. + + + +\bibliographystyle{plain} +\bibliography{mabase} \end{document}