From 26aa4e0cf2939580eae07e8bbad140d527f47234 Mon Sep 17 00:00:00 2001 From: guyeux Date: Wed, 30 Nov 2011 21:23:53 +0100 Subject: [PATCH 1/1] Presque fin d'une relecture --- prng_gpu.tex | 170 ++++++++++++++++++++++++++++++--------------------- 1 file changed, 99 insertions(+), 71 deletions(-) diff --git a/prng_gpu.tex b/prng_gpu.tex index 1c7c9fe..0279f03 100644 --- a/prng_gpu.tex +++ b/prng_gpu.tex @@ -141,8 +141,8 @@ The remainder of this paper is organized as follows. In Section~\ref{section:re and on an iteration process called ``chaotic iterations'' on which the post-treatment is based. Proofs of chaos are given in Section~\ref{sec:pseudorandom}. -Section~\ref{sec:efficient prng} presents an efficient -implementation of this chaotic PRNG on a CPU, whereas Section~\ref{sec:efficient prng +Section~\ref{sec:efficient PRNG} presents an efficient +implementation of this chaotic PRNG on a CPU, whereas Section~\ref{sec:efficient PRNG gpu} describes the GPU implementation. Such generators are experimented in Section~\ref{sec:experiments}. @@ -805,7 +805,7 @@ have $d((S,E),(\tilde S,E))<\epsilon$. \section{Efficient PRNG based on Chaotic Iterations} -\label{sec:efficient prng} +\label{sec:efficient PRNG} Based on the proof presented in the previous section, it is now possible to improve the speed of the generator formerly presented in~\cite{bgw09:ip,guyeux10}. @@ -846,9 +846,9 @@ $$ -\lstset{language=C,caption={C code of the sequential PRNG based on chaotic iterations},label=algo:seqCIprng} +\lstset{language=C,caption={C code of the sequential PRNG based on chaotic iterations},label=algo:seqCIPRNG} \begin{lstlisting} -unsigned int CIprng() { +unsigned int CIPRNG() { static unsigned int x = 123123123; unsigned long t1 = xorshift(); unsigned long t2 = xor128(); @@ -867,7 +867,7 @@ unsigned int CIprng() { -In Listing~\ref{algo:seqCIprng} a sequential version of the proposed PRNG based on chaotic iterations +In Listing~\ref{algo:seqCIPRNG} a sequential version of the proposed PRNG based on chaotic iterations is presented. The xor operator is represented by \textasciicircum. This function uses three classical 64-bits PRNGs, namely the \texttt{xorshift}, the \texttt{xor128}, and the \texttt{xorwow}~\cite{Marsaglia2003}. In the following, we call them @@ -882,14 +882,14 @@ with 6 32-bits numbers that are provided by 3 64-bits PRNGs. This version suc stringent BigCrush battery of tests~\cite{LEcuyerS07}. \section{Efficient PRNGs based on Chaotic Iterations on GPU} -\label{sec:efficient prng gpu} +\label{sec:efficient PRNG gpu} In order to take benefits from the computing power of GPU, a program needs to have independent blocks of threads that can be computed simultaneously. In general, the larger the number of threads is, the more local memory is used, and the less branching instructions are used (if, while, ...), the better the performances on GPU is. Obviously, having these requirements in mind, it is possible to build a program similar to -the one presented in Algorithm \ref{algo:seqCIprng}, which computes pseudorandom numbers +the one presented in Algorithm \ref{algo:seqCIPRNG}, which computes pseudorandom numbers on GPU. To do so, we must firstly recall that in the CUDA~\cite{Nvid10} environment, threads have a local @@ -925,7 +925,7 @@ NumThreads: number of threads\;} \If{threadIdx is concerned by the computation} { retrieve data from InternalVarXorLikeArray[threadIdx] in local variables\; \For{i=1 to n} { - compute a new PRNG as in Listing\ref{algo:seqCIprng}\; + compute a new PRNG as in Listing\ref{algo:seqCIPRNG}\; store the new PRNG in NewNb[NumThreads*threadIdx+i]\; } store internal variables in InternalVarXorLikeArray[threadIdx]\; @@ -962,19 +962,21 @@ for all the differents nodes involves in the computation. As GPU cards using CUDA have shared memory between threads of the same block, it is possible to use this feature in order to simplify the previous algorithm, -i.e., using less than 3 xor-like PRNGs. The solution consists in computing only -one xor-like PRNG by thread, saving it into shared memory and using the results +i.e., to use less than 3 xor-like PRNGs. The solution consists in computing only +one xor-like PRNG by thread, saving it into the shared memory, and then to use the results of some other threads in the same block of threads. In order to define which -thread uses the result of which other one, we can use a permutation array which +thread uses the result of which other one, we can use a permutation array that contains the indexes of all threads and for which a permutation has been -performed. In Algorithm~\ref{algo:gpu_kernel2}, 2 permutations arrays are used. +performed. + +In Algorithm~\ref{algo:gpu_kernel2}, two permutations arrays are used. The variable \texttt{offset} is computed using the value of \texttt{permutation\_size}. Then we can compute \texttt{o1} and \texttt{o2} -which represent the indexes of the other threads for which the results are used -by the current thread. In the algorithm, we consider that a 64-bits xor-like -PRNG is used, that is why both 32-bits parts are used. +representing the indexes of the other threads whose results are used +by the current one. In this algorithm, we consider that a 64-bits xor-like +PRNG has been chosen, and so its two 32-bits parts are used. -This version also succeeds to the {\it BigCrush} batteries of tests. +This version also can pass the whole {\it BigCrush} battery of tests. \begin{algorithm} @@ -1007,22 +1009,28 @@ version} \subsection{Theoretical Evaluation of the Improved Version} -A run of Algorithm~\ref{algo:gpu_kernel2} consists in three operations having +A run of Algorithm~\ref{algo:gpu_kernel2} consists in an operation ($x=x\oplus t$) having the form of Equation~\ref{equation Oplus}, which is equivalent to the iterative -system of Eq.~\ref{eq:generalIC}. That is, three iterations of the general chaotic -iterations are realized between two stored values of the PRNG. +system of Eq.~\ref{eq:generalIC}. That is, an iteration of the general chaotic +iterations is realized between the last stored value $x$ of the thread and a strategy $t$ +(obtained by a bitwise exclusive or between a value provided by a xor-like() call +and two values previously obtained by two other threads). To be certain that we are in the framework of Theorem~\ref{t:chaos des general}, we must guarantee that this dynamical system iterates on the space $\mathcal{X} = \mathcal{P}\left(\llbracket 1, \mathsf{N} \rrbracket\right)^\mathds{N}\times\mathds{B}^\mathsf{N}$. The left term $x$ obviously belongs into $\mathds{B}^ \mathsf{N}$. -To prevent from any flaws of chaotic properties, we must check that each right -term, corresponding to terms of the strategies, can possibly be equal to any +To prevent from any flaws of chaotic properties, we must check that the right +term (the last $t$), corresponding to the strategies, can possibly be equal to any integer of $\llbracket 1, \mathsf{N} \rrbracket$. -Such a result is obvious for the two first lines, as for the xor-like(), all the -integers belonging into its interval of definition can occur at each iteration. -It can be easily stated for the two last lines by an immediate mathematical -induction. +Such a result is obvious, as for the xor-like(), all the +integers belonging into its interval of definition can occur at each iteration, and thus the +last $t$ respects the requirement. Furthermore, it is possible to +prove by an immediate mathematical induction that, as the initial $x$ +is uniformly distributed (it is provided by a cryptographically secure PRNG), +the two other stored values shmem[o1] and shmem[o2] are uniformly distributed too, +(this can be stated by an immediate mathematical +induction), and thus the next $x$ is finally uniformly distributed. Thus Algorithm~\ref{algo:gpu_kernel2} is a concrete realization of the general chaotic iterations presented previously, and for this reason, it satisfies the @@ -1032,56 +1040,65 @@ Devaney's formulation of a chaotic behavior. \label{sec:experiments} Different experiments have been performed in order to measure the generation -speed. We have used a computer equiped with Tesla C1060 NVidia GPU card and an -Intel Xeon E5530 cadenced at 2.40 GHz for our experiments and we have used -another one equipped with a less performant CPU and a GeForce GTX 280. Both +speed. We have used a first computer equipped with a Tesla C1060 NVidia GPU card +and an +Intel Xeon E5530 cadenced at 2.40 GHz, and +a second computer equipped with a smaller CPU and a GeForce GTX 280. +All the cards have 240 cores. -In Figure~\ref{fig:time_xorlike_gpu} we compare the number of random numbers -generated per second with the xor-like based PRNG. In this figure, the optimized -version use the {\it xor64} described in~\cite{Marsaglia2003}. The naive version -use the three xor-like PRNGs described in Listing~\ref{algo:seqCIprng}. In -order to obtain the optimal performance we removed the storage of random numbers -in the GPU memory. This step is time consuming and slows down the random numbers -generation. Moreover, if one is interested by applications that consume random -numbers directly when they are generated, their storage are completely -useless. In this figure we can see that when the number of threads is greater -than approximately 30,000 upto 5 millions the number of random numbers generated -per second is almost constant. With the naive version, it is between 2.5 and -3GSample/s. With the optimized version, it is approximately equals to -20GSample/s. Finally we can remark that both GPU cards are quite similar. In -practice, the Tesla C1060 has more memory than the GTX 280 and this memory +In Figure~\ref{fig:time_xorlike_gpu} we compare the quantity of pseudorandom numbers +generated per second with various xor-like based PRNG. In this figure, the optimized +versions use the {\it xor64} described in~\cite{Marsaglia2003}, whereas the naive versions +embed the three xor-like PRNGs described in Listing~\ref{algo:seqCIPRNG}. In +order to obtain the optimal performances, the storage of pseudorandom numbers +into the GPU memory has been removed. This step is time consuming and slows down the numbers +generation. Moreover this storage is completely +useless, in case of applications that consume the pseudorandom +numbers directly after generation. We can see that when the number of threads is greater +than approximately 30,000 and lower than 5 millions, the number of pseudorandom numbers generated +per second is almost constant. With the naive version, this value ranges from 2.5 to +3GSamples/s. With the optimized version, it is approximately equal to +20GSamples/s. Finally we can remark that both GPU cards are quite similar, but in +practice, the Tesla C1060 has more memory than the GTX 280, and this memory should be of better quality. +As a comparison, Listing~\ref{algo:seqCIPRNG} leads to the generation of about +138MSample/s when using one core of the Xeon E5530. \begin{figure}[htbp] \begin{center} \includegraphics[scale=.7]{curve_time_xorlike_gpu.pdf} \end{center} -\caption{Number of random numbers generated per second with the xorlike based PRNG} +\caption{Quantity of pseudorandom numbers generated per second with the xorlike-based PRNG} \label{fig:time_xorlike_gpu} \end{figure} -In comparison, Listing~\ref{algo:seqCIprng} allows us to generate about -138MSample/s with only one core of the Xeon E5530. -In Figure~\ref{fig:time_bbs_gpu} we highlight the performance of the optimized -BBS based PRNG on GPU. Performances are less important. On the Tesla C1060 we -obtain approximately 1.8GSample/s and on the GTX 280 about 1.6GSample/s. + +In Figure~\ref{fig:time_bbs_gpu} we highlight the performances of the optimized +BBS-based PRNG on GPU. On the Tesla C1060 we +obtain approximately 1.8GSample/s and on the GTX 280 about 1.6GSample/s, which is +obviously slower than the xorlike-based PRNG on GPU. However, we will show in the +next sections that +this new PRNG has a strong level of security, which is necessary paid by a speed +reduction. \begin{figure}[htbp] \begin{center} \includegraphics[scale=.7]{curve_time_bbs_gpu.pdf} \end{center} -\caption{Number of random numbers generated per second with the BBS based PRNG} +\caption{Quantity of pseudorandom numbers generated per second using the BBS-based PRNG} \label{fig:time_bbs_gpu} \end{figure} -Both these experiments allows us to conclude that it is possible to -generate a huge number of pseudorandom numbers with the xor-like version and -about tens times less with the BBS based version. The former version has only -chaotic properties whereas the latter also has cryptographically properties. +All these experiments allow us to conclude that it is possible to +generate a very large quantity of pseudorandom numbers statistically perfect with the xor-like version. +In a certain extend, it is the case too with the secure BBS-based version, the speed deflation being +explained by the fact that the former version has ``only'' +chaotic properties and statistical perfection, whereas the latter is also cryptographically secure, +as it is shown in the next sections. @@ -1106,7 +1123,7 @@ The notion of {\it secure} PRNGs can now be defined as follows. A cryptographic PRNG $G$ is secure if for any probabilistic polynomial time algorithm $D$, for any positive polynomial $p$, and for all sufficiently large $k$'s, -$$| \mathrm{Pr}[D(G(U_k))=1]-Pr[D(U_{\ell_G(k)}=1]|< \frac{1}{p(N)},$$ +$$| \mathrm{Pr}[D(G(U_k))=1]-Pr[D(U_{\ell_G(k)})=1]|< \frac{1}{p(N)},$$ where $U_r$ is the uniform distribution over $\{0,1\}^r$ and the probabilities are taken over $U_N$, $U_{\ell_G(N)}$ as well as over the internal coin tosses of $D$. @@ -1133,6 +1150,7 @@ We claim now that if this PRNG is secure, then the new one is secure too. \begin{proposition} +\label{cryptopreuve} If $H$ is a secure cryptographic PRNG, then $X$ is a secure cryptographic PRNG too. \end{proposition} @@ -1199,40 +1217,50 @@ proving that $H$ is not secure, a contradiction. \end{proof} +\section{Cryptographical Applications} - -\section{A Cryptographically Secure PRNG for GPU} +\subsection{A Cryptographically Secure PRNG for GPU} \label{sec:CSGPU} -It is possible to build a cryptographically secure prng based on the previous -algorithm (algorithm~\ref{algo:gpu_kernel2}). It simply consists in replacing -the {\it xor-like} algorithm by another cryptographically secure prng. In -practice, we suggest to use the BBS algorithm~\cite{BBS} which takes the form: -$$x_{n+1}=x_n^2~ mod~ M$$ where $M$ is the product of two prime numbers. Those -prime numbers need to be congruent to 3 modulus 4. In practice, this PRNG is -known to be slow and not efficient for the generation of random numbers. For -current GPU cards, the modulus operation is the most time consuming -operation. So in order to obtain quite reasonable performances, it is required + +It is possible to build a cryptographically secure PRNG based on the previous +algorithm (Algorithm~\ref{algo:gpu_kernel2}). Due to Proposition~\ref{cryptopreuve}, +it simply consists in replacing +the {\it xor-like} PRNG by a cryptographically secure one. +We have chosen the Blum Blum Shum generator~\cite{BBS} (usually denoted by BBS) having the form: +$$x_{n+1}=x_n^2~ mod~ M$$ where $M$ is the product of two prime numbers. These +prime numbers need to be congruent to 3 modulus 4. BBS is +very slow and only usable for cryptographic applications. + + +The modulus operation is the most time consuming operation for +current GPU cards. +So in order to obtain quite reasonable performances, it is required to use only modulus on 32 bits integer numbers. Consequently $x_n^2$ need to be less than $2^{32}$ and the number $M$ need to be less than $2^{16}$. So in -pratice we can choose prime numbers around 256 that are congruent to 3 modulus +practice we can choose prime numbers around 256 that are congruent to 3 modulus 4. With 32 bits numbers, only the 4 least significant bits of $x_n$ can be -chosen (the maximum number of undistinguishing is less or equals to +chosen (the maximum number of indistinguishable bits is lesser than or equals to $log_2(log_2(x_n))$). So to generate a 32 bits number, we need to use 8 times -the BBS algorithm, with different combinations of $M$ is required. +the BBS algorithm with different combinations of $M$. Currently this PRNG does not succeed to pass all the tests of TestU01. +\subsection{A Secure Asymetric Cryptosystem} + + + + \section{Conclusion} In this paper we have presented a new class of PRNGs based on chaotic -iterations. We have proven that these PRNGs are chaotic in the sense of Devenay. +iterations. We have proven that these PRNGs are chaotic in the sense of Devaney. We also propose a PRNG cryptographically secure and its implementation on GPU. An efficient implementation on GPU based on a xor-like PRNG allows us to generate a huge number of pseudorandom numbers per second (about -20Gsample/s). This PRNG succeeds to pass the hardest batteries of TestU01. +20Gsamples/s). This PRNG succeeds to pass the hardest batteries of TestU01. In future work we plan to extend this work for parallel PRNG for clusters or grid computing. We also plan to improve the BBS version in order to succeed all -- 2.39.5