From: couturie Date: Thu, 10 Nov 2011 21:23:29 +0000 (+0100) Subject: new X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/prng_gpu.git/commitdiff_plain/7165f1d084760cf67c1cf3a0881e1478eba90b55?ds=inline;hp=--cc new --- 7165f1d084760cf67c1cf3a0881e1478eba90b55 diff --git a/mabase.bib b/mabase.bib index 3bba177..ae0a8d5 100644 --- a/mabase.bib +++ b/mabase.bib @@ -4252,4 +4252,21 @@ booktitle = "Proceedings of the {ACM}/{SIGDA} 17th International ISBN = "978-1-60558-410-2", pages = "63--72", URL = "http://doi.acm.org/10.1145/1508128.1508139", -} \ No newline at end of file +} + + + +@InProceedings{Jenkins96, + author = "Jenkins", + title = "{ISAAC}", + booktitle = "IWFSE: International Workshop on Fast Software + Encryption, LNCS", + year = "1996", +} + +@manual{Nvid10, + author = {Nvidia}, + title = {Cuda cublas library}, + year = {2010}, + Note = {Version 3.1}, + } diff --git a/prng_gpu.tex b/prng_gpu.tex index 82a4927..bc5b3e5 100644 --- a/prng_gpu.tex +++ b/prng_gpu.tex @@ -817,8 +817,8 @@ the larger the number of threads is, the more local memory is used and the less branching instructions are used (if, while, ...), the better performance is obtained on GPU. So with algorithm \ref{algo:seqCIprng} presented in the previous section, it is possible to build a similar program which computes PRNG -on GPU. In the CUDA [ref] environment, threads have a local identificator, -called \texttt{ThreadIdx} relative to the block containing them. +on GPU. In the CUDA~\cite{Nvid10} environment, threads have a local +identificator, called \texttt{ThreadIdx} relative to the block containing them. \subsection{Naive version for GPU} @@ -828,14 +828,14 @@ The principe consists in assigning the computation of a PRNG as in sequential to each thread of the GPU. Of course, it is essential that the three xor-like PRNGs used for our computation have different parameters. So we chose them randomly with another PRNG. As the initialisation is performed by the CPU, we -have chosen to use the ISAAC PRNG [ref] to initalize all the parameters for the -GPU version of our PRNG. The implementation of the three xor-like PRNGs is -straightforward as soon as their parameters have been allocated in the GPU -memory. Each xor-like PRNGs used works with an internal number $x$ which keeps -the last generated random numbers. Other internal variables are also used by the -xor-like PRNGs. More precisely, the implementation of the xor128, the xorshift -and the xorwow respectively require 4, 5 and 6 unsigned long as internal -variables. +have chosen to use the ISAAC PRNG~\ref{Jenkins96} to initalize all the +parameters for the GPU version of our PRNG. The implementation of the three +xor-like PRNGs is straightforward as soon as their parameters have been +allocated in the GPU memory. Each xor-like PRNGs used works with an internal +number $x$ which keeps the last generated random numbers. Other internal +variables are also used by the xor-like PRNGs. More precisely, the +implementation of the xor128, the xorshift and the xorwow respectively require +4, 5 and 6 unsigned long as internal variables. \begin{algorithm} @@ -954,8 +954,20 @@ Devaney's formulation of a chaotic behavior. \section{Experiments} \label{sec:experiments} -Different experiments have been performed in order to measure the generation -speed. In Figure~\ref{fig:time_gpu} we compare the number of random numbers generated per second. +Different experiments have been performed in order to measure the generation +speed. We have used a computer equiped with Tesla C1060 NVidia GPU card and an +Intel Xeon E5530 cadenced at 2.40 GHz for our experiments. + +In Figure~\ref{fig:time_gpu} we compare the number of random numbers generated +per second. In order to obtain the optimal number we remove the storage of +random numbers in the GPU memory. This step is time consumming and slows down +the random number generation. Moreover, if you are interested by applications +that consome random number directly when they are generated, their storage is +completely useless. In this figure we can see that when the number of threads is +greater than approximately 30,000 upto 5 millions the number of random numbers +generated per second is almost constant. With the naive version, it is between +2.5 and 3GSample/s. With the optimized version, it is almost equals to +20GSample/s. \begin{figure}[htbp] \begin{center}