-\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comt\'{e}}
-\chapterauthor{Christophe Guyeux}{Femto-ST Institute, University of Franche-Comt\'{e}}
+\chapterauthor{Raphaël Couturier and Christophe Guyeux}{Femto-ST Institute, University of Franche-Comte, France}
+%\chapterauthor{Christophe Guyeux}{Femto-ST Institute, University of Franche-Comt\'{e}}
-\chapter{Pseudorandom Number Generator on GPU}
+\chapter{Pseudorandom number generator on GPU}
\label{chapter18}
\section{Introduction}
Randomness is of importance in many fields such as scientific
-simulations or cryptography. ``Random numbers'' can mainly be
-generated either by a deterministic and reproducible algorithm called
-a pseudorandom number generator (PRNG)\index{PRNG}, or by a physical non-deterministic
+simulations or cryptography. Random numbers can mainly be
+generated by either a deterministic and reproducible algorithm called
+a pseudorandom number generator (PRNG)\index{PRNG}, or by a physical nondeterministic
process having all the characteristics of a random noise, called a truly random number
generator (TRNG). In this chapter, we focus on
-reproducible generators, useful for instance in Monte-Carlo based
+reproducible generators, useful for instance in MonteCarlo-based
simulators. These domains need PRNGs that are statistically
irreproachable. In some fields such as in numerical simulations,
speed is a strong requirement that is usually attained by using
parallel architectures. In that case, a recurrent problem is that a
deflation of the statistical qualities is often reported, when the
parallelization of a good PRNG is realized. This
-is why ad-hoc PRNGs for each possible architecture must be found to
+is why adhoc PRNGs for each possible architecture must be found to
achieve both speed and randomness. On the other hand, speed is not
the main requirement in cryptography: the most important point is to
define \emph{secure} generators able to withstand malicious
next bit of the generator, having the knowledge of all the binary
digits that have been already released~\cite{Goldreich}. ``Being able in practice''
refers here to the possibility to achieve this attack in polynomial
-time, and to the exponential growth of the difficulty of this
+time and to the exponential growth of the difficulty of this
challenge when the size of the parameters of the PRNG increases.
Finally, a small part of the community working in this domain focuses on a
-third requirement, that is to define chaotic generators~\cite{kellert1994wake, Wu20051195,gleick2011chaos}.
-The main idea is to benefits from a chaotic dynamical system to obtain a
-generator that is unpredictable, disordered, sensible to its seed, or in other word chaotic.
+third requirement: to define chaotic generators~\cite{kellert1994wake, Wu20051195,gleick2011chaos}.
+The main idea is to benefit from a chaotic dynamical system to obtain a
+generator that is unpredictable, disordered, sensible to its seed, or in other words, chaotic.
These scientists' desire is to map a given chaotic dynamics into a sequence that seems random
and unassailable due to chaos.
However, the chaotic maps used as a pattern are defined in the real line
architectures in this chapter.
-Let us finish this introduction by noticing that, in this paper,
+Let us finish this introduction by noticing that, in this chapter,
statistical perfection refers to the ability to pass the whole
{\it BigCrush} battery of tests, which is widely considered as the most
stringent statistical evaluation of a sequence claimed as random.
This battery can be found in the well-known TestU01 package~\cite{LEcuyerS07}\index{TestU01}.
More precisely, each time we performed a test on a PRNG, we ran it
-twice in order to observe if all $p-$values were inside [0.01, 0.99]. In
-fact, we observed that few $p-$values (less than ten) are sometimes
+twice in order to observe if all $p$-values were inside [0.01, 0.99]. In
+fact, we observed that few $p$-values (fewer than 10 out of 160) are sometimes
outside this interval but inside [0.001, 0.999], so that is why a
second run has allowed us to confirm that the values outside are not for
the same test. With this approach all our PRNGs pass the {\it
- BigCrush} successfully and all $p-$values are at least once inside
+ BigCrush} successfully and all $p$-values are at least once inside
[0.01, 0.99].
Chaos, for its part, refers to the well-established definition of a
chaotic dynamical system defined by Devaney~\cite{Devaney}.
These generators are finally experimented in Section~\ref{sec:experiments}.
-\section{Basic Remindees}
+\section{Basic remindees}
\label{section:BASIC RECALLS}
This section is devoted to basic definitions and terminologies in the fields of
with basic notions on topology (see for instance~\cite{Devaney}).
-\subsection{A Short Presentation of Chaos}
+\subsection{A short presentation of chaos}
Chaos theory studies the behavior of dynamical systems that are perfectly predictable, yet appear to be wildly amorphous and meaningless.
-Chaotic systems\index{chaotic systems} are highly sensitive to initial conditions,
+Chaotic systems\index{chaotic!systems} are highly sensitive to initial conditions,
which is popularly referred to as the butterfly effect.
In other words, small differences in initial conditions (such as those due to rounding errors in numerical computation) yield widely diverging outcomes,
-rendering long-term prediction impossible in general \cite{kellert1994wake}. This happens even though these systems are deterministic, meaning that their future behavior is fully determined by their initial conditions, with no random elements involved \cite{kellert1994wake}. That is, the deterministic nature of these systems does not make them predictable \cite{kellert1994wake,Werndl01032009}. This behavior is known as deterministic chaos, or simply chaos. It has been well-studied in mathematics and
+in general rendering long-term prediction impossible \cite{kellert1994wake}. This happens even though these systems are deterministic, meaning that their future behavior is fully determined by their initial conditions, with no random elements involved \cite{kellert1994wake}. That is, the deterministic nature of these systems does not make them predictable \cite{kellert1994wake,Werndl01032009}. This behavior is known as deterministic chaos, or simply chaos. It has been well-studied in mathematics and
physics, leading among other things to the well-established definition of Devaney which can be found next.
-\subsection{On Devaney's Definition of Chaos}\index{chaos}
+\subsection{On Devaney's definition of chaos}\index{chaos}
\label{sec:dev}
Consider a metric space $(\mathcal{X},d)$ and a continuous function $f:\mathcal{X}\longrightarrow \mathcal{X}$, for one-dimensional dynamical systems of the form:
\begin{equation}
\end{equation}
Intuitively, a topologically transitive map has points that eventually move under iteration from one arbitrarily small neighborhood to any other. Consequently, the dynamical system cannot be decomposed into two disjoint open sets that are invariant under the map. Note that if a map possesses a dense orbit, then it is clearly topologically transitive.
-\item Density of periodic points in $\mathcal{X}$\index{density of periodic points}.
+\item Density of periodic points in $\mathcal{X}$\index{density of periodic points}:
Let $P=\{p\in \mathcal{X}|\exists n \in \mathds{N}^{\ast}:f^n(p)=p\}$ the set of periodic points of $f$. Then $P$ is dense in $\mathcal{X}$:
\overline{P}=\mathcal{X} .
\end{equation}
-The density of periodic orbits means that every point in space is closely approached by periodic orbits in an arbitrary way. Topologically mixing systems failing this condition may not display sensitivity to initial conditions presented below, and hence may not be chaotic.
+The density of periodic orbits means that every point in space is closely approached by periodic orbits in an arbitrary way. Topologically mixing systems failing this condition may not display sensitivity to initial conditions presented below and, hence,may not be chaotic.
\item Sensitive dependence on initial conditions\index{sensitive dependence on initial conditions}:
$\exists \varepsilon>0,$ $\forall x \in \mathcal{X},$ $\forall \delta >0,$ $\exists y \in \mathcal{X},$ $\exists n \in \mathbb{N},$ $d(x,y)<\delta$ and $d\left(f^n(x),f^n(y)\right) \geqslant \varepsilon.$
-Intuitively, a map possesses sensitive dependence on initial conditions if there exist points arbitrarily close to $x$ that eventually separate from $x$ by at least $\varepsilon$ under the iteration of $f$. Not all points near $x$ need eventually separate from $x$ under iteration, but there must be at least one such point in every neighborhood of $x$. If a map possesses sensitive dependence on initial conditions, then for all practical purposes, the dynamics of the map defy numerical computation. Small errors in computation that are introduced by round-off may become magnified upon iteration. The results of numerical computation of an orbit, no matter how accurate, may bear no resemblance whatsoever with the real orbit.
+Intuitively, a map possesses sensitive dependence on initial conditions if there exist points arbitrarily close to $x$ that eventually separate from $x$ by at least $\varepsilon$ under the iteration of $f$. Not all points near $x$ need to be eventually separate from $x$ under iteration, but there must be at least one such point in every neighborhood of $x$. If a map possesses sensitive dependence on initial conditions, then for all practical purposes, the dynamics of the map defy numerical computation. Small errors in computation that are introduced by round-off may become magnified upon iteration. The results of numerical computation of an orbit, no matter how accurate, may bear no resemblance whatsoever with the real orbit.
\end{itemize}
\end{definition}
-When $f$ is chaotic, then the system $(\mathcal{X}, f)$ is chaotic and quoting Devaney: ``it is unpredictable because of the sensitive dependence on initial conditions. It cannot be broken down or decomposed into two subsystems which do not interact because of topological transitivity. And, in the midst of this random behavior, we nevertheless have an element of regularity.'' Fundamentally different behaviors are consequently possible and occur in an unpredictable way.
+When $f$ is chaotic, then the system $(\mathcal{X}, f)$ is chaotic and quoting Devaney~\cite[p. 50]{Devaney}: ``it is unpredictable because of the sensitive dependence on initial conditions. It cannot be broken down or decomposed into two subsystems which do not interact because of topological transitivity. And, in the midst of this random behavior, we nevertheless have an element of regularity.'' Fundamentally different behaviors are consequently possible and occur in an unpredictable way.
-\subsection{Chaotic iterations}\index{chaotic iterations}
+\subsection{Chaotic iterations}\index{chaotic!iterations}
\label{subsection:Chaotic iterations}
Let us now introduce an example of a dynamical systems family that has
\end{equation}
\end{definition}
-In other words, at the $n^{th}$ iteration, only the cells of $S^{n}$ are
-``iterated''.
+In other words, at the $nth$ iteration, only the cells of $S^{n}$ are
+iterated.
Chaotic iterations generate a set of vectors;
they are defined by an initial state $x^{0}$, an iteration function $f$, and a chaotic strategy $S$~\cite{bg10:ij}.
These ``chaotic iterations'' can behave chaotically as defined by Devaney,
depending on the choice of $f$~\cite{bg10:ij}. For instance,
chaos is obtained when $f$ is the vectorial negation.
Note that, with this example of function, chaotic iterations
-defined above can be rewritten as:
+defined above can be rewritten as
\begin{equation}
\label{equation Oplus}
x^0 \in \llbracket 0,2^\mathsf{N}-1\rrbracket,~\mathcal{S}^n \in \mathcal{P}\left(\llbracket 1,2^\mathsf{N}-1\rrbracket\right)^\mathds{N},~~ x^{n+1}=x^n \oplus \mathcal{S}^n,
$a\oplus b$ is the bitwise exclusive or operation between the binary
representation of the integers $a$ and $b$. Note that the term
$\mathcal{S}^n$ is directly and obviously linked to the $S^n$ of
-Eq.\ref{eq:generalIC}. As recalled above, such an iterative sequence
+Eq.~\ref{eq:generalIC}. Such an iterative sequence
satisfies the Devaney's definition of chaos.
-\section{Toward Efficiency and Improvement for CI PRNG}
+\section{Toward efficiency and improvement for CI PRNG}
\label{sec:efficient PRNG}
-\subsection{First Efficient Implementation of a PRNG based on Chaotic Iterations}
+\subsection{First efficient implementation of a PRNG based on chaotic iterations}
%
%Based on the proof presented in the previous section, it is now possible to
%improve the speed of the generator formerly presented in~\cite{bgw09:ip,guyeux10}.
-\lstset{language=C,caption={C code of the sequential PRNG based on chaotic iterations},label={algo:seqCIPRNG}}
-\begin{small}
-\begin{lstlisting}
-
-unsigned int CIPRNG() {
- static unsigned int x = 123123123;
- unsigned long t1 = xorshift();
- unsigned long t2 = xor128();
- unsigned long t3 = xorwow();
- x = x^(unsigned int)t1;
- x = x^(unsigned int)(t2>>32);
- x = x^(unsigned int)(t3>>32);
- x = x^(unsigned int)t2;
- x = x^(unsigned int)(t1>>32);
- x = x^(unsigned int)t3;
- return x;
-}
-\end{lstlisting}
-\end{small}
-
+\lstinputlisting[label=algo:seqCIPRNG,caption={C code of the sequential PRNG based on chaotic iterations}]{Chapters/chapter18/code2.cu}
In Listing~\ref{algo:seqCIPRNG} a sequential version of the proposed PRNG based
on chaotic iterations is presented, which extends the generator family
formerly presented in~\cite{bgw09:ip,guyeux10}. The xor operator is represented by
-\textasciicircum. This function uses three classical 64-bits PRNGs, namely the
+\textasciicircum. This function uses three classical 64-bit PRNGs, namely the
\texttt{xorshift}, the \texttt{xor128}, and the
\texttt{xorwow}~\cite{Marsaglia2003}. In the following, we call them ``xor-like
PRNGs''. As each xor-like PRNG uses 64-bits whereas our proposed generator
-works with 32-bits, we use the command \texttt{(unsigned int)}, that selects the
+works with 32-bits, we use the command \texttt{(unsigned int)}, which selects the
32 least significant bits of a given integer, and the code \texttt{(unsigned
int)(t$>>$32)} in order to obtain the 32 most significant bits of \texttt{t}.
-Thus producing a pseudorandom number needs 6 xor operations with 6 32-bits numbers
-that are provided by 3 64-bits PRNGs. This version successfully passes the
-stringent BigCrush battery of tests~\cite{LEcuyerS07}.
-At this point, we thus
-have defined an efficient and statistically unbiased generator. Its speed is
+Thus producing a pseudorandom number needs 6 xor operations with 6 32-bit numbers
+that are provided by 3 64-bit PRNGs. This version successfully passes the
+stringent {\it BigCrush} battery of tests~\cite{LEcuyerS07}.
+At this point, we have defined an efficient and statistically unbiased generator. Its speed is
directly related to the use of linear operations, but for the same reason,
this fast generator cannot be proven as secure.
-\subsection{Efficient PRNGs based on Chaotic Iterations on GPU}
+\subsection{Efficient PRNGs based on chaotic iterations on GPU}
\label{sec:efficient PRNG gpu}
In order to benefit from the computing power of GPU, a program
needs to have independent blocks of threads that can be computed
simultaneously. In general, the larger the number of threads is, the
more local memory is used, and the less branching instructions are
-used (if, while, ...) and so, the better the performances on GPU are.
+used (if, while, etc.) and so, the better the performances on GPU are.
Obviously, having these requirements in mind, it is possible to build
-a program similar to the one presented in Listing
-\ref{algo:seqCIPRNG}, which computes pseudorandom numbers on GPU. To
-do so, we must firstly recall that in the CUDA~\cite{Nvid10}
+a program similar to the one presented in Listing~\ref{algo:seqCIPRNG}, which computes pseudorandom numbers on GPU. To
+do so, we must first recall that in the CUDA~\cite{Nvid10}
environment, threads have a local identifier called
\texttt{ThreadIdx}, which is relative to the block containing
-them. Furthermore, in CUDA, parts of the code that are executed by the GPU, are
+them. Furthermore, in CUDA, parts of the code that are executed by the GPU are
called {\it kernels}.
-\subsection{Naive Version for GPU}
+\subsection{Naive version for GPU}
It is possible to deduce from the CPU version a quite similar version adapted to GPU.
-The simple principle consists in making each thread of the GPU computing the CPU version of our PRNG.
+The simple principle consists of making each thread of the GPU compute the CPU version of our PRNG.
Of course, the three xor-like
PRNGs used in these computations must have different parameters.
In a given thread, these parameters are
randomly picked from another PRNGs.
The initialization stage is performed by the CPU.
-To do this, the ISAAC PRNG~\cite{Jenkins96} is used to set all the
+To do this, the Indirection, Shift, Accumulate, Add, and Count (ISAAC) PRNG~\cite{Jenkins96} is used to set all the
parameters embedded into each thread.
The implementation of the three
xor-like PRNGs is straightforward when their parameters have been
allocated in the GPU memory. Each xor-like works with an internal
number $x$ that saves the last generated pseudorandom number. Additionally, the
-implementation of the xor128, the xorshift, and the xorwow respectively require
+implementation of the {\it xor128}, the {\it xorshift}, and the {\it xorwow}, respectively, require
4, 5, and 6 unsigned long as internal variables.
\If{threadIdx is concerned by the computation} {
retrieve data from InternalVarXorLikeArray[threadIdx] in local variables\;
\For{i=1 to n} {
- compute a new PRNG as in Listing\ref{algo:seqCIPRNG}\;
+ compute a new PRNG as in Listing~\ref{algo:seqCIPRNG}\;
store the new PRNG in NewNb[NumThreads*threadIdx+i]\;
}
store internal variables in InternalVarXorLikeArray[threadIdx]\;
}
\end{small}
-\caption{Main kernel of the GPU ``naive'' version of the PRNG based on chaotic iterations}
+\caption{main kernel of the GPU ``naive'' version of the PRNG based on chaotic iterations}
\label{algo:gpu_kernel}
\end{algorithm}
Algorithm~\ref{algo:gpu_kernel} presents a naive implementation of the proposed PRNG on
GPU. Due to the available memory in the GPU and the number of threads
used simultaneously, the number of random numbers that a thread can generate
-inside a kernel is limited (\emph{i.e.}, the variable \texttt{n} in
-algorithm~\ref{algo:gpu_kernel}). For instance, if $100,000$ threads are used and
-if $n=100$\footnote{in fact, we need to add the initial seed (a 32-bits number)},
+inside a kernel is limited (i.e., the variable {\it n} in
+Algorithm~\ref{algo:gpu_kernel}). For instance, if $100,000$ threads are used and
+if $n=100$\footnote{In fact, we need to add the initial seed (a 32-bit number).},
then the memory required to store all of the internal variables of both the xor-like
-PRNGs\footnote{we multiply this number by $2$ in order to count 32-bits numbers}
+PRNGs\footnote{We multiply this number by $2$ in order to count 32-bit numbers.}
and the pseudorandom numbers generated by our PRNG, is equal to $100,000\times ((4+5+6)\times
-2+(1+100))=1,310,000$ 32-bits numbers, that is, approximately $52$Mb.
+2+(1+100))=1,310,000$ 32-bit numbers, that is, approximately $52$Mb.
-This generator is able to pass the whole BigCrush battery of tests, for all
-the versions that have been tested depending on their number of threads
-(called \texttt{NumThreads} in our algorithm, tested up to $5$ million).
+This generator is able to pass the whole {\it BigCrush} battery of tests, for all
+the versions that have been tested depend on their number of threads
+(called NumThreads in our algorithm, tested up to $5$ million).
-\subsection{Improved Version for GPU}
+\subsection{Improved version for GPU}
As GPU cards using CUDA have shared memory between threads of the same block, it
is possible to use this feature in order to simplify the previous algorithm,
-i.e., to use less than 3 xor-like PRNGs. The solution consists in computing only
-one xor-like PRNG by thread, saving it into the shared memory, and then to use the results
+i.e., to use fewer than 3 xor-like PRNGs. The solution consists in computing only
+one xor-like PRNG by thread, saving it into the shared memory, and then using the results
of some other threads in the same block of threads. In order to define which
thread uses the result of which other one, we can use a combination array that
contains the indexes of all threads and for which a combination has been
performed.
In Algorithm~\ref{algo:gpu_kernel2}, two combination arrays are used. The
-variable \texttt{offset} is computed using the value of
-\texttt{combination\_size}. Then we can compute \texttt{o1} and \texttt{o2}
+variable offset is computed using the value of
+combination\_size. Then we can compute o1 and o2
representing the indexes of the other threads whose results are used by the
-current one. In this algorithm, we consider that a 32-bits xor-like PRNG has
+current one. In this algorithm, we consider that a 32-bit xor-like PRNG has
been chosen. In practice, we use the xor128 proposed in~\cite{Marsaglia2003} in
which unsigned longs (64 bits) have been replaced by unsigned integers (32
bits).
\KwOut{NewNb: array containing random numbers in global memory}
\If{threadId is concerned} {
- retrieve data from InternalVarXorLikeArray[threadId] in local variables including shared memory and x\;
+ retrieve data from InternalVarXorLikeArray[threadIdx] in local variables including shared memory and x\;
offset = threadIdx\%combination\_size\;
o1 = threadIdx-offset+array\_comb1[offset]\;
o2 = threadIdx-offset+array\_comb2[offset]\;
shared\_mem[threadId]=t\;
x = x\textasciicircum t\;
- store the new PRNG in NewNb[NumThreads*threadId+i]\;
+ store the new PRNG in NewNb[NumThreads*threadIdx+i]\;
}
- store internal variables in InternalVarXorLikeArray[threadId]\;
+ store internal variables in InternalVarXorLikeArray[threadIdx]\;
}
\end{small}
\caption{Main kernel for the chaotic iterations based PRNG GPU efficient
\label{algo:gpu_kernel2}
\end{algorithm}
-\subsection{Chaos Evaluation of the Improved Version}
+\subsection{Chaos evaluation of the improved version}
-A run of Algorithm~\ref{algo:gpu_kernel2} consists in an operation ($x=x\oplus t$) having
+A run of Algorithm~\ref{algo:gpu_kernel2} consists of an operation ($x=x\oplus t$) having
the form of Equation~\ref{equation Oplus}, which is equivalent to the iterative
system of Eq.~\ref{eq:generalIC}. That is, an iteration of the general chaotic
iterations is realized between the last stored value $x$ of the thread and a strategy $t$
(obtained by a bitwise exclusive or between a value provided by a xor-like() call
and two values previously obtained by two other threads).
-To be certain that such iterations corresponds to the chaotic one recalled at the
+To be certain that such iterations correspond to the chaotic one recalled at the
end of Section~\ref{sec:dev},
we must guarantee that this dynamical system iterates on the space
$\mathcal{X} =\mathds{B}^\mathsf{N} \times \mathcal{P}\left(\llbracket 1, 2^\mathsf{N} \rrbracket\right)^\mathds{N}$.
The left term $x$ obviously belongs to $\mathds{B}^ \mathsf{N}$.
-To prevent from any flaws of chaotic properties, we must check that the right
-term (the last $t$), corresponding to the strategies, can possibly be equal to any
+To prevent any flaws of chaotic properties, we must check that the right
+term (the last $t$ in Algorithm~\ref{algo:gpu_kernel2}), corresponding to the strategies, can possibly be equal to any
integer of $\llbracket 1, 2^\mathsf{N} \rrbracket$.
-Such a result is obvious, as for the xor-like(), all the
-integers belonging into its interval of definition can occur at each iteration, and thus the
+Such a result is obvious; as for the xor-like(), all the
+integers belonging to its interval of definition can occur at each iteration, and thus the
last $t$ respects the requirement. Furthermore, it is possible to
prove by an immediate mathematical induction that, supposing that the initial $x$
is uniformly distributed, %(it is provided by a cryptographically secure PRNG),
-the two other stored values shmem[o1] and shmem[o2] are uniformly distributed too,
+the two other stored values shmem[o1] and shmem[o2] are uniformly distributed too
(this is the induction hypothesis), and thus the next $x$ is finally uniformly distributed.
Thus Algorithm~\ref{algo:gpu_kernel2} is a concrete realization of the general
chaotic iterations presented previously, and for this reason, it satisfies the
-Devaney's formulation of a chaotic behavior.
+Devaney's formulation of chaotic behavior.
\section{Experiments}
\label{sec:experiments}
Different experiments have been performed in order to measure the generation
-speed. We have used a first computer equipped with a Tesla C1060 NVidia GPU card
+speed. We have used one computer equipped with a Tesla C1060 NVIDIA GPU card
and an
Intel Xeon E5530 cadenced at 2.40 GHz, and
a second computer equipped with a smaller CPU and a GeForce GTX 280.
All the
cards have 240 cores.
+\begin{figure}[htpb]
+\begin{center}
+ \includegraphics[scale=0.65]{Chapters/chapter18/figures/curve_time_xorlike_gpu.pdf}
+\end{center}
+\caption{Quantity of pseudorandom numbers generated per second with the xorlike-based PRNG.}
+\label{fig:time_xorlike_gpu}
+\end{figure}
+
In Figure~\ref{fig:time_xorlike_gpu} we compare the quantity of pseudorandom numbers
generated per second with various xor-like based PRNGs. In this figure, the optimized
-versions use the {\it xor64} described in~\cite{Marsaglia2003}, whereas the naive versions
+versions use the xor64 described in~\cite{Marsaglia2003}, whereas the naive versions
embed the three xor-like PRNGs described in Listing~\ref{algo:seqCIPRNG}. In
order to obtain the optimal performances, the storage of pseudorandom numbers
-into the GPU memory has been removed. This step is time consuming and slows down the numbers
+to the GPU memory has been removed. This step is time-consuming and slows down the numbers
generation. Moreover this storage is completely
useless, in case of applications that consume the pseudorandom
numbers directly after they have been generated. We can see that when the number of threads is greater
-than approximately 30,000 and lower than 5 million, the number of pseudorandom numbers generated
+than approximately 30,000 and less than 5 million, the number of pseudorandom numbers generated
per second is almost constant. With the naive version, this value ranges from 2.5 to
3GSamples/s. With the optimized version, it is approximately equal to
20GSamples/s. Finally we can remark that both GPU cards are quite similar, but in
As a comparison, Listing~\ref{algo:seqCIPRNG} leads to the generation of about
138MSample/s when using one core of the Xeon E5530.
-\begin{figure}[htbp]
-\begin{center}
- \includegraphics[scale=0.65]{Chapters/chapter18/figures/curve_time_xorlike_gpu.pdf}
-\end{center}
-\caption{Quantity of pseudorandom numbers generated per second with the xorlike-based PRNG}
-\label{fig:time_xorlike_gpu}
-\end{figure}
+
chaotic according to Devaney. Efficient implementations on GPU
using xor-like PRNGs as input generators have shown that a very large quantity
of pseudorandom numbers can be generated per second (about 20Gsamples/s on a Tesla C1060), and
-that these proposed PRNGs succeed to pass the hardest battery in TestU01, namely
-the BigCrush.
+that these proposed PRNGs succeed in passing the hardest battery in TestU01, namely, the {\it BigCrush}.