From 4c5e6c1725249ae02b156277ef750a43f5d6144b Mon Sep 17 00:00:00 2001 From: couturie Date: Tue, 19 Mar 2013 16:47:02 +0100 Subject: [PATCH 1/1] correct ch18 --- .../Chapters/chapter1/figures/scalability.pdf | Bin 19107 -> 22220 bytes BookGPU/Chapters/chapter18/ch18.tex | 45 ++++++++---------- 2 files changed, 20 insertions(+), 25 deletions(-) diff --git a/BookGPU/Chapters/chapter1/figures/scalability.pdf b/BookGPU/Chapters/chapter1/figures/scalability.pdf index 28398a07d1f5b4926d8e2c446d0ff4bae878c318..e484f598014cc7f21ee3fbda7de5f8db6347ce61 100644 GIT binary patch delta 3153 zcmb`JdpuNmAIE0~S)&+T7?&6_YZ%O&!I)W$Yr`nFAxcSuQ*O&j_o+D(0Bjb2}!8o+JIhwQ)|7?smr1-t{-#4M7&u zCT->H9W1@7mY(&_ZH_!k>F83J-mg2{@f(Nn?01K`*%!`pDuhMEs%f zUb8&cE5eM({O#9S{UgC@ofW9us1*uUItL0Y21^1i80`1^jjqjHYenAcf9$9+v@6~~ zRcD(uZ9MkChr?zGvdZ>|W!h1D*FZH~FUy)CEC&e(Hcva{L;Qxj>j-Vh+B>$l;|;k! z)mnKQv_dU36_UR5eaaYIsew3JeP+1>>iqMSlhl}mRR2V&LBbJXio@hOms0GT)7o_z zC7bjvR;<=AnJoyFd9*b;(Wc4nTZn1Sc%E)w1t9~0T`co$u2RjqA;?>0vtXj^kcxKv zxgCmaW)h10w9`eX%7L>*sGlxhxZS9t>x*-@r9~yBIGj2(^pBh#>R4K??*5b^P1ZR1 z%@u#s*RKK+ckx%Ov@$K0nr^a8Xg%P3_TC%=xG}D+-de^K*}4~N?LKneZ6!3O6n8bQ zMT2uL{orb4VtQoBOs_hz`=LaBa0VBHXwSt@BV%rfQd}hbj-6obE`Z-U9Y4x{qlW*{ zS-vB#MZHT=Tg@$?($U+#OZ7sP{rYSR^R?u++~ft8Nq89_A>)}i^AwJ!!oPc!JWN-bx zbS1FxetpKxo@7blGhb9eMcBnQiW5UqZ!>ni{_7Jap6wgMy01f+TdO7wyZv_XPM*tn z`=Pd46#s#u-#i~%~soE2-CFHW6xV({*-J0zt=e6Q6?S1?cnPr|c1~5*5S7Pzv zhBOU*1#RoX__Vl{K9*|E6@K((L;z2}OTEl_-RlD{aHJL8QR@k2!(^l8jq%#qOy7q$ z)PD(lUT{Z2>3j7om21jjX?6{2_;=1pCmqS#-gf6vlN9(t^}E69Ac#U%?YlHNTcGte z{pM|HgOFsz`*qSC=CQ-BeGPSM&QC3@u znz}(Sb@N4|wRBgu#kYMAoO6%gnc5qLo>ui^&Kgi)gP8OBm*ic_Lzk^qjty~(yVE$V z47FVXy5GQwIJQQi|IkR!VXd0AP>|z_N#}>ZPQ=KFP_A_uc3Tit^l&R&+A z8q1OMnC^O~5@>qxo5uD+mAgYCC}Qr)RDKUVjdCylHEIee$>oI!W(j3?FcuWHD6$$K zqB>gVDRE!<0bxunWL-4l{P9sx#u*yIqagSxssb6pR-%BA_zVvFj+`x$e zX0GgP9MY$hMWpM68arkq+_n4=;i_O;rYAP0|t` zSCd|h+hi%j_FK4qVV3?Q0WW=esqO;@X9h3^wwLj~7vjPU3l$8UG28atyaM%gA}HE~ zn!MTbs;3*So;riF$}7AWDJsVZ0?T`{1P#y-?q2b!I)w>zLgZora6!97^r(9};=&LfHs;z+KXT_BnS zxpD*`SBL{(12&rE1O^0gYXG5wgiJ9sqQe)?*vQC`N+y#5SO7`3JgyLS3Wu%4 z^2x>l#*2-~Vk7<gKT>nvEGu{U^q`C@CUTS-{8}(MvTU`q2(R3w=l6nLLVtJk7#OEsqh-3N* z@FFkef9Iuii+Q2{42)I5>S0y0$Hl}}X^N$Pk@o8;27iI{Uy^L;C`G4#ofP`>B=xn> zQhL80TC7So9NVk~QmkHP4slc^hp_q+U`@8t$}a%J2?~84Tk_Y5{~j1+<78-NCY6CW zU?XKKC4oy!?zm*dq+|BiTTi+&-5L@KtBxd89oTm=Yu`QyX&2;Ho{(x&X;YAZLOe-R z6n+6zyoEl?f&W9!Qtm@vk8CEEkeP{HO4>|s!l-wq?x%s3Wh;(;&KoHJccP7O02(eB zaLJ$paM@Vs;)-+l4pa6ilc!)2t$3lk!Q0Va9|FL)WV z*-?!?_ob2H3iYS2P{0Wa1qBNyetY+zXb=a$S<2?|;4-*S&(I_mcP9@AEY%uU0T_V; cXr{&sg(O_a5eOIlkOI(6>1aHDy$cil2l39`g8%>k delta 16 XcmX@JmT~b^#tn15*i=0,$ $\forall x \in \mathcal{X},$ $\forall \delta >0,$ $\exists y \in \mathcal{X},$ $\exists n \in \mathbb{N},$ $d(x,y)<\delta$ and $d\left(f^n(x),f^n(y)\right) \geqslant \varepsilon.$ -Intuitively, a map possesses sensitive dependence on initial conditions if there exist points arbitrarily close to $x$ that eventually separate from $x$ by at least $\varepsilon$ under iteration of $f$. Not all points near $x$ need eventually separate from $x$ under iteration, but there must be at least one such point in every neighborhood of $x$. If a map possesses sensitive dependence on initial conditions, then for all practical purposes, the dynamics of the map defy numerical computation. Small errors in computation that are introduced by round-off may become magnified upon iteration. The results of numerical computation of an orbit, no matter how accurate, may bear no resemblance whatsoever with the real orbit. +Intuitively, a map possesses sensitive dependence on initial conditions if there exist points arbitrarily close to $x$ that eventually separate from $x$ by at least $\varepsilon$ under the iteration of $f$. Not all points near $x$ need eventually separate from $x$ under iteration, but there must be at least one such point in every neighborhood of $x$. If a map possesses sensitive dependence on initial conditions, then for all practical purposes, the dynamics of the map defy numerical computation. Small errors in computation that are introduced by round-off may become magnified upon iteration. The results of numerical computation of an orbit, no matter how accurate, may bear no resemblance whatsoever with the real orbit. \end{itemize} \end{definition} @@ -298,11 +293,11 @@ this fast generator cannot be proven as secure. \subsection{Efficient PRNGs based on Chaotic Iterations on GPU} \label{sec:efficient PRNG gpu} -In order to take benefits from the computing power of GPU, a program +In order to benefit from the computing power of GPU, a program needs to have independent blocks of threads that can be computed simultaneously. In general, the larger the number of threads is, the more local memory is used, and the less branching instructions are -used (if, while, ...), the better the performances on GPU is. +used (if, while, ...) and so, the better the performances on GPU are. Obviously, having these requirements in mind, it is possible to build a program similar to the one presented in Listing \ref{algo:seqCIPRNG}, which computes pseudorandom numbers on GPU. To @@ -323,7 +318,7 @@ PRNGs used in these computations must have different parameters. In a given thread, these parameters are randomly picked from another PRNGs. The initialization stage is performed by the CPU. -To do it, the ISAAC PRNG~\cite{Jenkins96} is used to set all the +To do this, the ISAAC PRNG~\cite{Jenkins96} is used to set all the parameters embedded into each thread. The implementation of the three @@ -361,7 +356,7 @@ used simultaneously, the number of random numbers that a thread can generate inside a kernel is limited (\emph{i.e.}, the variable \texttt{n} in algorithm~\ref{algo:gpu_kernel}). For instance, if $100,000$ threads are used and if $n=100$\footnote{in fact, we need to add the initial seed (a 32-bits number)}, -then the memory required to store all of the internals variables of both the xor-like +then the memory required to store all of the internal variables of both the xor-like PRNGs\footnote{we multiply this number by $2$ in order to count 32-bits numbers} and the pseudorandom numbers generated by our PRNG, is equal to $100,000\times ((4+5+6)\times 2+(1+100))=1,310,000$ 32-bits numbers, that is, approximately $52$Mb. @@ -431,7 +426,7 @@ iterations is realized between the last stored value $x$ of the thread and a str (obtained by a bitwise exclusive or between a value provided by a xor-like() call and two values previously obtained by two other threads). To be certain that such iterations corresponds to the chaotic one recalled at the -end of the Section~\ref{sec:dev}, +end of Section~\ref{sec:dev}, we must guarantee that this dynamical system iterates on the space $\mathcal{X} =\mathds{B}^\mathsf{N} \times \mathcal{P}\left(\llbracket 1, 2^\mathsf{N} \rrbracket\right)^\mathds{N}$. The left term $x$ obviously belongs to $\mathds{B}^ \mathsf{N}$. @@ -442,7 +437,7 @@ integer of $\llbracket 1, 2^\mathsf{N} \rrbracket$. Such a result is obvious, as for the xor-like(), all the integers belonging into its interval of definition can occur at each iteration, and thus the last $t$ respects the requirement. Furthermore, it is possible to -prove by an immediate mathematical induction that, supposes that the initial $x$ +prove by an immediate mathematical induction that, supposing that the initial $x$ is uniformly distributed, %(it is provided by a cryptographically secure PRNG), the two other stored values shmem[o1] and shmem[o2] are uniformly distributed too, (this is the induction hypothesis), and thus the next $x$ is finally uniformly distributed. @@ -470,7 +465,7 @@ order to obtain the optimal performances, the storage of pseudorandom numbers into the GPU memory has been removed. This step is time consuming and slows down the numbers generation. Moreover this storage is completely useless, in case of applications that consume the pseudorandom -numbers directly after generation. We can see that when the number of threads is greater +numbers directly after they have been generated. We can see that when the number of threads is greater than approximately 30,000 and lower than 5 million, the number of pseudorandom numbers generated per second is almost constant. With the naive version, this value ranges from 2.5 to 3GSamples/s. With the optimized version, it is approximately equal to -- 2.39.5