suite

[book_gpu.git] / BookGPU / Chapters / chapter2 / ch2.tex
diff --git a/BookGPU/Chapters/chapter2/ch2.tex b/BookGPU/Chapters/chapter2/ch2.tex

index 501e34e1948ea29d293260a0de7bf69a09d4f7cf..e80b6709b80b829d346d30fd9ea2a618c68d732f 100755 (executable)
--- a/BookGPU/Chapters/chapter2/ch2.tex
+++ b/BookGPU/Chapters/chapter2/ch2.tex
@@ -137,10 +137,35 @@ consider  we  have  a  squared  matrix  of size  \texttt{size}.  So  with  a  1D
  array, \texttt{A[i*size+j]} allows  us to access to the  element of the $i^{th}$
  row and of the $j^{th}$ column.
  
  array, \texttt{A[i*size+j]} allows  us to access to the  element of the $i^{th}$
  row and of the $j^{th}$ column.
  
-On C2070M Tesla card, this code take 37.68ms to perform the multiplication. On a
-Intel Xeon E31245 at 3.30GHz, it takes 2465ms without any parallelization (using
-only one  core). Consequently the  speed up between  the CPU and GPU  version is
-about 65 which is very good regarding the difficulty of parallelizing this code.
+With  a sequential  programming, the  matrix multiplication  is  performed using
+three loops. Supposing that $A$, $B$  represent two square matrices and that the
+result   of    the   multiplication    of   $A   \times    B$   is    $C$.   The
+element \texttt{C[i*size+j]} is computed as follows:
+\begin{equation}
+C[i*size+j]=\sum_{k=0}^{size-1} A[i*size+k]*B[k*size+j];
+\end{equation}
+
+In  Listing~\ref{ch2:lst:ex3}, in  the CPU  computation,  this part  of code  is
+performed using 3 loops, one for $i$, one  for $j$ and one for $k$.  In order to
+perform the same computation on a  GPU, a naive solution consists in considering
+that the matrix $C$ is split into  2 dimensional blocks.  The size of each block
+must be chosen such  as the number of threads per block  is inferior to $1,024$.
+In Listing~\ref{ch2:lst:ex3},  we consider that  a block contains 16  threads in
+each dimension. The variable \texttt{nbTh}  represents the number of threads per
+block. So to be  able to compute the matrix-matrix product on  a GPU, each block
+of threads is assigned to compute the  result of the product for the elements of
+this block.   So the first  step for each  thread of a  block is to  compute the
+corresponding row and column. With a 2 dimensional decomposition, \texttt{int i=
+blockIdx.y*blockDim.y+ threadIdx.y;} allows us to compute the corresponding line
+and  \texttt{int  j=   blockIdx.x*blockDim.x+  threadIdx.x;}  the  corresponding
+column.
+
+
+On C2070M Tesla card, this code take $37.68$ms to perform the multiplication. On
+a Intel Xeon E31245 at  $3.30$GHz, it takes $2465$ms without any parallelization
+(using only one core). Consequently the speed up between the CPU and GPU version
+is about $65$ which is very  good regarding the difficulty of parallelizing this
+code.
  
  \lstinputlisting[label=ch2:lst:ex3,caption=simple Matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu}
  
  
  \lstinputlisting[label=ch2:lst:ex3,caption=simple Matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu}