quelques modifs

[book_gpu.git] / BookGPU / Chapters / chapter2 / ch2.tex
diff --git a/BookGPU/Chapters/chapter2/ch2.tex b/BookGPU/Chapters/chapter2/ch2.tex

index c33ac50dbe4b52ba7443a8d3294c2aa3816e06ca..8222660b03fb33730f512cf2c4c726539ab24fdb 100755 (executable)
--- a/BookGPU/Chapters/chapter2/ch2.tex
+++ b/BookGPU/Chapters/chapter2/ch2.tex
@@ -1,15 +1,15 @@
  \chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte}
  
-\chapter{Introduction to CUDA}
+\chapter{Introduction to Cuda}
  \label{chapter2}
  
  \section{Introduction}
  \label{ch2:intro}
  
-In this chapter  we give some simple examples on CUDA  programming.  The goal is
-not to provide an exhaustive presentation of all the functionalities of CUDA but
-rather giving some basic elements. Of  course, readers that do not know CUDA are
-invited  to read  other  books that  are  specialized on  CUDA programming  (for
+In this chapter  we give some simple examples on Cuda  programming.  The goal is
+not to provide an exhaustive presentation of all the functionalities of Cuda but
+rather giving some basic elements. Of  course, readers that do not know Cuda are
+invited  to read  other  books that  are  specialized on  Cuda programming  (for
  example: \cite{ch2:Sanders:2010:CEI}).
  
  
@@ -17,53 +17,57 @@ example: \cite{ch2:Sanders:2010:CEI}).
  \label{ch2:1ex}
  
  This first example is  intented to show how to build a  very simple example with
-CUDA.   The goal  of this  example is  to performed  the sum  of two  arrays and
-putting the  result into a  third array.   A cuda program  consists in a  C code
-which calls CUDA kernels that are executed on a GPU. The listing of this code is
+Cuda.   The goal  of this  example is  to perform  the sum  of two  arrays and
+put the  result into a  third array.   A Cuda program  consists in a  C code
+which calls Cuda kernels that are executed on a GPU. The listing of this code is
  in Listing~\ref{ch2:lst:ex1}.
  
  
  As GPUs have  their own memory, the first step consists  in allocating memory on
-the GPU. A call to \texttt{cudaMalloc} allows to allocate memory on the GPU. The
-first  parameter of  this  function  is a  pointer  on a  memory  on the  device
-(i.e. the GPU). In this example, \texttt{d\_} is added on each variable allocated
-on the GPU meaning this variable  is on the GPU. The second parameter represents
-the size of the allocated variables, this size is in bits.
+the GPU.  A call to  \texttt{cudaMalloc}\index{Cuda~functions!cudaMalloc} allows
+to allocate memory on the GPU. The first parameter of this function is a pointer
+on a memory on the device (i.e. the GPU). In this example, \texttt{d\_} is added
+on each variable allocated  on the GPU, meaning this variable is  on the GPU. The
+second parameter represents the size of the allocated variables, this size is in
+bits.
  
  In this example, we  want to compare the execution time of  the additions of two
  arrays in  CPU and  GPU. So  for both these  operations, a  timer is  created to
-measure the  time. CUDA proposes to  manipulate timers quick  easily.  The first
-step is  to create the timer, then  to start it and  at the end to  stop it. For
-each of these operations a dedicated functions is used.
+measure the  time. Cuda proposes to  manipulate timers quite  easily.  The first
+step is to create the timer\index{Cuda~functions!timer}, then to start it and at
+the end to stop it. For each of these operations a dedicated functions is used.
  
-In  order to  compute  the  same sum  with  a GPU,  the  first  step consits  in
-transferring the data from the CPU (considered as the host with CUDA) to the GPU
-(considered as the  device with CUDA).  A call  to \texttt{cudaMalloc} allows to
+In  order to  compute  the same  sum  with a  GPU, the  first  step consists  in
+transferring the data from the CPU (considered as the host with Cuda) to the GPU
+(considered as the  device with Cuda).  A call  to \texttt{cudaMemcpy} allows to
  copy the content of an array allocated in the host to the device when the fourth
-parameter is set to  \texttt{cudaMemcpyHostToDevice}. The first parameter of the
-function is the destination array, the  second is the source array and the third
-is the number of elements to copy (exprimed in bytes).
-
-Now that the GPU contains the data needed to perform the addition. In sequential
-such addition is achieved  out with a loop on all the  elements.  With a GPU, it
-is possible  to perform the addition of  all elements of the  arrays in parallel
+parameter                                 is                                 set
+to  \texttt{cudaMemcpyHostToDevice}\index{Cuda~functions!cudaMemcpy}.  The first
+parameter of the function is the  destination array, the second is the
+source  array and  the third  is the  number of  elements to  copy  (exprimed in
+bytes).
+
+Now the GPU contains the data needed to perform the addition. In sequential such
+addition is  achieved out with a  loop on all the  elements.  With a  GPU, it is
+possible to perform  the addition of all elements of the  two arrays in parallel
  (if  the  number   of  blocks  and  threads  per   blocks  is  sufficient).   In
  Listing\ref{ch2:lst:ex1}     at    the     beginning,    a     simple    kernel,
  called \texttt{addition} is defined to  compute in parallel the summation of the
-two     arrays.     With     CUDA,      a     kernel     starts     with     the
-keyword   \texttt{\_\_global\_\_}   \index{CUDA~keywords!\_\_shared\_\_}   which
+two     arrays.      With     Cuda,     a     kernel     starts     with     the
+keyword   \texttt{\_\_global\_\_}   \index{Cuda~keywords!\_\_shared\_\_}   which
  indicates that this kernel can be called from the C code.  The first instruction
  in this kernel is used to compute the variable \texttt{tid} which represents the
  thread index.   This thread index\index{thread  index} is computed  according to
-the   values    of   the   block   index    (it   is   a    variable   of   CUDA
-called  \texttt{blockIdx}\index{CUDA~keywords!blockIdx}). Blocks of  threads can
-be decomposed into  1 dimension, 2 dimensions or 3  dimensions. According to the
-dimension of data  manipulated, the appropriate dimension can  be useful. In our
-example, only  one dimension  is used.  Then  using notation \texttt{.x}  we can
-access to the first dimension (\texttt{.y} and \texttt{.z} allow respectively to
-access      to      the     second      and      third     dimension).       The
-variable \texttt{blockDim}\index{CUDA~keywords!blockDim} gives  the size of each
-block.
+the           values            of           the           block           index
+(called  \texttt{blockIdx} \index{Cuda~keywords!blockIdx}  in Cuda)  and  of the
+thread   index   (called   \texttt{blockIdx}\index{Cuda~keywords!threadIdx}   in
+Cuda). Blocks of threads and thread  indexes can be decomposed into 1 dimension, 2
+dimensions or 3 dimensions.  According to the dimension of data manipulated, the
+appropriate dimension can be useful. In our example, only one dimension is used.
+Then  using  notation   \texttt{.x}  we  can  access  to   the  first  dimension
+(\texttt{.y}  and \texttt{.z}  allow respectively  to access  to the  second and
+third dimension).   The variable \texttt{blockDim}\index{Cuda~keywords!blockDim}
+gives the size of each block.
  
  
  
@@ -72,24 +76,24 @@ block.
  \section{Second example: using CUBLAS}
  \label{ch2:2ex}
  
-The Basic Linear Algebra Subprograms  (BLAS) allows programmer to use performant
-routines that are often used. Those routines are heavily used in many scientific
-applications  and  are  very  optimized  for  vector  operations,  matrix-vector
-operations                           and                           matrix-matrix
+The Basic Linear Algebra Subprograms  (BLAS) allows programmers to use efficient
+routines  that are  often  required. Those  routines  are heavily  used in  many
+scientific  applications   and  are   very  optimized  for   vector  operations,
+matrix-vector              operations              and             matrix-matrix
  operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some  of those operations seem
-to be  easy to  implement with CUDA.   Nevertheless, as  soon as a  reduction is
-needed, implementing an efficient reduction routines with CUDA is far from being
+to be  easy to  implement with Cuda.   Nevertheless, as  soon as a  reduction is
+needed, implementing an efficient reduction routines with Cuda is far from being
  simple. Roughly speaking, a reduction operation\index{reduction~operation} is an
  operation  which combines  all the  elements of  an array  and extract  a number
  computed with all the  elements. For example, a sum, a maximum  or a dot product
-are reduction operations. 
+are reduction operations.
  
  In this second example, we consider that  we have two vectors $A$ and $B$. First
  of all, we want to compute the sum  of both vectors in a vector $C$. Then we want
  to compute the  scalar product between $1/C$ and $1/A$. This  is just an example
-which has no direct interest except to show how to program it with CUDA.
+which has no direct interest except to show how to program it with Cuda.
  
-Listing~\ref{ch2:lst:ex2} shows this example with CUDA. The first kernel for the
+Listing~\ref{ch2:lst:ex2} shows this example with Cuda. The first kernel for the
  addition  of two  arrays  is exactly  the same  as  the one  described in  the
  previous example.
  
@@ -137,14 +141,60 @@ consider  we  have  a  squared  matrix  of size  \texttt{size}.  So  with  a  1D
  array, \texttt{A[i*size+j]} allows  us to access to the  element of the $i^{th}$
  row and of the $j^{th}$ column.
  
-In sequential the matrix multiplication is performed using three loops. Supposing that $A$, $B$ represent two square matrices, the result of the multiplication of $A \times B$ is 
-
-On C2070M Tesla card, this code take 37.68ms to perform the multiplication. On a
-Intel Xeon E31245 at 3.30GHz, it takes 2465ms without any parallelization (using
-only one  core). Consequently the  speed up between  the CPU and GPU  version is
-about 65 which is very good regarding the difficulty of parallelizing this code.
+With  a sequential  programming, the  matrix multiplication  is  performed using
+three loops. Supposing that $A$, $B$  represent two square matrices and that the
+result   of    the   multiplication    of   $A   \times    B$   is    $C$.   The
+element \texttt{C[i*size+j]} is computed as follows:
+\begin{equation}
+C[i*size+j]=\sum_{k=0}^{size-1} A[i*size+k]*B[k*size+j];
+\end{equation}
+
+In  Listing~\ref{ch2:lst:ex3}, in  the CPU  computation,  this part  of code  is
+performed using 3 loops, one for $i$, one  for $j$ and one for $k$.  In order to
+perform the same computation on a  GPU, a naive solution consists in considering
+that the matrix $C$ is split into  2 dimensional blocks.  The size of each block
+must be chosen such  as the number of threads per block  is inferior to $1,024$.
+
+
+In Listing~\ref{ch2:lst:ex3},  we consider that  a block contains 16  threads in
+each   dimension,  the   variable  \texttt{width}   is  used   for   that.   The
+variable \texttt{nbTh} represents the number of threads per block. So to be able
+to compute the matrix-matrix product on a GPU, each block of threads is assigned
+to compute the result  of the product for the elements of  this block.  The main
+part of the code is quite similar to the previous code.  Arrays are allocated in
+the  CPU and  the GPU.   Matrices $A$  and $B$  are randomly  initialized.  Then
+arrays are  transfered inside the  GPU memory with call  to \texttt{cudaMemcpy}.
+So the first step for each thread of a block is to compute the corresponding row
+and   column.    With   a    2   dimensional   decomposition,   \texttt{int   i=
+blockIdx.y*blockDim.y+ threadIdx.y;} allows us to compute the corresponding line
+and  \texttt{int  j=   blockIdx.x*blockDim.x+  threadIdx.x;}  the  corresponding
+column. Then each  thread has to compute the  sum of the product of  the line of
+$A$   per   the  column   of   $B$.    In  order   to   use   a  register,   the
+kernel  \texttt{matmul}  uses a  variable  called  \texttt{sum}  to compute  the
+sum. Then the result is set into  the matrix at the right place. The computation
+of  CPU matrix-matrix multiplication  is performed  as described  previously.  A
+timer measures  the time.   In order to  use 2 dimensional  blocks, \texttt{dim3
+dimGrid(size/width,size/width);} allows us  to create \texttt{size/width} blocks
+in each  dimension.  Likewise,  \texttt{dim3 dimBlock(width,width);} is  used to
+create \texttt{width} thread  in each dimension. After that,  the kernel for the
+matrix  multiplication is  called. At  the end  of the  listing, the  matrix $C$
+computed by the GPU is transfered back  in the CPU and we check if both matrices
+C computed by the CPU and the GPU are identical with a precision of $10^{-4}$.
+
+
+On C2070M Tesla card, this code take $37.68$ms to perform the multiplication. On
+a Intel Xeon E31245 at  $3.30$GHz, it takes $2465$ms without any parallelization
+(using only one core). Consequently the speed up between the CPU and GPU version
+is about $65$ which is very  good regarding the difficulty of parallelizing this
+code.
  
  \lstinputlisting[label=ch2:lst:ex3,caption=simple Matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu}
  
+\section{Conclusion}
+In this chapter  3 simple Cuda examples have been  presented. Those examples are
+quite  simple  and  they  cannot  present  all the  possibilities  of  the  Cuda
+programming.   Interested  readers  are  invited  to  consult  Cuda  programming
+introduction books if some issues regarding the Cuda programming is not clear.
+
  \putbib[Chapters/chapter2/biblio]