BookGPU/Chapters/chapter2/ch2.tex

   1 \chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte}
   2
   3 \chapter{Introduction to CUDA}
   4 \label{chapter2}
   5
   6 \section{Introduction}\label{intro}
   7 In this chapter  we give some simple examples on CUDA  programming.  The goal is
   8 not to provide an exhaustive presentation of all the functionalities of CUDA but
   9 rather giving some basic elements. Of  course, readers that do not know CUDA are
  10 invited  to read  other  books that  are  specialized on  CUDA programming  (for
  11 example: \cite{ch2:Sanders:2010:CEI}).
  12
  13
  14 \section{First example}
  15
  16 This first example is  intented to show how to build a  very simple example with
  17 CUDA.   The goal  of this  example is  to performed  the sum  of two  arrays and
  18 putting the  result into a  third array.   A cuda program  consists in a  C code
  19 which calls CUDA kernels that are executed on a GPU. The listing of this code is
  20 in Listing~\ref{ch2:lst:ex1}.
  21
  22
  23 As GPUs have  their own memory, the first step consists  in allocating memory on
  24 the GPU. A call to \texttt{cudaMalloc} allows to allocate memory on the GPU. The
  25 first  parameter of  this  function  is a  pointer  on a  memory  on the  device
  26 (i.e. the GPU). In this example, \texttt{d\_} is added on each variable allocated
  27 on the GPU meaning this variable  is on the GPU. The second parameter represents
  28 the size of the allocated variables, this size is in bits.
  29
  30 In this example, we  want to compare the execution time of  the additions of two
  31 arrays in  CPU and  GPU. So  for both these  operations, a  timer is  created to
  32 measure the  time. CUDA proposes to  manipulate timers quick  easily.  The first
  33 step is  to create the timer, then  to start it and  at the end to  stop it. For
  34 each of these operations a dedicated functions is used.
  35
  36 In  order to  compute  the  same sum  with  a GPU,  the  first  step consits  in
  37 transferring the data from the CPU (considered as the host with CUDA) to the GPU
  38 (considered as the  device with CUDA).  A call  to \texttt{cudaMalloc} allows to
  39 copy the content of an array allocated in the host to the device when the fourth
  40 parameter is set to  \texttt{cudaMemcpyHostToDevice}. The first parameter of the
  41 function is the destination array, the  second is the source array and the third
  42 is the number of elements to copy (exprimed in bytes).
  43
  44 Now that the GPU contains the data needed to perform the addition. In sequential
  45 such addition is achieved  out with a loop on all the  elements.  With a GPU, it
  46 is possible  to perform the addition of  all elements of the  arrays in parallel
  47 (if  the  number   of  blocks  and  threads  per   blocks  is  sufficient).   In
  48 Listing\ref{ch2:lst:ex1}     at    the     beginning,    a     simple    kernel,
  49 called \texttt{addition} is defined to  compute in parallel the summation of the
  50 two     arrays.     With     CUDA,      a     kernel     starts     with     the
  51 keyword   \texttt{\_\_global\_\_}   \index{CUDA~keywords!\_\_shared\_\_}   which
  52 indicates that this kernel can be called from the C code.  The first instruction
  53 in this kernel is used to compute the variable \texttt{tid} which represents the
  54 thread index.   This thread index\index{thread  index} is computed  according to
  55 the   values    of   the   block   index    (it   is   a    variable   of   CUDA
  56 called  \texttt{blockIdx}\index{CUDA~keywords!blockIdx}). Blocks of  threads can
  57 be decomposed into  1 dimension, 2 dimensions or 3  dimensions. According to the
  58 dimension of data  manipulated, the appropriate dimension can  be useful. In our
  59 example, only  one dimension  is used.  Then  using notation \texttt{.x}  we can
  60 access to the first dimension (\texttt{.y} and \texttt{.z} allow respectively to
  61 access      to      the     second      and      third     dimension).       The
  62 variable \texttt{blockDim}\index{CUDA~keywords!blockDim} gives  the size of each
  63 block.
  64
  65
  66
  67 \lstinputlisting[label=ch2:lst:ex1,caption=A simple example]{Chapters/chapter2/ex1.cu}
  68
  69 \section{Second example: using CUBLAS}
  70
  71 The Basic Linear Algebra Subprograms  (BLAS) allows programmer to use performant
  72 routines that are often used. Those routines are heavily used in many scientific
  73 applications  and  are  very   optimized  for  vector  operations,  matrix-vector
  74 operations                           and                           matrix-matrix
  75 operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some of those operations seem
  76 to be  easy to  implement with CUDA.   Nevertheless, as  soon as a  reduction is
  77 needed, implementing an efficient reduction routines with CUDA is far from being
  78 simple.
  79
  80 In this second example, we consider that  we have two vectors $A$ and $B$. First
  81 of all, we want to compute the sum  of both vectors in a vector $C$. Then we want
  82 to compute the  scalar product between $1/C$ and $1/A$. This  is just an example
  83 which has no direct interest except to show how to program it with CUDA.
  84
  85 Listing~\ref{ch2:lst:ex2} shows this example with CUDA. The first kernel for the
  86 addition  of two  arrays  is exactly  the same  as  the one  described in  the
  87 previous example.
  88
  89 The  kernel  to  compute the  inverse  of  the  elements  of  an array  is  very
  90 simple. For  each thread index,  the inverse of  the array replaces  the initial
  91 array.
  92
  93 In the main function,  the beginning is very similar to the  one in the previous
  94 example.   First, the number  of elements  is asked  to the  user.  Then  a call
  95 to \texttt{cublasCreate} allows to initialize  the cublas library. It creates an
  96 handle. Then all the arrays are allocated  in the host and the device, as in the
  97 previous  example.  Both  arrays  $A$ and  $B$  are initialized.   Then the  CPU
  98 computation is performed  and the time for this CPU  computation is measured. In
  99 order to  compute the same result  on the GPU, first  of all, data  from the CPU
 100 need to be  copied into the memory of  the GPU. For that, it is  possible to use
 101 cublas function \texttt{cublasSetVector}.  This function several arguments. More
 102 precisely, the first argument represents the number of elements to transfer, the
 103 second arguments is the size of  each elements, the third element represents the
 104 source of the  array to transfer (in  the GPU), the fourth is  an offset between
 105 each element of  the source (usually this value  is set to 1), the  fifth is the
 106 destination (in the GPU)  and the last is an offset between  each element of the
 107 destination.
 108
 109 \lstinputlisting[label=ch2:lst:ex2,caption=A simple example]{Chapters/chapter2/ex2.cu}
 110
 111
 112 \putbib[Chapters/chapter2/biblio]
 113