BookGPU/Chapters/chapter2/ch2.tex

   1 \chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte}
   2
   3 \chapter{Introduction to Cuda}
   4 \label{chapter2}
   5
   6 \section{Introduction}
   7 \label{ch2:intro}
   8
   9 In this chapter  we give some simple examples on Cuda  programming.  The goal is
  10 not to provide an exhaustive presentation of all the functionalities of Cuda but
  11 rather giving some basic elements. Of  course, readers that do not know Cuda are
  12 invited  to read  other  books that  are  specialized on  Cuda programming  (for
  13 example: \cite{ch2:Sanders:2010:CEI}).
  14
  15
  16 \section{First example}
  17 \label{ch2:1ex}
  18
  19 This first example is  intented to show how to build a  very simple example with
  20 Cuda.   The goal  of this  example is  to perform  the sum  of two  arrays and
  21 put the  result into a  third array.   A Cuda program  consists in a  C code
  22 which calls Cuda kernels that are executed on a GPU. The listing of this code is
  23 in Listing~\ref{ch2:lst:ex1}.
  24
  25
  26 As GPUs have  their own memory, the first step consists  in allocating memory on
  27 the GPU.  A call to  \texttt{cudaMalloc}\index{Cuda~functions!cudaMalloc} allows
  28 to allocate memory on the GPU. The first parameter of this function is a pointer
  29 on a memory on the device (i.e. the GPU). In this example, \texttt{d\_} is added
  30 on each variable allocated  on the GPU, meaning this variable is  on the GPU. The
  31 second parameter represents the size of the allocated variables, this size is in
  32 bits.
  33
  34 In this example, we  want to compare the execution time of  the additions of two
  35 arrays in  CPU and  GPU. So  for both these  operations, a  timer is  created to
  36 measure the  time. Cuda proposes to  manipulate timers quite  easily.  The first
  37 step is to create the timer\index{Cuda~functions!timer}, then to start it and at
  38 the end to stop it. For each of these operations a dedicated functions is used.
  39
  40 In  order to  compute  the same  sum  with a  GPU, the  first  step consists  in
  41 transferring the data from the CPU (considered as the host with Cuda) to the GPU
  42 (considered as the  device with Cuda).  A call  to \texttt{cudaMemcpy} allows to
  43 copy the content of an array allocated in the host to the device when the fourth
  44 parameter                                 is                                 set
  45 to  \texttt{cudaMemcpyHostToDevice}\index{Cuda~functions!cudaMemcpy}.  The first
  46 parameter of the function is the  destination array, the second is the
  47 source  array and  the third  is the  number of  elements to  copy  (expressed in
  48 bytes).
  49
  50 Now the GPU contains the data needed to perform the addition. In sequential such
  51 addition is  achieved out with a  loop on all the  elements.  With a  GPU, it is
  52 possible to perform  the addition of all elements of the  two arrays in parallel
  53 (if  the  number   of  blocks  and  threads  per   blocks  is  sufficient).   In
  54 Listing\ref{ch2:lst:ex1}     at    the     beginning,    a     simple    kernel,
  55 called \texttt{addition} is defined to  compute in parallel the summation of the
  56 two     arrays.      With     Cuda,     a     kernel     starts     with     the
  57 keyword   \texttt{\_\_global\_\_}   \index{Cuda~keywords!\_\_shared\_\_}   which
  58 indicates that this kernel can be called from the C code.  The first instruction
  59 in this kernel is used to compute the variable \texttt{tid} which represents the
  60 thread index.   This thread index\index{thread  index} is computed  according to
  61 the           values            of           the           block           index
  62 (called  \texttt{blockIdx} \index{Cuda~keywords!blockIdx}  in Cuda)  and  of the
  63 thread   index   (called   \texttt{blockIdx}\index{Cuda~keywords!threadIdx}   in
  64 Cuda). Blocks of threads and thread  indexes can be decomposed into 1 dimension, 2
  65 dimensions or 3 dimensions.  According to the dimension of data manipulated, the
  66 appropriate dimension can be useful. In our example, only one dimension is used.
  67 Then  using  notation   \texttt{.x}  we  can  access  to   the  first  dimension
  68 (\texttt{.y}  and \texttt{.z}  allow respectively  to access  to the  second and
  69 third dimension).   The variable \texttt{blockDim}\index{Cuda~keywords!blockDim}
  70 gives the size of each block.
  71
  72
  73
  74 \lstinputlisting[label=ch2:lst:ex1,caption=A simple example]{Chapters/chapter2/ex1.cu}
  75
  76 \section{Second example: using CUBLAS}
  77 \label{ch2:2ex}
  78
  79 The Basic Linear Algebra Subprograms  (BLAS) allows programmers to use efficient
  80 routines  that are  often  required. Those  routines  are heavily  used in  many
  81 scientific  applications   and  are   very  optimized  for   vector  operations,
  82 matrix-vector              operations              and             matrix-matrix
  83 operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some  of those operations seem
  84 to be  easy to  implement with Cuda.   Nevertheless, as  soon as a  reduction is
  85 needed, implementing an efficient reduction routines with Cuda is far from being
  86 simple. Roughly speaking, a reduction operation\index{reduction~operation} is an
  87 operation  which combines  all the  elements of  an array  and extract  a number
  88 computed with all the  elements. For example, a sum, a maximum  or a dot product
  89 are reduction operations.
  90
  91 In this second example, we consider that  we have two vectors $A$ and $B$. First
  92 of all, we want to compute the sum  of both vectors in a vector $C$. Then we want
  93 to compute the  scalar product between $1/C$ and $1/A$. This  is just an example
  94 which has no direct interest except to show how to program it with Cuda.
  95
  96 Listing~\ref{ch2:lst:ex2} shows this example with Cuda. The first kernel for the
  97 addition  of two  arrays  is exactly  the same  as  the one  described in  the
  98 previous example.
  99
 100 The  kernel  to  compute the  inverse  of  the  elements  of  an array  is  very
 101 simple. For  each thread index,  the inverse of  the array replaces  the initial
 102 array.
 103
 104 In the main function,  the beginning is very similar to the  one in the previous
 105 example.   First, the number  of elements  is asked  to the  user.  Then  a call
 106 to \texttt{cublasCreate} allows to initialize  the cublas library. It creates an
 107 handle. Then all the arrays are allocated  in the host and the device, as in the
 108 previous  example.  Both  arrays  $A$ and  $B$  are initialized.   The  CPU
 109 computation is performed  and the time for this CPU  computation is measured. In
 110 order to  compute the same result  on the GPU, first  of all, data  from the CPU
 111 need to be  copied into the memory of  the GPU. For that, it is  possible to use
 112 cublas function \texttt{cublasSetVector}.  This function has several arguments. More
 113 precisely, the first argument represents the number of elements to transfer, the
 114 second arguments is the size of  each element, the third element represents the
 115 source of the  array to transfer (in  the GPU), the fourth is  an offset between
 116 each element of  the source (usually this value  is set to 1), the  fifth is the
 117 destination (in the GPU)  and the last is an offset between  each element of the
 118 destination. Then we call the kernel \texttt{addition} which computes the sum of
 119 all elements of arrays $A$ and $B$. The \texttt{inverse} kernel is called twice,
 120 once to  inverse elements of array  $C$ and once  for $A$. Finally, we  call the
 121 function \texttt{cublasDdot} which  computes the dot product of  two vectors. To
 122 use this routine, we must specify  the handle initialized by Cuda, the number of
 123 elements to consider,  then each vector is followed by  the offset between every
 124 element.  After  the  GPU  computation,  it  is  possible  to  check  that  both
 125 computation produce the same result.
 126
 127 \lstinputlisting[label=ch2:lst:ex2,caption=A simple example with cublas]{Chapters/chapter2/ex2.cu}
 128
 129 \section{Third example: matrix-matrix multiplication}
 130 \label{ch2:3ex}
 131
 132
 133
 134 Matrix-matrix multiplication is an operation  which is quite easy to parallelize
 135 with a GPU. If we consider that  a matrix is represented using a two dimensional
 136 array,  $A[i][j]$ represents   the element  of  the $i^{th}$  row  and of  the
 137 $j^{th}$ column. In many cases, it is easier to manipulate 1D array instead of 2D
 138 array.   With Cuda,  even if  it is  possible to  manipulate 2D  arrays,  in the
 139 following we  present an example  based on 1D  array. For the sake of  simplicity, we
 140 consider  we  have  a  square  matrix  of size  \texttt{size}.  So  with  a  1D
 141 array, \texttt{A[i*size+j]} allows  us to access to the  element of the $i^{th}$
 142 row and of the $j^{th}$ column.
 143
 144 With  a sequential  programming, the  matrix multiplication  is  performed using
 145 three loops. We assume that $A$, $B$  represent two square matrices and the
 146 result   of    the   multiplication    of   $A   \times    B$   is    $C$.   The
 147 element \texttt{C[i*size+j]} is computed as follows:
 148 \begin{equation}
 149 C[i*size+j]=\sum_{k=0}^{size-1} A[i*size+k]*B[k*size+j];
 150 \end{equation}
 151
 152 In Listing~\ref{ch2:lst:ex3},  the CPU computation  is performed using  3 loops,
 153 one  for $i$,  one for  $j$  and one  for $k$.   In  order to  perform the  same
 154 computation on a  GPU, a naive solution consists in  considering that the matrix
 155 $C$ is split into  2 dimensional blocks.  The size of each  block must be chosen
 156 such as the number of threads per block is inferior to $1,024$.
 157
 158
 159 In Listing~\ref{ch2:lst:ex3},  we consider that  a block contains 16  threads in
 160 each   dimension,  the   variable  \texttt{width}   is  used   for   that.   The
 161 variable \texttt{nbTh} represents the number of threads per block. So, to be able
 162 to compute the matrix-matrix product on a GPU, each block of threads is assigned
 163 to compute the result  of the product for the elements of  this block.  The main
 164 part of the code is quite similar to the previous code.  Arrays are allocated in
 165 the  CPU and  the GPU.   Matrices $A$  and $B$  are randomly  initialized.  Then
 166 arrays are  transferred inside the  GPU memory with call  to \texttt{cudaMemcpy}.
 167 So the first step for each thread of a block is to compute the corresponding row
 168 and   column.    With   a    2   dimensional   decomposition,   \texttt{int   i=
 169 blockIdx.y*blockDim.y+ threadIdx.y;} allows us to compute the corresponding line
 170 and  \texttt{int  j=   blockIdx.x*blockDim.x+  threadIdx.x;}  the  corresponding
 171 column. Then each  thread has to compute the  sum of the product of  the line of
 172 $A$   per   the  column   of   $B$.    In  order   to   use   a  register,   the
 173 kernel  \texttt{matmul}  uses a  variable  called  \texttt{sum}  to compute  the
 174 sum. Then the result is set into  the matrix at the right place. The computation
 175 of  CPU matrix-matrix multiplication  is performed  as described  previously.  A
 176 timer measures  the time.   In order to  use 2 dimensional  blocks, \texttt{dim3
 177 dimGrid(size/width,size/width);} allows us  to create \texttt{size/width} blocks
 178 in each  dimension.  Likewise,  \texttt{dim3 dimBlock(width,width);} is  used to
 179 create \texttt{width} thread  in each dimension. After that,  the kernel for the
 180 matrix  multiplication is  called. At  the end  of the  listing, the  matrix $C$
 181 computed by the GPU is transferred back  into the CPU and we check if both matrices
 182 C computed by the CPU and the GPU are identical with a precision of $10^{-4}$.
 183
 184
 185 With $1,024  \times 1,024$ matrices,  on a C2070M  Tesla card, this  code takes
 186 $37.68$ms to perform the multiplication. With an Intel Xeon E31245 at $3.30$GHz, it
 187 takes $2465$ms  without any parallelization (using only  one core). Consequently
 188 the speed up  between the CPU and GPU  version is about $65$ which  is very good
 189 regarding the difficulty of parallelizing this code.
 190
 191 \lstinputlisting[label=ch2:lst:ex3,caption=simple Matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu}
 192
 193 \section{Conclusion}
 194 In this chapter, three simple Cuda examples have been  presented. Those examples are
 195 quite  simple. As we  cannot  present  all the  possibilities  of  the  Cuda
 196 programming, interested  readers  are  invited  to  consult  Cuda  programming
 197 introduction books if some issues regarding the Cuda programming are not clear.
 198
 199 \putbib[Chapters/chapter2/biblio]
 200