BookGPU/Chapters/chapter2/ch2.tex

   1 \chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte}
   2
   3 \chapter{Introduction to Cuda}
   4 \label{chapter2}
   5
   6 \section{Introduction}
   7 \label{ch2:intro}
   8
   9 In this chapter  we give some simple examples of Cuda  programming.  The goal is
  10 not to provide an exhaustive presentation of all the functionalities of Cuda but
  11 rather to give some basic elements. Of  course, readers that do not know Cuda are
  12 invited  to read  other  books that  are  specialized on  Cuda programming  (for
  13 example: \cite{ch2:Sanders:2010:CEI}).
  14
  15
  16 \section{First example}
  17 \label{ch2:1ex}
  18
  19 This first example is  intented to show how to build a  very simple program with
  20 Cuda.  Its goal  is to perform the sum  of two arrays and put the  result into a
  21 third array.  A Cuda program consists in  a C code which calls Cuda kernels that
  22 are executed on a GPU. The listing of this code is in Listing~\ref{ch2:lst:ex1}.
  23
  24
  25 As GPUs have  their own memory, the first step consists  in allocating memory on
  26 the GPU.  A call to  \texttt{cudaMalloc}\index{Cuda~functions!cudaMalloc} allows
  27 to allocate memory on the GPU. The first parameter of this function is a pointer
  28 on a memory on the device (i.e. the GPU). In this example, \texttt{d\_} is added
  29 on each variable allocated  on the GPU, meaning this variable is  on the GPU. The
  30 second parameter represents the size of the allocated variables, this size is expressed in
  31 bits.
  32
  33 In this example, we  want to compare the execution time of  the additions of two
  34 arrays in  CPU and  GPU. So  for both these  operations, a  timer is  created to
  35 measure the  time. Cuda proposes to  manipulate timers quite  easily.  The first
  36 step is to create the timer\index{Cuda~functions!timer}, then to start it and at
  37 the end to stop it. For each of these operations a dedicated function is used.
  38
  39 In  order to  compute  the same  sum  with a  GPU, the  first  step consists  in
  40 transferring the data from the CPU (considered as the host with Cuda) to the GPU
  41 (considered as the  device with Cuda).  A call  to \texttt{cudaMemcpy} allows to
  42 copy the content of an array allocated in the host to the device when the fourth
  43 parameter                                 is                                 set
  44 to  \texttt{cudaMemcpyHostToDevice}\index{Cuda~functions!cudaMemcpy}.  The first
  45 parameter of the function is the  destination array, the second is the
  46 source  array and  the third  is the  number of  elements to  copy  (expressed in
  47 bytes).
  48
  49 Now the  GPU contains the  data needed to  perform the addition.   In sequential
  50 programming, such  addition is  achieved out  with a loop  on all  the elements.
  51 With a GPU,  it is possible to perform  the addition of all the  elements of the
  52 two  arrays in  parallel (if  the number  of blocks  and threads  per  blocks is
  53 sufficient).   In Listing\ref{ch2:lst:ex1}  at the  beginning, a  simple kernel,
  54 called \texttt{addition} is defined to  compute in parallel the summation of the
  55 two     arrays.      With     Cuda,     a     kernel     starts     with     the
  56 keyword   \texttt{\_\_global\_\_}   \index{Cuda~keywords!\_\_shared\_\_}   which
  57 indicates that this kernel can be called from the C code.  The first instruction
  58 in this kernel is used to compute the variable \texttt{tid} which represents the
  59 thread index.   This thread index\index{thread  index} is computed  according to
  60 the           values            of           the           block           index
  61 (called  \texttt{blockIdx} \index{Cuda~keywords!blockIdx}  in Cuda)  and  of the
  62 thread   index   (called   \texttt{blockIdx}\index{Cuda~keywords!threadIdx}   in
  63 Cuda). Blocks of threads and thread  indexes can be decomposed into 1 dimension,
  64 2 dimensions or  3 dimensions.  According to the  dimension of manipulated data,
  65 the appropriate dimension  can be useful. In our example,  only one dimension is
  66 used.   Then using notation  \texttt{.x} we  can access  to the  first dimension
  67 (\texttt{.y}  and \texttt{.z}  respectively allow to access  to the  second and
  68 third dimension).   The variable \texttt{blockDim}\index{Cuda~keywords!blockDim}
  69 gives the size of each block.
  70
  71
  72
  73 \lstinputlisting[label=ch2:lst:ex1,caption=A simple example]{Chapters/chapter2/ex1.cu}
  74
  75 \section{Second example: using CUBLAS}
  76 \label{ch2:2ex}
  77
  78 The Basic Linear Algebra Subprograms  (BLAS) allows programmers to use efficient
  79 routines  that are  often  required. Those  routines  are heavily  used in  many
  80 scientific applications  and are optimized for  vector operations, matrix-vector
  81 operations                           and                           matrix-matrix
  82 operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some  of those operations seem
  83 to be  easy to  implement with Cuda.   Nevertheless, as  soon as a  reduction is
  84 needed, implementing an efficient reduction routine with Cuda is far from being
  85 simple. Roughly speaking, a reduction operation\index{reduction~operation} is an
  86 operation  which combines  all the  elements of  an array  and extracts  a number
  87 computed with all the  elements. For example, a sum, a maximum  or a dot product
  88 are reduction operations.
  89
  90 In this second example, we consider that  we have two vectors $A$ and $B$. First
  91 of all, we want to compute the sum  of both vectors in a vector $C$. Then we want
  92 to compute the  scalar product between $1/C$ and $1/A$. This  is just an example
  93 which has no direct interest except to show how to program it with Cuda.
  94
  95 Listing~\ref{ch2:lst:ex2} shows this example with Cuda. The first kernel for the
  96 addition  of two  arrays  is exactly  the same  as  the one  described in  the
  97 previous example.
  98
  99 The  kernel  to  compute the  opposite  of  the  elements  of  an array  is  very
 100 simple. For  each thread index,  the inverse of  the array replaces  the initial
 101 array.
 102
 103 In the main function,  the beginning is very similar to the  one in the previous
 104 example.  First,  the user is  askef to define  the number of elements.   Then a
 105 call  to \texttt{cublasCreate}  allows  to initialize  the  cublas library.   It
 106 creates a handle. Then all the arrays  are allocated in the host and the device,
 107 as in the  previous example.  Both arrays $A$ and $B$  are initialized.  The CPU
 108 computation is performed  and the time for this CPU  computation is measured. In
 109 order to  compute the same result  on the GPU, first  of all, data  from the CPU
 110 need to be  copied into the memory of  the GPU. For that, it is  possible to use
 111 cublas   function   \texttt{cublasSetVector}.    This   function   has   several
 112 arguments. More precisely, the first  argument represents the number of elements
 113 to transfer, the second arguments is the size of each element, the third element
 114 represents the source  of the array to  transfer (in the GPU), the  fourth is an
 115 offset between each element of the source  (usually this value is set to 1), the
 116 fifth is  the destination (in the  GPU) and the  last is an offset  between each
 117 element  of the  destination. Then  we call  the kernel  \texttt{addition} which
 118 computes the  sum of all elements  of arrays $A$ and  $B$.  The \texttt{inverse}
 119 kernel  is called twice,  once to  inverse elements  of array  $C$ and  once for
 120 $A$. Finally,  we call the  function \texttt{cublasDdot} which computes  the dot
 121 product  of two  vectors.   To use  this  routine, we  must  specify the  handle
 122 initialized by  Cuda, the number  of elements to  consider, then each  vector is
 123 followed by the offset between every  element.  After the GPU computation, it is
 124 possible to check that both computation produce the same result.
 125
 126 \lstinputlisting[label=ch2:lst:ex2,caption=A simple example with cublas]{Chapters/chapter2/ex2.cu}
 127
 128 \section{Third example: matrix-matrix multiplication}
 129 \label{ch2:3ex}
 130
 131
 132
 133 Matrix-matrix multiplication is an operation  which is quite easy to parallelize
 134 with a GPU. If we consider that  a matrix is represented using a two dimensional
 135 array, $A[i][j]$ represents the element of  the $i^{th}$ row and of the $j^{th}$
 136 column. In  many cases, it is  easier to manipulate a  1D array instead  of a 2D
 137 array.   With Cuda,  even if  it is  possible to  manipulate 2D  arrays,  in the
 138 following we present an example based on a 1D array. For the sake of simplicity,
 139 we  consider we  have  a square  matrix of  size  \texttt{size}.  So  with a  1D
 140 array,  \texttt{A[i*size+j]} allows  us to  have access  to the  element  of the
 141 $i^{th}$ row and of the $j^{th}$ column.
 142
 143 With  a sequential  programming, the  matrix multiplication  is  performed using
 144 three loops. We assume that $A$, $B$  represent two square matrices and the
 145 result   of    the   multiplication    of   $A   \times    B$   is    $C$.   The
 146 element \texttt{C[i*size+j]} is computed as follows:
 147 \begin{equation}
 148 C[i*size+j]=\sum_{k=0}^{size-1} A[i*size+k]*B[k*size+j];
 149 \end{equation}
 150
 151 In Listing~\ref{ch2:lst:ex3},  the CPU computation  is performed using  3 loops,
 152 one  for $i$,  one for  $j$  and one  for $k$.   In  order to  perform the  same
 153 computation on a  GPU, a naive solution consists in  considering that the matrix
 154 $C$ is split into  2 dimensional blocks.  The size of each  block must be chosen
 155 such as the number of threads per block is inferior to $1,024$.
 156
 157
 158 In Listing~\ref{ch2:lst:ex3},  we consider that  a block contains 16  threads in
 159 each   dimension,  the   variable  \texttt{width}   is  used   for   that.   The
 160 variable \texttt{nbTh} represents the number of threads per block. So, to be able
 161 to compute the matrix-matrix product on a GPU, each block of threads is assigned
 162 to compute the result  of the product for the elements of  this block.  The main
 163 part of the code is quite similar to the previous code.  Arrays are allocated in
 164 the  CPU and  the GPU.   Matrices $A$  and $B$  are randomly  initialized.  Then
 165 arrays are  transferred inside the  GPU memory with call  to \texttt{cudaMemcpy}.
 166 So the first step for each thread of a block is to compute the corresponding row
 167 and   column.    With   a    2   dimensional   decomposition,   \texttt{int   i=
 168 blockIdx.y*blockDim.y+ threadIdx.y;} allows us to compute the corresponding line
 169 and  \texttt{int  j=   blockIdx.x*blockDim.x+  threadIdx.x;}  the  corresponding
 170 column. Then each  thread has to compute the  sum of the product of  the line of
 171 $A$   by   the  column   of   $B$.    In  order   to   use   a  register,   the
 172 kernel  \texttt{matmul}  uses a  variable  called  \texttt{sum}  to compute  the
 173 sum. Then the result is set into  the matrix at the right place. The computation
 174 of  CPU matrix-matrix multiplication  is performed  as described  previously.  A
 175 timer measures  the time.   In order to  use 2 dimensional  blocks, \texttt{dim3
 176 dimGrid(size/width,size/width);} allows us  to create \texttt{size/width} blocks
 177 in each  dimension.  Likewise,  \texttt{dim3 dimBlock(width,width);} is  used to
 178 create \texttt{width} thread  in each dimension. After that,  the kernel for the
 179 matrix  multiplication is  called. At  the end  of the  listing, the  matrix $C$
 180 computed by the GPU is transferred back  into the CPU and we check if both matrices
 181 C computed by the CPU and the GPU are identical with a precision of $10^{-4}$.
 182
 183
 184 With $1,024  \times 1,024$ matrices,  on a C2070M  Tesla card, this  code takes
 185 $37.68$ms to perform the multiplication. With an Intel Xeon E31245 at $3.30$GHz, it
 186 takes $2465$ms  without any parallelization (using only  one core). Consequently
 187 the speed up  between the CPU and GPU  version is about $65$ which  is very good
 188 regarding the difficulty of parallelizing this code.
 189
 190 \lstinputlisting[label=ch2:lst:ex3,caption=simple Matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu}
 191
 192 \section{Conclusion}
 193 In this chapter, three simple Cuda examples have been  presented. They are
 194 quite  simple. As we  cannot  present  all the  possibilities  of  the  Cuda
 195 programming, interested  readers  are  invited  to  consult  Cuda  programming
 196 introduction books if some issues regarding the Cuda programming are not clear.
 197
 198 \putbib[Chapters/chapter2/biblio]
 199