new

[book_gpu.git] / BookGPU / Chapters / chapter2 / ch2.tex
diff --git a/BookGPU/Chapters/chapter2/ch2.tex b/BookGPU/Chapters/chapter2/ch2.tex

index b06e9be3400ac32d4bc5ae254597feda71d39a40..7fc84710cadc77fb079a3824bdf3b150aac0a281 100755 (executable)
--- a/BookGPU/Chapters/chapter2/ch2.tex
+++ b/BookGPU/Chapters/chapter2/ch2.tex
@@ -1,45 +1,199 @@
-\chapterauthor{Author Name1}{Affiliation text1}
-\chapterauthor{Author Name2}{Affiliation text2}
-
+\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte, France}
  
  \chapter{Introduction to CUDA}
  \label{chapter2}
  
-\section{Introduction}\label{intro}
-In this chapter  we give some simple examples on CUDA  programming.  The goal is
+\section{Introduction}
+\label{ch2:intro}
+
+In this chapter  we give some simple examples of CUDA  programming.  The goal is
  not to provide an exhaustive presentation of all the functionalities of CUDA but
-rather giving some basic elements. Of  course, readers that do not know CUDA are
-invited to read other books that are specialized on CUDA programming.
+rather to give some basic elements. Of  course, readers who do not know CUDA are
+invited  to read  other  books that  are  specialized on  CUDA programming  (for
+example, \cite{ch2:Sanders:2010:CEI}).
  
  
  \section{First example}
+\label{ch2:1ex}
  
-This first example is  intented to show how to build a  very simple example with
-CUDA.   The goal  of this  example is  to performed  the sum  of two  arrays and
-putting the  result into a  third array.   A cuda program  consists in a  C code
-which calls CUDA kernels that are executed on a GPU.
+This first example is  intented to show how to build a  very simple program with
+CUDA.  Its goal  is to perform the sum  of two arrays and put the  result into a
+third array.  A CUDA program consists in  a C code which calls CUDA kernels that
+are executed on a GPU. This code is in Listing~\ref{ch2:lst:ex1}.
  
  
-As GPUs have  their own memory, the first step consists  in allocating memory on
-the GPU. A call to \texttt{cudaMalloc} allows to allocate memory on the GPU. The
-first  parameter of  this  function  is a  pointer  on a  memory  on the  device
-(i.e. the GPU). In this example, \texttt{d\_} is added on each variable allocated
-on the GPU meaning this variable  is on the GPU. The second parameter represents
-the size of the allocated variables, this size is in bits.
+As GPUs have  their own memory, the first step consists  of allocating memory on
+the  GPU.    A  call  to   \texttt{cudaMalloc}\index{CUDA  functions!cudaMalloc}
+allocates memory on  the GPU. {\bf REREAD The first parameter of this  function is a pointer
+on a  memory on the  device, i.e. the  GPU.} The second parameter  represents the
+size of the allocated variables, this size is expressed in bits.
+\pagebreak
+\lstinputlisting[label=ch2:lst:ex1,caption=simple example]{Chapters/chapter2/ex1.cu}
+
  
  In this example, we  want to compare the execution time of  the additions of two
  arrays in  CPU and  GPU. So  for both these  operations, a  timer is  created to
-measure the  time. CUDA proposes to  manipulate timers quick  easily.  The first
-step is  to create the timer, then  to start it and  at the end to  stop it. For
-each of these operations a dedicated functions is used.
+measure the  time. CUDA  manipulates timers quite  easily.  The first
+step is to create the timer\index{CUDA functions!timer}, then to start it, and at
+the end to stop it. For each of these operations a dedicated function is used.
  
-In  order to  compute  the  same sum  with  a GPU,  the  first  step consits  in
+In  order to  compute  the same  sum  with a  GPU, the  first  step consists  of
  transferring the data from the CPU (considered as the host with CUDA) to the GPU
-(considered as the  device with CUDA).  A call  to \texttt{cudaMalloc} allows to
-copy the content of an array allocated in the host to the device when the fourth
-parameter is set to  \texttt{cudaMemcpyHostToDevice}. The first parameter of the
-function is the destination array, the  second is the source array and the third
-is the number of elements to copy (exprimed in bytes).
+(considered as the  device with CUDA).  A call  to \texttt{cudaMemcpy} copies the content of an array allocated in the host to the device when the fourth
+parameter                                 is                                 set
+to  \texttt{cudaMemcpyHostToDevice}\index{CUDA functions!cudaMemcpy}.  The first
+parameter of the function is the  destination array, the second is the
+source  array, and  the third  is the  number of  elements to  copy  (expressed in
+bytes).
+
+Now the  GPU contains the  data needed to  perform the addition.   In sequential
+programming, such  addition is  achieved   with a loop  on all  the elements.
+With a GPU,  it is possible to perform  the addition of all the  elements of the
+two  arrays in  parallel (if  the number  of blocks  and threads  per  blocks is
+sufficient).   In Listing~\ref{ch2:lst:ex1}  at the  beginning, a  simple kernel,
+called \texttt{addition} is defined to  compute in parallel the summation of the
+two     arrays.      With     CUDA,     a     kernel     starts     with     the
+keyword   \texttt{\_\_global\_\_}   \index{CUDA keywords!\_\_shared\_\_}   which
+indicates that this kernel can be called from the C code.  The first instruction
+in this kernel is used to compute the variable \texttt{tid} which represents the
+thread index.   This thread index\index{CUDA keywords!thread  index} is computed  according to
+the           values            of           the           block           index
+(called  \texttt{blockIdx} \index{CUDA keywords!blockIdx}  in CUDA)  and  of the
+thread   index   (called   \texttt{threadIdx}\index{CUDA keywords!threadIdx}   in
+CUDA). Blocks of threads and thread  indexes can be decomposed into 1 dimension,
+2 dimensions, or  3 dimensions.  According to the  dimension of manipulated data,
+the dimension of blocks of threads  must be chosen carefully. In our example,  only one dimension is
+used.   Then using the notation  \texttt{.x}, we  can access  the  first dimension
+(\texttt{.y}  and \texttt{.z},  respectively allow access  to the  second and
+third dimensions).   The variable \texttt{blockDim}\index{CUDA keywords!blockDim}
+gives the size of each block.
+
+
+
+
+
+\section{Second example: using CUBLAS \index{CUBLAS}}
+\label{ch2:2ex}
+
+The Basic Linear Algebra Subprograms  (BLAS) allow programmers to use efficient
+routines for basic linear operations. Those  routines  are heavily  used in  many
+scientific applications  and are optimized for  vector operations, matrix-vector
+operations,                           and                           matrix-matrix
+operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some  of those operations seem
+to be  easy to  implement with CUDA; however, as  soon as a  reduction is
+needed, implementing an efficient reduction routine with CUDA is far from being
+simple. Roughly speaking, a reduction operation\index{reduction operation} is an
+operation  which combines  all the  elements of  an array  and extracts  a number
+computed from all the  elements. For example, a sum, a maximum,  or a dot product
+are reduction operations.
+
+In this second example,   we have two vectors $A$ and $B$. First
+of all, we want to compute the sum  of both vectors and store the result in a vector $C$. Then we want
+to compute the  scalar product between $1/C$ and $1/A$. This  is just an example
+which has no direct interest except to show how to program it with CUDA.
+
+Listing~\ref{ch2:lst:ex2} shows this example with CUDA. The first kernel for the
+addition  of two  arrays  is exactly  the same  as  the one  described in  the
+previous example.
+
+The  kernel  to  compute the  inverse  of  the  elements  of  an array  is  very
+simple. For  each thread index,  the inverse of  the array replaces  the initial
+array.
+
+In the main function,  the beginning is very similar to the  one in the previous
+example.  First,  the user is  asked to define  the number of elements.   Then a
+call  to \texttt{cublasCreate}  initializes  the  CUBLAS library.   It
+creates a handle. Then all the arrays  are allocated in the host and the device,
+as in the  previous example.  Both arrays $A$ and $B$  are initialized.  The CPU
+computation is performed  and the time for this  is measured. In
+order to  compute the same result  for the GPU, first  of all, data  from the CPU
+need to be  copied into the memory of  the GPU. For that, it is  possible to use
+CUBLAS   function   \texttt{cublasSetVector}.    This   function   has   several
+arguments. More precisely, the first  argument represents the number of elements
+to transfer, the second arguments is the size of each element, the third element
+represents the source  of the array to  transfer (in the GPU), the  fourth is an
+offset between each element of the source  (usually this value is set to 1), the
+fifth is  the destination (in the  GPU), and the  last is an offset  between each
+element  of the  destination. Then  we call  the kernel  \texttt{addition} which
+computes the  sum of all elements  of arrays $A$ and  $B$.  The \texttt{inverse}
+kernel  is called twice,  once to  inverse elements  of array  $C$ and  once for
+$A$. Finally,  we call the  function \texttt{cublasDdot} which computes  the dot
+product  of two  vectors.   To use  this  routine, we  must  specify the  handle
+initialized by  CUDA, the number  of elements to  consider, then each  vector is
+followed by the offset between every  element.  After the GPU computation, it is
+possible to check that both computations produce the same result.
+
+\lstinputlisting[label=ch2:lst:ex2,caption=simple example with CUBLAS]{Chapters/chapter2/ex2.cu}
+
+\section{Third example: matrix-matrix multiplication}
+\label{ch2:3ex}
+
+
+
+Matrix-matrix multiplication is an operation  which is quite easy to parallelize
+with a GPU. If we consider that  a matrix is represented using a two-dimensional
+array, $A[i][j]$ represents the element of  the $i$ row and of the $j$
+column. In  many cases, it is  easier to manipulate a  one-dimentional (1D) array rather than a 2D
+array.   With CUDA,  even if  it is  possible to  manipulate 2D  arrays,  in the
+following we present an example based on a 1D array. For the sake of simplicity,
+we  consider we  have  a square  matrix of  size  \texttt{size}.  So  with a  1D
+array,  \texttt{A[i*size+j]} allows  us to  have access  to the  element  of the
+$i$ row and of the $j$ column.
+
+With  sequential  programming, the  matrix-matrix multiplication  is  performed using
+three loops. We assume that $A$, $B$  represent two square matrices and the
+result   of    the   multiplication    of   $A   \times    B$   is    $C$.   The
+element \texttt{C[i*size+j]} is computed as follows:
+\begin{equation}
+C[size*i+j]=\sum_{k=0}^{size-1} A[size*i+k]*B[size*k+j].
+\end{equation}
+
+In Listing~\ref{ch2:lst:ex3},  the CPU computation  is performed using  3 loops,
+one  for $i$,  one for  $j$,  and one  for $k$.   In  order to  perform the  same
+computation on a  GPU, a naive solution consists of  considering that the matrix
+$C$ is split into  2-dimensional blocks.  The size of each  block must be chosen
+such that the number of threads per block is less than $1,024$.
+
+
+In Listing~\ref{ch2:lst:ex3},  we consider that  a block contains 16  threads in
+each   dimension,  the   variable  \texttt{width}   is  used   for   that.   The
+variable \texttt{nbTh} represents the number of threads per block. So, 
+to compute the matrix-matrix product on a GPU, each block of threads is assigned
+to compute the result  of the product of the elements of  that block.  The main
+part of the code is quite similar to the previous code.  Arrays are allocated in
+the  CPU and  the GPU.   Matrices $A$  and $B$  are randomly  initialized.  Then
+arrays are  transferred to the  GPU memory with call  to \texttt{cudaMemcpy}.
+So the first step for each thread of a block is to compute the corresponding row
+and   column.    With   a   2-dimensional   decomposition,   \texttt{int   i=
+blockIdx.y*blockDim.y+ threadIdx.y;} allows us to compute the corresponding line
+and  \texttt{int  j=   blockIdx.x*blockDim.x+  threadIdx.x;}  the  corresponding
+column. Then each  thread has to compute the  sum of the product of  the row of
+$A$   by   the  column   of   $B$.    In  order   to   use   a  register,   the
+kernel  \texttt{matmul}  uses a  variable  called  \texttt{sum}  to compute  the
+sum. Then the result is set into  the matrix at the right place. The computation
+of  CPU matrix-matrix multiplication  is performed  as described  previously.  A
+timer measures  the time.   In order to  use 2-dimensional  blocks, \texttt{dim3
+dimGrid(size/width,size/width);} allows us  to create \texttt{size/width} blocks
+in each  dimension.  Likewise,  \texttt{dim3 dimBlock(width,width);} is  used to
+create \texttt{width} thread  in each dimension. After that,  the kernel for the
+matrix  multiplication is  called. At  the end  of the  listing, the  matrix $C$
+computed by the GPU is transferred back  into the CPU and we check that both matrices
+C computed by the CPU and the GPU are identical with a precision of $10^{-4}$.
+
+
+With $1,024  \times 1,024$ matrices,  on a C2070M  Tesla card, this  code takes
+$37.68$ms to perform the multiplication. With an Intel Xeon E31245 at $3.30$GHz, it
+takes $2465$ms  without any parallelization (using only  one core). Consequently
+the speed up  between the CPU and GPU  version is about $65$ which  is very good
+considering the difficulty of parallelizing this code.
+
+\lstinputlisting[label=ch2:lst:ex3,caption=simple matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu}
+
+\section{Conclusion}
+In this  chapter, three simple CUDA  examples have been presented.  As we cannot
+present all  the possibilities of  the CUDA programming, interested  readers are
+invited to consult CUDA programming  introduction books if some issues regarding
+the CUDA programming are not clear.
  
-\putbib[biblio]
+\putbib[Chapters/chapter2/biblio]