new

[book_gpu.git] / BookGPU / Chapters / chapter2 / ch2.tex
diff --git a/BookGPU/Chapters/chapter2/ch2.tex b/BookGPU/Chapters/chapter2/ch2.tex

index ae0704b99da0385b8e934d872795f131b465290c..490d7530d01b572e6aecf949bf400886db5bf44c 100755 (executable)
--- a/BookGPU/Chapters/chapter2/ch2.tex
+++ b/BookGPU/Chapters/chapter2/ch2.tex
@@ -1,84 +1,94 @@
-\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte}
+\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte, France}
  
  \chapter{Introduction to CUDA}
  \label{chapter2}
  
-\section{Introduction}\label{intro}
-In this chapter  we give some simple examples on CUDA  programming.  The goal is
+\section{Introduction}
+\label{ch2:intro}
+
+In this chapter  we give some simple examples of CUDA  programming.  The goal is
  not to provide an exhaustive presentation of all the functionalities of CUDA but
-rather giving some basic elements. Of  course, readers that do not know CUDA are
+rather to give some basic elements. Of  course, readers who do not know CUDA are
  invited  to read  other  books that  are  specialized on  CUDA programming  (for
-example: \cite{ch2:Sanders:2010:CEI}).
+example, \cite{ch2:Sanders:2010:CEI}).
  
  
  \section{First example}
+\label{ch2:1ex}
+
+This first example is  intented to show how to build a  very simple program with
+CUDA.  Its goal  is to perform the sum  of two arrays and put the  result into a
+third array.  A CUDA program consists in  a C code which calls CUDA kernels that
+are executed on a GPU. This code is in Listing~\ref{ch2:lst:ex1}.
  
-This first example is  intented to show how to build a  very simple example with
-CUDA.   The goal  of this  example is  to performed  the sum  of two  arrays and
-putting the  result into a  third array.   A cuda program  consists in a  C code
-which calls CUDA kernels that are executed on a GPU. The listing of this code is
-in Listing~\ref{ch2:lst:ex1}.
  
+As GPUs have  their own memory, the first step consists  of allocating memory on
+the  GPU.    A  call  to   \texttt{cudaMalloc}\index{CUDA  functions!cudaMalloc}
+allocates memory on  the GPU.  The first parameter of this  function is a pointer
+on a  memory on the  device, i.e. the  GPU. The second parameter  represents the
+size of the allocated variables, this size is expressed in bits.
+\pagebreak
+\lstinputlisting[label=ch2:lst:ex1,caption=simple example]{Chapters/chapter2/ex1.cu}
  
-As GPUs have  their own memory, the first step consists  in allocating memory on
-the GPU. A call to \texttt{cudaMalloc} allows to allocate memory on the GPU. The
-first  parameter of  this  function  is a  pointer  on a  memory  on the  device
-(i.e. the GPU). In this example, \texttt{d\_} is added on each variable allocated
-on the GPU meaning this variable  is on the GPU. The second parameter represents
-the size of the allocated variables, this size is in bits.
  
  In this example, we  want to compare the execution time of  the additions of two
  arrays in  CPU and  GPU. So  for both these  operations, a  timer is  created to
-measure the  time. CUDA proposes to  manipulate timers quick  easily.  The first
-step is  to create the timer, then  to start it and  at the end to  stop it. For
-each of these operations a dedicated functions is used.
+measure the  time. CUDA  manipulates timers quite  easily.  The first
+step is to create the timer\index{CUDA functions!timer}, then to start it, and at
+the end to stop it. For each of these operations a dedicated function is used.
  
-In  order to  compute  the  same sum  with  a GPU,  the  first  step consits  in
+In  order to  compute  the same  sum  with a  GPU, the  first  step consists  of
  transferring the data from the CPU (considered as the host with CUDA) to the GPU
-(considered as the  device with CUDA).  A call  to \texttt{cudaMalloc} allows to
-copy the content of an array allocated in the host to the device when the fourth
-parameter is set to  \texttt{cudaMemcpyHostToDevice}. The first parameter of the
-function is the destination array, the  second is the source array and the third
-is the number of elements to copy (exprimed in bytes).
-
-Now that the GPU contains the data needed to perform the addition. In sequential
-such addition is achieved  out with a loop on all the  elements.  With a GPU, it
-is possible  to perform the addition of  all elements of the  arrays in parallel
-(if  the  number   of  blocks  and  threads  per   blocks  is  sufficient).   In
-Listing\ref{ch2:lst:ex1}     at    the     beginning,    a     simple    kernel,
+(considered as the  device with CUDA).  A call  to \texttt{cudaMemcpy} copies the content of an array allocated in the host to the device when the fourth
+parameter                                 is                                 set
+to  \texttt{cudaMemcpyHostToDevice}\index{CUDA functions!cudaMemcpy}.  The first
+parameter of the function is the  destination array, the second is the
+source  array, and  the third  is the  number of  elements to  copy  (expressed in
+bytes).
+
+Now the  GPU contains the  data needed to  perform the addition.   In sequential
+programming, such  addition is  achieved   with a loop  on all  the elements.
+With a GPU,  it is possible to perform  the addition of all the  elements of the
+two  arrays in  parallel (if  the number  of blocks  and threads  per  blocks is
+sufficient).   In Listing~\ref{ch2:lst:ex1}  at the  beginning, a  simple kernel,
  called \texttt{addition} is defined to  compute in parallel the summation of the
-two     arrays.     With     CUDA,      a     kernel     starts     with     the
-keyword   \texttt{\_\_global\_\_}   \index{CUDA~keywords!\_\_shared\_\_}   which
+two     arrays.      With     CUDA,     a     kernel     starts     with     the
+keyword   \texttt{\_\_global\_\_}   \index{CUDA keywords!\_\_shared\_\_}   which
  indicates that this kernel can be called from the C code.  The first instruction
  in this kernel is used to compute the variable \texttt{tid} which represents the
-thread index.   This thread index\index{thread  index} is computed  according to
-the   values    of   the   block   index    (it   is   a    variable   of   CUDA
-called  \texttt{blockIdx}\index{CUDA~keywords!blockIdx}). Blocks of  threads can
-be decomposed into  1 dimension, 2 dimensions or 3  dimensions. According to the
-dimension of data  manipulated, the appropriate dimension can  be useful. In our
-example, only  one dimension  is used.  Then  using notation \texttt{.x}  we can
-access to the first dimension (\texttt{.y} and \texttt{.z} allow respectively to
-access      to      the     second      and      third     dimension).       The
-variable \texttt{blockDim}\index{CUDA~keywords!blockDim} gives  the size of each
-block.
-
-
-
-\lstinputlisting[label=ch2:lst:ex1,caption=A simple example]{Chapters/chapter2/ex1.cu}
-
-\section{Second example: using CUBLAS}
-
-The Basic Linear Algebra Subprograms  (BLAS) allows programmer to use performant
-routines that are often used. Those routines are heavily used in many scientific
-applications  and  are  very   optimized  for  vector  operations,  matrix-vector
-operations                           and                           matrix-matrix
-operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some of those operations seem
-to be  easy to  implement with CUDA.   Nevertheless, as  soon as a  reduction is
-needed, implementing an efficient reduction routines with CUDA is far from being
-simple.
-
-In this second example, we consider that  we have two vectors $A$ and $B$. First
-of all, we want to compute the sum  of both vectors in a vector $C$. Then we want
+thread index.   This thread index\index{CUDA keywords!thread  index} is computed  according to
+the           values            of           the           block           index
+(called  \texttt{blockIdx} \index{CUDA keywords!blockIdx}  in CUDA)  and  of the
+thread   index   (called   \texttt{threadIdx}\index{CUDA keywords!threadIdx}   in
+CUDA). Blocks of threads and thread  indexes can be decomposed into 1 dimension,
+2 dimensions, or  3 dimensions.  According to the  dimension of manipulated data,
+the dimension of blocks of threads  must be chosen carefully. In our example,  only one dimension is
+used.   Then using the notation  \texttt{.x}, we  can access  the  first dimension
+(\texttt{.y}  and \texttt{.z},  respectively allow access  to the  second and
+third dimensions).   The variable \texttt{blockDim}\index{CUDA keywords!blockDim}
+gives the size of each block.
+
+
+
+
+\pagebreak
+\section{Second example: using CUBLAS \index{CUBLAS}}
+\label{ch2:2ex}
+
+The Basic Linear Algebra Subprograms  (BLAS) allow programmers to use efficient
+routines for basic linear operations. Those  routines  are heavily  used in  many
+scientific applications  and are optimized for  vector operations, matrix-vector
+operations,                           and                           matrix-matrix
+operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some  of those operations seem
+to be  easy to  implement with CUDA; however, as  soon as a  reduction is
+needed, implementing an efficient reduction routine with CUDA is far from being
+simple. Roughly speaking, a reduction operation\index{reduction operation} is an
+operation  which combines  all the  elements of  an array  and extracts  a number
+computed from all the  elements. For example, a sum, a maximum,  or a dot product
+are reduction operations.
+
+In this second example,   we have two vectors $A$ and $B$. First
+of all, we want to compute the sum  of both vectors and store the result in a vector $C$. Then we want
  to compute the  scalar product between $1/C$ and $1/A$. This  is just an example
  which has no direct interest except to show how to program it with CUDA.
  
@@ -91,23 +101,99 @@ simple. For  each thread index,  the inverse of  the array replaces  the initial
  array.
  
  In the main function,  the beginning is very similar to the  one in the previous
-example.   First, the number  of elements  is asked  to the  user.  Then  a call
-to \texttt{cublasCreate} allows to initialize  the cublas library. It creates an
-handle. Then all the arrays are allocated  in the host and the device, as in the
-previous  example.  Both  arrays  $A$ and  $B$  are initialized.   Then the  CPU
-computation is performed  and the time for this CPU  computation is measured. In
-order to  compute the same result  on the GPU, first  of all, data  from the CPU
+example.  First,  the user is  asked to define  the number of elements.   Then a
+call  to \texttt{cublasCreate}  initializes  the  CUBLAS library.   It
+creates a handle. Then all the arrays  are allocated in the host and the device,
+as in the  previous example.  Both arrays $A$ and $B$  are initialized.  The CPU
+computation is performed  and the time for this  is measured. In
+order to  compute the same result  for the GPU, first  of all, data  from the CPU
  need to be  copied into the memory of  the GPU. For that, it is  possible to use
-cublas function \texttt{cublasSetVector}.  This function several arguments. More
-precisely, the first argument represents the number of elements to transfer, the
-second arguments is the size of  each elements, the third element represents the
-source of the  array to transfer (in  the GPU), the fourth is  an offset between
-each element of  the source (usually this value  is set to 1), the  fifth is the
-destination (in the GPU)  and the last is an offset between  each element of the
-destination.
-
-\lstinputlisting[label=ch2:lst:ex2,caption=A simple example]{Chapters/chapter2/ex2.cu}
-
+CUBLAS   function   \texttt{cublasSetVector}.    This   function   has   several
+arguments. More precisely, the first  argument represents the number of elements
+to transfer, the second arguments is the size of each element, the third element
+represents the source  of the array to  transfer (in the GPU), the  fourth is an
+offset between each element of the source  (usually this value is set to 1), the
+fifth is  the destination (in the  GPU), and the  last is an offset  between each
+element  of the  destination. Then  we call  the kernel  \texttt{addition} which
+computes the  sum of all elements  of arrays $A$ and  $B$.  The \texttt{inverse}
+kernel  is called twice,  once to  inverse elements  of array  $C$ and  once for
+$A$. Finally,  we call the  function \texttt{cublasDdot} which computes  the dot
+product  of two  vectors.   To use  this  routine, we  must  specify the  handle
+initialized by  CUDA, the number  of elements to  consider, then each  vector is
+followed by the offset between every  element.  After the GPU computation, it is
+possible to check that both computations produce the same result.
+
+\lstinputlisting[label=ch2:lst:ex2,caption=simple example with CUBLAS]{Chapters/chapter2/ex2.cu}
+
+\section{Third example: matrix-matrix multiplication}
+\label{ch2:3ex}
+
+
+
+Matrix-matrix multiplication is an operation  which is quite easy to parallelize
+with a GPU. If we consider that  a matrix is represented using a two-dimensional
+array, $A[i][j]$ represents the element of  the $i$ row and of the $j$
+column. In  many cases, it is  easier to manipulate a  one-dimentional (1D) array rather than a 2D
+array.   With CUDA,  even if  it is  possible to  manipulate 2D  arrays,  in the
+following we present an example based on a 1D array. For the sake of simplicity,
+we  consider we  have  a square  matrix of  size  \texttt{size}.  So  with a  1D
+array,  \texttt{A[i*size+j]} allows  us to  have access  to the  element  of the
+$i$ row and of the $j$ column.
+
+With  sequential  programming, the  matrix-matrix multiplication  is  performed using
+three loops. We assume that $A$, $B$  represent two square matrices and the
+result   of    the   multiplication    of   $A   \times    B$   is    $C$.   The
+element \texttt{C[i*size+j]} is computed as follows:
+\begin{equation}
+C[size*i+j]=\sum_{k=0}^{size-1} A[size*i+k]*B[size*k+j].
+\end{equation}
+
+In Listing~\ref{ch2:lst:ex3},  the CPU computation  is performed using  3 loops,
+one  for $i$,  one for  $j$,  and one  for $k$.   In  order to  perform the  same
+computation on a  GPU, a naive solution consists of  considering that the matrix
+$C$ is split into  2-dimensional blocks.  The size of each  block must be chosen
+such that the number of threads per block is less than $1,024$.
+
+
+In Listing~\ref{ch2:lst:ex3},  we consider that  a block contains 16  threads in
+each   dimension,  the   variable  \texttt{width}   is  used   for   that.   The
+variable \texttt{nbTh} represents the number of threads per block. So, 
+to compute the matrix-matrix product on a GPU, each block of threads is assigned
+to compute the result  of the product of the elements of  that block.  The main
+part of the code is quite similar to the previous code.  Arrays are allocated in
+the  CPU and  the GPU.   Matrices $A$  and $B$  are randomly  initialized.  Then
+arrays are  transferred to the  GPU memory with call  to \texttt{cudaMemcpy}.
+So the first step for each thread of a block is to compute the corresponding row
+and   column.    With   a   2-dimensional   decomposition,   \texttt{int   i=
+blockIdx.y*blockDim.y+ threadIdx.y;} allows us to compute the corresponding line
+and  \texttt{int  j=   blockIdx.x*blockDim.x+  threadIdx.x;}  the  corresponding
+column. Then each  thread has to compute the  sum of the product of  the row of
+$A$   by   the  column   of   $B$.    In  order   to   use   a  register,   the
+kernel  \texttt{matmul}  uses a  variable  called  \texttt{sum}  to compute  the
+sum. Then the result is set into  the matrix at the right place. The computation
+of  CPU matrix-matrix multiplication  is performed  as described  previously.  A
+timer measures  the time.   In order to  use 2-dimensional  blocks, \texttt{dim3
+dimGrid(size/width,size/width);} allows us  to create \texttt{size/width} blocks
+in each  dimension.  Likewise,  \texttt{dim3 dimBlock(width,width);} is  used to
+create \texttt{width} thread  in each dimension. After that,  the kernel for the
+matrix  multiplication is  called. At  the end  of the  listing, the  matrix $C$
+computed by the GPU is transferred back  into the CPU and we check that both matrices
+C computed by the CPU and the GPU are identical with a precision of $10^{-4}$.
+
+
+With $1,024  \times 1,024$ matrices,  on a C2070M  Tesla card, this  code takes
+$37.68$ms to perform the multiplication. With an Intel Xeon E31245 at $3.30$GHz, it
+takes $2465$ms  without any parallelization (using only  one core). Consequently
+the speed up  between the CPU and GPU  version is about $65$ which  is very good
+considering the difficulty of parallelizing this code.
+
+\lstinputlisting[label=ch2:lst:ex3,caption=simple matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu}
+
+\section{Conclusion}
+In this  chapter, three simple CUDA  examples have been presented.  As we cannot
+present all  the possibilities of  the CUDA programming, interested  readers are
+invited to consult CUDA programming  introduction books if some issues regarding
+the CUDA programming are not clear.
  
  \putbib[Chapters/chapter2/biblio]