X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/0dee9cd4943e9ee5a8d83e90a8f42842e9cfae15..1eda0e6e69a2160009bfcb2b86237a544898530d:/BookGPU/Chapters/chapter2/ch2.tex

diff --git a/BookGPU/Chapters/chapter2/ch2.tex b/BookGPU/Chapters/chapter2/ch2.tex
index 804afc2..8222660 100755
--- a/BookGPU/Chapters/chapter2/ch2.tex
+++ b/BookGPU/Chapters/chapter2/ch2.tex
@@ -1,69 +1,200 @@
-\chapterauthor{Author Name1}{Affiliation text1}
-\chapterauthor{Author Name2}{Affiliation text2}
+\chapterauthor{RaphaÃ«l Couturier}{Femto-ST Institute, University of Franche-Comte}
 
-
-\chapter{Introduction to CUDA}
+\chapter{Introduction to Cuda}
 \label{chapter2}
 
-\section{Introduction}\label{intro}
-In this chapter  we give some simple examples on CUDA  programming.  The goal is
-not to provide an exhaustive presentation of all the functionalities of CUDA but
-rather giving some basic elements. Of  course, readers that do not know CUDA are
-invited to read other books that are specialized on CUDA programming (for example: \cite{Sanders:2010:CEI}).
+\section{Introduction}
+\label{ch2:intro}
+
+In this chapter  we give some simple examples on Cuda  programming.  The goal is
+not to provide an exhaustive presentation of all the functionalities of Cuda but
+rather giving some basic elements. Of  course, readers that do not know Cuda are
+invited  to read  other  books that  are  specialized on  Cuda programming  (for
+example: \cite{ch2:Sanders:2010:CEI}).
 
 
 \section{First example}
+\label{ch2:1ex}
 
 This first example is  intented to show how to build a  very simple example with
-CUDA.   The goal  of this  example is  to performed  the sum  of two  arrays and
-putting the  result into a  third array.   A cuda program  consists in a  C code
-which calls CUDA kernels that are executed on a GPU. The listing of this code is in Listing~\ref{ch2:lst:ex1}
+Cuda.   The goal  of this  example is  to perform  the sum  of two  arrays and
+put the  result into a  third array.   A Cuda program  consists in a  C code
+which calls Cuda kernels that are executed on a GPU. The listing of this code is
+in Listing~\ref{ch2:lst:ex1}.
 
 
 As GPUs have  their own memory, the first step consists  in allocating memory on
-the GPU. A call to \texttt{cudaMalloc} allows to allocate memory on the GPU. The
-first  parameter of  this  function  is a  pointer  on a  memory  on the  device
-(i.e. the GPU). In this example, \texttt{d\_} is added on each variable allocated
-on the GPU meaning this variable  is on the GPU. The second parameter represents
-the size of the allocated variables, this size is in bits.
+the GPU.  A call to  \texttt{cudaMalloc}\index{Cuda~functions!cudaMalloc} allows
+to allocate memory on the GPU. The first parameter of this function is a pointer
+on a memory on the device (i.e. the GPU). In this example, \texttt{d\_} is added
+on each variable allocated  on the GPU, meaning this variable is  on the GPU. The
+second parameter represents the size of the allocated variables, this size is in
+bits.
 
 In this example, we  want to compare the execution time of  the additions of two
 arrays in  CPU and  GPU. So  for both these  operations, a  timer is  created to
-measure the  time. CUDA proposes to  manipulate timers quick  easily.  The first
-step is  to create the timer, then  to start it and  at the end to  stop it. For
-each of these operations a dedicated functions is used.
+measure the  time. Cuda proposes to  manipulate timers quite  easily.  The first
+step is to create the timer\index{Cuda~functions!timer}, then to start it and at
+the end to stop it. For each of these operations a dedicated functions is used.
 
-In  order to  compute  the  same sum  with  a GPU,  the  first  step consits  in
-transferring the data from the CPU (considered as the host with CUDA) to the GPU
-(considered as the  device with CUDA).  A call  to \texttt{cudaMalloc} allows to
+In  order to  compute  the same  sum  with a  GPU, the  first  step consists  in
+transferring the data from the CPU (considered as the host with Cuda) to the GPU
+(considered as the  device with Cuda).  A call  to \texttt{cudaMemcpy} allows to
 copy the content of an array allocated in the host to the device when the fourth
-parameter is set to  \texttt{cudaMemcpyHostToDevice}. The first parameter of the
-function is the destination array, the  second is the source array and the third
-is the number of elements to copy (exprimed in bytes).
-
-Now that the GPU contains the data needed to perform the addition. In sequential
-such addition is achieved  out with a loop on all the  elements.  With a GPU, it
-is possible  to perform the addition of  all elements of the  arrays in parallel
+parameter                                 is                                 set
+to  \texttt{cudaMemcpyHostToDevice}\index{Cuda~functions!cudaMemcpy}.  The first
+parameter of the function is the  destination array, the second is the
+source  array and  the third  is the  number of  elements to  copy  (exprimed in
+bytes).
+
+Now the GPU contains the data needed to perform the addition. In sequential such
+addition is  achieved out with a  loop on all the  elements.  With a  GPU, it is
+possible to perform  the addition of all elements of the  two arrays in parallel
 (if  the  number   of  blocks  and  threads  per   blocks  is  sufficient).   In
 Listing\ref{ch2:lst:ex1}     at    the     beginning,    a     simple    kernel,
 called \texttt{addition} is defined to  compute in parallel the summation of the
-two arrays. With CUDA, a  kernel starts with the keyword \texttt{\_\_global\_\_} \index{CUDA~keywords!\_\_shared\_\_}
-which  indicates that this  kernel can  be called  from the  C code.   The first
-instruction in this  kernel is used to compute  the variable \texttt{tid} which
-represents the thread index.   This thread index\index{thread index} is computed
-according  to  the  values  of  the  block  index (it  is  a  variable  of  CUDA
-called  \texttt{blockIdx}\index{CUDA~keywords!blockIdx}). Blocks of  threads can
-be decomposed into  1 dimension, 2 dimensions or 3  dimensions. According to the
-dimension of data  manipulated, the appropriate dimension can  be useful. In our
-example, only  one dimension  is used.  Then  using notation \texttt{.x}  we can
-access to the first dimension (\texttt{.y} and \texttt{.z} allow respectively to
-access      to      the     second      and      third     dimension).       The
-variable \texttt{blockDim}\index{CUDA~keywords!blockDim} gives  the size of each
-block.
+two     arrays.      With     Cuda,     a     kernel     starts     with     the
+keyword   \texttt{\_\_global\_\_}   \index{Cuda~keywords!\_\_shared\_\_}   which
+indicates that this kernel can be called from the C code.  The first instruction
+in this kernel is used to compute the variable \texttt{tid} which represents the
+thread index.   This thread index\index{thread  index} is computed  according to
+the           values            of           the           block           index
+(called  \texttt{blockIdx} \index{Cuda~keywords!blockIdx}  in Cuda)  and  of the
+thread   index   (called   \texttt{blockIdx}\index{Cuda~keywords!threadIdx}   in
+Cuda). Blocks of threads and thread  indexes can be decomposed into 1 dimension, 2
+dimensions or 3 dimensions.  According to the dimension of data manipulated, the
+appropriate dimension can be useful. In our example, only one dimension is used.
+Then  using  notation   \texttt{.x}  we  can  access  to   the  first  dimension
+(\texttt{.y}  and \texttt{.z}  allow respectively  to access  to the  second and
+third dimension).   The variable \texttt{blockDim}\index{Cuda~keywords!blockDim}
+gives the size of each block.
 
 
 
 \lstinputlisting[label=ch2:lst:ex1,caption=A simple example]{Chapters/chapter2/ex1.cu}
 
+\section{Second example: using CUBLAS}
+\label{ch2:2ex}
+
+The Basic Linear Algebra Subprograms  (BLAS) allows programmers to use efficient
+routines  that are  often  required. Those  routines  are heavily  used in  many
+scientific  applications   and  are   very  optimized  for   vector  operations,
+matrix-vector              operations              and             matrix-matrix
+operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some  of those operations seem
+to be  easy to  implement with Cuda.   Nevertheless, as  soon as a  reduction is
+needed, implementing an efficient reduction routines with Cuda is far from being
+simple. Roughly speaking, a reduction operation\index{reduction~operation} is an
+operation  which combines  all the  elements of  an array  and extract  a number
+computed with all the  elements. For example, a sum, a maximum  or a dot product
+are reduction operations.
+
+In this second example, we consider that  we have two vectors $A$ and $B$. First
+of all, we want to compute the sum  of both vectors in a vector $C$. Then we want
+to compute the  scalar product between $1/C$ and $1/A$. This  is just an example
+which has no direct interest except to show how to program it with Cuda.
+
+Listing~\ref{ch2:lst:ex2} shows this example with Cuda. The first kernel for the
+addition  of two  arrays  is exactly  the same  as  the one  described in  the
+previous example.
+
+The  kernel  to  compute the  inverse  of  the  elements  of  an array  is  very
+simple. For  each thread index,  the inverse of  the array replaces  the initial
+array.
+
+In the main function,  the beginning is very similar to the  one in the previous
+example.   First, the number  of elements  is asked  to the  user.  Then  a call
+to \texttt{cublasCreate} allows to initialize  the cublas library. It creates an
+handle. Then all the arrays are allocated  in the host and the device, as in the
+previous  example.  Both  arrays  $A$ and  $B$  are initialized.   Then the  CPU
+computation is performed  and the time for this CPU  computation is measured. In
+order to  compute the same result  on the GPU, first  of all, data  from the CPU
+need to be  copied into the memory of  the GPU. For that, it is  possible to use
+cublas function \texttt{cublasSetVector}.  This function several arguments. More
+precisely, the first argument represents the number of elements to transfer, the
+second arguments is the size of  each elements, the third element represents the
+source of the  array to transfer (in  the GPU), the fourth is  an offset between
+each element of  the source (usually this value  is set to 1), the  fifth is the
+destination (in the GPU)  and the last is an offset between  each element of the
+destination. Then we call the kernel \texttt{addition} which computes the sum of
+all elements of arrays $A$ and $B$. The \texttt{inverse} kernel is called twice,
+once to  inverse elements of array  $C$ and once  for $A$. Finally, we  call the
+function \texttt{cublasDdot} which  computes the dot product of  two vectors. To
+use this routine, we must specify  the handle initialized by Cuda, the number of
+elements to consider,  then each vector is followed by  the offset between every
+element.  After  the  GPU  computation,  it  is  possible  to  check  that  both
+computation produce the same result.
+
+\lstinputlisting[label=ch2:lst:ex2,caption=A simple example with cublas]{Chapters/chapter2/ex2.cu}
+
+\section{Third example: matrix-matrix multiplication}
+\label{ch2:3ex}
+
+
+
+Matrix-matrix multiplication is an operation  which is quite easy to parallelize
+with a GPU. If we consider that  a matrix is represented using a two dimensional
+array,  A[i][j] represents  the  the element  of  the $i^{th}$  row  and of  the
+$j^{th}$ column. In many case, it is easier to manipulate 1D array instead of 2D
+array.   With Cuda,  even if  it is  possible to  manipulate 2D  arrays,  in the
+following we  present an example  based on 1D  array. For sake of  simplicity we
+consider  we  have  a  squared  matrix  of size  \texttt{size}.  So  with  a  1D
+array, \texttt{A[i*size+j]} allows  us to access to the  element of the $i^{th}$
+row and of the $j^{th}$ column.
+
+With  a sequential  programming, the  matrix multiplication  is  performed using
+three loops. Supposing that $A$, $B$  represent two square matrices and that the
+result   of    the   multiplication    of   $A   \times    B$   is    $C$.   The
+element \texttt{C[i*size+j]} is computed as follows:
+\begin{equation}
+C[i*size+j]=\sum_{k=0}^{size-1} A[i*size+k]*B[k*size+j];
+\end{equation}
+
+In  Listing~\ref{ch2:lst:ex3}, in  the CPU  computation,  this part  of code  is
+performed using 3 loops, one for $i$, one  for $j$ and one for $k$.  In order to
+perform the same computation on a  GPU, a naive solution consists in considering
+that the matrix $C$ is split into  2 dimensional blocks.  The size of each block
+must be chosen such  as the number of threads per block  is inferior to $1,024$.
+
+
+In Listing~\ref{ch2:lst:ex3},  we consider that  a block contains 16  threads in
+each   dimension,  the   variable  \texttt{width}   is  used   for   that.   The
+variable \texttt{nbTh} represents the number of threads per block. So to be able
+to compute the matrix-matrix product on a GPU, each block of threads is assigned
+to compute the result  of the product for the elements of  this block.  The main
+part of the code is quite similar to the previous code.  Arrays are allocated in
+the  CPU and  the GPU.   Matrices $A$  and $B$  are randomly  initialized.  Then
+arrays are  transfered inside the  GPU memory with call  to \texttt{cudaMemcpy}.
+So the first step for each thread of a block is to compute the corresponding row
+and   column.    With   a    2   dimensional   decomposition,   \texttt{int   i=
+blockIdx.y*blockDim.y+ threadIdx.y;} allows us to compute the corresponding line
+and  \texttt{int  j=   blockIdx.x*blockDim.x+  threadIdx.x;}  the  corresponding
+column. Then each  thread has to compute the  sum of the product of  the line of
+$A$   per   the  column   of   $B$.    In  order   to   use   a  register,   the
+kernel  \texttt{matmul}  uses a  variable  called  \texttt{sum}  to compute  the
+sum. Then the result is set into  the matrix at the right place. The computation
+of  CPU matrix-matrix multiplication  is performed  as described  previously.  A
+timer measures  the time.   In order to  use 2 dimensional  blocks, \texttt{dim3
+dimGrid(size/width,size/width);} allows us  to create \texttt{size/width} blocks
+in each  dimension.  Likewise,  \texttt{dim3 dimBlock(width,width);} is  used to
+create \texttt{width} thread  in each dimension. After that,  the kernel for the
+matrix  multiplication is  called. At  the end  of the  listing, the  matrix $C$
+computed by the GPU is transfered back  in the CPU and we check if both matrices
+C computed by the CPU and the GPU are identical with a precision of $10^{-4}$.
+
+
+On C2070M Tesla card, this code take $37.68$ms to perform the multiplication. On
+a Intel Xeon E31245 at  $3.30$GHz, it takes $2465$ms without any parallelization
+(using only one core). Consequently the speed up between the CPU and GPU version
+is about $65$ which is very  good regarding the difficulty of parallelizing this
+code.
+
+\lstinputlisting[label=ch2:lst:ex3,caption=simple Matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu}
+
+\section{Conclusion}
+In this chapter  3 simple Cuda examples have been  presented. Those examples are
+quite  simple  and  they  cannot  present  all the  possibilities  of  the  Cuda
+programming.   Interested  readers  are  invited  to  consult  Cuda  programming
+introduction books if some issues regarding the Cuda programming is not clear.
+
 \putbib[Chapters/chapter2/biblio]