suite

[book_gpu.git] / BookGPU / Chapters / chapter2 / ch2.tex
diff --git a/BookGPU/Chapters/chapter2/ch2.tex b/BookGPU/Chapters/chapter2/ch2.tex

index a9e9a870ba5357a716553856f4f93383c4c60f9a..e80b6709b80b829d346d30fd9ea2a618c68d732f 100755 (executable)
--- a/BookGPU/Chapters/chapter2/ch2.tex
+++ b/BookGPU/Chapters/chapter2/ch2.tex
@@ -3,7 +3,9 @@
  \chapter{Introduction to CUDA}
  \label{chapter2}
  
  \chapter{Introduction to CUDA}
  \label{chapter2}
  
-\section{Introduction}\label{intro}
+\section{Introduction}
+\label{ch2:intro}
+
  In this chapter  we give some simple examples on CUDA  programming.  The goal is
  not to provide an exhaustive presentation of all the functionalities of CUDA but
  rather giving some basic elements. Of  course, readers that do not know CUDA are
  In this chapter  we give some simple examples on CUDA  programming.  The goal is
  not to provide an exhaustive presentation of all the functionalities of CUDA but
  rather giving some basic elements. Of  course, readers that do not know CUDA are
@@ -12,6 +14,7 @@ example: \cite{ch2:Sanders:2010:CEI}).
  
  
  \section{First example}
  
  
  \section{First example}
+\label{ch2:1ex}
  
  This first example is  intented to show how to build a  very simple example with
  CUDA.   The goal  of this  example is  to performed  the sum  of two  arrays and
  
  This first example is  intented to show how to build a  very simple example with
  CUDA.   The goal  of this  example is  to performed  the sum  of two  arrays and
@@ -67,23 +70,27 @@ block.
  \lstinputlisting[label=ch2:lst:ex1,caption=A simple example]{Chapters/chapter2/ex1.cu}
  
  \section{Second example: using CUBLAS}
  \lstinputlisting[label=ch2:lst:ex1,caption=A simple example]{Chapters/chapter2/ex1.cu}
  
  \section{Second example: using CUBLAS}
+\label{ch2:2ex}
  
  The Basic Linear Algebra Subprograms  (BLAS) allows programmer to use performant
  routines that are often used. Those routines are heavily used in many scientific
  
  The Basic Linear Algebra Subprograms  (BLAS) allows programmer to use performant
  routines that are often used. Those routines are heavily used in many scientific
-applications  and  are  very   optimzed  for  vector  operations,  matrix-vector
+applications  and  are  very  optimized  for  vector  operations,  matrix-vector
  operations                           and                           matrix-matrix
  operations                           and                           matrix-matrix
-operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some of those operations seems
+operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some  of those operations seem
  to be  easy to  implement with CUDA.   Nevertheless, as  soon as a  reduction is
  needed, implementing an efficient reduction routines with CUDA is far from being
  to be  easy to  implement with CUDA.   Nevertheless, as  soon as a  reduction is
  needed, implementing an efficient reduction routines with CUDA is far from being
-simple.
+simple. Roughly speaking, a reduction operation\index{reduction~operation} is an
+operation  which combines  all the  elements of  an array  and extract  a number
+computed with all the  elements. For example, a sum, a maximum  or a dot product
+are reduction operations. 
  
  In this second example, we consider that  we have two vectors $A$ and $B$. First
  
  In this second example, we consider that  we have two vectors $A$ and $B$. First
-of all we want to compute the sum  of both vectors in a vector $C$. Then we want
+of all, we want to compute the sum  of both vectors in a vector $C$. Then we want
  to compute the  scalar product between $1/C$ and $1/A$. This  is just an example
  to compute the  scalar product between $1/C$ and $1/A$. This  is just an example
-which has not direct interest except to show how to program it with CUDA.
+which has no direct interest except to show how to program it with CUDA.
  
  Listing~\ref{ch2:lst:ex2} shows this example with CUDA. The first kernel for the
  
  Listing~\ref{ch2:lst:ex2} shows this example with CUDA. The first kernel for the
-addition  of two  arrays  is exactly  the same  that  the one  described in  the
+addition  of two  arrays  is exactly  the same  as  the one  described in  the
  previous example.
  
  The  kernel  to  compute the  inverse  of  the  elements  of  an array  is  very
  previous example.
  
  The  kernel  to  compute the  inverse  of  the  elements  of  an array  is  very
@@ -91,17 +98,76 @@ simple. For  each thread index,  the inverse of  the array replaces  the initial
  array.
  
  In the main function,  the beginning is very similar to the  one in the previous
  array.
  
  In the main function,  the beginning is very similar to the  one in the previous
-example.   First the  number of  elements is  asked to  the user.   Then  a call
+example.   First, the number  of elements  is asked  to the  user.  Then  a call
  to \texttt{cublasCreate} allows to initialize  the cublas library. It creates an
  handle. Then all the arrays are allocated  in the host and the device, as in the
  previous  example.  Both  arrays  $A$ and  $B$  are initialized.   Then the  CPU
  computation is performed  and the time for this CPU  computation is measured. In
  order to  compute the same result  on the GPU, first  of all, data  from the CPU
  need to be  copied into the memory of  the GPU. For that, it is  possible to use
  to \texttt{cublasCreate} allows to initialize  the cublas library. It creates an
  handle. Then all the arrays are allocated  in the host and the device, as in the
  previous  example.  Both  arrays  $A$ and  $B$  are initialized.   Then the  CPU
  computation is performed  and the time for this CPU  computation is measured. In
  order to  compute the same result  on the GPU, first  of all, data  from the CPU
  need to be  copied into the memory of  the GPU. For that, it is  possible to use
-cublas function \texttt{cublasSetVector}.
-
-\lstinputlisting[label=ch2:lst:ex2,caption=A simple example]{Chapters/chapter2/ex2.cu}
-
+cublas function \texttt{cublasSetVector}.  This function several arguments. More
+precisely, the first argument represents the number of elements to transfer, the
+second arguments is the size of  each elements, the third element represents the
+source of the  array to transfer (in  the GPU), the fourth is  an offset between
+each element of  the source (usually this value  is set to 1), the  fifth is the
+destination (in the GPU)  and the last is an offset between  each element of the
+destination. Then we call the kernel \texttt{addition} which computes the sum of
+all elements of arrays $A$ and $B$. The \texttt{inverse} kernel is called twice,
+once to  inverse elements of array  $C$ and once  for $A$. Finally, we  call the
+function \texttt{cublasDdot} which  computes the dot product of  two vectors. To
+use this routine, we must specify  the handle initialized by Cuda, the number of
+elements to consider,  then each vector is followed by  the offset between every
+element.  After  the  GPU  computation,  it  is  possible  to  check  that  both
+computation produce the same result.
+
+\lstinputlisting[label=ch2:lst:ex2,caption=A simple example with cublas]{Chapters/chapter2/ex2.cu}
+
+\section{Third example: matrix-matrix multiplication}
+\label{ch2:3ex}
+
+
+
+Matrix-matrix multiplication is an operation  which is quite easy to parallelize
+with a GPU. If we consider that  a matrix is represented using a two dimensional
+array,  A[i][j] represents  the  the element  of  the $i^{th}$  row  and of  the
+$j^{th}$ column. In many case, it is easier to manipulate 1D array instead of 2D
+array.   With Cuda,  even if  it is  possible to  manipulate 2D  arrays,  in the
+following we  present an example  based on 1D  array. For sake of  simplicity we
+consider  we  have  a  squared  matrix  of size  \texttt{size}.  So  with  a  1D
+array, \texttt{A[i*size+j]} allows  us to access to the  element of the $i^{th}$
+row and of the $j^{th}$ column.
+
+With  a sequential  programming, the  matrix multiplication  is  performed using
+three loops. Supposing that $A$, $B$  represent two square matrices and that the
+result   of    the   multiplication    of   $A   \times    B$   is    $C$.   The
+element \texttt{C[i*size+j]} is computed as follows:
+\begin{equation}
+C[i*size+j]=\sum_{k=0}^{size-1} A[i*size+k]*B[k*size+j];
+\end{equation}
+
+In  Listing~\ref{ch2:lst:ex3}, in  the CPU  computation,  this part  of code  is
+performed using 3 loops, one for $i$, one  for $j$ and one for $k$.  In order to
+perform the same computation on a  GPU, a naive solution consists in considering
+that the matrix $C$ is split into  2 dimensional blocks.  The size of each block
+must be chosen such  as the number of threads per block  is inferior to $1,024$.
+In Listing~\ref{ch2:lst:ex3},  we consider that  a block contains 16  threads in
+each dimension. The variable \texttt{nbTh}  represents the number of threads per
+block. So to be  able to compute the matrix-matrix product on  a GPU, each block
+of threads is assigned to compute the  result of the product for the elements of
+this block.   So the first  step for each  thread of a  block is to  compute the
+corresponding row and column. With a 2 dimensional decomposition, \texttt{int i=
+blockIdx.y*blockDim.y+ threadIdx.y;} allows us to compute the corresponding line
+and  \texttt{int  j=   blockIdx.x*blockDim.x+  threadIdx.x;}  the  corresponding
+column.
+
+
+On C2070M Tesla card, this code take $37.68$ms to perform the multiplication. On
+a Intel Xeon E31245 at  $3.30$GHz, it takes $2465$ms without any parallelization
+(using only one core). Consequently the speed up between the CPU and GPU version
+is about $65$ which is very  good regarding the difficulty of parallelizing this
+code.
+
+\lstinputlisting[label=ch2:lst:ex3,caption=simple Matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu}
  
  \putbib[Chapters/chapter2/biblio]
  
  
  \putbib[Chapters/chapter2/biblio]