suite

[book_gpu.git] / BookGPU / Chapters / chapter2 / ch2.tex
diff --git a/BookGPU/Chapters/chapter2/ch2.tex b/BookGPU/Chapters/chapter2/ch2.tex

index b06e9be3400ac32d4bc5ae254597feda71d39a40..804afc2933c1b9b924babaf32a2fff7c06d3ee41 100755 (executable)
--- a/BookGPU/Chapters/chapter2/ch2.tex
+++ b/BookGPU/Chapters/chapter2/ch2.tex
@@ -9,7 +9,7 @@
  In this chapter  we give some simple examples on CUDA  programming.  The goal is
  not to provide an exhaustive presentation of all the functionalities of CUDA but
  rather giving some basic elements. Of  course, readers that do not know CUDA are
-invited to read other books that are specialized on CUDA programming.
+invited to read other books that are specialized on CUDA programming (for example: \cite{Sanders:2010:CEI}).
  
  
  \section{First example}
@@ -17,7 +17,7 @@ invited to read other books that are specialized on CUDA programming.
  This first example is  intented to show how to build a  very simple example with
  CUDA.   The goal  of this  example is  to performed  the sum  of two  arrays and
  putting the  result into a  third array.   A cuda program  consists in a  C code
-which calls CUDA kernels that are executed on a GPU.
+which calls CUDA kernels that are executed on a GPU. The listing of this code is in Listing~\ref{ch2:lst:ex1}
  
  
  As GPUs have  their own memory, the first step consists  in allocating memory on
@@ -41,5 +41,29 @@ parameter is set to  \texttt{cudaMemcpyHostToDevice}. The first parameter of the
  function is the destination array, the  second is the source array and the third
  is the number of elements to copy (exprimed in bytes).
  
-\putbib[biblio]
+Now that the GPU contains the data needed to perform the addition. In sequential
+such addition is achieved  out with a loop on all the  elements.  With a GPU, it
+is possible  to perform the addition of  all elements of the  arrays in parallel
+(if  the  number   of  blocks  and  threads  per   blocks  is  sufficient).   In
+Listing\ref{ch2:lst:ex1}     at    the     beginning,    a     simple    kernel,
+called \texttt{addition} is defined to  compute in parallel the summation of the
+two arrays. With CUDA, a  kernel starts with the keyword \texttt{\_\_global\_\_} \index{CUDA~keywords!\_\_shared\_\_}
+which  indicates that this  kernel can  be called  from the  C code.   The first
+instruction in this  kernel is used to compute  the variable \texttt{tid} which
+represents the thread index.   This thread index\index{thread index} is computed
+according  to  the  values  of  the  block  index (it  is  a  variable  of  CUDA
+called  \texttt{blockIdx}\index{CUDA~keywords!blockIdx}). Blocks of  threads can
+be decomposed into  1 dimension, 2 dimensions or 3  dimensions. According to the
+dimension of data  manipulated, the appropriate dimension can  be useful. In our
+example, only  one dimension  is used.  Then  using notation \texttt{.x}  we can
+access to the first dimension (\texttt{.y} and \texttt{.z} allow respectively to
+access      to      the     second      and      third     dimension).       The
+variable \texttt{blockDim}\index{CUDA~keywords!blockDim} gives  the size of each
+block.
+
+
+
+\lstinputlisting[label=ch2:lst:ex1,caption=A simple example]{Chapters/chapter2/ex1.cu}
+
+\putbib[Chapters/chapter2/biblio]