new

[book_gpu.git] / BookGPU / Chapters / chapter2 / ch2.tex
diff --git a/BookGPU/Chapters/chapter2/ch2.tex b/BookGPU/Chapters/chapter2/ch2.tex

index b330d6b63630893b32e629b83004b5cd1a7cc84a..75be84bc7e690206142d347d57a3eacdca84f642 100755 (executable)
--- a/BookGPU/Chapters/chapter2/ch2.tex
+++ b/BookGPU/Chapters/chapter2/ch2.tex
@@ -23,24 +23,24 @@ are executed on a GPU. This code is in Listing~\ref{ch2:lst:ex1}.
  
  
  As GPUs have  their own memory, the first step consists  of allocating memory on
-the   GPU.   A   call   to  \texttt{cudaMalloc}\index{CUDA~functions!cudaMalloc}
+the   GPU.   A   call   to  \texttt{cudaMalloc}\index{CUDA functions!cudaMalloc}
  allocates memory  on the GPU.  The  second parameter represents the  size of the
  allocated variables, this size is expressed in bits.
-
+\pagebreak
  \lstinputlisting[label=ch2:lst:ex1,caption=simple example]{Chapters/chapter2/ex1.cu}
  
  
  In this example, we  want to compare the execution time of  the additions of two
  arrays in  CPU and  GPU. So  for both these  operations, a  timer is  created to
  measure the  time. CUDA proposes to  manipulate timers quite  easily.  The first
-step is to create the timer\index{CUDA~functions!timer}, then to start it, and at
+step is to create the timer\index{CUDA functions!timer}, then to start it, and at
  the end to stop it. For each of these operations a dedicated function is used.
  
  In  order to  compute  the same  sum  with a  GPU, the  first  step consists  of
  transferring the data from the CPU (considered as the host with CUDA) to the GPU
  (considered as the  device with CUDA).  A call  to \texttt{cudaMemcpy} copies the content of an array allocated in the host to the device when the fourth
  parameter                                 is                                 set
-to  \texttt{cudaMemcpyHostToDevice}\index{CUDA~functions!cudaMemcpy}.  The first
+to  \texttt{cudaMemcpyHostToDevice}\index{CUDA functions!cudaMemcpy}.  The first
  parameter of the function is the  destination array, the second is the
  source  array, and  the third  is the  number of  elements to  copy  (expressed in
  bytes).
@@ -52,26 +52,26 @@ two  arrays in  parallel (if  the number  of blocks  and threads  per  blocks is
  sufficient).   In Listing~\ref{ch2:lst:ex1}  at the  beginning, a  simple kernel,
  called \texttt{addition} is defined to  compute in parallel the summation of the
  two     arrays.      With     CUDA,     a     kernel     starts     with     the
-keyword   \texttt{\_\_global\_\_}   \index{CUDA~keywords!\_\_shared\_\_}   which
+keyword   \texttt{\_\_global\_\_}   \index{CUDA keywords!\_\_shared\_\_}   which
  indicates that this kernel can be called from the C code.  The first instruction
  in this kernel is used to compute the variable \texttt{tid} which represents the
-thread index.   This thread index\index{thread  index} is computed  according to
+thread index.   This thread index\index{CUDA keywords!thread  index} is computed  according to
  the           values            of           the           block           index
-(called  \texttt{blockIdx} \index{CUDA~keywords!blockIdx}  in CUDA)  and  of the
-thread   index   (called   \texttt{threadIdx}\index{CUDA~keywords!threadIdx}   in
+(called  \texttt{blockIdx} \index{CUDA keywords!blockIdx}  in CUDA)  and  of the
+thread   index   (called   \texttt{threadIdx}\index{CUDA keywords!threadIdx}   in
  CUDA). Blocks of threads and thread  indexes can be decomposed into 1 dimension,
-2 dimensions, or  3 dimensions. {\bf A REGARDER} According to the  dimension of manipulated data,
-the appropriate dimension  can be useful. In our example,  only one dimension is
+2 dimensions, or  3 dimensions.  According to the  dimension of manipulated data,
+the dimension of blocks of threads  must be chosen carefully. In our example,  only one dimension is
  used.   Then using the notation  \texttt{.x}, we  can access  the  first dimension
  (\texttt{.y}  and \texttt{.z},  respectively allow access  to the  second and
-third dimension).   The variable \texttt{blockDim}\index{CUDA~keywords!blockDim}
+third dimension).   The variable \texttt{blockDim}\index{CUDA keywords!blockDim}
  gives the size of each block.
  
  
  
  
  
-\section{Second example: using CUBLAS}
+\section{Second example: using CUBLAS \index{CUBLAS}}
  \label{ch2:2ex}
  
  The Basic Linear Algebra Subprograms  (BLAS) allows programmers to use efficient
@@ -81,7 +81,7 @@ operations,                           and                           matrix-matri
  operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some  of those operations seem
  to be  easy to  implement with CUDA.   Nevertheless, as  soon as a  reduction is
  needed, implementing an efficient reduction routine with CUDA is far from being
-simple. Roughly speaking, a reduction operation\index{reduction~operation} is an
+simple. Roughly speaking, a reduction operation\index{reduction operation} is an
  operation  which combines  all the  elements of  an array  and extracts  a number
  computed from all the  elements. For example, a sum, a maximum,  or a dot product
  are reduction operations.