X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/f3c358c4bba6851fbf01121853d7e21866f01884..9378973df8e8a9aac4a7c212a7efb7d831bfae94:/BookGPU/Chapters/chapter2/ch2.tex?ds=inline

diff --git a/BookGPU/Chapters/chapter2/ch2.tex b/BookGPU/Chapters/chapter2/ch2.tex
index 9c6d0de..75be84b 100755
--- a/BookGPU/Chapters/chapter2/ch2.tex
+++ b/BookGPU/Chapters/chapter2/ch2.tex
@@ -23,24 +23,24 @@ are executed on a GPU. This code is in Listing~\ref{ch2:lst:ex1}.
 
 
 As GPUs have  their own memory, the first step consists  of allocating memory on
-the   GPU.   A   call   to  \texttt{cudaMalloc}\index{CUDA~functions!cudaMalloc}
+the   GPU.   A   call   to  \texttt{cudaMalloc}\index{CUDA functions!cudaMalloc}
 allocates memory  on the GPU.  The  second parameter represents the  size of the
 allocated variables, this size is expressed in bits.
-
+\pagebreak
 \lstinputlisting[label=ch2:lst:ex1,caption=simple example]{Chapters/chapter2/ex1.cu}
 
 
 In this example, we  want to compare the execution time of  the additions of two
 arrays in  CPU and  GPU. So  for both these  operations, a  timer is  created to
 measure the  time. CUDA proposes to  manipulate timers quite  easily.  The first
-step is to create the timer\index{CUDA~functions!timer}, then to start it, and at
+step is to create the timer\index{CUDA functions!timer}, then to start it, and at
 the end to stop it. For each of these operations a dedicated function is used.
 
 In  order to  compute  the same  sum  with a  GPU, the  first  step consists  of
 transferring the data from the CPU (considered as the host with CUDA) to the GPU
 (considered as the  device with CUDA).  A call  to \texttt{cudaMemcpy} copies the content of an array allocated in the host to the device when the fourth
 parameter                                 is                                 set
-to  \texttt{cudaMemcpyHostToDevice}\index{CUDA~functions!cudaMemcpy}.  The first
+to  \texttt{cudaMemcpyHostToDevice}\index{CUDA functions!cudaMemcpy}.  The first
 parameter of the function is the  destination array, the second is the
 source  array, and  the third  is the  number of  elements to  copy  (expressed in
 bytes).
@@ -52,26 +52,26 @@ two  arrays in  parallel (if  the number  of blocks  and threads  per  blocks is
 sufficient).   In Listing~\ref{ch2:lst:ex1}  at the  beginning, a  simple kernel,
 called \texttt{addition} is defined to  compute in parallel the summation of the
 two     arrays.      With     CUDA,     a     kernel     starts     with     the
-keyword   \texttt{\_\_global\_\_}   \index{CUDA~keywords!\_\_shared\_\_}   which
+keyword   \texttt{\_\_global\_\_}   \index{CUDA keywords!\_\_shared\_\_}   which
 indicates that this kernel can be called from the C code.  The first instruction
 in this kernel is used to compute the variable \texttt{tid} which represents the
-thread index.   This thread index\index{thread  index} is computed  according to
+thread index.   This thread index\index{CUDA keywords!thread  index} is computed  according to
 the           values            of           the           block           index
-(called  \texttt{blockIdx} \index{CUDA~keywords!blockIdx}  in CUDA)  and  of the
-thread   index   (called   \texttt{threadIdx}\index{CUDA~keywords!threadIdx}   in
+(called  \texttt{blockIdx} \index{CUDA keywords!blockIdx}  in CUDA)  and  of the
+thread   index   (called   \texttt{threadIdx}\index{CUDA keywords!threadIdx}   in
 CUDA). Blocks of threads and thread  indexes can be decomposed into 1 dimension,
-2 dimensions, or  3 dimensions. {\bf A REGARDER} According to the  dimension of manipulated data,
-the appropriate dimension  can be useful. In our example,  only one dimension is
+2 dimensions, or  3 dimensions.  According to the  dimension of manipulated data,
+the dimension of blocks of threads  must be chosen carefully. In our example,  only one dimension is
 used.   Then using the notation  \texttt{.x}, we  can access  the  first dimension
 (\texttt{.y}  and \texttt{.z},  respectively allow access  to the  second and
-third dimension).   The variable \texttt{blockDim}\index{CUDA~keywords!blockDim}
+third dimension).   The variable \texttt{blockDim}\index{CUDA keywords!blockDim}
 gives the size of each block.
 
 
 
 
 
-\section{Second example: using CUBLAS}
+\section{Second example: using CUBLAS \index{CUBLAS}}
 \label{ch2:2ex}
 
 The Basic Linear Algebra Subprograms  (BLAS) allows programmers to use efficient
@@ -81,7 +81,7 @@ operations,                           and                           matrix-matri
 operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some  of those operations seem
 to be  easy to  implement with CUDA.   Nevertheless, as  soon as a  reduction is
 needed, implementing an efficient reduction routine with CUDA is far from being
-simple. Roughly speaking, a reduction operation\index{reduction~operation} is an
+simple. Roughly speaking, a reduction operation\index{reduction operation} is an
 operation  which combines  all the  elements of  an array  and extracts  a number
 computed from all the  elements. For example, a sum, a maximum,  or a dot product
 are reduction operations.
@@ -189,10 +189,10 @@ considering the difficulty of parallelizing this code.
 \lstinputlisting[label=ch2:lst:ex3,caption=simple matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu}
 
 \section{Conclusion}
-In this chapter, three simple CUDA examples have been  presented. They are
-quite  simple. As we  cannot  present  all the  possibilities  of  the  CUDA
-programming, interested  readers  are  invited  to  consult  CUDA  programming
-introduction books if some issues regarding the CUDA programming are not clear.
+In this  chapter, three simple CUDA  examples have been presented.  As we cannot
+present all  the possibilities of  the CUDA programming, interested  readers are
+invited to consult CUDA programming  introduction books if some issues regarding
+the CUDA programming are not clear.
 
 \putbib[Chapters/chapter2/biblio]