suite ch2

author Raphael Couturier <raphael.couturier@univ-fcomte.fr>

Tue, 23 Oct 2012 14:16:41 +0000 (16:16 +0200)

committer Raphael Couturier <raphael.couturier@univ-fcomte.fr>

Tue, 23 Oct 2012 14:16:41 +0000 (16:16 +0200)
author Raphael Couturier <raphael.couturier@univ-fcomte.fr>
Tue, 23 Oct 2012 14:16:41 +0000 (16:16 +0200)
committer Raphael Couturier <raphael.couturier@univ-fcomte.fr>
Tue, 23 Oct 2012 14:16:41 +0000 (16:16 +0200)
diff --git a/BookGPU/Chapters/chapter2/ch2.tex b/BookGPU/Chapters/chapter2/ch2.tex

index 8222660b03fb33730f512cf2c4c726539ab24fdb..155c8ca1432109847f58cd520fbea34a43a8f273 100755 (executable)
--- a/BookGPU/Chapters/chapter2/ch2.tex
+++ b/BookGPU/Chapters/chapter2/ch2.tex
@@ -44,7 +44,7 @@ copy the content of an array allocated in the host to the device when the fourth
  parameter                                 is                                 set
  to  \texttt{cudaMemcpyHostToDevice}\index{Cuda~functions!cudaMemcpy}.  The first
  parameter of the function is the  destination array, the second is the
  parameter                                 is                                 set
  to  \texttt{cudaMemcpyHostToDevice}\index{Cuda~functions!cudaMemcpy}.  The first
  parameter of the function is the  destination array, the second is the
-source  array and  the third  is the  number of  elements to  copy  (exprimed in
+source  array and  the third  is the  number of  elements to  copy  (expressed in
  bytes).
  
  Now the GPU contains the data needed to perform the addition. In sequential such
  bytes).
  
  Now the GPU contains the data needed to perform the addition. In sequential such
@@ -105,13 +105,13 @@ In the main function,  the beginning is very similar to the  one in the previous
  example.   First, the number  of elements  is asked  to the  user.  Then  a call
  to \texttt{cublasCreate} allows to initialize  the cublas library. It creates an
  handle. Then all the arrays are allocated  in the host and the device, as in the
  example.   First, the number  of elements  is asked  to the  user.  Then  a call
  to \texttt{cublasCreate} allows to initialize  the cublas library. It creates an
  handle. Then all the arrays are allocated  in the host and the device, as in the
-previous  example.  Both  arrays  $A$ and  $B$  are initialized.   Then the  CPU
+previous  example.  Both  arrays  $A$ and  $B$  are initialized.   The  CPU
  computation is performed  and the time for this CPU  computation is measured. In
  order to  compute the same result  on the GPU, first  of all, data  from the CPU
  need to be  copied into the memory of  the GPU. For that, it is  possible to use
  computation is performed  and the time for this CPU  computation is measured. In
  order to  compute the same result  on the GPU, first  of all, data  from the CPU
  need to be  copied into the memory of  the GPU. For that, it is  possible to use
-cublas function \texttt{cublasSetVector}.  This function several arguments. More
+cublas function \texttt{cublasSetVector}.  This function has several arguments. More
  precisely, the first argument represents the number of elements to transfer, the
  precisely, the first argument represents the number of elements to transfer, the
-second arguments is the size of  each elements, the third element represents the
+second arguments is the size of  each element, the third element represents the
  source of the  array to transfer (in  the GPU), the fourth is  an offset between
  each element of  the source (usually this value  is set to 1), the  fifth is the
  destination (in the GPU)  and the last is an offset between  each element of the
  source of the  array to transfer (in  the GPU), the fourth is  an offset between
  each element of  the source (usually this value  is set to 1), the  fifth is the
  destination (in the GPU)  and the last is an offset between  each element of the
@@ -133,37 +133,37 @@ computation produce the same result.
  
  Matrix-matrix multiplication is an operation  which is quite easy to parallelize
  with a GPU. If we consider that  a matrix is represented using a two dimensional
  
  Matrix-matrix multiplication is an operation  which is quite easy to parallelize
  with a GPU. If we consider that  a matrix is represented using a two dimensional
-array,  A[i][j] represents  the  the element  of  the $i^{th}$  row  and of  the
-$j^{th}$ column. In many case, it is easier to manipulate 1D array instead of 2D
+array,  $A[i][j]$ represents   the element  of  the $i^{th}$  row  and of  the
+$j^{th}$ column. In many cases, it is easier to manipulate 1D array instead of 2D
  array.   With Cuda,  even if  it is  possible to  manipulate 2D  arrays,  in the
  array.   With Cuda,  even if  it is  possible to  manipulate 2D  arrays,  in the
-following we  present an example  based on 1D  array. For sake of  simplicity we
-consider  we  have  a  squared  matrix  of size  \texttt{size}.  So  with  a  1D
+following we  present an example  based on 1D  array. For the sake of  simplicity, we
+consider  we  have  a  square  matrix  of size  \texttt{size}.  So  with  a  1D
  array, \texttt{A[i*size+j]} allows  us to access to the  element of the $i^{th}$
  row and of the $j^{th}$ column.
  
  With  a sequential  programming, the  matrix multiplication  is  performed using
  array, \texttt{A[i*size+j]} allows  us to access to the  element of the $i^{th}$
  row and of the $j^{th}$ column.
  
  With  a sequential  programming, the  matrix multiplication  is  performed using
-three loops. Supposing that $A$, $B$  represent two square matrices and that the
+three loops. We assume that $A$, $B$  represent two square matrices and the
  result   of    the   multiplication    of   $A   \times    B$   is    $C$.   The
  element \texttt{C[i*size+j]} is computed as follows:
  \begin{equation}
  C[i*size+j]=\sum_{k=0}^{size-1} A[i*size+k]*B[k*size+j];
  \end{equation}
  
  result   of    the   multiplication    of   $A   \times    B$   is    $C$.   The
  element \texttt{C[i*size+j]} is computed as follows:
  \begin{equation}
  C[i*size+j]=\sum_{k=0}^{size-1} A[i*size+k]*B[k*size+j];
  \end{equation}
  
-In  Listing~\ref{ch2:lst:ex3}, in  the CPU  computation,  this part  of code  is
-performed using 3 loops, one for $i$, one  for $j$ and one for $k$.  In order to
-perform the same computation on a  GPU, a naive solution consists in considering
-that the matrix $C$ is split into  2 dimensional blocks.  The size of each block
-must be chosen such  as the number of threads per block  is inferior to $1,024$.
+In Listing~\ref{ch2:lst:ex3},  the CPU computation  is performed using  3 loops,
+one  for $i$,  one for  $j$  and one  for $k$.   In  order to  perform the  same
+computation on a  GPU, a naive solution consists in  considering that the matrix
+$C$ is split into  2 dimensional blocks.  The size of each  block must be chosen
+such as the number of threads per block is inferior to $1,024$.
  
  
  In Listing~\ref{ch2:lst:ex3},  we consider that  a block contains 16  threads in
  each   dimension,  the   variable  \texttt{width}   is  used   for   that.   The
  
  
  In Listing~\ref{ch2:lst:ex3},  we consider that  a block contains 16  threads in
  each   dimension,  the   variable  \texttt{width}   is  used   for   that.   The
-variable \texttt{nbTh} represents the number of threads per block. So to be able
+variable \texttt{nbTh} represents the number of threads per block. So, to be able
  to compute the matrix-matrix product on a GPU, each block of threads is assigned
  to compute the result  of the product for the elements of  this block.  The main
  part of the code is quite similar to the previous code.  Arrays are allocated in
  the  CPU and  the GPU.   Matrices $A$  and $B$  are randomly  initialized.  Then
  to compute the matrix-matrix product on a GPU, each block of threads is assigned
  to compute the result  of the product for the elements of  this block.  The main
  part of the code is quite similar to the previous code.  Arrays are allocated in
  the  CPU and  the GPU.   Matrices $A$  and $B$  are randomly  initialized.  Then
-arrays are  transfered inside the  GPU memory with call  to \texttt{cudaMemcpy}.
+arrays are  transferred inside the  GPU memory with call  to \texttt{cudaMemcpy}.
  So the first step for each thread of a block is to compute the corresponding row
  and   column.    With   a    2   dimensional   decomposition,   \texttt{int   i=
  blockIdx.y*blockDim.y+ threadIdx.y;} allows us to compute the corresponding line
  So the first step for each thread of a block is to compute the corresponding row
  and   column.    With   a    2   dimensional   decomposition,   \texttt{int   i=
  blockIdx.y*blockDim.y+ threadIdx.y;} allows us to compute the corresponding line
@@ -178,23 +178,23 @@ dimGrid(size/width,size/width);} allows us  to create \texttt{size/width} blocks
  in each  dimension.  Likewise,  \texttt{dim3 dimBlock(width,width);} is  used to
  create \texttt{width} thread  in each dimension. After that,  the kernel for the
  matrix  multiplication is  called. At  the end  of the  listing, the  matrix $C$
  in each  dimension.  Likewise,  \texttt{dim3 dimBlock(width,width);} is  used to
  create \texttt{width} thread  in each dimension. After that,  the kernel for the
  matrix  multiplication is  called. At  the end  of the  listing, the  matrix $C$
-computed by the GPU is transfered back  in the CPU and we check if both matrices
+computed by the GPU is transferred back  into the CPU and we check if both matrices
  C computed by the CPU and the GPU are identical with a precision of $10^{-4}$.
  
  
  C computed by the CPU and the GPU are identical with a precision of $10^{-4}$.
  
  
-On C2070M Tesla card, this code take $37.68$ms to perform the multiplication. On
-a Intel Xeon E31245 at  $3.30$GHz, it takes $2465$ms without any parallelization
-(using only one core). Consequently the speed up between the CPU and GPU version
-is about $65$ which is very  good regarding the difficulty of parallelizing this
-code.
+With $1,024  \times 1,024$ matrices,  on a C2070M  Tesla card, this  code takes
+$37.68$ms to perform the multiplication. With an Intel Xeon E31245 at $3.30$GHz, it
+takes $2465$ms  without any parallelization (using only  one core). Consequently
+the speed up  between the CPU and GPU  version is about $65$ which  is very good
+regarding the difficulty of parallelizing this code.
  
  \lstinputlisting[label=ch2:lst:ex3,caption=simple Matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu}
  
  \section{Conclusion}
  
  \lstinputlisting[label=ch2:lst:ex3,caption=simple Matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu}
  
  \section{Conclusion}
-In this chapter  3 simple Cuda examples have been  presented. Those examples are
-quite  simple  and  they  cannot  present  all the  possibilities  of  the  Cuda
-programming.   Interested  readers  are  invited  to  consult  Cuda  programming
-introduction books if some issues regarding the Cuda programming is not clear.
+In this chapter, three simple Cuda examples have been  presented. Those examples are
+quite  simple. As we  cannot  present  all the  possibilities  of  the  Cuda
+programming, interested  readers  are  invited  to  consult  Cuda  programming
+introduction books if some issues regarding the Cuda programming are not clear.
  
  \putbib[Chapters/chapter2/biblio]
  
  
  \putbib[Chapters/chapter2/biblio]
author	Raphael Couturier <raphael.couturier@univ-fcomte.fr>
	Tue, 23 Oct 2012 14:16:41 +0000 (16:16 +0200)
committer	Raphael Couturier <raphael.couturier@univ-fcomte.fr>
	Tue, 23 Oct 2012 14:16:41 +0000 (16:16 +0200)