X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/1874c46934f4ba7e8c2013d3829f65309456d292..063fd4437e9bfbefc2f6ed6c932744bb20514751:/BookGPU/Chapters/chapter2/ch2.tex diff --git a/BookGPU/Chapters/chapter2/ch2.tex b/BookGPU/Chapters/chapter2/ch2.tex index bd48e2a..906c7b8 100755 --- a/BookGPU/Chapters/chapter2/ch2.tex +++ b/BookGPU/Chapters/chapter2/ch2.tex @@ -23,16 +23,17 @@ are executed on a GPU. This code is in Listing~\ref{ch2:lst:ex1}. As GPUs have their own memory, the first step consists of allocating memory on -the GPU. A call to \texttt{cudaMalloc}\index{CUDA functions!cudaMalloc} -allocates memory on the GPU. The second parameter represents the size of the -allocated variables, this size is expressed in bits. - +the GPU. A call to \texttt{cudaMalloc}\index{CUDA functions!cudaMalloc} +allocates memory on the GPU. The first parameter of this function is a pointer +on a memory on the device, i.e., the GPU. The second parameter represents the +size of the allocated variables; this size is expressed in bits. +\pagebreak \lstinputlisting[label=ch2:lst:ex1,caption=simple example]{Chapters/chapter2/ex1.cu} In this example, we want to compare the execution time of the additions of two arrays in CPU and GPU. So for both these operations, a timer is created to -measure the time. CUDA proposes to manipulate timers quite easily. The first +measure the time. CUDA manipulates timers quite easily. The first step is to create the timer\index{CUDA functions!timer}, then to start it, and at the end to stop it. For each of these operations a dedicated function is used. @@ -60,26 +61,26 @@ the values of the block index (called \texttt{blockIdx} \index{CUDA keywords!blockIdx} in CUDA) and of the thread index (called \texttt{threadIdx}\index{CUDA keywords!threadIdx} in CUDA). Blocks of threads and thread indexes can be decomposed into 1 dimension, -2 dimensions, or 3 dimensions. {\bf A REGARDER} According to the dimension of manipulated data, -the appropriate dimension can be useful. In our example, only one dimension is +2 dimensions, or 3 dimensions. According to the dimension of manipulated data, +the dimension of blocks of threads must be chosen carefully. In our example, only one dimension is used. Then using the notation \texttt{.x}, we can access the first dimension (\texttt{.y} and \texttt{.z}, respectively allow access to the second and -third dimension). The variable \texttt{blockDim}\index{CUDA keywords!blockDim} +third dimensions). The variable \texttt{blockDim}\index{CUDA keywords!blockDim} gives the size of each block. - +\pagebreak \section{Second example: using CUBLAS \index{CUBLAS}} \label{ch2:2ex} -The Basic Linear Algebra Subprograms (BLAS) allows programmers to use efficient +The Basic Linear Algebra Subprograms (BLAS) allow programmers to use efficient routines for basic linear operations. Those routines are heavily used in many scientific applications and are optimized for vector operations, matrix-vector operations, and matrix-matrix operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some of those operations seem -to be easy to implement with CUDA. Nevertheless, as soon as a reduction is +to be easy to implement with CUDA; however, as soon as a reduction is needed, implementing an efficient reduction routine with CUDA is far from being simple. Roughly speaking, a reduction operation\index{reduction operation} is an operation which combines all the elements of an array and extracts a number @@ -144,7 +145,7 @@ three loops. We assume that $A$, $B$ represent two square matrices and the result of the multiplication of $A \times B$ is $C$. The element \texttt{C[i*size+j]} is computed as follows: \begin{equation} -C[size*i+j]=\sum_{k=0}^{size-1} A[size*i+k]*B[size*k+j]; +C[size*i+j]=\sum_{k=0}^{size-1} A[size*i+k]*B[size*k+j]. \end{equation} In Listing~\ref{ch2:lst:ex3}, the CPU computation is performed using 3 loops,