X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/1eda0e6e69a2160009bfcb2b86237a544898530d..b79bc3cefac346cb7bf889751e4b6a58a18f5d24:/BookGPU/Chapters/chapter2/ch2.tex?ds=sidebyside diff --git a/BookGPU/Chapters/chapter2/ch2.tex b/BookGPU/Chapters/chapter2/ch2.tex index 8222660..155c8ca 100755 --- a/BookGPU/Chapters/chapter2/ch2.tex +++ b/BookGPU/Chapters/chapter2/ch2.tex @@ -44,7 +44,7 @@ copy the content of an array allocated in the host to the device when the fourth parameter is set to \texttt{cudaMemcpyHostToDevice}\index{Cuda~functions!cudaMemcpy}. The first parameter of the function is the destination array, the second is the -source array and the third is the number of elements to copy (exprimed in +source array and the third is the number of elements to copy (expressed in bytes). Now the GPU contains the data needed to perform the addition. In sequential such @@ -105,13 +105,13 @@ In the main function, the beginning is very similar to the one in the previous example. First, the number of elements is asked to the user. Then a call to \texttt{cublasCreate} allows to initialize the cublas library. It creates an handle. Then all the arrays are allocated in the host and the device, as in the -previous example. Both arrays $A$ and $B$ are initialized. Then the CPU +previous example. Both arrays $A$ and $B$ are initialized. The CPU computation is performed and the time for this CPU computation is measured. In order to compute the same result on the GPU, first of all, data from the CPU need to be copied into the memory of the GPU. For that, it is possible to use -cublas function \texttt{cublasSetVector}. This function several arguments. More +cublas function \texttt{cublasSetVector}. This function has several arguments. More precisely, the first argument represents the number of elements to transfer, the -second arguments is the size of each elements, the third element represents the +second arguments is the size of each element, the third element represents the source of the array to transfer (in the GPU), the fourth is an offset between each element of the source (usually this value is set to 1), the fifth is the destination (in the GPU) and the last is an offset between each element of the @@ -133,37 +133,37 @@ computation produce the same result. Matrix-matrix multiplication is an operation which is quite easy to parallelize with a GPU. If we consider that a matrix is represented using a two dimensional -array, A[i][j] represents the the element of the $i^{th}$ row and of the -$j^{th}$ column. In many case, it is easier to manipulate 1D array instead of 2D +array, $A[i][j]$ represents the element of the $i^{th}$ row and of the +$j^{th}$ column. In many cases, it is easier to manipulate 1D array instead of 2D array. With Cuda, even if it is possible to manipulate 2D arrays, in the -following we present an example based on 1D array. For sake of simplicity we -consider we have a squared matrix of size \texttt{size}. So with a 1D +following we present an example based on 1D array. For the sake of simplicity, we +consider we have a square matrix of size \texttt{size}. So with a 1D array, \texttt{A[i*size+j]} allows us to access to the element of the $i^{th}$ row and of the $j^{th}$ column. With a sequential programming, the matrix multiplication is performed using -three loops. Supposing that $A$, $B$ represent two square matrices and that the +three loops. We assume that $A$, $B$ represent two square matrices and the result of the multiplication of $A \times B$ is $C$. The element \texttt{C[i*size+j]} is computed as follows: \begin{equation} C[i*size+j]=\sum_{k=0}^{size-1} A[i*size+k]*B[k*size+j]; \end{equation} -In Listing~\ref{ch2:lst:ex3}, in the CPU computation, this part of code is -performed using 3 loops, one for $i$, one for $j$ and one for $k$. In order to -perform the same computation on a GPU, a naive solution consists in considering -that the matrix $C$ is split into 2 dimensional blocks. The size of each block -must be chosen such as the number of threads per block is inferior to $1,024$. +In Listing~\ref{ch2:lst:ex3}, the CPU computation is performed using 3 loops, +one for $i$, one for $j$ and one for $k$. In order to perform the same +computation on a GPU, a naive solution consists in considering that the matrix +$C$ is split into 2 dimensional blocks. The size of each block must be chosen +such as the number of threads per block is inferior to $1,024$. In Listing~\ref{ch2:lst:ex3}, we consider that a block contains 16 threads in each dimension, the variable \texttt{width} is used for that. The -variable \texttt{nbTh} represents the number of threads per block. So to be able +variable \texttt{nbTh} represents the number of threads per block. So, to be able to compute the matrix-matrix product on a GPU, each block of threads is assigned to compute the result of the product for the elements of this block. The main part of the code is quite similar to the previous code. Arrays are allocated in the CPU and the GPU. Matrices $A$ and $B$ are randomly initialized. Then -arrays are transfered inside the GPU memory with call to \texttt{cudaMemcpy}. +arrays are transferred inside the GPU memory with call to \texttt{cudaMemcpy}. So the first step for each thread of a block is to compute the corresponding row and column. With a 2 dimensional decomposition, \texttt{int i= blockIdx.y*blockDim.y+ threadIdx.y;} allows us to compute the corresponding line @@ -178,23 +178,23 @@ dimGrid(size/width,size/width);} allows us to create \texttt{size/width} blocks in each dimension. Likewise, \texttt{dim3 dimBlock(width,width);} is used to create \texttt{width} thread in each dimension. After that, the kernel for the matrix multiplication is called. At the end of the listing, the matrix $C$ -computed by the GPU is transfered back in the CPU and we check if both matrices +computed by the GPU is transferred back into the CPU and we check if both matrices C computed by the CPU and the GPU are identical with a precision of $10^{-4}$. -On C2070M Tesla card, this code take $37.68$ms to perform the multiplication. On -a Intel Xeon E31245 at $3.30$GHz, it takes $2465$ms without any parallelization -(using only one core). Consequently the speed up between the CPU and GPU version -is about $65$ which is very good regarding the difficulty of parallelizing this -code. +With $1,024 \times 1,024$ matrices, on a C2070M Tesla card, this code takes +$37.68$ms to perform the multiplication. With an Intel Xeon E31245 at $3.30$GHz, it +takes $2465$ms without any parallelization (using only one core). Consequently +the speed up between the CPU and GPU version is about $65$ which is very good +regarding the difficulty of parallelizing this code. \lstinputlisting[label=ch2:lst:ex3,caption=simple Matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu} \section{Conclusion} -In this chapter 3 simple Cuda examples have been presented. Those examples are -quite simple and they cannot present all the possibilities of the Cuda -programming. Interested readers are invited to consult Cuda programming -introduction books if some issues regarding the Cuda programming is not clear. +In this chapter, three simple Cuda examples have been presented. Those examples are +quite simple. As we cannot present all the possibilities of the Cuda +programming, interested readers are invited to consult Cuda programming +introduction books if some issues regarding the Cuda programming are not clear. \putbib[Chapters/chapter2/biblio]