X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/fc4670d0de6814f682df0ce247905cba40b9d547..c45fe27f00d63f76b7590e3077f0c2179d225bff:/BookGPU/Chapters/chapter2/ch2.tex?ds=inline diff --git a/BookGPU/Chapters/chapter2/ch2.tex b/BookGPU/Chapters/chapter2/ch2.tex index a9e9a87..e80b670 100755 --- a/BookGPU/Chapters/chapter2/ch2.tex +++ b/BookGPU/Chapters/chapter2/ch2.tex @@ -3,7 +3,9 @@ \chapter{Introduction to CUDA} \label{chapter2} -\section{Introduction}\label{intro} +\section{Introduction} +\label{ch2:intro} + In this chapter we give some simple examples on CUDA programming. The goal is not to provide an exhaustive presentation of all the functionalities of CUDA but rather giving some basic elements. Of course, readers that do not know CUDA are @@ -12,6 +14,7 @@ example: \cite{ch2:Sanders:2010:CEI}). \section{First example} +\label{ch2:1ex} This first example is intented to show how to build a very simple example with CUDA. The goal of this example is to performed the sum of two arrays and @@ -67,23 +70,27 @@ block. \lstinputlisting[label=ch2:lst:ex1,caption=A simple example]{Chapters/chapter2/ex1.cu} \section{Second example: using CUBLAS} +\label{ch2:2ex} The Basic Linear Algebra Subprograms (BLAS) allows programmer to use performant routines that are often used. Those routines are heavily used in many scientific -applications and are very optimzed for vector operations, matrix-vector +applications and are very optimized for vector operations, matrix-vector operations and matrix-matrix -operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some of those operations seems +operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some of those operations seem to be easy to implement with CUDA. Nevertheless, as soon as a reduction is needed, implementing an efficient reduction routines with CUDA is far from being -simple. +simple. Roughly speaking, a reduction operation\index{reduction~operation} is an +operation which combines all the elements of an array and extract a number +computed with all the elements. For example, a sum, a maximum or a dot product +are reduction operations. In this second example, we consider that we have two vectors $A$ and $B$. First -of all we want to compute the sum of both vectors in a vector $C$. Then we want +of all, we want to compute the sum of both vectors in a vector $C$. Then we want to compute the scalar product between $1/C$ and $1/A$. This is just an example -which has not direct interest except to show how to program it with CUDA. +which has no direct interest except to show how to program it with CUDA. Listing~\ref{ch2:lst:ex2} shows this example with CUDA. The first kernel for the -addition of two arrays is exactly the same that the one described in the +addition of two arrays is exactly the same as the one described in the previous example. The kernel to compute the inverse of the elements of an array is very @@ -91,17 +98,76 @@ simple. For each thread index, the inverse of the array replaces the initial array. In the main function, the beginning is very similar to the one in the previous -example. First the number of elements is asked to the user. Then a call +example. First, the number of elements is asked to the user. Then a call to \texttt{cublasCreate} allows to initialize the cublas library. It creates an handle. Then all the arrays are allocated in the host and the device, as in the previous example. Both arrays $A$ and $B$ are initialized. Then the CPU computation is performed and the time for this CPU computation is measured. In order to compute the same result on the GPU, first of all, data from the CPU need to be copied into the memory of the GPU. For that, it is possible to use -cublas function \texttt{cublasSetVector}. - -\lstinputlisting[label=ch2:lst:ex2,caption=A simple example]{Chapters/chapter2/ex2.cu} - +cublas function \texttt{cublasSetVector}. This function several arguments. More +precisely, the first argument represents the number of elements to transfer, the +second arguments is the size of each elements, the third element represents the +source of the array to transfer (in the GPU), the fourth is an offset between +each element of the source (usually this value is set to 1), the fifth is the +destination (in the GPU) and the last is an offset between each element of the +destination. Then we call the kernel \texttt{addition} which computes the sum of +all elements of arrays $A$ and $B$. The \texttt{inverse} kernel is called twice, +once to inverse elements of array $C$ and once for $A$. Finally, we call the +function \texttt{cublasDdot} which computes the dot product of two vectors. To +use this routine, we must specify the handle initialized by Cuda, the number of +elements to consider, then each vector is followed by the offset between every +element. After the GPU computation, it is possible to check that both +computation produce the same result. + +\lstinputlisting[label=ch2:lst:ex2,caption=A simple example with cublas]{Chapters/chapter2/ex2.cu} + +\section{Third example: matrix-matrix multiplication} +\label{ch2:3ex} + + + +Matrix-matrix multiplication is an operation which is quite easy to parallelize +with a GPU. If we consider that a matrix is represented using a two dimensional +array, A[i][j] represents the the element of the $i^{th}$ row and of the +$j^{th}$ column. In many case, it is easier to manipulate 1D array instead of 2D +array. With Cuda, even if it is possible to manipulate 2D arrays, in the +following we present an example based on 1D array. For sake of simplicity we +consider we have a squared matrix of size \texttt{size}. So with a 1D +array, \texttt{A[i*size+j]} allows us to access to the element of the $i^{th}$ +row and of the $j^{th}$ column. + +With a sequential programming, the matrix multiplication is performed using +three loops. Supposing that $A$, $B$ represent two square matrices and that the +result of the multiplication of $A \times B$ is $C$. The +element \texttt{C[i*size+j]} is computed as follows: +\begin{equation} +C[i*size+j]=\sum_{k=0}^{size-1} A[i*size+k]*B[k*size+j]; +\end{equation} + +In Listing~\ref{ch2:lst:ex3}, in the CPU computation, this part of code is +performed using 3 loops, one for $i$, one for $j$ and one for $k$. In order to +perform the same computation on a GPU, a naive solution consists in considering +that the matrix $C$ is split into 2 dimensional blocks. The size of each block +must be chosen such as the number of threads per block is inferior to $1,024$. +In Listing~\ref{ch2:lst:ex3}, we consider that a block contains 16 threads in +each dimension. The variable \texttt{nbTh} represents the number of threads per +block. So to be able to compute the matrix-matrix product on a GPU, each block +of threads is assigned to compute the result of the product for the elements of +this block. So the first step for each thread of a block is to compute the +corresponding row and column. With a 2 dimensional decomposition, \texttt{int i= +blockIdx.y*blockDim.y+ threadIdx.y;} allows us to compute the corresponding line +and \texttt{int j= blockIdx.x*blockDim.x+ threadIdx.x;} the corresponding +column. + + +On C2070M Tesla card, this code take $37.68$ms to perform the multiplication. On +a Intel Xeon E31245 at $3.30$GHz, it takes $2465$ms without any parallelization +(using only one core). Consequently the speed up between the CPU and GPU version +is about $65$ which is very good regarding the difficulty of parallelizing this +code. + +\lstinputlisting[label=ch2:lst:ex3,caption=simple Matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu} \putbib[Chapters/chapter2/biblio]