X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/e4346947aa62df9cab78e792050767d42c5f5ab3..1eda0e6e69a2160009bfcb2b86237a544898530d:/BookGPU/Chapters/chapter2/ch2.tex?ds=inline diff --git a/BookGPU/Chapters/chapter2/ch2.tex b/BookGPU/Chapters/chapter2/ch2.tex index c33ac50..8222660 100755 --- a/BookGPU/Chapters/chapter2/ch2.tex +++ b/BookGPU/Chapters/chapter2/ch2.tex @@ -1,15 +1,15 @@ \chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte} -\chapter{Introduction to CUDA} +\chapter{Introduction to Cuda} \label{chapter2} \section{Introduction} \label{ch2:intro} -In this chapter we give some simple examples on CUDA programming. The goal is -not to provide an exhaustive presentation of all the functionalities of CUDA but -rather giving some basic elements. Of course, readers that do not know CUDA are -invited to read other books that are specialized on CUDA programming (for +In this chapter we give some simple examples on Cuda programming. The goal is +not to provide an exhaustive presentation of all the functionalities of Cuda but +rather giving some basic elements. Of course, readers that do not know Cuda are +invited to read other books that are specialized on Cuda programming (for example: \cite{ch2:Sanders:2010:CEI}). @@ -17,53 +17,57 @@ example: \cite{ch2:Sanders:2010:CEI}). \label{ch2:1ex} This first example is intented to show how to build a very simple example with -CUDA. The goal of this example is to performed the sum of two arrays and -putting the result into a third array. A cuda program consists in a C code -which calls CUDA kernels that are executed on a GPU. The listing of this code is +Cuda. The goal of this example is to perform the sum of two arrays and +put the result into a third array. A Cuda program consists in a C code +which calls Cuda kernels that are executed on a GPU. The listing of this code is in Listing~\ref{ch2:lst:ex1}. As GPUs have their own memory, the first step consists in allocating memory on -the GPU. A call to \texttt{cudaMalloc} allows to allocate memory on the GPU. The -first parameter of this function is a pointer on a memory on the device -(i.e. the GPU). In this example, \texttt{d\_} is added on each variable allocated -on the GPU meaning this variable is on the GPU. The second parameter represents -the size of the allocated variables, this size is in bits. +the GPU. A call to \texttt{cudaMalloc}\index{Cuda~functions!cudaMalloc} allows +to allocate memory on the GPU. The first parameter of this function is a pointer +on a memory on the device (i.e. the GPU). In this example, \texttt{d\_} is added +on each variable allocated on the GPU, meaning this variable is on the GPU. The +second parameter represents the size of the allocated variables, this size is in +bits. In this example, we want to compare the execution time of the additions of two arrays in CPU and GPU. So for both these operations, a timer is created to -measure the time. CUDA proposes to manipulate timers quick easily. The first -step is to create the timer, then to start it and at the end to stop it. For -each of these operations a dedicated functions is used. +measure the time. Cuda proposes to manipulate timers quite easily. The first +step is to create the timer\index{Cuda~functions!timer}, then to start it and at +the end to stop it. For each of these operations a dedicated functions is used. -In order to compute the same sum with a GPU, the first step consits in -transferring the data from the CPU (considered as the host with CUDA) to the GPU -(considered as the device with CUDA). A call to \texttt{cudaMalloc} allows to +In order to compute the same sum with a GPU, the first step consists in +transferring the data from the CPU (considered as the host with Cuda) to the GPU +(considered as the device with Cuda). A call to \texttt{cudaMemcpy} allows to copy the content of an array allocated in the host to the device when the fourth -parameter is set to \texttt{cudaMemcpyHostToDevice}. The first parameter of the -function is the destination array, the second is the source array and the third -is the number of elements to copy (exprimed in bytes). - -Now that the GPU contains the data needed to perform the addition. In sequential -such addition is achieved out with a loop on all the elements. With a GPU, it -is possible to perform the addition of all elements of the arrays in parallel +parameter is set +to \texttt{cudaMemcpyHostToDevice}\index{Cuda~functions!cudaMemcpy}. The first +parameter of the function is the destination array, the second is the +source array and the third is the number of elements to copy (exprimed in +bytes). + +Now the GPU contains the data needed to perform the addition. In sequential such +addition is achieved out with a loop on all the elements. With a GPU, it is +possible to perform the addition of all elements of the two arrays in parallel (if the number of blocks and threads per blocks is sufficient). In Listing\ref{ch2:lst:ex1} at the beginning, a simple kernel, called \texttt{addition} is defined to compute in parallel the summation of the -two arrays. With CUDA, a kernel starts with the -keyword \texttt{\_\_global\_\_} \index{CUDA~keywords!\_\_shared\_\_} which +two arrays. With Cuda, a kernel starts with the +keyword \texttt{\_\_global\_\_} \index{Cuda~keywords!\_\_shared\_\_} which indicates that this kernel can be called from the C code. The first instruction in this kernel is used to compute the variable \texttt{tid} which represents the thread index. This thread index\index{thread index} is computed according to -the values of the block index (it is a variable of CUDA -called \texttt{blockIdx}\index{CUDA~keywords!blockIdx}). Blocks of threads can -be decomposed into 1 dimension, 2 dimensions or 3 dimensions. According to the -dimension of data manipulated, the appropriate dimension can be useful. In our -example, only one dimension is used. Then using notation \texttt{.x} we can -access to the first dimension (\texttt{.y} and \texttt{.z} allow respectively to -access to the second and third dimension). The -variable \texttt{blockDim}\index{CUDA~keywords!blockDim} gives the size of each -block. +the values of the block index +(called \texttt{blockIdx} \index{Cuda~keywords!blockIdx} in Cuda) and of the +thread index (called \texttt{blockIdx}\index{Cuda~keywords!threadIdx} in +Cuda). Blocks of threads and thread indexes can be decomposed into 1 dimension, 2 +dimensions or 3 dimensions. According to the dimension of data manipulated, the +appropriate dimension can be useful. In our example, only one dimension is used. +Then using notation \texttt{.x} we can access to the first dimension +(\texttt{.y} and \texttt{.z} allow respectively to access to the second and +third dimension). The variable \texttt{blockDim}\index{Cuda~keywords!blockDim} +gives the size of each block. @@ -72,24 +76,24 @@ block. \section{Second example: using CUBLAS} \label{ch2:2ex} -The Basic Linear Algebra Subprograms (BLAS) allows programmer to use performant -routines that are often used. Those routines are heavily used in many scientific -applications and are very optimized for vector operations, matrix-vector -operations and matrix-matrix +The Basic Linear Algebra Subprograms (BLAS) allows programmers to use efficient +routines that are often required. Those routines are heavily used in many +scientific applications and are very optimized for vector operations, +matrix-vector operations and matrix-matrix operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some of those operations seem -to be easy to implement with CUDA. Nevertheless, as soon as a reduction is -needed, implementing an efficient reduction routines with CUDA is far from being +to be easy to implement with Cuda. Nevertheless, as soon as a reduction is +needed, implementing an efficient reduction routines with Cuda is far from being simple. Roughly speaking, a reduction operation\index{reduction~operation} is an operation which combines all the elements of an array and extract a number computed with all the elements. For example, a sum, a maximum or a dot product -are reduction operations. +are reduction operations. In this second example, we consider that we have two vectors $A$ and $B$. First of all, we want to compute the sum of both vectors in a vector $C$. Then we want to compute the scalar product between $1/C$ and $1/A$. This is just an example -which has no direct interest except to show how to program it with CUDA. +which has no direct interest except to show how to program it with Cuda. -Listing~\ref{ch2:lst:ex2} shows this example with CUDA. The first kernel for the +Listing~\ref{ch2:lst:ex2} shows this example with Cuda. The first kernel for the addition of two arrays is exactly the same as the one described in the previous example. @@ -137,14 +141,60 @@ consider we have a squared matrix of size \texttt{size}. So with a 1D array, \texttt{A[i*size+j]} allows us to access to the element of the $i^{th}$ row and of the $j^{th}$ column. -In sequential the matrix multiplication is performed using three loops. Supposing that $A$, $B$ represent two square matrices, the result of the multiplication of $A \times B$ is - -On C2070M Tesla card, this code take 37.68ms to perform the multiplication. On a -Intel Xeon E31245 at 3.30GHz, it takes 2465ms without any parallelization (using -only one core). Consequently the speed up between the CPU and GPU version is -about 65 which is very good regarding the difficulty of parallelizing this code. +With a sequential programming, the matrix multiplication is performed using +three loops. Supposing that $A$, $B$ represent two square matrices and that the +result of the multiplication of $A \times B$ is $C$. The +element \texttt{C[i*size+j]} is computed as follows: +\begin{equation} +C[i*size+j]=\sum_{k=0}^{size-1} A[i*size+k]*B[k*size+j]; +\end{equation} + +In Listing~\ref{ch2:lst:ex3}, in the CPU computation, this part of code is +performed using 3 loops, one for $i$, one for $j$ and one for $k$. In order to +perform the same computation on a GPU, a naive solution consists in considering +that the matrix $C$ is split into 2 dimensional blocks. The size of each block +must be chosen such as the number of threads per block is inferior to $1,024$. + + +In Listing~\ref{ch2:lst:ex3}, we consider that a block contains 16 threads in +each dimension, the variable \texttt{width} is used for that. The +variable \texttt{nbTh} represents the number of threads per block. So to be able +to compute the matrix-matrix product on a GPU, each block of threads is assigned +to compute the result of the product for the elements of this block. The main +part of the code is quite similar to the previous code. Arrays are allocated in +the CPU and the GPU. Matrices $A$ and $B$ are randomly initialized. Then +arrays are transfered inside the GPU memory with call to \texttt{cudaMemcpy}. +So the first step for each thread of a block is to compute the corresponding row +and column. With a 2 dimensional decomposition, \texttt{int i= +blockIdx.y*blockDim.y+ threadIdx.y;} allows us to compute the corresponding line +and \texttt{int j= blockIdx.x*blockDim.x+ threadIdx.x;} the corresponding +column. Then each thread has to compute the sum of the product of the line of +$A$ per the column of $B$. In order to use a register, the +kernel \texttt{matmul} uses a variable called \texttt{sum} to compute the +sum. Then the result is set into the matrix at the right place. The computation +of CPU matrix-matrix multiplication is performed as described previously. A +timer measures the time. In order to use 2 dimensional blocks, \texttt{dim3 +dimGrid(size/width,size/width);} allows us to create \texttt{size/width} blocks +in each dimension. Likewise, \texttt{dim3 dimBlock(width,width);} is used to +create \texttt{width} thread in each dimension. After that, the kernel for the +matrix multiplication is called. At the end of the listing, the matrix $C$ +computed by the GPU is transfered back in the CPU and we check if both matrices +C computed by the CPU and the GPU are identical with a precision of $10^{-4}$. + + +On C2070M Tesla card, this code take $37.68$ms to perform the multiplication. On +a Intel Xeon E31245 at $3.30$GHz, it takes $2465$ms without any parallelization +(using only one core). Consequently the speed up between the CPU and GPU version +is about $65$ which is very good regarding the difficulty of parallelizing this +code. \lstinputlisting[label=ch2:lst:ex3,caption=simple Matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu} +\section{Conclusion} +In this chapter 3 simple Cuda examples have been presented. Those examples are +quite simple and they cannot present all the possibilities of the Cuda +programming. Interested readers are invited to consult Cuda programming +introduction books if some issues regarding the Cuda programming is not clear. + \putbib[Chapters/chapter2/biblio]