X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/f69f75dd5354fe34ad3b1360af061a8f1aebb9aa..a2aa3f0f91a668ee6e799bad0f4de90b7b2be452:/BookGPU/Chapters/chapter2/ch2.tex diff --git a/BookGPU/Chapters/chapter2/ch2.tex b/BookGPU/Chapters/chapter2/ch2.tex index b06e9be..490d753 100755 --- a/BookGPU/Chapters/chapter2/ch2.tex +++ b/BookGPU/Chapters/chapter2/ch2.tex @@ -1,45 +1,199 @@ -\chapterauthor{Author Name1}{Affiliation text1} -\chapterauthor{Author Name2}{Affiliation text2} - +\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte, France} \chapter{Introduction to CUDA} \label{chapter2} -\section{Introduction}\label{intro} -In this chapter we give some simple examples on CUDA programming. The goal is +\section{Introduction} +\label{ch2:intro} + +In this chapter we give some simple examples of CUDA programming. The goal is not to provide an exhaustive presentation of all the functionalities of CUDA but -rather giving some basic elements. Of course, readers that do not know CUDA are -invited to read other books that are specialized on CUDA programming. +rather to give some basic elements. Of course, readers who do not know CUDA are +invited to read other books that are specialized on CUDA programming (for +example, \cite{ch2:Sanders:2010:CEI}). \section{First example} +\label{ch2:1ex} -This first example is intented to show how to build a very simple example with -CUDA. The goal of this example is to performed the sum of two arrays and -putting the result into a third array. A cuda program consists in a C code -which calls CUDA kernels that are executed on a GPU. +This first example is intented to show how to build a very simple program with +CUDA. Its goal is to perform the sum of two arrays and put the result into a +third array. A CUDA program consists in a C code which calls CUDA kernels that +are executed on a GPU. This code is in Listing~\ref{ch2:lst:ex1}. -As GPUs have their own memory, the first step consists in allocating memory on -the GPU. A call to \texttt{cudaMalloc} allows to allocate memory on the GPU. The -first parameter of this function is a pointer on a memory on the device -(i.e. the GPU). In this example, \texttt{d\_} is added on each variable allocated -on the GPU meaning this variable is on the GPU. The second parameter represents -the size of the allocated variables, this size is in bits. +As GPUs have their own memory, the first step consists of allocating memory on +the GPU. A call to \texttt{cudaMalloc}\index{CUDA functions!cudaMalloc} +allocates memory on the GPU. The first parameter of this function is a pointer +on a memory on the device, i.e. the GPU. The second parameter represents the +size of the allocated variables, this size is expressed in bits. +\pagebreak +\lstinputlisting[label=ch2:lst:ex1,caption=simple example]{Chapters/chapter2/ex1.cu} + In this example, we want to compare the execution time of the additions of two arrays in CPU and GPU. So for both these operations, a timer is created to -measure the time. CUDA proposes to manipulate timers quick easily. The first -step is to create the timer, then to start it and at the end to stop it. For -each of these operations a dedicated functions is used. +measure the time. CUDA manipulates timers quite easily. The first +step is to create the timer\index{CUDA functions!timer}, then to start it, and at +the end to stop it. For each of these operations a dedicated function is used. -In order to compute the same sum with a GPU, the first step consits in +In order to compute the same sum with a GPU, the first step consists of transferring the data from the CPU (considered as the host with CUDA) to the GPU -(considered as the device with CUDA). A call to \texttt{cudaMalloc} allows to -copy the content of an array allocated in the host to the device when the fourth -parameter is set to \texttt{cudaMemcpyHostToDevice}. The first parameter of the -function is the destination array, the second is the source array and the third -is the number of elements to copy (exprimed in bytes). +(considered as the device with CUDA). A call to \texttt{cudaMemcpy} copies the content of an array allocated in the host to the device when the fourth +parameter is set +to \texttt{cudaMemcpyHostToDevice}\index{CUDA functions!cudaMemcpy}. The first +parameter of the function is the destination array, the second is the +source array, and the third is the number of elements to copy (expressed in +bytes). + +Now the GPU contains the data needed to perform the addition. In sequential +programming, such addition is achieved with a loop on all the elements. +With a GPU, it is possible to perform the addition of all the elements of the +two arrays in parallel (if the number of blocks and threads per blocks is +sufficient). In Listing~\ref{ch2:lst:ex1} at the beginning, a simple kernel, +called \texttt{addition} is defined to compute in parallel the summation of the +two arrays. With CUDA, a kernel starts with the +keyword \texttt{\_\_global\_\_} \index{CUDA keywords!\_\_shared\_\_} which +indicates that this kernel can be called from the C code. The first instruction +in this kernel is used to compute the variable \texttt{tid} which represents the +thread index. This thread index\index{CUDA keywords!thread index} is computed according to +the values of the block index +(called \texttt{blockIdx} \index{CUDA keywords!blockIdx} in CUDA) and of the +thread index (called \texttt{threadIdx}\index{CUDA keywords!threadIdx} in +CUDA). Blocks of threads and thread indexes can be decomposed into 1 dimension, +2 dimensions, or 3 dimensions. According to the dimension of manipulated data, +the dimension of blocks of threads must be chosen carefully. In our example, only one dimension is +used. Then using the notation \texttt{.x}, we can access the first dimension +(\texttt{.y} and \texttt{.z}, respectively allow access to the second and +third dimensions). The variable \texttt{blockDim}\index{CUDA keywords!blockDim} +gives the size of each block. + + + + +\pagebreak +\section{Second example: using CUBLAS \index{CUBLAS}} +\label{ch2:2ex} + +The Basic Linear Algebra Subprograms (BLAS) allow programmers to use efficient +routines for basic linear operations. Those routines are heavily used in many +scientific applications and are optimized for vector operations, matrix-vector +operations, and matrix-matrix +operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some of those operations seem +to be easy to implement with CUDA; however, as soon as a reduction is +needed, implementing an efficient reduction routine with CUDA is far from being +simple. Roughly speaking, a reduction operation\index{reduction operation} is an +operation which combines all the elements of an array and extracts a number +computed from all the elements. For example, a sum, a maximum, or a dot product +are reduction operations. + +In this second example, we have two vectors $A$ and $B$. First +of all, we want to compute the sum of both vectors and store the result in a vector $C$. Then we want +to compute the scalar product between $1/C$ and $1/A$. This is just an example +which has no direct interest except to show how to program it with CUDA. + +Listing~\ref{ch2:lst:ex2} shows this example with CUDA. The first kernel for the +addition of two arrays is exactly the same as the one described in the +previous example. + +The kernel to compute the inverse of the elements of an array is very +simple. For each thread index, the inverse of the array replaces the initial +array. + +In the main function, the beginning is very similar to the one in the previous +example. First, the user is asked to define the number of elements. Then a +call to \texttt{cublasCreate} initializes the CUBLAS library. It +creates a handle. Then all the arrays are allocated in the host and the device, +as in the previous example. Both arrays $A$ and $B$ are initialized. The CPU +computation is performed and the time for this is measured. In +order to compute the same result for the GPU, first of all, data from the CPU +need to be copied into the memory of the GPU. For that, it is possible to use +CUBLAS function \texttt{cublasSetVector}. This function has several +arguments. More precisely, the first argument represents the number of elements +to transfer, the second arguments is the size of each element, the third element +represents the source of the array to transfer (in the GPU), the fourth is an +offset between each element of the source (usually this value is set to 1), the +fifth is the destination (in the GPU), and the last is an offset between each +element of the destination. Then we call the kernel \texttt{addition} which +computes the sum of all elements of arrays $A$ and $B$. The \texttt{inverse} +kernel is called twice, once to inverse elements of array $C$ and once for +$A$. Finally, we call the function \texttt{cublasDdot} which computes the dot +product of two vectors. To use this routine, we must specify the handle +initialized by CUDA, the number of elements to consider, then each vector is +followed by the offset between every element. After the GPU computation, it is +possible to check that both computations produce the same result. + +\lstinputlisting[label=ch2:lst:ex2,caption=simple example with CUBLAS]{Chapters/chapter2/ex2.cu} + +\section{Third example: matrix-matrix multiplication} +\label{ch2:3ex} + + + +Matrix-matrix multiplication is an operation which is quite easy to parallelize +with a GPU. If we consider that a matrix is represented using a two-dimensional +array, $A[i][j]$ represents the element of the $i$ row and of the $j$ +column. In many cases, it is easier to manipulate a one-dimentional (1D) array rather than a 2D +array. With CUDA, even if it is possible to manipulate 2D arrays, in the +following we present an example based on a 1D array. For the sake of simplicity, +we consider we have a square matrix of size \texttt{size}. So with a 1D +array, \texttt{A[i*size+j]} allows us to have access to the element of the +$i$ row and of the $j$ column. + +With sequential programming, the matrix-matrix multiplication is performed using +three loops. We assume that $A$, $B$ represent two square matrices and the +result of the multiplication of $A \times B$ is $C$. The +element \texttt{C[i*size+j]} is computed as follows: +\begin{equation} +C[size*i+j]=\sum_{k=0}^{size-1} A[size*i+k]*B[size*k+j]. +\end{equation} + +In Listing~\ref{ch2:lst:ex3}, the CPU computation is performed using 3 loops, +one for $i$, one for $j$, and one for $k$. In order to perform the same +computation on a GPU, a naive solution consists of considering that the matrix +$C$ is split into 2-dimensional blocks. The size of each block must be chosen +such that the number of threads per block is less than $1,024$. + + +In Listing~\ref{ch2:lst:ex3}, we consider that a block contains 16 threads in +each dimension, the variable \texttt{width} is used for that. The +variable \texttt{nbTh} represents the number of threads per block. So, +to compute the matrix-matrix product on a GPU, each block of threads is assigned +to compute the result of the product of the elements of that block. The main +part of the code is quite similar to the previous code. Arrays are allocated in +the CPU and the GPU. Matrices $A$ and $B$ are randomly initialized. Then +arrays are transferred to the GPU memory with call to \texttt{cudaMemcpy}. +So the first step for each thread of a block is to compute the corresponding row +and column. With a 2-dimensional decomposition, \texttt{int i= +blockIdx.y*blockDim.y+ threadIdx.y;} allows us to compute the corresponding line +and \texttt{int j= blockIdx.x*blockDim.x+ threadIdx.x;} the corresponding +column. Then each thread has to compute the sum of the product of the row of +$A$ by the column of $B$. In order to use a register, the +kernel \texttt{matmul} uses a variable called \texttt{sum} to compute the +sum. Then the result is set into the matrix at the right place. The computation +of CPU matrix-matrix multiplication is performed as described previously. A +timer measures the time. In order to use 2-dimensional blocks, \texttt{dim3 +dimGrid(size/width,size/width);} allows us to create \texttt{size/width} blocks +in each dimension. Likewise, \texttt{dim3 dimBlock(width,width);} is used to +create \texttt{width} thread in each dimension. After that, the kernel for the +matrix multiplication is called. At the end of the listing, the matrix $C$ +computed by the GPU is transferred back into the CPU and we check that both matrices +C computed by the CPU and the GPU are identical with a precision of $10^{-4}$. + + +With $1,024 \times 1,024$ matrices, on a C2070M Tesla card, this code takes +$37.68$ms to perform the multiplication. With an Intel Xeon E31245 at $3.30$GHz, it +takes $2465$ms without any parallelization (using only one core). Consequently +the speed up between the CPU and GPU version is about $65$ which is very good +considering the difficulty of parallelizing this code. + +\lstinputlisting[label=ch2:lst:ex3,caption=simple matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu} + +\section{Conclusion} +In this chapter, three simple CUDA examples have been presented. As we cannot +present all the possibilities of the CUDA programming, interested readers are +invited to consult CUDA programming introduction books if some issues regarding +the CUDA programming are not clear. -\putbib[biblio] +\putbib[Chapters/chapter2/biblio]