From: couturie Date: Wed, 24 Oct 2012 17:46:56 +0000 (+0200) Subject: correct ch2 X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/commitdiff_plain/31c87768e1b18e90d982b335cadb326853c1c0ce?ds=sidebyside correct ch2 --- diff --git a/BookGPU/Chapters/chapter2/ch2.tex b/BookGPU/Chapters/chapter2/ch2.tex index 155c8ca..2eba230 100755 --- a/BookGPU/Chapters/chapter2/ch2.tex +++ b/BookGPU/Chapters/chapter2/ch2.tex @@ -6,9 +6,9 @@ \section{Introduction} \label{ch2:intro} -In this chapter we give some simple examples on Cuda programming. The goal is +In this chapter we give some simple examples of Cuda programming. The goal is not to provide an exhaustive presentation of all the functionalities of Cuda but -rather giving some basic elements. Of course, readers that do not know Cuda are +rather to give some basic elements. Of course, readers that do not know Cuda are invited to read other books that are specialized on Cuda programming (for example: \cite{ch2:Sanders:2010:CEI}). @@ -16,11 +16,10 @@ example: \cite{ch2:Sanders:2010:CEI}). \section{First example} \label{ch2:1ex} -This first example is intented to show how to build a very simple example with -Cuda. The goal of this example is to perform the sum of two arrays and -put the result into a third array. A Cuda program consists in a C code -which calls Cuda kernels that are executed on a GPU. The listing of this code is -in Listing~\ref{ch2:lst:ex1}. +This first example is intented to show how to build a very simple program with +Cuda. Its goal is to perform the sum of two arrays and put the result into a +third array. A Cuda program consists in a C code which calls Cuda kernels that +are executed on a GPU. The listing of this code is in Listing~\ref{ch2:lst:ex1}. As GPUs have their own memory, the first step consists in allocating memory on @@ -28,14 +27,14 @@ the GPU. A call to \texttt{cudaMalloc}\index{Cuda~functions!cudaMalloc} allows to allocate memory on the GPU. The first parameter of this function is a pointer on a memory on the device (i.e. the GPU). In this example, \texttt{d\_} is added on each variable allocated on the GPU, meaning this variable is on the GPU. The -second parameter represents the size of the allocated variables, this size is in +second parameter represents the size of the allocated variables, this size is expressed in bits. In this example, we want to compare the execution time of the additions of two arrays in CPU and GPU. So for both these operations, a timer is created to measure the time. Cuda proposes to manipulate timers quite easily. The first step is to create the timer\index{Cuda~functions!timer}, then to start it and at -the end to stop it. For each of these operations a dedicated functions is used. +the end to stop it. For each of these operations a dedicated function is used. In order to compute the same sum with a GPU, the first step consists in transferring the data from the CPU (considered as the host with Cuda) to the GPU @@ -47,11 +46,11 @@ parameter of the function is the destination array, the second is the source array and the third is the number of elements to copy (expressed in bytes). -Now the GPU contains the data needed to perform the addition. In sequential such -addition is achieved out with a loop on all the elements. With a GPU, it is -possible to perform the addition of all elements of the two arrays in parallel -(if the number of blocks and threads per blocks is sufficient). In -Listing\ref{ch2:lst:ex1} at the beginning, a simple kernel, +Now the GPU contains the data needed to perform the addition. In sequential +programming, such addition is achieved out with a loop on all the elements. +With a GPU, it is possible to perform the addition of all the elements of the +two arrays in parallel (if the number of blocks and threads per blocks is +sufficient). In Listing\ref{ch2:lst:ex1} at the beginning, a simple kernel, called \texttt{addition} is defined to compute in parallel the summation of the two arrays. With Cuda, a kernel starts with the keyword \texttt{\_\_global\_\_} \index{Cuda~keywords!\_\_shared\_\_} which @@ -61,11 +60,11 @@ thread index. This thread index\index{thread index} is computed according to the values of the block index (called \texttt{blockIdx} \index{Cuda~keywords!blockIdx} in Cuda) and of the thread index (called \texttt{blockIdx}\index{Cuda~keywords!threadIdx} in -Cuda). Blocks of threads and thread indexes can be decomposed into 1 dimension, 2 -dimensions or 3 dimensions. According to the dimension of data manipulated, the -appropriate dimension can be useful. In our example, only one dimension is used. -Then using notation \texttt{.x} we can access to the first dimension -(\texttt{.y} and \texttt{.z} allow respectively to access to the second and +Cuda). Blocks of threads and thread indexes can be decomposed into 1 dimension, +2 dimensions or 3 dimensions. According to the dimension of manipulated data, +the appropriate dimension can be useful. In our example, only one dimension is +used. Then using notation \texttt{.x} we can access to the first dimension +(\texttt{.y} and \texttt{.z} respectively allow to access to the second and third dimension). The variable \texttt{blockDim}\index{Cuda~keywords!blockDim} gives the size of each block. @@ -78,13 +77,13 @@ gives the size of each block. The Basic Linear Algebra Subprograms (BLAS) allows programmers to use efficient routines that are often required. Those routines are heavily used in many -scientific applications and are very optimized for vector operations, -matrix-vector operations and matrix-matrix +scientific applications and are optimized for vector operations, matrix-vector +operations and matrix-matrix operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some of those operations seem to be easy to implement with Cuda. Nevertheless, as soon as a reduction is -needed, implementing an efficient reduction routines with Cuda is far from being +needed, implementing an efficient reduction routine with Cuda is far from being simple. Roughly speaking, a reduction operation\index{reduction~operation} is an -operation which combines all the elements of an array and extract a number +operation which combines all the elements of an array and extracts a number computed with all the elements. For example, a sum, a maximum or a dot product are reduction operations. @@ -97,32 +96,32 @@ Listing~\ref{ch2:lst:ex2} shows this example with Cuda. The first kernel for the addition of two arrays is exactly the same as the one described in the previous example. -The kernel to compute the inverse of the elements of an array is very +The kernel to compute the opposite of the elements of an array is very simple. For each thread index, the inverse of the array replaces the initial array. In the main function, the beginning is very similar to the one in the previous -example. First, the number of elements is asked to the user. Then a call -to \texttt{cublasCreate} allows to initialize the cublas library. It creates an -handle. Then all the arrays are allocated in the host and the device, as in the -previous example. Both arrays $A$ and $B$ are initialized. The CPU +example. First, the user is askef to define the number of elements. Then a +call to \texttt{cublasCreate} allows to initialize the cublas library. It +creates a handle. Then all the arrays are allocated in the host and the device, +as in the previous example. Both arrays $A$ and $B$ are initialized. The CPU computation is performed and the time for this CPU computation is measured. In order to compute the same result on the GPU, first of all, data from the CPU need to be copied into the memory of the GPU. For that, it is possible to use -cublas function \texttt{cublasSetVector}. This function has several arguments. More -precisely, the first argument represents the number of elements to transfer, the -second arguments is the size of each element, the third element represents the -source of the array to transfer (in the GPU), the fourth is an offset between -each element of the source (usually this value is set to 1), the fifth is the -destination (in the GPU) and the last is an offset between each element of the -destination. Then we call the kernel \texttt{addition} which computes the sum of -all elements of arrays $A$ and $B$. The \texttt{inverse} kernel is called twice, -once to inverse elements of array $C$ and once for $A$. Finally, we call the -function \texttt{cublasDdot} which computes the dot product of two vectors. To -use this routine, we must specify the handle initialized by Cuda, the number of -elements to consider, then each vector is followed by the offset between every -element. After the GPU computation, it is possible to check that both -computation produce the same result. +cublas function \texttt{cublasSetVector}. This function has several +arguments. More precisely, the first argument represents the number of elements +to transfer, the second arguments is the size of each element, the third element +represents the source of the array to transfer (in the GPU), the fourth is an +offset between each element of the source (usually this value is set to 1), the +fifth is the destination (in the GPU) and the last is an offset between each +element of the destination. Then we call the kernel \texttt{addition} which +computes the sum of all elements of arrays $A$ and $B$. The \texttt{inverse} +kernel is called twice, once to inverse elements of array $C$ and once for +$A$. Finally, we call the function \texttt{cublasDdot} which computes the dot +product of two vectors. To use this routine, we must specify the handle +initialized by Cuda, the number of elements to consider, then each vector is +followed by the offset between every element. After the GPU computation, it is +possible to check that both computation produce the same result. \lstinputlisting[label=ch2:lst:ex2,caption=A simple example with cublas]{Chapters/chapter2/ex2.cu} @@ -133,13 +132,13 @@ computation produce the same result. Matrix-matrix multiplication is an operation which is quite easy to parallelize with a GPU. If we consider that a matrix is represented using a two dimensional -array, $A[i][j]$ represents the element of the $i^{th}$ row and of the -$j^{th}$ column. In many cases, it is easier to manipulate 1D array instead of 2D +array, $A[i][j]$ represents the element of the $i^{th}$ row and of the $j^{th}$ +column. In many cases, it is easier to manipulate a 1D array instead of a 2D array. With Cuda, even if it is possible to manipulate 2D arrays, in the -following we present an example based on 1D array. For the sake of simplicity, we -consider we have a square matrix of size \texttt{size}. So with a 1D -array, \texttt{A[i*size+j]} allows us to access to the element of the $i^{th}$ -row and of the $j^{th}$ column. +following we present an example based on a 1D array. For the sake of simplicity, +we consider we have a square matrix of size \texttt{size}. So with a 1D +array, \texttt{A[i*size+j]} allows us to have access to the element of the +$i^{th}$ row and of the $j^{th}$ column. With a sequential programming, the matrix multiplication is performed using three loops. We assume that $A$, $B$ represent two square matrices and the @@ -169,7 +168,7 @@ and column. With a 2 dimensional decomposition, \texttt{int i= blockIdx.y*blockDim.y+ threadIdx.y;} allows us to compute the corresponding line and \texttt{int j= blockIdx.x*blockDim.x+ threadIdx.x;} the corresponding column. Then each thread has to compute the sum of the product of the line of -$A$ per the column of $B$. In order to use a register, the +$A$ by the column of $B$. In order to use a register, the kernel \texttt{matmul} uses a variable called \texttt{sum} to compute the sum. Then the result is set into the matrix at the right place. The computation of CPU matrix-matrix multiplication is performed as described previously. A @@ -191,7 +190,7 @@ regarding the difficulty of parallelizing this code. \lstinputlisting[label=ch2:lst:ex3,caption=simple Matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu} \section{Conclusion} -In this chapter, three simple Cuda examples have been presented. Those examples are +In this chapter, three simple Cuda examples have been presented. They are quite simple. As we cannot present all the possibilities of the Cuda programming, interested readers are invited to consult Cuda programming introduction books if some issues regarding the Cuda programming are not clear. diff --git a/BookGPU/Chapters/chapter2/ex1.cu b/BookGPU/Chapters/chapter2/ex1.cu index e182349..cae3c9c 100644 --- a/BookGPU/Chapters/chapter2/ex1.cu +++ b/BookGPU/Chapters/chapter2/ex1.cu @@ -18,16 +18,12 @@ void addition(int size, int *d_C, int *d_A, int *d_B) { int main( int argc, char** argv) { - if(argc!=2) { printf("usage: ex1 nb_components\n"); exit(0); } - - int size=atoi(argv[1]); - int i; int *h_arrayA=(int*)malloc(size*sizeof(int)); int *h_arrayB=(int*)malloc(size*sizeof(int)); @@ -35,7 +31,6 @@ int main( int argc, char** argv) int *h_arrayCgpu=(int*)malloc(size*sizeof(int)); int *d_arrayA, *d_arrayB, *d_arrayC; - cudaMalloc((void**)&d_arrayA,size*sizeof(int)); cudaMalloc((void**)&d_arrayB,size*sizeof(int)); cudaMalloc((void**)&d_arrayC,size*sizeof(int)); @@ -45,7 +40,6 @@ int main( int argc, char** argv) h_arrayB[i]=2*i; } - unsigned int timer_cpu = 0; cutilCheckError(cutCreateTimer(&timer_cpu)); cutilCheckError(cutStartTimer(timer_cpu)); @@ -56,20 +50,14 @@ int main( int argc, char** argv) printf("CPU processing time : %f (ms) \n", cutGetTimerValue(timer_cpu)); cutDeleteTimer(timer_cpu); - unsigned int timer_gpu = 0; cutilCheckError(cutCreateTimer(&timer_gpu)); cutilCheckError(cutStartTimer(timer_gpu)); cudaMemcpy(d_arrayA,h_arrayA, size * sizeof(int), cudaMemcpyHostToDevice); cudaMemcpy(d_arrayB,h_arrayB, size * sizeof(int), cudaMemcpyHostToDevice); - - - int nbBlocs=(size+nbThreadsPerBloc-1)/nbThreadsPerBloc; - addition<<>>(size,d_arrayC,d_arrayA,d_arrayB); - cudaMemcpy(h_arrayCgpu,d_arrayC, size * sizeof(int), cudaMemcpyDeviceToHost); cutilCheckError(cutStopTimer(timer_gpu)); @@ -85,8 +73,5 @@ int main( int argc, char** argv) free(h_arrayA); free(h_arrayB); free(h_arrayC); - - return 0; - } diff --git a/BookGPU/Chapters/chapter2/ex2.cu b/BookGPU/Chapters/chapter2/ex2.cu index b27d619..156764d 100644 --- a/BookGPU/Chapters/chapter2/ex2.cu +++ b/BookGPU/Chapters/chapter2/ex2.cu @@ -6,7 +6,6 @@ #include "cutil_inline.h" #include - const int nbThreadsPerBloc=256; __global__ @@ -28,19 +27,15 @@ void inverse(int size, double *d_x) { int main( int argc, char** argv) { - if(argc!=2) { printf("usage: ex2 nb_components\n"); exit(0); } int size=atoi(argv[1]); - cublasStatus_t stat; cublasHandle_t handle; stat=cublasCreate(&handle); - - int i; double *h_arrayA=(double*)malloc(size*sizeof(double)); double *h_arrayB=(double*)malloc(size*sizeof(double)); @@ -48,7 +43,6 @@ int main( int argc, char** argv) double *h_arrayCgpu=(double*)malloc(size*sizeof(double)); double *d_arrayA, *d_arrayB, *d_arrayC; - cudaMalloc((void**)&d_arrayA,size*sizeof(double)); cudaMalloc((void**)&d_arrayB,size*sizeof(double)); cudaMalloc((void**)&d_arrayC,size*sizeof(double)); @@ -58,7 +52,6 @@ int main( int argc, char** argv) h_arrayB[i]=2*(i+1); } - unsigned int timer_cpu = 0; cutilCheckError(cutCreateTimer(&timer_cpu)); cutilCheckError(cutStartTimer(timer_cpu)); @@ -71,7 +64,6 @@ int main( int argc, char** argv) printf("CPU processing time : %f (ms) \n", cutGetTimerValue(timer_cpu)); cutDeleteTimer(timer_cpu); - unsigned int timer_gpu = 0; cutilCheckError(cutCreateTimer(&timer_gpu)); cutilCheckError(cutStartTimer(timer_gpu)); @@ -85,7 +77,6 @@ int main( int argc, char** argv) double dot_gpu=0; stat = cublasDdot(handle,size,d_arrayC,1,d_arrayA,1,&dot_gpu); - cutilCheckError(cutStopTimer(timer_gpu)); printf("GPU processing time : %f (ms) \n", cutGetTimerValue(timer_gpu)); cutDeleteTimer(timer_gpu); @@ -98,8 +89,6 @@ int main( int argc, char** argv) free(h_arrayB); free(h_arrayC); free(h_arrayCgpu); - cublasDestroy(handle); return 0; - }