1 \chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte}
3 \chapter{Introduction to CUDA}
9 In this chapter we give some simple examples on CUDA programming. The goal is
10 not to provide an exhaustive presentation of all the functionalities of CUDA but
11 rather giving some basic elements. Of course, readers that do not know CUDA are
12 invited to read other books that are specialized on CUDA programming (for
13 example: \cite{ch2:Sanders:2010:CEI}).
16 \section{First example}
19 This first example is intented to show how to build a very simple example with
20 CUDA. The goal of this example is to performed the sum of two arrays and
21 putting the result into a third array. A cuda program consists in a C code
22 which calls CUDA kernels that are executed on a GPU. The listing of this code is
23 in Listing~\ref{ch2:lst:ex1}.
26 As GPUs have their own memory, the first step consists in allocating memory on
27 the GPU. A call to \texttt{cudaMalloc} allows to allocate memory on the GPU. The
28 first parameter of this function is a pointer on a memory on the device
29 (i.e. the GPU). In this example, \texttt{d\_} is added on each variable allocated
30 on the GPU meaning this variable is on the GPU. The second parameter represents
31 the size of the allocated variables, this size is in bits.
33 In this example, we want to compare the execution time of the additions of two
34 arrays in CPU and GPU. So for both these operations, a timer is created to
35 measure the time. CUDA proposes to manipulate timers quick easily. The first
36 step is to create the timer, then to start it and at the end to stop it. For
37 each of these operations a dedicated functions is used.
39 In order to compute the same sum with a GPU, the first step consits in
40 transferring the data from the CPU (considered as the host with CUDA) to the GPU
41 (considered as the device with CUDA). A call to \texttt{cudaMalloc} allows to
42 copy the content of an array allocated in the host to the device when the fourth
43 parameter is set to \texttt{cudaMemcpyHostToDevice}. The first parameter of the
44 function is the destination array, the second is the source array and the third
45 is the number of elements to copy (exprimed in bytes).
47 Now that the GPU contains the data needed to perform the addition. In sequential
48 such addition is achieved out with a loop on all the elements. With a GPU, it
49 is possible to perform the addition of all elements of the arrays in parallel
50 (if the number of blocks and threads per blocks is sufficient). In
51 Listing\ref{ch2:lst:ex1} at the beginning, a simple kernel,
52 called \texttt{addition} is defined to compute in parallel the summation of the
53 two arrays. With CUDA, a kernel starts with the
54 keyword \texttt{\_\_global\_\_} \index{CUDA~keywords!\_\_shared\_\_} which
55 indicates that this kernel can be called from the C code. The first instruction
56 in this kernel is used to compute the variable \texttt{tid} which represents the
57 thread index. This thread index\index{thread index} is computed according to
58 the values of the block index (it is a variable of CUDA
59 called \texttt{blockIdx}\index{CUDA~keywords!blockIdx}). Blocks of threads can
60 be decomposed into 1 dimension, 2 dimensions or 3 dimensions. According to the
61 dimension of data manipulated, the appropriate dimension can be useful. In our
62 example, only one dimension is used. Then using notation \texttt{.x} we can
63 access to the first dimension (\texttt{.y} and \texttt{.z} allow respectively to
64 access to the second and third dimension). The
65 variable \texttt{blockDim}\index{CUDA~keywords!blockDim} gives the size of each
70 \lstinputlisting[label=ch2:lst:ex1,caption=A simple example]{Chapters/chapter2/ex1.cu}
72 \section{Second example: using CUBLAS}
75 The Basic Linear Algebra Subprograms (BLAS) allows programmer to use performant
76 routines that are often used. Those routines are heavily used in many scientific
77 applications and are very optimized for vector operations, matrix-vector
78 operations and matrix-matrix
79 operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some of those operations seem
80 to be easy to implement with CUDA. Nevertheless, as soon as a reduction is
81 needed, implementing an efficient reduction routines with CUDA is far from being
82 simple. Roughly speaking, a reduction operation\index{reduction~operation} is an
83 operation which combines all the elements of an array and extract a number
84 computed with all the elements. For example, a sum, a maximum or a dot product
85 are reduction operations.
87 In this second example, we consider that we have two vectors $A$ and $B$. First
88 of all, we want to compute the sum of both vectors in a vector $C$. Then we want
89 to compute the scalar product between $1/C$ and $1/A$. This is just an example
90 which has no direct interest except to show how to program it with CUDA.
92 Listing~\ref{ch2:lst:ex2} shows this example with CUDA. The first kernel for the
93 addition of two arrays is exactly the same as the one described in the
96 The kernel to compute the inverse of the elements of an array is very
97 simple. For each thread index, the inverse of the array replaces the initial
100 In the main function, the beginning is very similar to the one in the previous
101 example. First, the number of elements is asked to the user. Then a call
102 to \texttt{cublasCreate} allows to initialize the cublas library. It creates an
103 handle. Then all the arrays are allocated in the host and the device, as in the
104 previous example. Both arrays $A$ and $B$ are initialized. Then the CPU
105 computation is performed and the time for this CPU computation is measured. In
106 order to compute the same result on the GPU, first of all, data from the CPU
107 need to be copied into the memory of the GPU. For that, it is possible to use
108 cublas function \texttt{cublasSetVector}. This function several arguments. More
109 precisely, the first argument represents the number of elements to transfer, the
110 second arguments is the size of each elements, the third element represents the
111 source of the array to transfer (in the GPU), the fourth is an offset between
112 each element of the source (usually this value is set to 1), the fifth is the
113 destination (in the GPU) and the last is an offset between each element of the
114 destination. Then we call the kernel \texttt{addition} which computes the sum of
115 all elements of arrays $A$ and $B$. The \texttt{inverse} kernel is called twice,
116 once to inverse elements of array $C$ and once for $A$. Finally, we call the
117 function \texttt{cublasDdot} which computes the dot product of two vectors. To
118 use this routine, we must specify the handle initialized by Cuda, the number of
119 elements to consider, then each vector is followed by the offset between every
120 element. After the GPU computation, it is possible to check that both
121 computation produce the same result.
123 \lstinputlisting[label=ch2:lst:ex2,caption=A simple example with cublas]{Chapters/chapter2/ex2.cu}
125 \section{Third example: matrix-matrix multiplication}
130 Matrix-matrix multiplication is an operation which is quite easy to parallelize
131 with a GPU. If we consider that a matrix is represented using a two dimensional
132 array, A[i][j] represents the the element of the $i^{th}$ row and of the
133 $j^{th}$ column. In many case, it is easier to manipulate 1D array instead of 2D
134 array. With Cuda, even if it is possible to manipulate 2D arrays, in the
135 following we present an example based on 1D array. For sake of simplicity we
136 consider we have a squared matrix of size \texttt{size}. So with a 1D
137 array, \texttt{A[i*size+j]} allows us to access to the element of the $i^{th}$
138 row and of the $j^{th}$ column.
140 In sequential the matrix multiplication is performed using three loops. Supposing that $A$, $B$ represent two square matrices, the result of the multiplication of $A \times B$ is
142 On C2070M Tesla card, this code take 37.68ms to perform the multiplication. On a
143 Intel Xeon E31245 at 3.30GHz, it takes 2465ms without any parallelization (using
144 only one core). Consequently the speed up between the CPU and GPU version is
145 about 65 which is very good regarding the difficulty of parallelizing this code.
147 \lstinputlisting[label=ch2:lst:ex3,caption=simple Matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu}
149 \putbib[Chapters/chapter2/biblio]