1 \chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte, France}
3 \chapter{Introduction to CUDA}
9 In this chapter we give some simple examples of CUDA programming. The goal is
10 not to provide an exhaustive presentation of all the functionalities of CUDA but
11 rather to give some basic elements. Of course, readers who do not know CUDA are
12 invited to read other books that are specialized on CUDA programming (for
13 example, \cite{ch2:Sanders:2010:CEI}).
16 \section{First example}
19 This first example is intented to show how to build a very simple program with
20 CUDA. Its goal is to perform the sum of two arrays and put the result into a
21 third array. A CUDA program consists in a C code which calls CUDA kernels that
22 are executed on a GPU. This code is in Listing~\ref{ch2:lst:ex1}.
25 As GPUs have their own memory, the first step consists of allocating memory on
26 the GPU. A call to \texttt{cudaMalloc}\index{CUDA functions!cudaMalloc}
27 allocates memory on the GPU. The first parameter of this function is a pointer
28 on a memory on the device, i.e. the GPU. The second parameter represents the
29 size of the allocated variables, this size is expressed in bits.
31 \lstinputlisting[label=ch2:lst:ex1,caption=simple example]{Chapters/chapter2/ex1.cu}
34 In this example, we want to compare the execution time of the additions of two
35 arrays in CPU and GPU. So for both these operations, a timer is created to
36 measure the time. CUDA manipulates timers quite easily. The first
37 step is to create the timer\index{CUDA functions!timer}, then to start it, and at
38 the end to stop it. For each of these operations a dedicated function is used.
40 In order to compute the same sum with a GPU, the first step consists of
41 transferring the data from the CPU (considered as the host with CUDA) to the GPU
42 (considered as the device with CUDA). A call to \texttt{cudaMemcpy} copies the content of an array allocated in the host to the device when the fourth
44 to \texttt{cudaMemcpyHostToDevice}\index{CUDA functions!cudaMemcpy}. The first
45 parameter of the function is the destination array, the second is the
46 source array, and the third is the number of elements to copy (expressed in
49 Now the GPU contains the data needed to perform the addition. In sequential
50 programming, such addition is achieved with a loop on all the elements.
51 With a GPU, it is possible to perform the addition of all the elements of the
52 two arrays in parallel (if the number of blocks and threads per blocks is
53 sufficient). In Listing~\ref{ch2:lst:ex1} at the beginning, a simple kernel,
54 called \texttt{addition} is defined to compute in parallel the summation of the
55 two arrays. With CUDA, a kernel starts with the
56 keyword \texttt{\_\_global\_\_} \index{CUDA keywords!\_\_shared\_\_} which
57 indicates that this kernel can be called from the C code. The first instruction
58 in this kernel is used to compute the variable \texttt{tid} which represents the
59 thread index. This thread index\index{CUDA keywords!thread index} is computed according to
60 the values of the block index
61 (called \texttt{blockIdx} \index{CUDA keywords!blockIdx} in CUDA) and of the
62 thread index (called \texttt{threadIdx}\index{CUDA keywords!threadIdx} in
63 CUDA). Blocks of threads and thread indexes can be decomposed into 1 dimension,
64 2 dimensions, or 3 dimensions. According to the dimension of manipulated data,
65 the dimension of blocks of threads must be chosen carefully. In our example, only one dimension is
66 used. Then using the notation \texttt{.x}, we can access the first dimension
67 (\texttt{.y} and \texttt{.z}, respectively allow access to the second and
68 third dimensions). The variable \texttt{blockDim}\index{CUDA keywords!blockDim}
69 gives the size of each block.
75 \section{Second example: using CUBLAS \index{CUBLAS}}
78 The Basic Linear Algebra Subprograms (BLAS) allow programmers to use efficient
79 routines for basic linear operations. Those routines are heavily used in many
80 scientific applications and are optimized for vector operations, matrix-vector
81 operations, and matrix-matrix
82 operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some of those operations seem
83 to be easy to implement with CUDA; however, as soon as a reduction is
84 needed, implementing an efficient reduction routine with CUDA is far from being
85 simple. Roughly speaking, a reduction operation\index{reduction operation} is an
86 operation which combines all the elements of an array and extracts a number
87 computed from all the elements. For example, a sum, a maximum, or a dot product
88 are reduction operations.
90 In this second example, we have two vectors $A$ and $B$. First
91 of all, we want to compute the sum of both vectors and store the result in a vector $C$. Then we want
92 to compute the scalar product between $1/C$ and $1/A$. This is just an example
93 which has no direct interest except to show how to program it with CUDA.
95 Listing~\ref{ch2:lst:ex2} shows this example with CUDA. The first kernel for the
96 addition of two arrays is exactly the same as the one described in the
99 The kernel to compute the inverse of the elements of an array is very
100 simple. For each thread index, the inverse of the array replaces the initial
103 In the main function, the beginning is very similar to the one in the previous
104 example. First, the user is asked to define the number of elements. Then a
105 call to \texttt{cublasCreate} initializes the CUBLAS library. It
106 creates a handle. Then all the arrays are allocated in the host and the device,
107 as in the previous example. Both arrays $A$ and $B$ are initialized. The CPU
108 computation is performed and the time for this is measured. In
109 order to compute the same result for the GPU, first of all, data from the CPU
110 need to be copied into the memory of the GPU. For that, it is possible to use
111 CUBLAS function \texttt{cublasSetVector}. This function has several
112 arguments. More precisely, the first argument represents the number of elements
113 to transfer, the second arguments is the size of each element, the third element
114 represents the source of the array to transfer (in the GPU), the fourth is an
115 offset between each element of the source (usually this value is set to 1), the
116 fifth is the destination (in the GPU), and the last is an offset between each
117 element of the destination. Then we call the kernel \texttt{addition} which
118 computes the sum of all elements of arrays $A$ and $B$. The \texttt{inverse}
119 kernel is called twice, once to inverse elements of array $C$ and once for
120 $A$. Finally, we call the function \texttt{cublasDdot} which computes the dot
121 product of two vectors. To use this routine, we must specify the handle
122 initialized by CUDA, the number of elements to consider, then each vector is
123 followed by the offset between every element. After the GPU computation, it is
124 possible to check that both computations produce the same result.
126 \lstinputlisting[label=ch2:lst:ex2,caption=simple example with CUBLAS]{Chapters/chapter2/ex2.cu}
128 \section{Third example: matrix-matrix multiplication}
133 Matrix-matrix multiplication is an operation which is quite easy to parallelize
134 with a GPU. If we consider that a matrix is represented using a two-dimensional
135 array, $A[i][j]$ represents the element of the $i$ row and of the $j$
136 column. In many cases, it is easier to manipulate a one-dimentional (1D) array rather than a 2D
137 array. With CUDA, even if it is possible to manipulate 2D arrays, in the
138 following we present an example based on a 1D array. For the sake of simplicity,
139 we consider we have a square matrix of size \texttt{size}. So with a 1D
140 array, \texttt{A[i*size+j]} allows us to have access to the element of the
141 $i$ row and of the $j$ column.
143 With sequential programming, the matrix-matrix multiplication is performed using
144 three loops. We assume that $A$, $B$ represent two square matrices and the
145 result of the multiplication of $A \times B$ is $C$. The
146 element \texttt{C[i*size+j]} is computed as follows:
148 C[size*i+j]=\sum_{k=0}^{size-1} A[size*i+k]*B[size*k+j].
151 In Listing~\ref{ch2:lst:ex3}, the CPU computation is performed using 3 loops,
152 one for $i$, one for $j$, and one for $k$. In order to perform the same
153 computation on a GPU, a naive solution consists of considering that the matrix
154 $C$ is split into 2-dimensional blocks. The size of each block must be chosen
155 such that the number of threads per block is less than $1,024$.
158 In Listing~\ref{ch2:lst:ex3}, we consider that a block contains 16 threads in
159 each dimension, the variable \texttt{width} is used for that. The
160 variable \texttt{nbTh} represents the number of threads per block. So,
161 to compute the matrix-matrix product on a GPU, each block of threads is assigned
162 to compute the result of the product of the elements of that block. The main
163 part of the code is quite similar to the previous code. Arrays are allocated in
164 the CPU and the GPU. Matrices $A$ and $B$ are randomly initialized. Then
165 arrays are transferred to the GPU memory with call to \texttt{cudaMemcpy}.
166 So the first step for each thread of a block is to compute the corresponding row
167 and column. With a 2-dimensional decomposition, \texttt{int i=
168 blockIdx.y*blockDim.y+ threadIdx.y;} allows us to compute the corresponding line
169 and \texttt{int j= blockIdx.x*blockDim.x+ threadIdx.x;} the corresponding
170 column. Then each thread has to compute the sum of the product of the row of
171 $A$ by the column of $B$. In order to use a register, the
172 kernel \texttt{matmul} uses a variable called \texttt{sum} to compute the
173 sum. Then the result is set into the matrix at the right place. The computation
174 of CPU matrix-matrix multiplication is performed as described previously. A
175 timer measures the time. In order to use 2-dimensional blocks, \texttt{dim3
176 dimGrid(size/width,size/width);} allows us to create \texttt{size/width} blocks
177 in each dimension. Likewise, \texttt{dim3 dimBlock(width,width);} is used to
178 create \texttt{width} thread in each dimension. After that, the kernel for the
179 matrix multiplication is called. At the end of the listing, the matrix $C$
180 computed by the GPU is transferred back into the CPU and we check that both matrices
181 C computed by the CPU and the GPU are identical with a precision of $10^{-4}$.
184 With $1,024 \times 1,024$ matrices, on a C2070M Tesla card, this code takes
185 $37.68$ms to perform the multiplication. With an Intel Xeon E31245 at $3.30$GHz, it
186 takes $2465$ms without any parallelization (using only one core). Consequently
187 the speed up between the CPU and GPU version is about $65$ which is very good
188 considering the difficulty of parallelizing this code.
190 \lstinputlisting[label=ch2:lst:ex3,caption=simple matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu}
193 In this chapter, three simple CUDA examples have been presented. As we cannot
194 present all the possibilities of the CUDA programming, interested readers are
195 invited to consult CUDA programming introduction books if some issues regarding
196 the CUDA programming are not clear.
198 \putbib[Chapters/chapter2/biblio]