1 \chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte}
3 \chapter{Introduction to Cuda}
9 In this chapter we give some simple examples on Cuda programming. The goal is
10 not to provide an exhaustive presentation of all the functionalities of Cuda but
11 rather giving some basic elements. Of course, readers that do not know Cuda are
12 invited to read other books that are specialized on Cuda programming (for
13 example: \cite{ch2:Sanders:2010:CEI}).
16 \section{First example}
19 This first example is intented to show how to build a very simple example with
20 Cuda. The goal of this example is to perform the sum of two arrays and
21 put the result into a third array. A Cuda program consists in a C code
22 which calls Cuda kernels that are executed on a GPU. The listing of this code is
23 in Listing~\ref{ch2:lst:ex1}.
26 As GPUs have their own memory, the first step consists in allocating memory on
27 the GPU. A call to \texttt{cudaMalloc}\index{Cuda~functions!cudaMalloc} allows
28 to allocate memory on the GPU. The first parameter of this function is a pointer
29 on a memory on the device (i.e. the GPU). In this example, \texttt{d\_} is added
30 on each variable allocated on the GPU, meaning this variable is on the GPU. The
31 second parameter represents the size of the allocated variables, this size is in
34 In this example, we want to compare the execution time of the additions of two
35 arrays in CPU and GPU. So for both these operations, a timer is created to
36 measure the time. Cuda proposes to manipulate timers quite easily. The first
37 step is to create the timer\index{Cuda~functions!timer}, then to start it and at
38 the end to stop it. For each of these operations a dedicated functions is used.
40 In order to compute the same sum with a GPU, the first step consists in
41 transferring the data from the CPU (considered as the host with Cuda) to the GPU
42 (considered as the device with Cuda). A call to \texttt{cudaMemcpy} allows to
43 copy the content of an array allocated in the host to the device when the fourth
45 to \texttt{cudaMemcpyHostToDevice}\index{Cuda~functions!cudaMemcpy}. The first
46 parameter of the function is the destination array, the second is the
47 source array and the third is the number of elements to copy (expressed in
50 Now the GPU contains the data needed to perform the addition. In sequential such
51 addition is achieved out with a loop on all the elements. With a GPU, it is
52 possible to perform the addition of all elements of the two arrays in parallel
53 (if the number of blocks and threads per blocks is sufficient). In
54 Listing\ref{ch2:lst:ex1} at the beginning, a simple kernel,
55 called \texttt{addition} is defined to compute in parallel the summation of the
56 two arrays. With Cuda, a kernel starts with the
57 keyword \texttt{\_\_global\_\_} \index{Cuda~keywords!\_\_shared\_\_} which
58 indicates that this kernel can be called from the C code. The first instruction
59 in this kernel is used to compute the variable \texttt{tid} which represents the
60 thread index. This thread index\index{thread index} is computed according to
61 the values of the block index
62 (called \texttt{blockIdx} \index{Cuda~keywords!blockIdx} in Cuda) and of the
63 thread index (called \texttt{blockIdx}\index{Cuda~keywords!threadIdx} in
64 Cuda). Blocks of threads and thread indexes can be decomposed into 1 dimension, 2
65 dimensions or 3 dimensions. According to the dimension of data manipulated, the
66 appropriate dimension can be useful. In our example, only one dimension is used.
67 Then using notation \texttt{.x} we can access to the first dimension
68 (\texttt{.y} and \texttt{.z} allow respectively to access to the second and
69 third dimension). The variable \texttt{blockDim}\index{Cuda~keywords!blockDim}
70 gives the size of each block.
74 \lstinputlisting[label=ch2:lst:ex1,caption=A simple example]{Chapters/chapter2/ex1.cu}
76 \section{Second example: using CUBLAS}
79 The Basic Linear Algebra Subprograms (BLAS) allows programmers to use efficient
80 routines that are often required. Those routines are heavily used in many
81 scientific applications and are very optimized for vector operations,
82 matrix-vector operations and matrix-matrix
83 operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some of those operations seem
84 to be easy to implement with Cuda. Nevertheless, as soon as a reduction is
85 needed, implementing an efficient reduction routines with Cuda is far from being
86 simple. Roughly speaking, a reduction operation\index{reduction~operation} is an
87 operation which combines all the elements of an array and extract a number
88 computed with all the elements. For example, a sum, a maximum or a dot product
89 are reduction operations.
91 In this second example, we consider that we have two vectors $A$ and $B$. First
92 of all, we want to compute the sum of both vectors in a vector $C$. Then we want
93 to compute the scalar product between $1/C$ and $1/A$. This is just an example
94 which has no direct interest except to show how to program it with Cuda.
96 Listing~\ref{ch2:lst:ex2} shows this example with Cuda. The first kernel for the
97 addition of two arrays is exactly the same as the one described in the
100 The kernel to compute the inverse of the elements of an array is very
101 simple. For each thread index, the inverse of the array replaces the initial
104 In the main function, the beginning is very similar to the one in the previous
105 example. First, the number of elements is asked to the user. Then a call
106 to \texttt{cublasCreate} allows to initialize the cublas library. It creates an
107 handle. Then all the arrays are allocated in the host and the device, as in the
108 previous example. Both arrays $A$ and $B$ are initialized. The CPU
109 computation is performed and the time for this CPU computation is measured. In
110 order to compute the same result on the GPU, first of all, data from the CPU
111 need to be copied into the memory of the GPU. For that, it is possible to use
112 cublas function \texttt{cublasSetVector}. This function has several arguments. More
113 precisely, the first argument represents the number of elements to transfer, the
114 second arguments is the size of each element, the third element represents the
115 source of the array to transfer (in the GPU), the fourth is an offset between
116 each element of the source (usually this value is set to 1), the fifth is the
117 destination (in the GPU) and the last is an offset between each element of the
118 destination. Then we call the kernel \texttt{addition} which computes the sum of
119 all elements of arrays $A$ and $B$. The \texttt{inverse} kernel is called twice,
120 once to inverse elements of array $C$ and once for $A$. Finally, we call the
121 function \texttt{cublasDdot} which computes the dot product of two vectors. To
122 use this routine, we must specify the handle initialized by Cuda, the number of
123 elements to consider, then each vector is followed by the offset between every
124 element. After the GPU computation, it is possible to check that both
125 computation produce the same result.
127 \lstinputlisting[label=ch2:lst:ex2,caption=A simple example with cublas]{Chapters/chapter2/ex2.cu}
129 \section{Third example: matrix-matrix multiplication}
134 Matrix-matrix multiplication is an operation which is quite easy to parallelize
135 with a GPU. If we consider that a matrix is represented using a two dimensional
136 array, $A[i][j]$ represents the element of the $i^{th}$ row and of the
137 $j^{th}$ column. In many cases, it is easier to manipulate 1D array instead of 2D
138 array. With Cuda, even if it is possible to manipulate 2D arrays, in the
139 following we present an example based on 1D array. For the sake of simplicity, we
140 consider we have a square matrix of size \texttt{size}. So with a 1D
141 array, \texttt{A[i*size+j]} allows us to access to the element of the $i^{th}$
142 row and of the $j^{th}$ column.
144 With a sequential programming, the matrix multiplication is performed using
145 three loops. We assume that $A$, $B$ represent two square matrices and the
146 result of the multiplication of $A \times B$ is $C$. The
147 element \texttt{C[i*size+j]} is computed as follows:
149 C[i*size+j]=\sum_{k=0}^{size-1} A[i*size+k]*B[k*size+j];
152 In Listing~\ref{ch2:lst:ex3}, the CPU computation is performed using 3 loops,
153 one for $i$, one for $j$ and one for $k$. In order to perform the same
154 computation on a GPU, a naive solution consists in considering that the matrix
155 $C$ is split into 2 dimensional blocks. The size of each block must be chosen
156 such as the number of threads per block is inferior to $1,024$.
159 In Listing~\ref{ch2:lst:ex3}, we consider that a block contains 16 threads in
160 each dimension, the variable \texttt{width} is used for that. The
161 variable \texttt{nbTh} represents the number of threads per block. So, to be able
162 to compute the matrix-matrix product on a GPU, each block of threads is assigned
163 to compute the result of the product for the elements of this block. The main
164 part of the code is quite similar to the previous code. Arrays are allocated in
165 the CPU and the GPU. Matrices $A$ and $B$ are randomly initialized. Then
166 arrays are transferred inside the GPU memory with call to \texttt{cudaMemcpy}.
167 So the first step for each thread of a block is to compute the corresponding row
168 and column. With a 2 dimensional decomposition, \texttt{int i=
169 blockIdx.y*blockDim.y+ threadIdx.y;} allows us to compute the corresponding line
170 and \texttt{int j= blockIdx.x*blockDim.x+ threadIdx.x;} the corresponding
171 column. Then each thread has to compute the sum of the product of the line of
172 $A$ per the column of $B$. In order to use a register, the
173 kernel \texttt{matmul} uses a variable called \texttt{sum} to compute the
174 sum. Then the result is set into the matrix at the right place. The computation
175 of CPU matrix-matrix multiplication is performed as described previously. A
176 timer measures the time. In order to use 2 dimensional blocks, \texttt{dim3
177 dimGrid(size/width,size/width);} allows us to create \texttt{size/width} blocks
178 in each dimension. Likewise, \texttt{dim3 dimBlock(width,width);} is used to
179 create \texttt{width} thread in each dimension. After that, the kernel for the
180 matrix multiplication is called. At the end of the listing, the matrix $C$
181 computed by the GPU is transferred back into the CPU and we check if both matrices
182 C computed by the CPU and the GPU are identical with a precision of $10^{-4}$.
185 With $1,024 \times 1,024$ matrices, on a C2070M Tesla card, this code takes
186 $37.68$ms to perform the multiplication. With an Intel Xeon E31245 at $3.30$GHz, it
187 takes $2465$ms without any parallelization (using only one core). Consequently
188 the speed up between the CPU and GPU version is about $65$ which is very good
189 regarding the difficulty of parallelizing this code.
191 \lstinputlisting[label=ch2:lst:ex3,caption=simple Matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu}
194 In this chapter, three simple Cuda examples have been presented. Those examples are
195 quite simple. As we cannot present all the possibilities of the Cuda
196 programming, interested readers are invited to consult Cuda programming
197 introduction books if some issues regarding the Cuda programming are not clear.
199 \putbib[Chapters/chapter2/biblio]