1 \chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte}
3 \chapter{Introduction to CUDA}
6 \section{Introduction}\label{intro}
7 In this chapter we give some simple examples on CUDA programming. The goal is
8 not to provide an exhaustive presentation of all the functionalities of CUDA but
9 rather giving some basic elements. Of course, readers that do not know CUDA are
10 invited to read other books that are specialized on CUDA programming (for
11 example: \cite{ch2:Sanders:2010:CEI}).
14 \section{First example}
16 This first example is intented to show how to build a very simple example with
17 CUDA. The goal of this example is to performed the sum of two arrays and
18 putting the result into a third array. A cuda program consists in a C code
19 which calls CUDA kernels that are executed on a GPU. The listing of this code is
20 in Listing~\ref{ch2:lst:ex1}.
23 As GPUs have their own memory, the first step consists in allocating memory on
24 the GPU. A call to \texttt{cudaMalloc} allows to allocate memory on the GPU. The
25 first parameter of this function is a pointer on a memory on the device
26 (i.e. the GPU). In this example, \texttt{d\_} is added on each variable allocated
27 on the GPU meaning this variable is on the GPU. The second parameter represents
28 the size of the allocated variables, this size is in bits.
30 In this example, we want to compare the execution time of the additions of two
31 arrays in CPU and GPU. So for both these operations, a timer is created to
32 measure the time. CUDA proposes to manipulate timers quick easily. The first
33 step is to create the timer, then to start it and at the end to stop it. For
34 each of these operations a dedicated functions is used.
36 In order to compute the same sum with a GPU, the first step consits in
37 transferring the data from the CPU (considered as the host with CUDA) to the GPU
38 (considered as the device with CUDA). A call to \texttt{cudaMalloc} allows to
39 copy the content of an array allocated in the host to the device when the fourth
40 parameter is set to \texttt{cudaMemcpyHostToDevice}. The first parameter of the
41 function is the destination array, the second is the source array and the third
42 is the number of elements to copy (exprimed in bytes).
44 Now that the GPU contains the data needed to perform the addition. In sequential
45 such addition is achieved out with a loop on all the elements. With a GPU, it
46 is possible to perform the addition of all elements of the arrays in parallel
47 (if the number of blocks and threads per blocks is sufficient). In
48 Listing\ref{ch2:lst:ex1} at the beginning, a simple kernel,
49 called \texttt{addition} is defined to compute in parallel the summation of the
50 two arrays. With CUDA, a kernel starts with the
51 keyword \texttt{\_\_global\_\_} \index{CUDA~keywords!\_\_shared\_\_} which
52 indicates that this kernel can be called from the C code. The first instruction
53 in this kernel is used to compute the variable \texttt{tid} which represents the
54 thread index. This thread index\index{thread index} is computed according to
55 the values of the block index (it is a variable of CUDA
56 called \texttt{blockIdx}\index{CUDA~keywords!blockIdx}). Blocks of threads can
57 be decomposed into 1 dimension, 2 dimensions or 3 dimensions. According to the
58 dimension of data manipulated, the appropriate dimension can be useful. In our
59 example, only one dimension is used. Then using notation \texttt{.x} we can
60 access to the first dimension (\texttt{.y} and \texttt{.z} allow respectively to
61 access to the second and third dimension). The
62 variable \texttt{blockDim}\index{CUDA~keywords!blockDim} gives the size of each
67 \lstinputlisting[label=ch2:lst:ex1,caption=A simple example]{Chapters/chapter2/ex1.cu}
69 \section{Second example: using CUBLAS}
71 The Basic Linear Algebra Subprograms (BLAS) allows programmer to use performant
72 routines that are often used. Those routines are heavily used in many scientific
73 applications and are very optimzed for vector operations, matrix-vector
74 operations and matrix-matrix
75 operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some of those operations seems
76 to be easy to implement with CUDA. Nevertheless, as soon as a reduction is
77 needed, implementing an efficient reduction routines with CUDA is far from being
80 In this second example, we consider that we have two vectors $A$ and $B$. First
81 of all we want to compute the sum of both vectors in a vector $C$. Then we want
82 to compute the scalar product between $1/C$ and $1/A$. This is just an example
83 which has not direct interest except to show how to program it with CUDA.
85 Listing~\ref{ch2:lst:ex2} shows this example with CUDA. The first kernel for the
86 addition of two arrays is exactly the same that the one described in the
89 The kernel to compute the inverse of the elements of an array is very
90 simple. For each thread index, the inverse of the array replaces the initial
93 In the main function, the beginning is very similar to the one in the previous
94 example. First the number of elements is asked to the user. Then a call
95 to \texttt{cublasCreate} allows to initialize the cublas library. It creates an
96 handle. Then all the arrays are allocated in the host and the device, as in the
97 previous example. Both arrays $A$ and $B$ are initialized. Then the CPU
98 computation is performed and the time for this CPU computation is measured. In
99 order to compute the same result on the GPU, first of all, data from the CPU
100 need to be copied into the memory of the GPU. For that, it is possible to use
101 cublas function \texttt{cublasSetVector}.
103 \lstinputlisting[label=ch2:lst:ex2,caption=A simple example]{Chapters/chapter2/ex2.cu}
106 \putbib[Chapters/chapter2/biblio]