1 \chapterauthor{Author Name1}{Affiliation text1}
2 \chapterauthor{Author Name2}{Affiliation text2}
5 \chapter{Introduction to CUDA}
8 \section{Introduction}\label{intro}
9 In this chapter we give some simple examples on CUDA programming. The goal is
10 not to provide an exhaustive presentation of all the functionalities of CUDA but
11 rather giving some basic elements. Of course, readers that do not know CUDA are
12 invited to read other books that are specialized on CUDA programming (for example: \cite{Sanders:2010:CEI}).
15 \section{First example}
17 This first example is intented to show how to build a very simple example with
18 CUDA. The goal of this example is to performed the sum of two arrays and
19 putting the result into a third array. A cuda program consists in a C code
20 which calls CUDA kernels that are executed on a GPU. The listing of this code is in Listing~\ref{ch2:lst:ex1}
23 As GPUs have their own memory, the first step consists in allocating memory on
24 the GPU. A call to \texttt{cudaMalloc} allows to allocate memory on the GPU. The
25 first parameter of this function is a pointer on a memory on the device
26 (i.e. the GPU). In this example, \texttt{d\_} is added on each variable allocated
27 on the GPU meaning this variable is on the GPU. The second parameter represents
28 the size of the allocated variables, this size is in bits.
30 In this example, we want to compare the execution time of the additions of two
31 arrays in CPU and GPU. So for both these operations, a timer is created to
32 measure the time. CUDA proposes to manipulate timers quick easily. The first
33 step is to create the timer, then to start it and at the end to stop it. For
34 each of these operations a dedicated functions is used.
36 In order to compute the same sum with a GPU, the first step consits in
37 transferring the data from the CPU (considered as the host with CUDA) to the GPU
38 (considered as the device with CUDA). A call to \texttt{cudaMalloc} allows to
39 copy the content of an array allocated in the host to the device when the fourth
40 parameter is set to \texttt{cudaMemcpyHostToDevice}. The first parameter of the
41 function is the destination array, the second is the source array and the third
42 is the number of elements to copy (exprimed in bytes).
44 Now the GPU contains the data needed to perform the addition. In sequential such
45 addition is achieved out with a loop on all the elements. With a GPU, it is
46 possible to perform the addition of all elements of the arrays in parallel (if
47 the number of blocks and threads per blocks is sufficient). In
48 Listing\ref{ch2:lst:ex1} at the beginning, a simple kernel,
49 called \texttt{addition} is defined to compute in parallel the summation of the
50 two arrays. With CUDA, a kernel starts with the keyword \texttt{\_\_global\_\_}
51 which indicates that this kernel can be call from the C code. The first
52 instruction in this kernel is used to computed the \texttt{tid} which
53 representes the thread index. This thread index is computed according to the
54 values of the block index (it is a variable of CUDA
55 called \texttt{blockIdx\index{CUDA~keywords!blockIdx}}). Blocks of threads can
56 be decomposed into 1 dimension, 2 dimensions or 3 dimensions. According to the
57 dimension of data manipulated, the appropriate dimension can be useful. In our
58 example, only one dimension is used. Then using notation \texttt{.x} we can
59 access to the first dimension (\texttt{.y} and \texttt{.z} allow respectively to
60 access to the second and third dimension). The
61 variable \texttt{blockDim}\index{CUDA~keywords!blockDim} gives the size of each
66 \lstinputlisting[label=ch2:lst:ex1,caption=A simple example]{Chapters/chapter2/ex1.cu}
68 \putbib[Chapters/chapter2/biblio]