-\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte}
+\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte, France}
\chapter{Introduction to CUDA}
\label{chapter2}
\section{Introduction}
\label{ch2:intro}
-In this chapter we give some simple examples on CUDA programming. The goal is
+In this chapter we give some simple examples of CUDA programming. The goal is
not to provide an exhaustive presentation of all the functionalities of CUDA but
-rather giving some basic elements. Of course, readers that do not know CUDA are
+rather to give some basic elements. Of course, readers who do not know CUDA are
invited to read other books that are specialized on CUDA programming (for
-example: \cite{ch2:Sanders:2010:CEI}).
+example, \cite{ch2:Sanders:2010:CEI}).
\section{First example}
\label{ch2:1ex}
-This first example is intented to show how to build a very simple example with
-CUDA. The goal of this example is to performed the sum of two arrays and
-putting the result into a third array. A cuda program consists in a C code
-which calls CUDA kernels that are executed on a GPU. The listing of this code is
-in Listing~\ref{ch2:lst:ex1}.
+This first example is intented to show how to build a very simple program with
+CUDA. Its goal is to perform the sum of two arrays and put the result into a
+third array. A CUDA program consists in a C code which calls CUDA kernels that
+are executed on a GPU. This code is in Listing~\ref{ch2:lst:ex1}.
-As GPUs have their own memory, the first step consists in allocating memory on
-the GPU. A call to \texttt{cudaMalloc} allows to allocate memory on the GPU. The
-first parameter of this function is a pointer on a memory on the device
-(i.e. the GPU). In this example, \texttt{d\_} is added on each variable allocated
-on the GPU meaning this variable is on the GPU. The second parameter represents
-the size of the allocated variables, this size is in bits.
+As GPUs have their own memory, the first step consists of allocating memory on
+the GPU. A call to \texttt{cudaMalloc}\index{CUDA functions!cudaMalloc}
+allocates memory on the GPU. The second parameter represents the size of the
+allocated variables, this size is expressed in bits.
+
+\lstinputlisting[label=ch2:lst:ex1,caption=simple example]{Chapters/chapter2/ex1.cu}
+
In this example, we want to compare the execution time of the additions of two
arrays in CPU and GPU. So for both these operations, a timer is created to
-measure the time. CUDA proposes to manipulate timers quick easily. The first
-step is to create the timer, then to start it and at the end to stop it. For
-each of these operations a dedicated functions is used.
+measure the time. CUDA proposes to manipulate timers quite easily. The first
+step is to create the timer\index{CUDA functions!timer}, then to start it, and at
+the end to stop it. For each of these operations a dedicated function is used.
-In order to compute the same sum with a GPU, the first step consits in
+In order to compute the same sum with a GPU, the first step consists of
transferring the data from the CPU (considered as the host with CUDA) to the GPU
-(considered as the device with CUDA). A call to \texttt{cudaMalloc} allows to
-copy the content of an array allocated in the host to the device when the fourth
-parameter is set to \texttt{cudaMemcpyHostToDevice}. The first parameter of the
-function is the destination array, the second is the source array and the third
-is the number of elements to copy (exprimed in bytes).
-
-Now that the GPU contains the data needed to perform the addition. In sequential
-such addition is achieved out with a loop on all the elements. With a GPU, it
-is possible to perform the addition of all elements of the arrays in parallel
-(if the number of blocks and threads per blocks is sufficient). In
-Listing\ref{ch2:lst:ex1} at the beginning, a simple kernel,
+(considered as the device with CUDA). A call to \texttt{cudaMemcpy} copies the content of an array allocated in the host to the device when the fourth
+parameter is set
+to \texttt{cudaMemcpyHostToDevice}\index{CUDA functions!cudaMemcpy}. The first
+parameter of the function is the destination array, the second is the
+source array, and the third is the number of elements to copy (expressed in
+bytes).
+
+Now the GPU contains the data needed to perform the addition. In sequential
+programming, such addition is achieved with a loop on all the elements.
+With a GPU, it is possible to perform the addition of all the elements of the
+two arrays in parallel (if the number of blocks and threads per blocks is
+sufficient). In Listing~\ref{ch2:lst:ex1} at the beginning, a simple kernel,
called \texttt{addition} is defined to compute in parallel the summation of the
-two arrays. With CUDA, a kernel starts with the
-keyword \texttt{\_\_global\_\_} \index{CUDA~keywords!\_\_shared\_\_} which
+two arrays. With CUDA, a kernel starts with the
+keyword \texttt{\_\_global\_\_} \index{CUDA keywords!\_\_shared\_\_} which
indicates that this kernel can be called from the C code. The first instruction
in this kernel is used to compute the variable \texttt{tid} which represents the
-thread index. This thread index\index{thread index} is computed according to
-the values of the block index (it is a variable of CUDA
-called \texttt{blockIdx}\index{CUDA~keywords!blockIdx}). Blocks of threads can
-be decomposed into 1 dimension, 2 dimensions or 3 dimensions. According to the
-dimension of data manipulated, the appropriate dimension can be useful. In our
-example, only one dimension is used. Then using notation \texttt{.x} we can
-access to the first dimension (\texttt{.y} and \texttt{.z} allow respectively to
-access to the second and third dimension). The
-variable \texttt{blockDim}\index{CUDA~keywords!blockDim} gives the size of each
-block.
+thread index. This thread index\index{CUDA keywords!thread index} is computed according to
+the values of the block index
+(called \texttt{blockIdx} \index{CUDA keywords!blockIdx} in CUDA) and of the
+thread index (called \texttt{threadIdx}\index{CUDA keywords!threadIdx} in
+CUDA). Blocks of threads and thread indexes can be decomposed into 1 dimension,
+2 dimensions, or 3 dimensions. According to the dimension of manipulated data,
+the dimension of blocks of threads must be chosen carefully. In our example, only one dimension is
+used. Then using the notation \texttt{.x}, we can access the first dimension
+(\texttt{.y} and \texttt{.z}, respectively allow access to the second and
+third dimension). The variable \texttt{blockDim}\index{CUDA keywords!blockDim}
+gives the size of each block.
+
-\lstinputlisting[label=ch2:lst:ex1,caption=A simple example]{Chapters/chapter2/ex1.cu}
-\section{Second example: using CUBLAS}
+\section{Second example: using CUBLAS \index{CUBLAS}}
\label{ch2:2ex}
-The Basic Linear Algebra Subprograms (BLAS) allows programmer to use performant
-routines that are often used. Those routines are heavily used in many scientific
-applications and are very optimized for vector operations, matrix-vector
-operations and matrix-matrix
+The Basic Linear Algebra Subprograms (BLAS) allows programmers to use efficient
+routines for basic linear operations. Those routines are heavily used in many
+scientific applications and are optimized for vector operations, matrix-vector
+operations, and matrix-matrix
operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some of those operations seem
to be easy to implement with CUDA. Nevertheless, as soon as a reduction is
-needed, implementing an efficient reduction routines with CUDA is far from being
-simple. Roughly speaking, a reduction operation\index{reduction~operation} is an
-operation which combines all the elements of an array and extract a number
-computed with all the elements. For example, a sum, a maximum or a dot product
-are reduction operations.
-
-In this second example, we consider that we have two vectors $A$ and $B$. First
-of all, we want to compute the sum of both vectors in a vector $C$. Then we want
+needed, implementing an efficient reduction routine with CUDA is far from being
+simple. Roughly speaking, a reduction operation\index{reduction operation} is an
+operation which combines all the elements of an array and extracts a number
+computed from all the elements. For example, a sum, a maximum, or a dot product
+are reduction operations.
+
+In this second example, we have two vectors $A$ and $B$. First
+of all, we want to compute the sum of both vectors and store the result in a vector $C$. Then we want
to compute the scalar product between $1/C$ and $1/A$. This is just an example
which has no direct interest except to show how to program it with CUDA.
array.
In the main function, the beginning is very similar to the one in the previous
-example. First, the number of elements is asked to the user. Then a call
-to \texttt{cublasCreate} allows to initialize the cublas library. It creates an
-handle. Then all the arrays are allocated in the host and the device, as in the
-previous example. Both arrays $A$ and $B$ are initialized. Then the CPU
-computation is performed and the time for this CPU computation is measured. In
-order to compute the same result on the GPU, first of all, data from the CPU
+example. First, the user is asked to define the number of elements. Then a
+call to \texttt{cublasCreate} initializes the CUBLAS library. It
+creates a handle. Then all the arrays are allocated in the host and the device,
+as in the previous example. Both arrays $A$ and $B$ are initialized. The CPU
+computation is performed and the time for this is measured. In
+order to compute the same result for the GPU, first of all, data from the CPU
need to be copied into the memory of the GPU. For that, it is possible to use
-cublas function \texttt{cublasSetVector}. This function several arguments. More
-precisely, the first argument represents the number of elements to transfer, the
-second arguments is the size of each elements, the third element represents the
-source of the array to transfer (in the GPU), the fourth is an offset between
-each element of the source (usually this value is set to 1), the fifth is the
-destination (in the GPU) and the last is an offset between each element of the
-destination. Then we call the kernel \texttt{addition} which computes the sum of
-all elements of arrays $A$ and $B$. The \texttt{inverse} kernel is called twice,
-once to inverse elements of array $C$ and once for $A$. Finally, we call the
-function \texttt{cublasDdot} which computes the dot product of two vectors. To
-use this routine, we must specify the handle initialized by Cuda, the number of
-elements to consider, then each vector is followed by the offset between every
-element. After the GPU computation, it is possible to check that both
-computation produce the same result.
-
-\lstinputlisting[label=ch2:lst:ex2,caption=A simple example with cublas]{Chapters/chapter2/ex2.cu}
+CUBLAS function \texttt{cublasSetVector}. This function has several
+arguments. More precisely, the first argument represents the number of elements
+to transfer, the second arguments is the size of each element, the third element
+represents the source of the array to transfer (in the GPU), the fourth is an
+offset between each element of the source (usually this value is set to 1), the
+fifth is the destination (in the GPU), and the last is an offset between each
+element of the destination. Then we call the kernel \texttt{addition} which
+computes the sum of all elements of arrays $A$ and $B$. The \texttt{inverse}
+kernel is called twice, once to inverse elements of array $C$ and once for
+$A$. Finally, we call the function \texttt{cublasDdot} which computes the dot
+product of two vectors. To use this routine, we must specify the handle
+initialized by CUDA, the number of elements to consider, then each vector is
+followed by the offset between every element. After the GPU computation, it is
+possible to check that both computations produce the same result.
+
+\lstinputlisting[label=ch2:lst:ex2,caption=simple example with CUBLAS]{Chapters/chapter2/ex2.cu}
\section{Third example: matrix-matrix multiplication}
\label{ch2:3ex}
Matrix-matrix multiplication is an operation which is quite easy to parallelize
-with a GPU. If we consider that a matrix is represented using a two dimensional
-array, A[i][j] represents the the element of the $i^{th}$ row and of the
-$j^{th}$ column. In many case, it is easier to manipulate 1D array instead of 2D
-array. With Cuda, even if it is possible to manipulate 2D arrays, in the
-following we present an example based on 1D array. For sake of simplicity we
-consider we have a squared matrix of size \texttt{size}. So with a 1D
-array, \texttt{A[i*size+j]} allows us to access to the element of the $i^{th}$
-row and of the $j^{th}$ column.
-
-On C2070M Tesla card, this code take 37.68ms to perform the multiplication. On a
-Intel Xeon E31245 at 3.30GHz, it takes 2465ms without any parallelization (using
-only one core). Consequently the speed up between the CPU and GPU version is
-about 65 which is very good regarding the difficulty of parallelizing this code.
-
-\lstinputlisting[label=ch2:lst:ex3,caption=simple Matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu}
+with a GPU. If we consider that a matrix is represented using a two-dimensional
+array, $A[i][j]$ represents the element of the $i$ row and of the $j$
+column. In many cases, it is easier to manipulate a one-dimentional (1D) array rather than a 2D
+array. With CUDA, even if it is possible to manipulate 2D arrays, in the
+following we present an example based on a 1D array. For the sake of simplicity,
+we consider we have a square matrix of size \texttt{size}. So with a 1D
+array, \texttt{A[i*size+j]} allows us to have access to the element of the
+$i$ row and of the $j$ column.
+
+With sequential programming, the matrix-matrix multiplication is performed using
+three loops. We assume that $A$, $B$ represent two square matrices and the
+result of the multiplication of $A \times B$ is $C$. The
+element \texttt{C[i*size+j]} is computed as follows:
+\begin{equation}
+C[size*i+j]=\sum_{k=0}^{size-1} A[size*i+k]*B[size*k+j];
+\end{equation}
+
+In Listing~\ref{ch2:lst:ex3}, the CPU computation is performed using 3 loops,
+one for $i$, one for $j$, and one for $k$. In order to perform the same
+computation on a GPU, a naive solution consists of considering that the matrix
+$C$ is split into 2-dimensional blocks. The size of each block must be chosen
+such that the number of threads per block is less than $1,024$.
+
+
+In Listing~\ref{ch2:lst:ex3}, we consider that a block contains 16 threads in
+each dimension, the variable \texttt{width} is used for that. The
+variable \texttt{nbTh} represents the number of threads per block. So,
+to compute the matrix-matrix product on a GPU, each block of threads is assigned
+to compute the result of the product of the elements of that block. The main
+part of the code is quite similar to the previous code. Arrays are allocated in
+the CPU and the GPU. Matrices $A$ and $B$ are randomly initialized. Then
+arrays are transferred to the GPU memory with call to \texttt{cudaMemcpy}.
+So the first step for each thread of a block is to compute the corresponding row
+and column. With a 2-dimensional decomposition, \texttt{int i=
+blockIdx.y*blockDim.y+ threadIdx.y;} allows us to compute the corresponding line
+and \texttt{int j= blockIdx.x*blockDim.x+ threadIdx.x;} the corresponding
+column. Then each thread has to compute the sum of the product of the row of
+$A$ by the column of $B$. In order to use a register, the
+kernel \texttt{matmul} uses a variable called \texttt{sum} to compute the
+sum. Then the result is set into the matrix at the right place. The computation
+of CPU matrix-matrix multiplication is performed as described previously. A
+timer measures the time. In order to use 2-dimensional blocks, \texttt{dim3
+dimGrid(size/width,size/width);} allows us to create \texttt{size/width} blocks
+in each dimension. Likewise, \texttt{dim3 dimBlock(width,width);} is used to
+create \texttt{width} thread in each dimension. After that, the kernel for the
+matrix multiplication is called. At the end of the listing, the matrix $C$
+computed by the GPU is transferred back into the CPU and we check that both matrices
+C computed by the CPU and the GPU are identical with a precision of $10^{-4}$.
+
+
+With $1,024 \times 1,024$ matrices, on a C2070M Tesla card, this code takes
+$37.68$ms to perform the multiplication. With an Intel Xeon E31245 at $3.30$GHz, it
+takes $2465$ms without any parallelization (using only one core). Consequently
+the speed up between the CPU and GPU version is about $65$ which is very good
+considering the difficulty of parallelizing this code.
+
+\lstinputlisting[label=ch2:lst:ex3,caption=simple matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu}
+
+\section{Conclusion}
+In this chapter, three simple CUDA examples have been presented. As we cannot
+present all the possibilities of the CUDA programming, interested readers are
+invited to consult CUDA programming introduction books if some issues regarding
+the CUDA programming are not clear.
\putbib[Chapters/chapter2/biblio]