-\chapterauthor{Author Name1}{Affiliation text1}
-\chapterauthor{Author Name2}{Affiliation text2}
-
+\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte}
\chapter{Introduction to CUDA}
\label{chapter2}
In this chapter we give some simple examples on CUDA programming. The goal is
not to provide an exhaustive presentation of all the functionalities of CUDA but
rather giving some basic elements. Of course, readers that do not know CUDA are
-invited to read other books that are specialized on CUDA programming.
+invited to read other books that are specialized on CUDA programming (for
+example: \cite{ch2:Sanders:2010:CEI}).
\section{First example}
This first example is intented to show how to build a very simple example with
CUDA. The goal of this example is to performed the sum of two arrays and
putting the result into a third array. A cuda program consists in a C code
-which calls CUDA kernels that are executed on a GPU.
+which calls CUDA kernels that are executed on a GPU. The listing of this code is
+in Listing~\ref{ch2:lst:ex1}.
As GPUs have their own memory, the first step consists in allocating memory on
function is the destination array, the second is the source array and the third
is the number of elements to copy (exprimed in bytes).
-\putbib[biblio]
+Now that the GPU contains the data needed to perform the addition. In sequential
+such addition is achieved out with a loop on all the elements. With a GPU, it
+is possible to perform the addition of all elements of the arrays in parallel
+(if the number of blocks and threads per blocks is sufficient). In
+Listing\ref{ch2:lst:ex1} at the beginning, a simple kernel,
+called \texttt{addition} is defined to compute in parallel the summation of the
+two arrays. With CUDA, a kernel starts with the
+keyword \texttt{\_\_global\_\_} \index{CUDA~keywords!\_\_shared\_\_} which
+indicates that this kernel can be called from the C code. The first instruction
+in this kernel is used to compute the variable \texttt{tid} which represents the
+thread index. This thread index\index{thread index} is computed according to
+the values of the block index (it is a variable of CUDA
+called \texttt{blockIdx}\index{CUDA~keywords!blockIdx}). Blocks of threads can
+be decomposed into 1 dimension, 2 dimensions or 3 dimensions. According to the
+dimension of data manipulated, the appropriate dimension can be useful. In our
+example, only one dimension is used. Then using notation \texttt{.x} we can
+access to the first dimension (\texttt{.y} and \texttt{.z} allow respectively to
+access to the second and third dimension). The
+variable \texttt{blockDim}\index{CUDA~keywords!blockDim} gives the size of each
+block.
+
+
+
+\lstinputlisting[label=ch2:lst:ex1,caption=A simple example]{Chapters/chapter2/ex1.cu}
+
+\section{Second example: using CUBLAS}
+
+The Basic Linear Algebra Subprograms (BLAS) allows programmer to use performant
+routines that are often used. Those routines are heavily used in many scientific
+applications and are very optimized for vector operations, matrix-vector
+operations and matrix-matrix
+operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some of those operations seem
+to be easy to implement with CUDA. Nevertheless, as soon as a reduction is
+needed, implementing an efficient reduction routines with CUDA is far from being
+simple.
+
+In this second example, we consider that we have two vectors $A$ and $B$. First
+of all, we want to compute the sum of both vectors in a vector $C$. Then we want
+to compute the scalar product between $1/C$ and $1/A$. This is just an example
+which has no direct interest except to show how to program it with CUDA.
+
+Listing~\ref{ch2:lst:ex2} shows this example with CUDA. The first kernel for the
+addition of two arrays is exactly the same as the one described in the
+previous example.
+
+The kernel to compute the inverse of the elements of an array is very
+simple. For each thread index, the inverse of the array replaces the initial
+array.
+
+In the main function, the beginning is very similar to the one in the previous
+example. First, the number of elements is asked to the user. Then a call
+to \texttt{cublasCreate} allows to initialize the cublas library. It creates an
+handle. Then all the arrays are allocated in the host and the device, as in the
+previous example. Both arrays $A$ and $B$ are initialized. Then the CPU
+computation is performed and the time for this CPU computation is measured. In
+order to compute the same result on the GPU, first of all, data from the CPU
+need to be copied into the memory of the GPU. For that, it is possible to use
+cublas function \texttt{cublasSetVector}. This function several arguments. More
+precisely, the first argument represents the number of elements to transfer, the
+second arguments is the size of each elements, the third element represents the
+source of the array to transfer (in the GPU), the fourth is an offset between
+each element of the source (usually this value is set to 1), the fifth is the
+destination (in the GPU) and the last is an offset between each element of the
+destination.
+
+\lstinputlisting[label=ch2:lst:ex2,caption=A simple example]{Chapters/chapter2/ex2.cu}
+
+
+\putbib[Chapters/chapter2/biblio]