-\chapterauthor{Author Name1}{Affiliation text1}
-\chapterauthor{Author Name2}{Affiliation text2}
-
+\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte}
\chapter{Introduction to CUDA}
\label{chapter2}
In this chapter we give some simple examples on CUDA programming. The goal is
not to provide an exhaustive presentation of all the functionalities of CUDA but
rather giving some basic elements. Of course, readers that do not know CUDA are
-invited to read other books that are specialized on CUDA programming (for example: \cite{Sanders:2010:CEI}).
+invited to read other books that are specialized on CUDA programming (for
+example: \cite{ch2:Sanders:2010:CEI}).
\section{First example}
This first example is intented to show how to build a very simple example with
CUDA. The goal of this example is to performed the sum of two arrays and
putting the result into a third array. A cuda program consists in a C code
-which calls CUDA kernels that are executed on a GPU. The listing of this code is in Listing~\ref{ch2:lst:ex1}
+which calls CUDA kernels that are executed on a GPU. The listing of this code is
+in Listing~\ref{ch2:lst:ex1}.
As GPUs have their own memory, the first step consists in allocating memory on
(if the number of blocks and threads per blocks is sufficient). In
Listing\ref{ch2:lst:ex1} at the beginning, a simple kernel,
called \texttt{addition} is defined to compute in parallel the summation of the
-two arrays. With CUDA, a kernel starts with the keyword \texttt{\_\_global\_\_} \index{CUDA~keywords!\_\_shared\_\_}
-which indicates that this kernel can be called from the C code. The first
-instruction in this kernel is used to compute the variable \texttt{tid} which
-represents the thread index. This thread index\index{thread index} is computed
-according to the values of the block index (it is a variable of CUDA
+two arrays. With CUDA, a kernel starts with the
+keyword \texttt{\_\_global\_\_} \index{CUDA~keywords!\_\_shared\_\_} which
+indicates that this kernel can be called from the C code. The first instruction
+in this kernel is used to compute the variable \texttt{tid} which represents the
+thread index. This thread index\index{thread index} is computed according to
+the values of the block index (it is a variable of CUDA
called \texttt{blockIdx}\index{CUDA~keywords!blockIdx}). Blocks of threads can
be decomposed into 1 dimension, 2 dimensions or 3 dimensions. According to the
dimension of data manipulated, the appropriate dimension can be useful. In our
\lstinputlisting[label=ch2:lst:ex1,caption=A simple example]{Chapters/chapter2/ex1.cu}
+\section{Second example: using CUBLAS}
+
+The Basic Linear Algebra Subprograms (BLAS) allows programmer to use performant
+routines that are often used. Those routines are heavily used in many scientific
+applications and are very optimized for vector operations, matrix-vector
+operations and matrix-matrix
+operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some of those operations seem
+to be easy to implement with CUDA. Nevertheless, as soon as a reduction is
+needed, implementing an efficient reduction routines with CUDA is far from being
+simple.
+
+In this second example, we consider that we have two vectors $A$ and $B$. First
+of all, we want to compute the sum of both vectors in a vector $C$. Then we want
+to compute the scalar product between $1/C$ and $1/A$. This is just an example
+which has no direct interest except to show how to program it with CUDA.
+
+Listing~\ref{ch2:lst:ex2} shows this example with CUDA. The first kernel for the
+addition of two arrays is exactly the same as the one described in the
+previous example.
+
+The kernel to compute the inverse of the elements of an array is very
+simple. For each thread index, the inverse of the array replaces the initial
+array.
+
+In the main function, the beginning is very similar to the one in the previous
+example. First, the number of elements is asked to the user. Then a call
+to \texttt{cublasCreate} allows to initialize the cublas library. It creates an
+handle. Then all the arrays are allocated in the host and the device, as in the
+previous example. Both arrays $A$ and $B$ are initialized. Then the CPU
+computation is performed and the time for this CPU computation is measured. In
+order to compute the same result on the GPU, first of all, data from the CPU
+need to be copied into the memory of the GPU. For that, it is possible to use
+cublas function \texttt{cublasSetVector}. This function several arguments. More
+precisely, the first argument represents the number of elements to transfer, the
+second arguments is the size of each elements, the third element represents the
+source of the array to transfer (in the GPU), the fourth is an offset between
+each element of the source (usually this value is set to 1), the fifth is the
+destination (in the GPU) and the last is an offset between each element of the
+destination.
+
+\lstinputlisting[label=ch2:lst:ex2,caption=A simple example]{Chapters/chapter2/ex2.cu}
+
+
\putbib[Chapters/chapter2/biblio]