+\section{Second example: using CUBLAS}
+
+The Basic Linear Algebra Subprograms (BLAS) allows programmer to use performant
+routines that are often used. Those routines are heavily used in many scientific
+applications and are very optimized for vector operations, matrix-vector
+operations and matrix-matrix
+operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some of those operations seem
+to be easy to implement with CUDA. Nevertheless, as soon as a reduction is
+needed, implementing an efficient reduction routines with CUDA is far from being
+simple.
+
+In this second example, we consider that we have two vectors $A$ and $B$. First
+of all, we want to compute the sum of both vectors in a vector $C$. Then we want
+to compute the scalar product between $1/C$ and $1/A$. This is just an example
+which has no direct interest except to show how to program it with CUDA.
+
+Listing~\ref{ch2:lst:ex2} shows this example with CUDA. The first kernel for the
+addition of two arrays is exactly the same as the one described in the
+previous example.
+
+The kernel to compute the inverse of the elements of an array is very
+simple. For each thread index, the inverse of the array replaces the initial
+array.
+
+In the main function, the beginning is very similar to the one in the previous
+example. First, the number of elements is asked to the user. Then a call
+to \texttt{cublasCreate} allows to initialize the cublas library. It creates an
+handle. Then all the arrays are allocated in the host and the device, as in the
+previous example. Both arrays $A$ and $B$ are initialized. Then the CPU
+computation is performed and the time for this CPU computation is measured. In
+order to compute the same result on the GPU, first of all, data from the CPU
+need to be copied into the memory of the GPU. For that, it is possible to use
+cublas function \texttt{cublasSetVector}. This function several arguments. More
+precisely, the first argument represents the number of elements to transfer, the
+second arguments is the size of each elements, the third element represents the
+source of the array to transfer (in the GPU), the fourth is an offset between
+each element of the source (usually this value is set to 1), the fifth is the
+destination (in the GPU) and the last is an offset between each element of the
+destination.
+
+\lstinputlisting[label=ch2:lst:ex2,caption=A simple example]{Chapters/chapter2/ex2.cu}
+
+