+Now that the GPU contains the data needed to perform the addition. In sequential
+such addition is achieved out with a loop on all the elements. With a GPU, it
+is possible to perform the addition of all elements of the arrays in parallel
+(if the number of blocks and threads per blocks is sufficient). In
+Listing\ref{ch2:lst:ex1} at the beginning, a simple kernel,
+called \texttt{addition} is defined to compute in parallel the summation of the
+two arrays. With CUDA, a kernel starts with the
+keyword \texttt{\_\_global\_\_} \index{CUDA~keywords!\_\_shared\_\_} which
+indicates that this kernel can be called from the C code. The first instruction
+in this kernel is used to compute the variable \texttt{tid} which represents the
+thread index. This thread index\index{thread index} is computed according to
+the values of the block index (it is a variable of CUDA
+called \texttt{blockIdx}\index{CUDA~keywords!blockIdx}). Blocks of threads can
+be decomposed into 1 dimension, 2 dimensions or 3 dimensions. According to the
+dimension of data manipulated, the appropriate dimension can be useful. In our
+example, only one dimension is used. Then using notation \texttt{.x} we can
+access to the first dimension (\texttt{.y} and \texttt{.z} allow respectively to
+access to the second and third dimension). The
+variable \texttt{blockDim}\index{CUDA~keywords!blockDim} gives the size of each
+block.
+
+
+
+\lstinputlisting[label=ch2:lst:ex1,caption=A simple example]{Chapters/chapter2/ex1.cu}
+
+\section{Second example: using CUBLAS}
+\label{ch2:2ex}
+
+The Basic Linear Algebra Subprograms (BLAS) allows programmer to use performant
+routines that are often used. Those routines are heavily used in many scientific
+applications and are very optimized for vector operations, matrix-vector
+operations and matrix-matrix
+operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some of those operations seem
+to be easy to implement with CUDA. Nevertheless, as soon as a reduction is
+needed, implementing an efficient reduction routines with CUDA is far from being
+simple. Roughly speaking, a reduction operation\index{reduction~operation} is an
+operation which combines all the elements of an array and extract a number
+computed with all the elements. For example, a sum, a maximum or a dot product
+are reduction operations.
+
+In this second example, we consider that we have two vectors $A$ and $B$. First
+of all, we want to compute the sum of both vectors in a vector $C$. Then we want
+to compute the scalar product between $1/C$ and $1/A$. This is just an example
+which has no direct interest except to show how to program it with CUDA.
+
+Listing~\ref{ch2:lst:ex2} shows this example with CUDA. The first kernel for the
+addition of two arrays is exactly the same as the one described in the
+previous example.
+
+The kernel to compute the inverse of the elements of an array is very
+simple. For each thread index, the inverse of the array replaces the initial
+array.
+
+In the main function, the beginning is very similar to the one in the previous
+example. First, the number of elements is asked to the user. Then a call
+to \texttt{cublasCreate} allows to initialize the cublas library. It creates an
+handle. Then all the arrays are allocated in the host and the device, as in the
+previous example. Both arrays $A$ and $B$ are initialized. Then the CPU
+computation is performed and the time for this CPU computation is measured. In
+order to compute the same result on the GPU, first of all, data from the CPU
+need to be copied into the memory of the GPU. For that, it is possible to use
+cublas function \texttt{cublasSetVector}. This function several arguments. More
+precisely, the first argument represents the number of elements to transfer, the
+second arguments is the size of each elements, the third element represents the
+source of the array to transfer (in the GPU), the fourth is an offset between
+each element of the source (usually this value is set to 1), the fifth is the
+destination (in the GPU) and the last is an offset between each element of the
+destination. Then we call the kernel \texttt{addition} which computes the sum of
+all elements of arrays $A$ and $B$. The \texttt{inverse} kernel is called twice,
+once to inverse elements of array $C$ and once for $A$. Finally, we call the
+function \texttt{cublasDdot} which computes the dot product of two vectors. To
+use this routine, we must specify the handle initialized by Cuda, the number of
+elements to consider, then each vector is followed by the offset between every
+element. After the GPU computation, it is possible to check that both
+computation produce the same result.
+
+\lstinputlisting[label=ch2:lst:ex2,caption=A simple example with cublas]{Chapters/chapter2/ex2.cu}
+
+\section{Third example: matrix-matrix multiplication}
+\label{ch2:3ex}
+
+
+
+Matrix-matrix multiplication is an operation which is quite easy to parallelize
+with a GPU. If we consider that a matrix is represented using a two dimensional
+array, A[i][j] represents the the element of the $i^{th}$ row and of the
+$j^{th}$ column. In many case, it is easier to manipulate 1D array instead of 2D
+array. With Cuda, even if it is possible to manipulate 2D arrays, in the
+following we present an example based on 1D array. For sake of simplicity we
+consider we have a squared matrix of size \texttt{size}. So with a 1D
+array, \texttt{A[i*size+j]} allows us to access to the element of the $i^{th}$
+row and of the $j^{th}$ column.
+
+With a sequential programming, the matrix multiplication is performed using
+three loops. Supposing that $A$, $B$ represent two square matrices and that the
+result of the multiplication of $A \times B$ is $C$. The
+element \texttt{C[i*size+j]} is computed as follows:
+\begin{equation}
+C[i*size+j]=\sum_{k=0}^{size-1} A[i*size+k]*B[k*size+j];
+\end{equation}
+
+In Listing~\ref{ch2:lst:ex3}, in the CPU computation, this part of code is
+performed using 3 loops, one for $i$, one for $j$ and one for $k$. In order to
+perform the same computation on a GPU, a naive solution consists in considering
+that the matrix $C$ is split into 2 dimensional blocks. The size of each block
+must be chosen such as the number of threads per block is inferior to $1,024$.
+In Listing~\ref{ch2:lst:ex3}, we consider that a block contains 16 threads in
+each dimension. The variable \texttt{nbTh} represents the number of threads per
+block. So to be able to compute the matrix-matrix product on a GPU, each block
+of threads is assigned to compute the result of the product for the elements of
+this block. So the first step for each thread of a block is to compute the
+corresponding row and column. With a 2 dimensional decomposition, \texttt{int i=
+blockIdx.y*blockDim.y+ threadIdx.y;} allows us to compute the corresponding line
+and \texttt{int j= blockIdx.x*blockDim.x+ threadIdx.x;} the corresponding
+column.
+
+
+On C2070M Tesla card, this code take $37.68$ms to perform the multiplication. On
+a Intel Xeon E31245 at $3.30$GHz, it takes $2465$ms without any parallelization
+(using only one core). Consequently the speed up between the CPU and GPU version
+is about $65$ which is very good regarding the difficulty of parallelizing this
+code.
+
+\lstinputlisting[label=ch2:lst:ex3,caption=simple Matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu}
+
+\putbib[Chapters/chapter2/biblio]