+cublas function \texttt{cublasSetVector}. This function several arguments. More
+precisely, the first argument represents the number of elements to transfer, the
+second arguments is the size of each elements, the third element represents the
+source of the array to transfer (in the GPU), the fourth is an offset between
+each element of the source (usually this value is set to 1), the fifth is the
+destination (in the GPU) and the last is an offset between each element of the
+destination. Then we call the kernel \texttt{addition} which computes the sum of
+all elements of arrays $A$ and $B$. The \texttt{inverse} kernel is called twice,
+once to inverse elements of array $C$ and once for $A$. Finally, we call the
+function \texttt{cublasDdot} which computes the dot product of two vectors. To
+use this routine, we must specify the handle initialized by Cuda, the number of
+elements to consider, then each vector is followed by the offset between every
+element. After the GPU computation, it is possible to check that both
+computation produce the same result.
+
+\lstinputlisting[label=ch2:lst:ex2,caption=A simple example with cublas]{Chapters/chapter2/ex2.cu}
+
+\section{Third example: matrix-matrix multiplication}
+\label{ch2:3ex}
+
+
+
+Matrix-matrix multiplication is an operation which is quite easy to parallelize
+with a GPU. If we consider that a matrix is represented using a two dimensional
+array, A[i][j] represents the the element of the $i^{th}$ row and of the
+$j^{th}$ column. In many case, it is easier to manipulate 1D array instead of 2D
+array. With Cuda, even if it is possible to manipulate 2D arrays, in the
+following we present an example based on 1D array. For sake of simplicity we
+consider we have a squared matrix of size \texttt{size}. So with a 1D
+array, \texttt{A[i*size+j]} allows us to access to the element of the $i^{th}$
+row and of the $j^{th}$ column.
+
+In sequential the matrix multiplication is performed using three loops. Supposing that $A$, $B$ represent two square matrices, the result of the multiplication of $A \times B$ is
+
+On C2070M Tesla card, this code take 37.68ms to perform the multiplication. On a
+Intel Xeon E31245 at 3.30GHz, it takes 2465ms without any parallelization (using
+only one core). Consequently the speed up between the CPU and GPU version is
+about 65 which is very good regarding the difficulty of parallelizing this code.
+
+\lstinputlisting[label=ch2:lst:ex3,caption=simple Matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu}