+CUBLAS function \texttt{cublasSetVector}. This function has several
+arguments. More precisely, the first argument represents the number of elements
+to transfer, the second arguments is the size of each element, the third element
+represents the source of the array to transfer (in the GPU), the fourth is an
+offset between each element of the source (usually this value is set to 1), the
+fifth is the destination (in the GPU), and the last is an offset between each
+element of the destination. Then we call the kernel \texttt{addition} which
+computes the sum of all elements of arrays $A$ and $B$. The \texttt{inverse}
+kernel is called twice, once to inverse elements of array $C$ and once for
+$A$. Finally, we call the function \texttt{cublasDdot} which computes the dot
+product of two vectors. To use this routine, we must specify the handle
+initialized by CUDA, the number of elements to consider, then each vector is
+followed by the offset between every element. After the GPU computation, it is
+possible to check that both computations produce the same result.
+
+\lstinputlisting[label=ch2:lst:ex2,caption=simple example with CUBLAS]{Chapters/chapter2/ex2.cu}
+
+\section{Third example: matrix-matrix multiplication}
+\label{ch2:3ex}
+
+
+
+Matrix-matrix multiplication is an operation which is quite easy to parallelize
+with a GPU. If we consider that a matrix is represented using a two-dimensional
+array, $A[i][j]$ represents the element of the $i$ row and of the $j$
+column. In many cases, it is easier to manipulate a one-dimentional (1D) array rather than a 2D
+array. With CUDA, even if it is possible to manipulate 2D arrays, in the
+following we present an example based on a 1D array. For the sake of simplicity,
+we consider we have a square matrix of size \texttt{size}. So with a 1D
+array, \texttt{A[i*size+j]} allows us to have access to the element of the
+$i$ row and of the $j$ column.
+
+With sequential programming, the matrix-matrix multiplication is performed using
+three loops. We assume that $A$, $B$ represent two square matrices and the
+result of the multiplication of $A \times B$ is $C$. The
+element \texttt{C[i*size+j]} is computed as follows:
+\begin{equation}
+C[size*i+j]=\sum_{k=0}^{size-1} A[size*i+k]*B[size*k+j];
+\end{equation}
+
+In Listing~\ref{ch2:lst:ex3}, the CPU computation is performed using 3 loops,
+one for $i$, one for $j$, and one for $k$. In order to perform the same
+computation on a GPU, a naive solution consists of considering that the matrix
+$C$ is split into 2-dimensional blocks. The size of each block must be chosen
+such that the number of threads per block is less than $1,024$.
+
+
+In Listing~\ref{ch2:lst:ex3}, we consider that a block contains 16 threads in
+each dimension, the variable \texttt{width} is used for that. The
+variable \texttt{nbTh} represents the number of threads per block. So,
+to compute the matrix-matrix product on a GPU, each block of threads is assigned
+to compute the result of the product of the elements of that block. The main
+part of the code is quite similar to the previous code. Arrays are allocated in
+the CPU and the GPU. Matrices $A$ and $B$ are randomly initialized. Then
+arrays are transferred to the GPU memory with call to \texttt{cudaMemcpy}.
+So the first step for each thread of a block is to compute the corresponding row
+and column. With a 2-dimensional decomposition, \texttt{int i=
+blockIdx.y*blockDim.y+ threadIdx.y;} allows us to compute the corresponding line
+and \texttt{int j= blockIdx.x*blockDim.x+ threadIdx.x;} the corresponding
+column. Then each thread has to compute the sum of the product of the row of
+$A$ by the column of $B$. In order to use a register, the
+kernel \texttt{matmul} uses a variable called \texttt{sum} to compute the
+sum. Then the result is set into the matrix at the right place. The computation
+of CPU matrix-matrix multiplication is performed as described previously. A
+timer measures the time. In order to use 2-dimensional blocks, \texttt{dim3
+dimGrid(size/width,size/width);} allows us to create \texttt{size/width} blocks
+in each dimension. Likewise, \texttt{dim3 dimBlock(width,width);} is used to
+create \texttt{width} thread in each dimension. After that, the kernel for the
+matrix multiplication is called. At the end of the listing, the matrix $C$
+computed by the GPU is transferred back into the CPU and we check that both matrices
+C computed by the CPU and the GPU are identical with a precision of $10^{-4}$.
+
+
+With $1,024 \times 1,024$ matrices, on a C2070M Tesla card, this code takes
+$37.68$ms to perform the multiplication. With an Intel Xeon E31245 at $3.30$GHz, it
+takes $2465$ms without any parallelization (using only one core). Consequently
+the speed up between the CPU and GPU version is about $65$ which is very good
+considering the difficulty of parallelizing this code.
+
+\lstinputlisting[label=ch2:lst:ex3,caption=simple matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu}
+
+\section{Conclusion}
+In this chapter, three simple CUDA examples have been presented. As we cannot
+present all the possibilities of the CUDA programming, interested readers are
+invited to consult CUDA programming introduction books if some issues regarding
+the CUDA programming are not clear.