+With a sequential programming, the matrix multiplication is performed using
+three loops. Supposing that $A$, $B$ represent two square matrices and that the
+result of the multiplication of $A \times B$ is $C$. The
+element \texttt{C[i*size+j]} is computed as follows:
+\begin{equation}
+C[i*size+j]=\sum_{k=0}^{size-1} A[i*size+k]*B[k*size+j];
+\end{equation}
+
+In Listing~\ref{ch2:lst:ex3}, in the CPU computation, this part of code is
+performed using 3 loops, one for $i$, one for $j$ and one for $k$. In order to
+perform the same computation on a GPU, a naive solution consists in considering
+that the matrix $C$ is split into 2 dimensional blocks. The size of each block
+must be chosen such as the number of threads per block is inferior to $1,024$.
+
+
+In Listing~\ref{ch2:lst:ex3}, we consider that a block contains 16 threads in
+each dimension, the variable \texttt{width} is used for that. The
+variable \texttt{nbTh} represents the number of threads per block. So to be able
+to compute the matrix-matrix product on a GPU, each block of threads is assigned
+to compute the result of the product for the elements of this block. The main
+part of the code is quite similar to the previous code. Arrays are allocated in
+the CPU and the GPU. Matrices $A$ and $B$ are randomly initialized. Then
+arrays are transfered inside the GPU memory with call to \texttt{cudaMemcpy}.
+So the first step for each thread of a block is to compute the corresponding row
+and column. With a 2 dimensional decomposition, \texttt{int i=
+blockIdx.y*blockDim.y+ threadIdx.y;} allows us to compute the corresponding line
+and \texttt{int j= blockIdx.x*blockDim.x+ threadIdx.x;} the corresponding
+column. Then each thread has to compute the sum of the product of the line of
+$A$ per the column of $B$. In order to use a register, the
+kernel \texttt{matmul} uses a variable called \texttt{sum} to compute the
+sum. Then the result is set into the matrix at the right place. The computation
+of CPU matrix-matrix multiplication is performed as described previously. A
+timer measures the time. In order to use 2 dimensional blocks, \texttt{dim3
+dimGrid(size/width,size/width);} allows us to create \texttt{size/width} blocks
+in each dimension. Likewise, \texttt{dim3 dimBlock(width,width);} is used to
+create \texttt{width} thread in each dimension. After that, the kernel for the
+matrix multiplication is called. At the end of the listing, the matrix $C$
+computed by the GPU is transfered back in the CPU and we check if both matrices
+C computed by the CPU and the GPU are identical with a precision of $10^{-4}$.
+
+
+On C2070M Tesla card, this code take $37.68$ms to perform the multiplication. On
+a Intel Xeon E31245 at $3.30$GHz, it takes $2465$ms without any parallelization
+(using only one core). Consequently the speed up between the CPU and GPU version
+is about $65$ which is very good regarding the difficulty of parallelizing this
+code.