+column. Then each thread has to compute the sum of the product of the line of
+$A$ per the column of $B$. In order to use a register, the
+kernel \texttt{matmul} uses a variable called \texttt{sum} to compute the
+sum. Then the result is set into the matrix at the right place. The computation
+of CPU matrix-matrix multiplication is performed as described previously. A
+timer measures the time. In order to use 2 dimensional blocks, \texttt{dim3
+dimGrid(size/width,size/width);} allows us to create \texttt{size/width} blocks
+in each dimension. Likewise, \texttt{dim3 dimBlock(width,width);} is used to
+create \texttt{width} thread in each dimension. After that, the kernel for the
+matrix multiplication is called. At the end of the listing, the matrix $C$
+computed by the GPU is transfered back in the CPU and we check if both matrices
+C computed by the CPU and the GPU are identical with a precision of $10^{-4}$.