-column.
-
-
-On C2070M Tesla card, this code take $37.68$ms to perform the multiplication. On
-a Intel Xeon E31245 at $3.30$GHz, it takes $2465$ms without any parallelization
-(using only one core). Consequently the speed up between the CPU and GPU version
-is about $65$ which is very good regarding the difficulty of parallelizing this
-code.
+column. Then each thread has to compute the sum of the product of the line of
+$A$ by the column of $B$. In order to use a register, the
+kernel \texttt{matmul} uses a variable called \texttt{sum} to compute the
+sum. Then the result is set into the matrix at the right place. The computation
+of CPU matrix-matrix multiplication is performed as described previously. A
+timer measures the time. In order to use 2 dimensional blocks, \texttt{dim3
+dimGrid(size/width,size/width);} allows us to create \texttt{size/width} blocks
+in each dimension. Likewise, \texttt{dim3 dimBlock(width,width);} is used to
+create \texttt{width} thread in each dimension. After that, the kernel for the
+matrix multiplication is called. At the end of the listing, the matrix $C$
+computed by the GPU is transferred back into the CPU and we check if both matrices
+C computed by the CPU and the GPU are identical with a precision of $10^{-4}$.
+
+
+With $1,024 \times 1,024$ matrices, on a C2070M Tesla card, this code takes
+$37.68$ms to perform the multiplication. With an Intel Xeon E31245 at $3.30$GHz, it
+takes $2465$ms without any parallelization (using only one core). Consequently
+the speed up between the CPU and GPU version is about $65$ which is very good
+regarding the difficulty of parallelizing this code.