From c45fe27f00d63f76b7590e3077f0c2179d225bff Mon Sep 17 00:00:00 2001 From: Raphael Couturier Date: Mon, 22 Oct 2012 18:00:47 +0200 Subject: [PATCH 1/1] suite --- BookGPU/Chapters/chapter2/ch2.tex | 35 +++++++++++++++++++++++++------ 1 file changed, 29 insertions(+), 6 deletions(-) diff --git a/BookGPU/Chapters/chapter2/ch2.tex b/BookGPU/Chapters/chapter2/ch2.tex index c33ac50..e80b670 100755 --- a/BookGPU/Chapters/chapter2/ch2.tex +++ b/BookGPU/Chapters/chapter2/ch2.tex @@ -137,12 +137,35 @@ consider we have a squared matrix of size \texttt{size}. So with a 1D array, \texttt{A[i*size+j]} allows us to access to the element of the $i^{th}$ row and of the $j^{th}$ column. -In sequential the matrix multiplication is performed using three loops. Supposing that $A$, $B$ represent two square matrices, the result of the multiplication of $A \times B$ is - -On C2070M Tesla card, this code take 37.68ms to perform the multiplication. On a -Intel Xeon E31245 at 3.30GHz, it takes 2465ms without any parallelization (using -only one core). Consequently the speed up between the CPU and GPU version is -about 65 which is very good regarding the difficulty of parallelizing this code. +With a sequential programming, the matrix multiplication is performed using +three loops. Supposing that $A$, $B$ represent two square matrices and that the +result of the multiplication of $A \times B$ is $C$. The +element \texttt{C[i*size+j]} is computed as follows: +\begin{equation} +C[i*size+j]=\sum_{k=0}^{size-1} A[i*size+k]*B[k*size+j]; +\end{equation} + +In Listing~\ref{ch2:lst:ex3}, in the CPU computation, this part of code is +performed using 3 loops, one for $i$, one for $j$ and one for $k$. In order to +perform the same computation on a GPU, a naive solution consists in considering +that the matrix $C$ is split into 2 dimensional blocks. The size of each block +must be chosen such as the number of threads per block is inferior to $1,024$. +In Listing~\ref{ch2:lst:ex3}, we consider that a block contains 16 threads in +each dimension. The variable \texttt{nbTh} represents the number of threads per +block. So to be able to compute the matrix-matrix product on a GPU, each block +of threads is assigned to compute the result of the product for the elements of +this block. So the first step for each thread of a block is to compute the +corresponding row and column. With a 2 dimensional decomposition, \texttt{int i= +blockIdx.y*blockDim.y+ threadIdx.y;} allows us to compute the corresponding line +and \texttt{int j= blockIdx.x*blockDim.x+ threadIdx.x;} the corresponding +column. + + +On C2070M Tesla card, this code take $37.68$ms to perform the multiplication. On +a Intel Xeon E31245 at $3.30$GHz, it takes $2465$ms without any parallelization +(using only one core). Consequently the speed up between the CPU and GPU version +is about $65$ which is very good regarding the difficulty of parallelizing this +code. \lstinputlisting[label=ch2:lst:ex3,caption=simple Matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu} -- 2.39.5