\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte}
-\chapter{Presentation of the GPU architecture and of the CUDA environment}
+\chapter{Presentation of the GPU architecture and of the Cuda environment}
\label{chapter1}
\section{Introduction}\label{ch1:intro}
the development of Graphics card until they have been used in order to make
general purpose computation. Then the architecture of a GPU is
illustrated. There are many fundamental differences between a GPU and a
-tradition processor. In order to benefit from the power of a GPU, a CUDA
+tradition processor. In order to benefit from the power of a GPU, a Cuda
programmer needs to use threads. They have some particularities which enable the
-CUDA model to be efficient and scalable when some constraints are addressed.
+Cuda model to be efficient and scalable when some constraints are addressed.
\section{GPGPU}
-In order to benefit from the computing power of more recent video cards, CUDA
+In order to benefit from the computing power of more recent video cards, Cuda
was first proposed in 2007 by NVidia. It unifies the programming model for some
of their most performant video cards. Cuda~\cite{ch1:cuda} has quickly been
considered by the scientific community as a great advance for general purpose
there is a context switch that allows the CPU to run concurrent applications
and/or multi-threaded applications. Memory latencies are longer in a GPU, the
the principle to obtain a high throughput is to have many tasks to
-compute. Later we will see that those tasks are called threads with CUDA. With
+compute. Later we will see that those tasks are called threads with Cuda. With
this principle, as soon as a task is finished the next one is ready to be
executed while the wait for data for the previous task is overlapped by
computation of other tasks.
high performance architectures where different tasks are executed by different
computing units.
-\section{CUDA Multithreading}
+\section{Cuda Multithreading}
-The data parallelism of CUDA is more precisely based on the Single Instruction
+The data parallelism of Cuda is more precisely based on the Single Instruction
Multiple Thread (SIMT) model. This is due to the fact that a programmer accesses
-to the cores by the intermediate of threads. In the CUDA model, all cores
+to the cores by the intermediate of threads. In the Cuda model, all cores
execute the same set of instructions but with different data. This model has
similarities with the vector programming model proposed for vector machines through
-the 1970s into the 90s, notably the various Cray platforms. On the CUDA
+the 1970s into the 90s, notably the various Cray platforms. On the Cuda
architecture, the performance is led by the use of a huge number of threads
(from thousands up to to millions). The particularity of the model is that there
is no context switching as in CPUs and each thread has its own registers. In
``active warps'' and warps becoming temporarily inactive due to waiting of data
(as shown in Figure~\ref{ch1:fig:latency_throughput}).
-The key to scalability in the CUDA model is the use of a huge number of threads.
+The key to scalability in the Cuda model is the use of a huge number of threads.
In practice, threads are not only gathered in warps but also in thread blocks. A
thread block is executed by only one SM and it cannot migrate. The typical size of
a thread block is a number power of two (for example: 64, 128, 256 or 512).
-In this case, without changing anything inside a CUDA code, it is possible to
-run your code with a small CUDA device or the most performing Tesla CUDA cards.
+In this case, without changing anything inside a Cuda code, it is possible to
+run your code with a small Cuda device or the most performing Tesla Cuda cards.
Blocks are executed in any order depending on the number of SMs available. So
the programmer must conceive its code having this issue in mind. This
-independence between thread blocks provides the scalability of CUDA codes.
+independence between thread blocks provides the scalability of Cuda codes.
\begin{figure}[b!]
\centerline{\includegraphics[scale=0.65]{Chapters/chapter1/figures/scalability.pdf}}
parameters to each kernel. Figure~\ref{ch1:fig:scalability} illustrates an
example of a kernel composed of 8 thread blocks. Then this kernel is executed on
a small device containing only 2 SMs. So in this case, blocks are executed 2
-by 2 in any order. If the kernel is executed on a larger CUDA device containing
+by 2 in any order. If the kernel is executed on a larger Cuda device containing
4 SMs, blocks are executed 4 by 4 simultaneously. The execution times should be
approximately twice faster in the latter case. Of course, that depends on other
parameters that will be described later.
latency and threads. In order to design an efficient algorithm for GPU, it is
essential to have all these parameters in mind.
-%%http://people.maths.ox.ac.uk/gilesm/pp10/lec2_2x2.pdf
-%%https://people.maths.ox.ac.uk/erban/papers/paperCUDA.pdf
-%%http://forum.wttsnxt.com/my_forum/viewtopic.php?f=5&t=9519
-%%http://www.cs.nyu.edu/manycores/cuda_many_cores.pdf
-%%http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/02-cuda-overview.pdf
-%%http://people.maths.ox.ac.uk/~gilesm/cuda/
-
\putbib[Chapters/chapter1/biblio]
\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte}
-\chapter{Introduction to CUDA}
+\chapter{Introduction to Cuda}
\label{chapter2}
\section{Introduction}
\label{ch2:intro}
-In this chapter we give some simple examples on CUDA programming. The goal is
-not to provide an exhaustive presentation of all the functionalities of CUDA but
-rather giving some basic elements. Of course, readers that do not know CUDA are
-invited to read other books that are specialized on CUDA programming (for
+In this chapter we give some simple examples on Cuda programming. The goal is
+not to provide an exhaustive presentation of all the functionalities of Cuda but
+rather giving some basic elements. Of course, readers that do not know Cuda are
+invited to read other books that are specialized on Cuda programming (for
example: \cite{ch2:Sanders:2010:CEI}).
\label{ch2:1ex}
This first example is intented to show how to build a very simple example with
-CUDA. The goal of this example is to performed the sum of two arrays and
-putting the result into a third array. A cuda program consists in a C code
-which calls CUDA kernels that are executed on a GPU. The listing of this code is
+Cuda. The goal of this example is to perform the sum of two arrays and
+put the result into a third array. A Cuda program consists in a C code
+which calls Cuda kernels that are executed on a GPU. The listing of this code is
in Listing~\ref{ch2:lst:ex1}.
As GPUs have their own memory, the first step consists in allocating memory on
-the GPU. A call to \texttt{cudaMalloc} allows to allocate memory on the GPU. The
-first parameter of this function is a pointer on a memory on the device
-(i.e. the GPU). In this example, \texttt{d\_} is added on each variable allocated
-on the GPU meaning this variable is on the GPU. The second parameter represents
-the size of the allocated variables, this size is in bits.
+the GPU. A call to \texttt{cudaMalloc}\index{Cuda~functions!cudaMalloc} allows
+to allocate memory on the GPU. The first parameter of this function is a pointer
+on a memory on the device (i.e. the GPU). In this example, \texttt{d\_} is added
+on each variable allocated on the GPU, meaning this variable is on the GPU. The
+second parameter represents the size of the allocated variables, this size is in
+bits.
In this example, we want to compare the execution time of the additions of two
arrays in CPU and GPU. So for both these operations, a timer is created to
-measure the time. CUDA proposes to manipulate timers quick easily. The first
-step is to create the timer, then to start it and at the end to stop it. For
-each of these operations a dedicated functions is used.
+measure the time. Cuda proposes to manipulate timers quite easily. The first
+step is to create the timer\index{Cuda~functions!timer}, then to start it and at
+the end to stop it. For each of these operations a dedicated functions is used.
-In order to compute the same sum with a GPU, the first step consits in
-transferring the data from the CPU (considered as the host with CUDA) to the GPU
-(considered as the device with CUDA). A call to \texttt{cudaMalloc} allows to
+In order to compute the same sum with a GPU, the first step consists in
+transferring the data from the CPU (considered as the host with Cuda) to the GPU
+(considered as the device with Cuda). A call to \texttt{cudaMemcpy} allows to
copy the content of an array allocated in the host to the device when the fourth
-parameter is set to \texttt{cudaMemcpyHostToDevice}. The first parameter of the
-function is the destination array, the second is the source array and the third
-is the number of elements to copy (exprimed in bytes).
-
-Now that the GPU contains the data needed to perform the addition. In sequential
-such addition is achieved out with a loop on all the elements. With a GPU, it
-is possible to perform the addition of all elements of the arrays in parallel
+parameter is set
+to \texttt{cudaMemcpyHostToDevice}\index{Cuda~functions!cudaMemcpy}. The first
+parameter of the function is the destination array, the second is the
+source array and the third is the number of elements to copy (exprimed in
+bytes).
+
+Now the GPU contains the data needed to perform the addition. In sequential such
+addition is achieved out with a loop on all the elements. With a GPU, it is
+possible to perform the addition of all elements of the two arrays in parallel
(if the number of blocks and threads per blocks is sufficient). In
Listing\ref{ch2:lst:ex1} at the beginning, a simple kernel,
called \texttt{addition} is defined to compute in parallel the summation of the
-two arrays. With CUDA, a kernel starts with the
-keyword \texttt{\_\_global\_\_} \index{CUDA~keywords!\_\_shared\_\_} which
+two arrays. With Cuda, a kernel starts with the
+keyword \texttt{\_\_global\_\_} \index{Cuda~keywords!\_\_shared\_\_} which
indicates that this kernel can be called from the C code. The first instruction
in this kernel is used to compute the variable \texttt{tid} which represents the
thread index. This thread index\index{thread index} is computed according to
-the values of the block index (it is a variable of CUDA
-called \texttt{blockIdx}\index{CUDA~keywords!blockIdx}). Blocks of threads can
-be decomposed into 1 dimension, 2 dimensions or 3 dimensions. According to the
-dimension of data manipulated, the appropriate dimension can be useful. In our
-example, only one dimension is used. Then using notation \texttt{.x} we can
-access to the first dimension (\texttt{.y} and \texttt{.z} allow respectively to
-access to the second and third dimension). The
-variable \texttt{blockDim}\index{CUDA~keywords!blockDim} gives the size of each
-block.
+the values of the block index
+(called \texttt{blockIdx} \index{Cuda~keywords!blockIdx} in Cuda) and of the
+thread index (called \texttt{blockIdx}\index{Cuda~keywords!threadIdx} in
+Cuda). Blocks of threads and thread indexes can be decomposed into 1 dimension, 2
+dimensions or 3 dimensions. According to the dimension of data manipulated, the
+appropriate dimension can be useful. In our example, only one dimension is used.
+Then using notation \texttt{.x} we can access to the first dimension
+(\texttt{.y} and \texttt{.z} allow respectively to access to the second and
+third dimension). The variable \texttt{blockDim}\index{Cuda~keywords!blockDim}
+gives the size of each block.
\section{Second example: using CUBLAS}
\label{ch2:2ex}
-The Basic Linear Algebra Subprograms (BLAS) allows programmer to use performant
-routines that are often used. Those routines are heavily used in many scientific
-applications and are very optimized for vector operations, matrix-vector
-operations and matrix-matrix
+The Basic Linear Algebra Subprograms (BLAS) allows programmers to use efficient
+routines that are often required. Those routines are heavily used in many
+scientific applications and are very optimized for vector operations,
+matrix-vector operations and matrix-matrix
operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some of those operations seem
-to be easy to implement with CUDA. Nevertheless, as soon as a reduction is
-needed, implementing an efficient reduction routines with CUDA is far from being
+to be easy to implement with Cuda. Nevertheless, as soon as a reduction is
+needed, implementing an efficient reduction routines with Cuda is far from being
simple. Roughly speaking, a reduction operation\index{reduction~operation} is an
operation which combines all the elements of an array and extract a number
computed with all the elements. For example, a sum, a maximum or a dot product
-are reduction operations.
+are reduction operations.
In this second example, we consider that we have two vectors $A$ and $B$. First
of all, we want to compute the sum of both vectors in a vector $C$. Then we want
to compute the scalar product between $1/C$ and $1/A$. This is just an example
-which has no direct interest except to show how to program it with CUDA.
+which has no direct interest except to show how to program it with Cuda.
-Listing~\ref{ch2:lst:ex2} shows this example with CUDA. The first kernel for the
+Listing~\ref{ch2:lst:ex2} shows this example with Cuda. The first kernel for the
addition of two arrays is exactly the same as the one described in the
previous example.
perform the same computation on a GPU, a naive solution consists in considering
that the matrix $C$ is split into 2 dimensional blocks. The size of each block
must be chosen such as the number of threads per block is inferior to $1,024$.
+
+
In Listing~\ref{ch2:lst:ex3}, we consider that a block contains 16 threads in
-each dimension. The variable \texttt{nbTh} represents the number of threads per
-block. So to be able to compute the matrix-matrix product on a GPU, each block
-of threads is assigned to compute the result of the product for the elements of
-this block. So the first step for each thread of a block is to compute the
-corresponding row and column. With a 2 dimensional decomposition, \texttt{int i=
+each dimension, the variable \texttt{width} is used for that. The
+variable \texttt{nbTh} represents the number of threads per block. So to be able
+to compute the matrix-matrix product on a GPU, each block of threads is assigned
+to compute the result of the product for the elements of this block. The main
+part of the code is quite similar to the previous code. Arrays are allocated in
+the CPU and the GPU. Matrices $A$ and $B$ are randomly initialized. Then
+arrays are transfered inside the GPU memory with call to \texttt{cudaMemcpy}.
+So the first step for each thread of a block is to compute the corresponding row
+and column. With a 2 dimensional decomposition, \texttt{int i=
blockIdx.y*blockDim.y+ threadIdx.y;} allows us to compute the corresponding line
and \texttt{int j= blockIdx.x*blockDim.x+ threadIdx.x;} the corresponding
-column.
+column. Then each thread has to compute the sum of the product of the line of
+$A$ per the column of $B$. In order to use a register, the
+kernel \texttt{matmul} uses a variable called \texttt{sum} to compute the
+sum. Then the result is set into the matrix at the right place. The computation
+of CPU matrix-matrix multiplication is performed as described previously. A
+timer measures the time. In order to use 2 dimensional blocks, \texttt{dim3
+dimGrid(size/width,size/width);} allows us to create \texttt{size/width} blocks
+in each dimension. Likewise, \texttt{dim3 dimBlock(width,width);} is used to
+create \texttt{width} thread in each dimension. After that, the kernel for the
+matrix multiplication is called. At the end of the listing, the matrix $C$
+computed by the GPU is transfered back in the CPU and we check if both matrices
+C computed by the CPU and the GPU are identical with a precision of $10^{-4}$.
On C2070M Tesla card, this code take $37.68$ms to perform the multiplication. On
\lstinputlisting[label=ch2:lst:ex3,caption=simple Matrix-matrix multiplication with cuda]{Chapters/chapter2/ex3.cu}
+\section{Conclusion}
+In this chapter 3 simple Cuda examples have been presented. Those examples are
+quite simple and they cannot present all the possibilities of the Cuda
+programming. Interested readers are invited to consult Cuda programming
+introduction books if some issues regarding the Cuda programming is not clear.
+
\putbib[Chapters/chapter2/biblio]
#include "cutil_inline.h"
#include <cublas_v2.h>
-
const int width=16;
const int nbTh=width*width;
const int size=1024;
const int sizeMat=size*size;
-
-
-
__global__
void matmul(float *d_A, float *d_B, float *d_C) {
int i= blockIdx.y*blockDim.y+ threadIdx.y;
sum+=d_A[i*size+k]*d_B[k*size+j];
}
d_C[i*size+j]=sum;
-
}
-
-
-
int main( int argc, char** argv)
{
-
float *h_arrayA=(float*)malloc(sizeMat*sizeof(float));
float *h_arrayB=(float*)malloc(sizeMat*sizeof(float));
float *h_arrayC=(float*)malloc(sizeMat*sizeof(float));
cudaMalloc((void**)&d_arrayB,sizeMat*sizeof(float));
cudaMalloc((void**)&d_arrayC,sizeMat*sizeof(float));
-
srand48(32);
-
for(int i=0;i<sizeMat;i++) {
h_arrayA[i]=drand48();
h_arrayB[i]=drand48();
cudaMemcpy(d_arrayB,h_arrayB, sizeMat * sizeof(float), cudaMemcpyHostToDevice);
cudaMemcpy(d_arrayC,h_arrayC, sizeMat * sizeof(float), cudaMemcpyHostToDevice);
-
unsigned int timer_cpu = 0;
cutilCheckError(cutCreateTimer(&timer_cpu));
cutilCheckError(cutStartTimer(timer_cpu));
printf("CPU processing time : %f (ms) \n", cutGetTimerValue(timer_cpu));
cutDeleteTimer(timer_cpu);
-
-
-
unsigned int timer_gpu = 0;
cutilCheckError(cutCreateTimer(&timer_gpu));
cutilCheckError(cutStartTimer(timer_gpu));
-
-
dim3 dimGrid(size/width,size/width);
dim3 dimBlock(width,width);
- printf("%d %d\n",dimGrid.x,dimBlock.x);
-
matmul<<<dimGrid,dimBlock>>>(d_arrayA,d_arrayB,d_arrayC);
cudaThreadSynchronize();
cudaMemcpy(h_arrayCgpu,d_arrayC, sizeMat * sizeof(float), cudaMemcpyDeviceToHost);
- int good=1;
for(int i=0;i<sizeMat;i++)
if (fabs(h_arrayC[i]-h_arrayCgpu[i])>1e-4)
printf("%f %f\n",h_arrayC[i],h_arrayCgpu[i]);
-
cudaFree(d_arrayA);
cudaFree(d_arrayB);
cudaFree(d_arrayC);
free(h_arrayB);
free(h_arrayC);
free(h_arrayCgpu);
-
return 0;
-
}