This chapter introduces the Graphics Processing Unit (GPU) architecture and all
the concepts needed to understand how GPUs work and can be used to speed up the
execution of some algorithms. First of all this chapter gives a brief history of
This chapter introduces the Graphics Processing Unit (GPU) architecture and all
the concepts needed to understand how GPUs work and can be used to speed up the
execution of some algorithms. First of all this chapter gives a brief history of
-the development of the graphics cards up to the point when they started being used in order to make
-general purpose computation. Then the architecture of a GPU is
-illustrated. There are many fundamental differences between a GPU and a
-tradition processor. In order to benefit from the power of a GPU, a CUDA
+the development of the graphics cards up to the point when they started being
+used in order to perform general purpose computations. Then the architecture of
+a GPU is illustrated. There are many fundamental differences between a GPU and
+a traditional processor. In order to benefit from the power of a GPU, a CUDA
programmer needs to use threads. They have some particularities which enable the
CUDA model to be efficient and scalable when some constraints are addressed.
programmer needs to use threads. They have some particularities which enable the
CUDA model to be efficient and scalable when some constraints are addressed.
\section{Brief history of the video card}
Video cards or graphics cards have been introduced in personal computers to
\section{Brief history of the video card}
Video cards or graphics cards have been introduced in personal computers to
repetitive and very specific. Hence, some manufacturers have produced more and
more sophisticated video cards, providing 2D accelerations, then 3D accelerations,
then some light transforms. Video cards own their own memory to perform their
repetitive and very specific. Hence, some manufacturers have produced more and
more sophisticated video cards, providing 2D accelerations, then 3D accelerations,
then some light transforms. Video cards own their own memory to perform their
may be more expensive than a CPU.
Since 2000, video cards have allowed users to apply arithmetic operations
may be more expensive than a CPU.
Since 2000, video cards have allowed users to apply arithmetic operations
Some researchers tried to apply those operations on other data, representing
something different from pixels, and consequently this resulted in the first
Some researchers tried to apply those operations on other data, representing
something different from pixels, and consequently this resulted in the first
model was not easy to use at all and was very dependent on the hardware
constraints. More precisely it consisted in using either DirectX of OpenGL
functions providing an interface to some classical operations for videos
model was not easy to use at all and was very dependent on the hardware
constraints. More precisely it consisted in using either DirectX of OpenGL
functions providing an interface to some classical operations for videos
In order to benefit from the computing power of more recent video cards, CUDA
was first proposed in 2007 by NVIDIA. It unifies the programming model for some
In order to benefit from the computing power of more recent video cards, CUDA
was first proposed in 2007 by NVIDIA. It unifies the programming model for some
considered by the scientific community as a great advance for general purpose
graphics processing unit (GPGPU) computing. Of course other programming models
have been proposed. The other well-known alternative is OpenCL which aims at
considered by the scientific community as a great advance for general purpose
graphics processing unit (GPGPU) computing. Of course other programming models
have been proposed. The other well-known alternative is OpenCL which aims at
-traditional CPUs. The main drawback is that it is less tight with the hardware
-and consequently sometimes provides less efficient programs. Moreover, CUDA
+traditional CPUs. The main drawback is that it is less close to the hardware
+and consequently it sometimes provides less efficient programs. Moreover, CUDA
-known environments have been proposed, but most of them have been discontinued, for
-example we can cite, FireStream by ATI which is not maintained anymore and
-has been replaced by OpenCL, BrookGPU by Standford University~\cite{ch1:Buck:2004:BGS}.
-Another environment based on pragma (insertion of pragma directives inside the
-code to help the compiler to generate efficient code) is called OpenACC. For a
+known environments have been proposed, but most of them have been discontinued,
+such FireStream by ATI which is not maintained anymore and has been replaced by
+OpenCL and BrookGPU by Stanford University~\cite{ch1:Buck:2004:BGS}. Another
+environment based on pragma (insertion of pragma directives inside the code to
+help the compiler to generate efficient code) is called OpenACC. For a
comparison with OpenCL, interested readers may refer to~\cite{ch1:Dongarra}.
\section{Architecture of current GPUs}
comparison with OpenCL, interested readers may refer to~\cite{ch1:Dongarra}.
\section{Architecture of current GPUs}
evolving. Nevertheless some trends remain constant throughout this evolution.
Processing units composing a GPU are far simpler than a traditional CPU and
it is much easier to integrate many computing units inside a GPU card than to do
evolving. Nevertheless some trends remain constant throughout this evolution.
Processing units composing a GPU are far simpler than a traditional CPU and
it is much easier to integrate many computing units inside a GPU card than to do
inside a GPU have 32 cores. Later we will see that these 32 cores need to do the
same work to get maximum performance.
inside a GPU have 32 cores. Later we will see that these 32 cores need to do the
same work to get maximum performance.
\centerline{\includegraphics[]{Chapters/chapter1/figures/nb_cores_CPU_GPU.pdf}}
\caption{Comparison of number of cores in a CPU and in a GPU.}
%[Comparison of number of cores in a CPU and in a GPU]
\centerline{\includegraphics[]{Chapters/chapter1/figures/nb_cores_CPU_GPU.pdf}}
\caption{Comparison of number of cores in a CPU and in a GPU.}
%[Comparison of number of cores in a CPU and in a GPU]
On the most powerful GPU cards, called Fermi, multiprocessors are called streaming
multiprocessors (SMs). Each SM contains 32 cores and is able to perform 32
On the most powerful GPU cards, called Fermi, multiprocessors are called streaming
multiprocessors (SMs). Each SM contains 32 cores and is able to perform 32
-floating points or integer operations per clock on 32 bit numbers or 16 floating
-points per clock on 64 bit numbers. SMs have their own registers, execution
+floating points or integer operations per clock on 32-bit numbers or 16 floating
+points per clock on 64-bit numbers. SMs have their own registers, execution
decide what amounts of shared memory and L1 cache SM are to be used. The constraint is
that the sum of both amounts should be less than or equal to 64Kb.
decide what amounts of shared memory and L1 cache SM are to be used. The constraint is
that the sum of both amounts should be less than or equal to 64Kb.
threads are different from traditional threads for a CPU. In
Chapter~\ref{chapter2}, some examples of GPU programming will explain the
details of the GPU threads. Threads are gathered into blocks of 32
threads are different from traditional threads for a CPU. In
Chapter~\ref{chapter2}, some examples of GPU programming will explain the
details of the GPU threads. Threads are gathered into blocks of 32
performance optimizations such as speculative execution which roughly speaking
consists of executing a small part of the code in advance even if later this work
reveals itself to be useless. GPUs do not have low latency
performance optimizations such as speculative execution which roughly speaking
consists of executing a small part of the code in advance even if later this work
reveals itself to be useless. GPUs do not have low latency
architecture of GPUs is optimized for throughput computation and it takes into
account the memory latency.
architecture of GPUs is optimized for throughput computation and it takes into
account the memory latency.
Figure~\ref{ch1:fig:latency_throughput} illustrates the main difference of
memory latency between a CPU and a GPU. In a CPU, tasks ``ti'' are executed one
by one with a short memory latency to get the data to process. After some tasks,
there is a context switch that allows the CPU to run concurrent applications
Figure~\ref{ch1:fig:latency_throughput} illustrates the main difference of
memory latency between a CPU and a GPU. In a CPU, tasks ``ti'' are executed one
by one with a short memory latency to get the data to process. After some tasks,
there is a context switch that allows the CPU to run concurrent applications
principle to obtain a high throughput is to have many tasks to
compute. Later we will see that these tasks are called threads with CUDA. With
this principle, as soon as a task is finished the next one is ready to be
principle to obtain a high throughput is to have many tasks to
compute. Later we will see that these tasks are called threads with CUDA. With
this principle, as soon as a task is finished the next one is ready to be
-executed while the wait for data for the previous task is overlapped by
-computation of other tasks. {\bf HERE}
+executed while the wait for data for the previous task is overlapped by the
+computation of other tasks.
Task parallelism is the common parallelism achieved on clusters and grids and
high performance architectures where different tasks are executed by different
computing units.
Task parallelism is the common parallelism achieved on clusters and grids and
high performance architectures where different tasks are executed by different
computing units.
\section{CUDA multithreading}
The data parallelism of CUDA is more precisely based on the Single Instruction
\section{CUDA multithreading}
The data parallelism of CUDA is more precisely based on the Single Instruction
active warps and warps becoming temporarily inactive due to waiting of data
(as shown in Figure~\ref{ch1:fig:latency_throughput}).
active warps and warps becoming temporarily inactive due to waiting of data
(as shown in Figure~\ref{ch1:fig:latency_throughput}).
The key to scalability in the CUDA model is the use of a huge number of threads.
In practice, threads are gathered not only in warps but also in thread blocks. A
The key to scalability in the CUDA model is the use of a huge number of threads.
In practice, threads are gathered not only in warps but also in thread blocks. A
-by the threads of a GPU. When the problem considered is a two-dimensional or three-dimensional problem, it is possible to group thread blocks into a grid. In
-practice, the number of thread blocks and the size of thread blocks are given as
-parameters to each kernel. Figure~\ref{ch1:fig:scalability} illustrates an
+by the threads of a GPU. When the problem considered is a two-dimensional or
+three-dimensional problem, it is possible to group thread blocks into a grid.
+In practice, the number of thread blocks and the size of thread blocks are given
+as parameters to each kernel. Figure~\ref{ch1:fig:scalability} illustrates an
-a small device containing only 2 SMs. {\bf RELIRE} So in this case, blocks are executed 2
-by 2 in any order. If the kernel is executed on a larger CUDA device containing
-4 SMs, blocks are executed 4 by 4 simultaneously. The execution times should be
-approximately twice faster in the latter case. Of course, that depends on other
+a small device containing only 2 SMs. So in this case, blocks are executed 2 by
+2 in any order. If the kernel is executed on a larger CUDA device containing 4
+SMs, blocks are executed 4 by 4 simultaneously. The execution times should be
+approximately twice as fast in the latter case. Of course, that depends on other
-{\bf RELIRE}
-Thread blocks provide a way to cooperation in the sense that threads of the same
+
+\begin{figure}[t!]
+\centerline{\includegraphics[scale=0.65]{Chapters/chapter1/figures/scalability.pdf}}
+\caption{Scalability of GPU.}
+\label{ch1:fig:scalability}
+\end{figure}
+
+Thread blocks provide a way to cooperate in the sense that threads of the same
block cooperatively load and store blocks of memory they all
use. Synchronizations of threads in the same block are possible (but not between
threads of different blocks). Threads of the same block can also share results
block cooperatively load and store blocks of memory they all
use. Synchronizations of threads in the same block are possible (but not between
threads of different blocks). Threads of the same block can also share results
-The memory hierarchy of GPUs\index{memory~hierarchy} is different from that of CPUs. In practice, there are registers\index{memory~hierarchy!registers}, local
-memory\index{memory~hierarchy!local~memory}, shared
-memory\index{memory~hierarchy!shared~memory}, cache
-memory\index{memory~hierarchy!cache~memory}, and global
-memory\index{memory~hierarchy!global~memory}.
+The memory hierarchy of GPUs\index{memory hierarchy} is different from that of CPUs. In practice, there are registers\index{memory hierarchy!registers}, local
+memory\index{memory hierarchy!local memory}, shared
+memory\index{memory hierarchy!shared memory}, cache
+memory\index{memory hierarchy!cache memory}, and global
+memory\index{memory hierarchy!global memory}.
Likewise each thread can access local memory which, in practice, is much slower
than registers. Local memory is automatically used by the compiler when all the
Likewise each thread can access local memory which, in practice, is much slower
than registers. Local memory is automatically used by the compiler when all the
\centerline{\includegraphics[scale=0.60]{Chapters/chapter1/figures/memory_hierarchy.pdf}}
\caption{Memory hierarchy of a GPU.}
\label{ch1:fig:memory_hierarchy}
\centerline{\includegraphics[scale=0.60]{Chapters/chapter1/figures/memory_hierarchy.pdf}}
\caption{Memory hierarchy of a GPU.}
\label{ch1:fig:memory_hierarchy}
used very frequently, then threads can access it for their computation. Threads
can obviously change the content of this shared memory either with computation
or by loading other data and they can store its content in the global memory. So
used very frequently, then threads can access it for their computation. Threads
can obviously change the content of this shared memory either with computation
or by loading other data and they can store its content in the global memory. So
obviously requires an effort from the programmer.
On recent cards, the programmer may decide what amount of cache memory and
obviously requires an effort from the programmer.
On recent cards, the programmer may decide what amount of cache memory and
the shared memory of that block. The cache memory is not represented here but it
is local to a thread. Then each block can access the global memory of the
GPU.
the shared memory of that block. The cache memory is not represented here but it
is local to a thread. Then each block can access the global memory of the
GPU.