1 \chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte, France}
4 \chapter{Presentation of the GPU architecture and of the CUDA environment}
7 \section{Introduction}\label{ch1:intro}
8 This chapter introduces the Graphics Processing Unit (GPU) architecture and all
9 the concepts needed to understand how GPUs work and can be used to speed up the
10 execution of some algorithms. First of all this chapter gives a brief history of
11 the development of the graphics cards up to the point when they started being used in order to make
12 general purpose computation. Then the architecture of a GPU is
13 illustrated. There are many fundamental differences between a GPU and a
14 tradition processor. In order to benefit from the power of a GPU, a CUDA
15 programmer needs to use threads. They have some particularities which enable the
16 CUDA model to be efficient and scalable when some constraints are addressed.
20 \section{Brief history of the video card}
22 Video cards or graphics cards have been introduced in personal computers to
23 produce high quality graphics faster than classical Central Processing Units
24 (CPU) and to free the CPU from this task. In general, display tasks are very
25 repetitive and very specific. Hence, some manufacturers have produced more and
26 more sophisticated video cards, providing 2D accelerations, then 3D accelerations,
27 then some light transforms. Video cards own their own memory to perform their
28 computation. For at least two decades, every personal computer has had a video
29 card which is simple for desktop computers or which provides many accelerations
30 for game and/or graphic-oriented computers. In the latter case, graphic cards
31 may be more expensive than a CPU.
33 Since 2000, video cards have allowed users to apply arithmetic operations
34 simultaneously on a sequence of pixels, later called stream processing. In
35 this case, the information of the pixels (color, location and other information) is
36 combined in order to produce a pixel color that can be displayed on a screen.
37 Simultaneous computations are provided by shaders which calculate rendering
38 effects on graphics hardware with a high degree of flexibility. These shaders
39 handle the stream data with pipelines.
42 Some researchers tried to apply those operations on other data, representing
43 something different from pixels, and consequently this resulted in the first
44 uses of video cards for performing general purpose computation. The programming
45 model was not easy to use at all and was very dependent on the hardware
46 constraints. More precisely it consisted in using either DirectX of OpenGL
47 functions providing an interface to some classical operations for videos
48 operations (memory transfers, texture manipulation, etc.). Floating point
49 operations were most of the time unimaginable. Obviously when something went
50 wrong, programmers had no way (and no tools) to detect it.
54 In order to benefit from the computing power of more recent video cards, CUDA
55 was first proposed in 2007 by NVIDIA. It unifies the programming model for some
56 of their most efficient video cards. CUDA~\cite{ch1:cuda} has quickly been
57 considered by the scientific community as a great advance for general purpose
58 graphics processing unit (GPGPU) computing. Of course other programming models
59 have been proposed. The other well-known alternative is OpenCL which aims at
60 proposing an alternative to CUDA and which is multiplatform and portable. This
61 is a great advantage since it is even possible to execute OpenCL programs on
62 traditional CPUs. The main drawback is that it is less tight with the hardware
63 and consequently sometimes provides less efficient programs. Moreover, CUDA
64 benefits from more mature compilation and optimization procedures. Other less
65 known environments have been proposed, but most of them have been discontinued, for
66 example we can cite, FireStream by ATI which is not maintained anymore and
67 has been replaced by OpenCL, BrookGPU by Standford University~\cite{ch1:Buck:2004:BGS}.
68 Another environment based on pragma (insertion of pragma directives inside the
69 code to help the compiler to generate efficient code) is called OpenACC. For a
70 comparison with OpenCL, interested readers may refer to~\cite{ch1:Dongarra}.
74 \section{Architecture of current GPUs}
76 The architecture \index{GPU!architecture of a} of current GPUs is constantly
77 evolving. Nevertheless some trends remain constant throughout this evolution.
78 Processing units composing a GPU are far simpler than a traditional CPU and
79 it is much easier to integrate many computing units inside a GPU card than to do
80 so with many cores inside a CPU. In 2012, the most powerful GPUs contained more than 500
81 cores and the most powerful CPUs had 8
82 cores. Figure~\ref{ch1:fig:comparison_cpu_gpu} shows the number of cores inside
83 a CPU and inside a GPU. In fact, in a current NVIDIA GPU, there are
84 multiprocessors which have 32 cores (for example, on Fermi cards). The core clock
85 of a CPU is generally around 3GHz and the one of a GPU is about 1.5GHz. Although
86 the core clock of GPU cores is slower, the number of cores inside a GPU provides
87 more computational power. This measure is commonly represented by the number of
88 floating point operation per seconds. Nowadays the most powerful GPUs provide more
89 than 1TFlops, i.e., $10^{12}$ floating point operations per second.
90 Nevertheless GPUs are very efficient at executing repetitive work in which
91 only the data change. It is important to keep in mind that multiprocessors
92 inside a GPU have 32 cores. Later we will see that these 32 cores need to do the
93 same work to get maximum performance.
96 \centerline{\includegraphics[]{Chapters/chapter1/figures/nb_cores_CPU_GPU.pdf}}
97 \caption{Comparison of number of cores in a CPU and in a GPU.}
98 %[Comparison of number of cores in a CPU and in a GPU]
99 \label{ch1:fig:comparison_cpu_gpu}
102 On the most powerful GPU cards, called Fermi, multiprocessors are called streaming
103 multiprocessors (SMs). Each SM contains 32 cores and is able to perform 32
104 floating points or integer operations per clock on 32 bit numbers or 16 floating
105 points per clock on 64 bit numbers. SMs have their own registers, execution
106 pipelines and caches. On Fermi architecture, there are 64Kb shared memory plus L1
107 cache and 32,536 32 bit registers per SM. More precisely the programmer can
108 decide what amounts of shared memory and L1 cache SM are to be used. The constraint is
109 that the sum of both amounts should be less than or equal to 64Kb.
111 Threads are used to benefit from the large number of cores of a GPU. These
112 threads are different from traditional threads for a CPU. In
113 Chapter~\ref{chapter2}, some examples of GPU programming will explain the
114 details of the GPU threads. Threads are gathered into blocks of 32
115 threads, called ``warps''. These warps are important when designing an algorithm
119 Another big difference between a CPU and a GPU is the latency of memory. In a CPU,
120 everything is optimized to obtain a low latency architecture. This is possible
121 through the use of cache memories. Moreover, nowadays CPUs carry out many
122 performance optimizations such as speculative execution which roughly speaking
123 consists of executing a small part of the code in advance even if later this work
124 reveals itself to be useless. GPUs do not have low latency
125 memory. In comparison GPUs have small cache memories. Nevertheless the
126 architecture of GPUs is optimized for throughput computation and it takes into
127 account the memory latency.
132 \centerline{\includegraphics[scale=0.7]{Chapters/chapter1/figures/low_latency_vs_high_throughput.pdf}}
133 \caption{Comparison of low latency of a CPU and high throughput of a GPU.}
134 \label{ch1:fig:latency_throughput}
137 Figure~\ref{ch1:fig:latency_throughput} illustrates the main difference of
138 memory latency between a CPU and a GPU. In a CPU, tasks ``ti'' are executed one
139 by one with a short memory latency to get the data to process. After some tasks,
140 there is a context switch that allows the CPU to run concurrent applications
141 and/or multi-threaded applications. Memory latencies are longer in a GPU. Thhe
142 principle to obtain a high throughput is to have many tasks to
143 compute. Later we will see that these tasks are called threads with CUDA. With
144 this principle, as soon as a task is finished the next one is ready to be
145 executed while the wait for data for the previous task is overlapped by the
146 computation of other tasks.
150 \section{Kinds of parallelism}
152 Many kinds of parallelism are available according to the type of hardware.
153 Roughly speaking, there are three classes of parallelism: instruction-level
154 parallelism, data parallelism, and task parallelism.
156 Instruction-level parallelism consists in reordering some instructions in order
157 to execute some of them in parallel without changing the result of the code.
158 In modern CPUs, instruction pipelines allow the processor to execute instructions
159 faster. With a pipeline a processor can execute multiple instructions
160 simultaneously because the output of a task is the input of the
163 Data parallelism consists in executing the same program with different data on
164 different computing units. Of course, no dependency should exist among the
165 data. For example, it is easy to parallelize loops without dependency using the
166 data parallelism paradigm. This paradigm is linked with the Single Instructions
167 Multiple Data (SIMD) architecture. This is the kind of parallelism provided by
170 Task parallelism is the common parallelism achieved on clusters and grids and
171 high performance architectures where different tasks are executed by different
174 \section{CUDA multithreading}
176 The data parallelism of CUDA is more precisely based on the Single Instruction
177 Multiple Thread (SIMT) model, because a programmer accesses
178 the cores by the intermediate of threads. In the CUDA model, all cores
179 execute the same set of instructions but with different data. This model has
180 similarities with the vector programming model proposed for vector machines through
181 the 1970s and into the 90s, notably the various Cray platforms. On the CUDA
182 architecture, the performance is led by the use of a huge number of threads
183 (from thousands up to millions). The particularity of the model is that there
184 is no context switching as in CPUs and each thread has its own registers. In
185 practice, threads are executed by SM and gathered into groups of 32
186 threads, called warps. Each SM alternatively executes
187 active warps and warps becoming temporarily inactive due to waiting of data
188 (as shown in Figure~\ref{ch1:fig:latency_throughput}).
191 \centerline{\includegraphics[scale=0.65]{Chapters/chapter1/figures/scalability.pdf}}
192 \caption{Scalability of GPU.}
193 \label{ch1:fig:scalability}
196 The key to scalability in the CUDA model is the use of a huge number of threads.
197 In practice, threads are gathered not only in warps but also in thread blocks. A
198 thread block is executed by only one SM and it cannot migrate. The typical size of
199 a thread block is a power of two (for example, 64, 128, 256, or 512).
203 In this case, without changing anything inside a CUDA code, it is possible to
204 run code with a small CUDA device or the best performing Tesla CUDA cards.
205 Blocks are executed in any order depending on the number of SMs available. So
206 the programmer must conceive code having this issue in mind. This
207 independence between thread blocks provides the scalability of CUDA codes.
212 A kernel is a function which contains a block of instructions that are executed
213 by the threads of a GPU. When the problem considered is a two-dimensional or three-dimensional problem, it is possible to group thread blocks into a grid. In
214 practice, the number of thread blocks and the size of thread blocks are given as
215 parameters to each kernel. Figure~\ref{ch1:fig:scalability} illustrates an
216 example of a kernel composed of 8 thread blocks. Then this kernel is executed on
217 a small device containing only 2 SMs. So in this case, blocks are executed 2
218 by 2 in any order. If the kernel is executed on a larger CUDA device containing
219 4 SMs, blocks are executed 4 by 4 simultaneously. The execution times should be
220 approximately twice faster in the latter case. Of course, that depends on other
221 parameters that will be described later (in this chapter and other chapters).
224 Thread blocks provide a way to cooperate in the sense that threads of the same
225 block cooperatively load and store blocks of memory they all
226 use. Synchronizations of threads in the same block are possible (but not between
227 threads of different blocks). Threads of the same block can also share results
228 in order to compute a single result. In Chapter~\ref{chapter2}, some examples
232 \section{Memory hierarchy}
234 The memory hierarchy of GPUs\index{memory hierarchy} is different from that of CPUs. In practice, there are registers\index{memory hierarchy!registers}, local
235 memory\index{memory hierarchy!local memory}, shared
236 memory\index{memory hierarchy!shared memory}, cache
237 memory\index{memory hierarchy!cache memory}, and global
238 memory\index{memory hierarchy!global memory}.
241 As previously mentioned each thread can access its own registers. It is
242 important to keep in mind that the number of registers per block is limited. On
243 recent cards, this number is limited to 64Kb per SM. Access to registers is
244 very fast, so it is a good idea to use them whenever possible.
246 Likewise each thread can access local memory which, in practice, is much slower
247 than registers. Local memory is automatically used by the compiler when all the
248 registers are occupied. So the best idea is to optimize the use of registers
249 even if this involves reducing the number of threads per block.
251 \begin{figure}[hbtp!]
252 \centerline{\includegraphics[scale=0.60]{Chapters/chapter1/figures/memory_hierarchy.pdf}}
253 \caption{Memory hierarchy of a GPU.}
254 \label{ch1:fig:memory_hierarchy}
259 Shared memory allows cooperation between threads of the same block. This kind
260 of memory is fast but it needs to be manipulated manually and its size is
261 limited. It is accessible during the execution of a kernel. So the idea is
262 to fill the shared memory at the start of the kernel with global data that are
263 used very frequently, then threads can access it for their computation. Threads
264 can obviously change the content of this shared memory either with computation
265 or by loading other data and they can store its content in the global memory. So
266 shared memory can be seen as a cache memory manageable manually. This
267 obviously requires an effort from the programmer.
269 On recent cards, the programmer may decide what amount of cache memory and
270 shared memory is attributed to a kernel. The cache memory is an L1 cache which is
271 directly managed by the GPU. Sometimes, this cache provides very efficient
272 result and sometimes the use of shared memory is a better solution.
277 Figure~\ref{ch1:fig:memory_hierarchy} illustrates the memory hierarchy of a
278 GPU. Threads are represented on the top of the figure. They can have access to their
279 own registers and their local memory. Threads of the same block can access
280 the shared memory of that block. The cache memory is not represented here but it
281 is local to a thread. Then each block can access the global memory of the
286 In this chapter, a brief presentation of the video card, which has later been
287 used to perform computation, has been given. The architecture of a GPU has been
288 illustrated focusing on the particularity of GPUs in terms of parallelism, memory
289 latency, and threads. In order to design an efficient algorithm for GPU, it is
290 essential to keep all these parameters in mind.
293 \putbib[Chapters/chapter1/biblio]