1 \chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte}
4 \chapter{Presentation of the GPU architecture and of the CUDA environment}
7 \section{Introduction}\label{ch1:intro}
9 This chapter introduces the Graphics Processing Unit (GPU) architecture and all
10 the concepts needed to understand how GPUs work and can be used to speed up the
11 execution of some algorithms. First of all this chapter gives a brief history of
12 the development of Graphics card until they can be used in order to make general
13 purpose computation. Then the
17 \section{Brief history of Video Card}
19 Video card or Graphics card have been introduced in personnal computers to
20 produce high quality graphics faster than classical Central Processing Unit
21 (CPU) and to alleviate CPU from this task. In general, display tasks are very
22 repetitive and very specific. Hence, some manufacturers have produced more and
23 more sofisticated video cards, providing 2D accelerations then 3D accelerations,
24 then some light transforms. Video cards own their own memory to perform their
25 computation. From at least two dedaces, every personnal computer has a video
26 card which is simple for desktop computers or which provides many accelerations
27 for game and/or graphic oriented computers. In the latter case, graphic cards
28 may be more expensive than a CPU.
30 After 2000, video cards allowed to apply arithmetics operations simulatenously
31 on a sequence of pixels, also later called stream processing. In this case,
32 information of the pixels (color, location and other information) are combined
33 in order to produce a pixel color that can be displayed on a
34 screen. Simultaneous computations are provided by shaders which calculate
35 rendering effects on graphics hardware with a high degree of flexibility. These
36 shaders handles the stream data with pipelines.
39 Some reasearchers tried to apply those operations on other data, representing
40 something different from pixels, and consequently this resulted in the first
41 uses of video cards for performing general purpose computation. The programming
42 model was not easy to use at all and was very dependent of the hardware
43 constraints. More precisely it consisted in using either DirectX of OpenGL
44 functions providing an interface to some classical operations for videos
45 operations (memory transfers, texture manipulation, ...). Floating point
46 operations were most of the time unimaginable. Obviously when something bad
47 happened, programmers had no way (and tools) to detect it.
51 In order to benefit from the computing power of more recent video cards, CUDA
52 was first proposed in 2007 by NVidia. It unifies the programming model for some
53 of their most performant video cards. Cuda~\cite{ch1:cuda} has quickly been
54 considered by the scientific community as a great advance for general purpose
55 graphics processing unit (GPGPU) computing. Of course other programming model
56 have been proposed. The other well-known alternative is OpenCL which aims at
57 proposing an alternative to Cuda and which is multi-platform and portable. This
58 is a great advantage since it is even possible to execute OpenCL programs on
59 traditionnal CPUs. The main drawbacks is that it is less tight with the
60 hardware and consequently provides sometimes less efficient programs. Moreover,
61 Cuda benefits from more mature compilation and optimization procedures. Other
62 less known environment have been proposed, but most of them have been stopped,
63 for example we can cited: FireStream by ATI which is not maintened anymore and
64 replaced by OpenCL, BrookGPU by Standford University~\cite{ch1:Buck:2004:BGS}.
65 Another environment based on pragma (insertion of pragma directives inside the
66 code to help the compiler to generate efficient code) is call OpenACC. For a
67 comparison with OpenCL, interested readers may refer to~\cite{ch1:CMR:12}.
71 \section{Architecture of current GPUs}
73 Architecture \index{Architecture of a GPU} of current GPUs is constantly
74 evolving. Nevertheless some trends remains true through this
75 evolution. Processing units composing a GPU are far more simpler than a
76 traditional CPU but it is much easier to integrate many computing units inside a
77 GPU card than many cores inside a CPU. This is due to the fact that cores of a
78 GPU are simpler than cores of a CPU. In 2012, the most powerful GPUs own more
79 than 500 cores and the most powerful CPUs have 8
80 cores. Figure~\ref{ch1:fig:comparison_cpu_gpu} shows the number of cores inside
81 a CPU and inside a GPU. In fact, in a current NVidia GPU, there are
82 multiprocessors which have 32 cores (for example on Fermi cards). The core clock
83 of CPU is generally around 3GHz and the one of GPU is about 1.5GHz. Although the
84 core clock of GPU cores is slower, the amount of cores inside a GPU provides
85 more computational power. This measure is commonly represented by the number of
86 floating point operation per seconds. Nowadays most powerful GPUs provide more
87 than 1TFlops, i.e. $10^{12}$ floating point operations per second. Nevertheless
88 GPUs are very efficient to perform some operations but not all kinds of
89 operations. They are very efficient to execute repetitive work in which only the
90 data change. It is important to keep in mind that multiprocessors inside a GPU
91 have 32 cores. Later we will see that these 32 cores need to do the same work to
92 get maximum performance.
95 \centerline{\includegraphics[]{Chapters/chapter1/figures/nb_cores_CPU_GPU.pdf}}
96 \caption[Comparison of number of cores in a CPU and in a GPU]{Comparison of number of cores in a CPU and in a GPU.}
97 \label{ch1:fig:comparison_cpu_gpu}
100 On most powerful GPU cards, called Fermi, multiprocessors are called streaming
101 multiprocessors (SM). Each SM contains 32 cores and is able to perform 32
102 floating point or integer operations on 32bits numbers per clock or 16 floating
103 point on 64bits number per clock. SMs have their own registers, execution
104 pipelines and caches. On Fermi architecture, there are 64Kb shared memory + L1
105 cache and 32,536 32bits registers per SM. More precisely the programmer can
106 decide what amount of shared memory and L1 cache SM can use. The constaint is
107 that the sum of both amounts is less or equal to 64Kb.
109 Threads are used to benefit from the important number of cores of a GPU. Those
110 threads are different from traditional threads for CPU. In
111 chapter~\ref{chapter2}, some examples of GPU programming will explicit the
112 details of the GPU threads. However, threads are gathered into blocks of 32
113 threads, called ``warp''. Those warps are important when designing an algorithm
117 Another big difference between CPU and GPU is the latency of memory. In CPU,
118 everything is optimized to obtain a low latency architecture. This is possible
119 through the use of cache memories. Moreover, nowadays CPUs perform many
120 performance optimizations such as speculative execution which roughly speaking
121 consists in executing a small part of code in advance even if later this work
122 reveals to be useless. In opposite, GPUs do not have low latency memory. In
123 comparison GPUs have ridiculous cache memories. Nevertheless the architecture of
124 GPUs is optimized for throughtput computation and it takes into account the
130 \centerline{\includegraphics[scale=0.7]{Chapters/chapter1/figures/low_latency_vs_high_throughput.pdf}}
131 \caption[Comparison of low latency of CPU and highthroughput of GPU]{Comparison of low latency of CPU and highthroughput of GPU.}
132 \label{ch1:fig:latency_throughput}
135 Figure~\ref{ch1:fig:latency_throughput} illustrates the main difference of
136 memory latency between a CPU and a GPU. In a CPU, tasks ``ti'' are executed one
137 by one with a short memory latency to get the data to process. After some tasks,
138 there is a context switch that allows the CPU to run concurrent applications
139 and/or multi-threaded applications. Memory latencies are longer in a GPU, the
140 the principle to obtain a high throughput is to have many tasks to
141 compute. Later we will see that those tasks are called threads with CUDA. With
142 this principle, as soon as a task is finished the next one is ready to be
143 executed while the waiting for data for the previous task is overlapped by
144 computation of other tasks.
148 \section{Kinds of parallelism}
150 Many kinds of parallelism are avaible according to the type of hardware.
151 Roughtly speaking, there are three classes of parallelism: instruction-level
152 parallelism, data parallelism and task parallelism.
154 Instruction-level parallelism consists in re-ordering some instructions in order
155 to execute some of them in parallel without changing the result of the code.
156 In modern CPUs, instruction pipelines allow processor to execute instruction
157 faster. With a pipeline a processor can execute multiple instructions
158 simultaneously due to the fact that the output of a task is the input of the
161 Data parallelism consists in executing the same program with different data on
162 different computing units. Of course, no depency should exist between the the
163 data. For example, it is easy to parallelize loops without dependency using the
164 data parallelism paradigm. This paradigm is linked with the Single Instructions
165 Multiple Data (SIMD) architecture. This is the kind of parallism providing by
168 Taks parallelism is the common parallism achieved out on cluster and grid and
169 high performance architecture where different tasks are executed by different
172 \section{CUDA Multithreading}
174 The data parallelism of CUDA is more precisely based on the Single Instruction
175 Multiple Thread (SIMT) model. This is due to the fact that the programmer access
176 to the cores by the intermediate of threads. In the CUDA model, all cores
177 execute the same set of instructions but with different data. This model has
178 similarities with vector programming model proposed for vector machines through
179 the 1970s into the 90s, notably the various Cray platforms. On the CUDA
180 architecture, the performance is led by the use of a huge number of threads
181 (from thousand upto to millions). The particularity of the model is that there
182 is no context switching as in CPUs and each thread has its own registers. In practice, threads are executed by SM and are gathered into groups of 32 threads. Those groups are call ``warps''. Each SM alternatively executes ``active warps'' and warps becoming temporaly inactive due to waiting of data (as shown in Figure~\ref{ch1:fig:latency_throughput}).
184 The key to scalability in the CUDA model is the use of a huge number of threads.
185 In practice threads are not only gathered in warps but also in thread blocks. A
186 thread block is executed by only one SM and it cannot migrate. Typical size of
187 thread block is a number power of two (for example: 64, 128, 256 or 512).
191 In this case, without changing anything inside a CUDA code, it is possible to
192 run your code with a small CUDA device or most performant Tesla CUDA cards.
193 Blocks are executed in any number depending on the number of SM available. So
194 the programmer must conceive its code having this issue in mind. This
195 independence between threads blocks provides the scalability of CUDA codes.
198 \centerline{\includegraphics[scale=0.65]{Chapters/chapter1/figures/scalability.pdf}}
199 \caption[Scalability of GPU]{Scalability of GPU.}
200 \label{ch1:fig:scalability}
204 A kernel is a function which contains a block a instruction that are executed by
205 threads of a GPU. When the problem considered is a 2 dimensions or 3 dimensions
206 problem, it is possible to group thread blocks into grid. In practice, the
207 number of thread blocks and the size of thread block is given in parameter to
208 each kernel. Figure~\ref{ch1:fig:scalability} illustrates an example of a
209 kernel composed of 8 thread blocks. Then this kernel is executed on a small
210 device containing only 2 SMs. So in in this case, blocks are executed 2 by 2 in
211 any order. If the kernel is executed on a larger CUDA device containing 4 SMs,
212 blocks are executed 4 by 4 simultaneously. The execution times should be
213 approximately twice faster in the latter case. Of course, that depends on other
214 parameters that will be described later.
216 Thread blocks provide a way to cooperation in the sens that threads of the same
217 block cooperatively load and store blocks of memory they all
218 use. Synchronizations of threads in the same block are possible (but not between
219 thread of different blocks). Threads of the same block can also share results in
220 order to compute a single result. In chapter~\ref{chapter2}, some examples will
224 \section{Memory hierarchy}
226 The memory hierarchy of GPUs\index{Memory hierarchy of a GPU} is different from
227 the CPUs one. In practice, there are registers, local memory, shared memory,
228 cache memroy and global memory.
230 As previously mentioned each thread can access its own registers. It is
231 important to keep in mind that the number of registers per block is limited. On
232 recent cards, this number is limited to 64Kb per SM. Access to registers is
233 very fast, so when possible it is a good idea to use them.
235 Likewise each thread can access local memory which, in practice, is much slower
236 than registers. Local memory is automatically used by the compiler when all the
237 registers are occupied. So the best idea is to optimize the use of registers
238 even if this implies to reduce the number of threads per block.
240 Shared memory allows cooperation between threads of the same block. This kind
241 of memory is fast by it requires to be manipulated manually and its size is
242 limited. It is accessible during the execution of a kernel. So the principle is
243 to fill the shared memory at the start of the kernel with global data that are
244 used very frequently, then threads can access it for their computation. They
245 can obviously change the content of this shared memory either with computation
246 or load of other data and they can store its content in the global memory. So
247 shared memory can be seen as a cache memory manageable manually. This requires
248 obviously an effort from the programmer.
250 On recent cards, the programmer may decide what amount of cache memory and
251 shared memory is attributed to a kernel. The cache memory is a L1 cache which is
252 directly managed by the GPU. Sometimes, this cache provides very efficient
253 result and sometimes the use of shared memory is a better solution.
256 \centerline{\includegraphics[scale=0.60]{Chapters/chapter1/figures/memory_hierarchy.pdf}}
257 \caption[Memory hierarchy of a GPU]{Memory hierarchy of a GPU.}
258 \label{ch1:fig:memory_hierarchy}
262 Figure~\ref{ch1:fig:memory_hierarchy} illustrates the memory hierarchy of a
263 GPU. Threads are represented on the top of the figure. They can access to their
264 own registers and their local memory. Threads of the same block can access to
265 the shared memory of this block. The cache memory is not represented here but it
266 is local to a thread. Then each block can access to the global memory of the
270 %%http://people.maths.ox.ac.uk/gilesm/pp10/lec2_2x2.pdf
271 %%https://people.maths.ox.ac.uk/erban/papers/paperCUDA.pdf
272 %%http://forum.wttsnxt.com/my_forum/viewtopic.php?f=5&t=9519
273 %%http://www.cs.nyu.edu/manycores/cuda_many_cores.pdf
274 %%http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/02-cuda-overview.pdf
275 %%http://people.maths.ox.ac.uk/~gilesm/cuda/
278 \putbib[Chapters/chapter1/biblio]