BookGPU/Chapters/chapter1/ch1.tex

   1 \chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte, France}
   2
   3
   4 \chapter{Presentation of the GPU architecture and of the CUDA environment}
   5 \label{chapter1}
   6
   7 \section{Introduction}\label{ch1:intro}
   8 This chapter introduces the Graphics  Processing Unit (GPU) architecture and all
   9 the concepts needed to understand how GPUs  work and can be used to speed up the
  10 execution of some algorithms. First of all this chapter gives a brief history of
  11 the development  of the graphics cards up  to the point when  they started being
  12 used in order to perform general purpose computations.  Then the architecture of
  13 a GPU is illustrated.  There are  many fundamental differences between a GPU and
  14 a traditional  processor. In order to  benefit from the  power of a GPU,  a CUDA
  15 programmer needs to use threads. They have some particularities which enable the
  16 CUDA model to be efficient and scalable when some constraints are addressed.
  17
  18 \clearpage
  19 \section{Brief history of the video card}
  20
  21 Video  cards or graphics  cards have  been introduced  in personal  computers to
  22 produce  high quality graphics  faster than  classical Central  Processing Units
  23 (CPU) and  to free the CPU from this  task. In general, display  tasks are very
  24 repetitive and very specific.  Hence,  some manufacturers have produced more and
  25 more sophisticated video cards, providing 2D accelerations, then 3D accelerations,
  26 then some  light transforms. Video cards  own their own memory  to perform their
  27 computations.  For at least two decades, every personal computer has had a video
  28 card which is simple for  desktop computers or which provides many accelerations
  29 for game and/or  graphic-oriented computers.  In the  latter case, graphics cards
  30 may be more expensive than a CPU.
  31
  32 Since  2000, video  cards have  allowed  users to  apply arithmetic  operations
  33 simultaneously on a sequence of  pixels, later called stream processing. In
  34 this case, the information of the pixels (color, location and other information) is
  35 combined in order  to produce a pixel  color that can be displayed  on a screen.
  36 Simultaneous  computations are  provided  by shaders  which calculate  rendering
  37 effects on  graphics hardware with a  high degree of  flexibility. These shaders
  38 handle the stream data with pipelines.
  39
  40
  41 Some researchers  tried to  apply those operations  on other  data, representing
  42 something different  from pixels,  and consequently this  resulted in  the first
  43 uses of video cards for  performing general purpose computations. The programming
  44 model  was not  easy  to use  at  all and  was very  dependent  on the  hardware
  45 constraints.   More precisely  it consisted  in using  either DirectX  of OpenGL
  46 functions  providing  an  interface  to  some classical  operations  for  videos
  47 operations  (memory  transfers,  texture  manipulation, etc.).   Floating  point
  48 operations were  most of the  time unimaginable.  Obviously when  something went
  49 wrong, programmers had no way (and no tools) to detect it.
  50
  51 \section{GPGPU}
  52
  53 In order  to benefit from the computing  power of more recent  video cards, CUDA
  54 was first proposed in 2007 by  NVIDIA. It unifies the programming model for some
  55 of  their most  efficient video  cards.  CUDA~\cite{ch1:cuda}  has  quickly been
  56 considered by  the scientific community as  a great advance  for general purpose
  57 graphics processing unit (GPGPU)  computing.  Of course other programming models
  58 have been  proposed. The  other well-known alternative  is OpenCL which  aims at
  59 proposing an alternative  to CUDA and which is  multiplatform and portable. This
  60 is a  great advantage since  it is even  possible to execute OpenCL  programs on
  61 traditional CPUs.  The main drawback is  that it is less close to the hardware
  62 and  consequently it sometimes  provides  less efficient  programs. Moreover,  CUDA
  63 benefits from  more mature compilation and optimization  procedures.  Other less
  64 known environments have been proposed,  but most of them have been discontinued,
  65 such FireStream by ATI which is  not maintained anymore and has been replaced by
  66 OpenCL and  BrookGPU  by  Stanford  University~\cite{ch1:Buck:2004:BGS}.   Another
  67 environment based on  pragma (insertion of pragma directives  inside the code to
  68 help  the  compiler  to generate  efficient  code)  is  called OpenACC.   For  a
  69 comparison with OpenCL, interested readers may refer to~\cite{ch1:Dongarra}.
  70
  71
  72
  73 \section{Architecture of current GPUs}
  74
  75 The architecture  \index{GPU!architecture of a} of  current GPUs  is constantly
  76 evolving.  Nevertheless  some trends remain constant  throughout this evolution.
  77 Processing units composing a GPU are  far simpler than a traditional CPU and
  78 it is much easier to integrate many computing units inside a GPU card than to do
  79 so with many cores inside a CPU.   In  2012, the most powerful GPUs contained more than 500
  80 cores       and       the       most       powerful      CPUs       had       8
  81 cores. Figure~\ref{ch1:fig:comparison_cpu_gpu} shows  the number of cores inside
  82 a  CPU  and  inside a  GPU.   In  fact,  in  a  current NVIDIA  GPU,  there  are
  83 multiprocessors which have 32 cores (for example, on Fermi cards). The core clock
  84 of a CPU is  generally around 3GHz and  the one of a GPU is  about 1.5GHz.  Although
  85 the core clock of GPU cores is slower, the number of cores inside a GPU provides
  86 more computational power.  This measure is commonly represented by the number of
  87 floating point operation  per seconds. Nowadays the most powerful  GPUs provide more
  88 than   1TFlops,  i.e.,    $10^{12}$   floating  point   operations  per   second.
  89 Nevertheless  GPUs are very  efficient at executing repetitive work in which
  90 only  the data  change. It  is important  to keep  in mind  that multiprocessors
  91 inside a GPU have 32 cores. Later we will see that these 32 cores need to do the
  92 same work to get maximum performance.
  93
  94 \begin{figure}[t!]
  95 \centerline{\includegraphics[]{Chapters/chapter1/figures/nb_cores_CPU_GPU.pdf}}
  96 \caption{Comparison of number of cores in a CPU and in a GPU.}
  97 %[Comparison of number of cores in a CPU and in a GPU]
  98 \label{ch1:fig:comparison_cpu_gpu}
  99 \end{figure}
 100
 101 On the most powerful  GPU cards, called Fermi, multiprocessors  are called streaming
 102 multiprocessors  (SMs). Each  SM contains  32  cores and  is able  to perform  32
 103 floating points or integer operations per clock on  32-bit numbers  or 16 floating
 104 points per clock  on  64-bit numbers. SMs  have  their  own registers,  execution
 105 pipelines and caches.  On Fermi architecture,  there are 64Kb shared memory plus L1
 106 cache  and 32,536 32-bit  registers per  SM. More  precisely the  programmer can
 107 decide what amounts  of shared memory and  L1 cache SM are to be used.  The constraint is
 108 that the sum of both amounts should be less than or equal to 64Kb.
 109
 110 Threads are used to  benefit from the large number of cores  of a GPU. These
 111 threads    are   different    from    traditional   threads    for a   CPU.     In
 112 Chapter~\ref{chapter2},  some  examples of  GPU  programming  will explain  the
 113 details of  the GPU  threads. Threads are gathered  into blocks  of 32
 114 threads, called ``warps''. These warps  are important when designing an algorithm
 115 for GPU.
 116
 117
 118 Another big  difference between a CPU and a GPU  is the latency of  memory.  In a CPU,
 119 everything is optimized  to obtain a low latency  architecture. This is possible
 120 through  the  use  of  cache  memories. Moreover,  nowadays  CPUs  carry out  many
 121 performance optimizations  such as speculative execution  which roughly speaking
 122 consists of executing  a small part of the code in advance even if  later this work
 123 reveals itself  to be  useless. GPUs  do not have  low latency
 124 memory.   In comparison GPUs  have small  cache memories; nevertheless the
 125 architecture of GPUs is optimized  for throughput computation and it takes into
 126 account the memory latency.
 127
 128
 129
 130
 131
 132 Figure~\ref{ch1:fig:latency_throughput}  illustrates   the  main  difference  of
 133 memory latency between a CPU and a  GPU. In a CPU, tasks ``ti'' are executed one
 134 by one with a short memory latency to get the data to process. After some tasks,
 135 there is  a context switch  that allows the  CPU to run  concurrent applications
 136 and/or multi-threaded  applications.  Memory latencies  are longer in a  GPU. The
 137  principle  to   obtain  a  high  throughput  is  to   have  many  tasks  to
 138 compute. Later we  will see that these tasks are called  threads with CUDA. With
 139 this  principle, as soon  as a  task is  finished the  next one  is ready  to be
 140 executed  while the  wait for  data for  the previous  task is  overlapped by the
 141 computation of other tasks.
 142
 143 \clearpage
 144
 145 \begin{figure}[t!]
 146 \centerline{\includegraphics[scale=0.7]{Chapters/chapter1/figures/low_latency_vs_high_throughput.pdf}}
 147 \caption{Comparison of low latency of a CPU and high throughput of a GPU.}
 148 \label{ch1:fig:latency_throughput}
 149 \end{figure}
 150
 151 \section{Kinds of parallelism}
 152
 153 Many  kinds  of parallelism  are  available according  to  the  type of  hardware.
 154 Roughly  speaking, there  are three  classes of  parallelism: instruction-level
 155 parallelism, data parallelism, and task parallelism.
 156
 157 Instruction-level parallelism consists in reordering some instructions in order
 158 to execute  some of them in parallel  without changing the result  of the code.
 159 In  modern CPUs, instruction  pipelines allow  the processor to  execute instructions
 160 faster.   With   a  pipeline  a  processor  can   execute  multiple  instructions
 161 simultaneously because  the output of a  task is the  input of the
 162 next one.
 163
 164 Data parallelism consists  in executing the same program  with different data on
 165 different computing  units.  Of course,  no dependency should exist  among the
 166 data. For example, it is easy  to parallelize loops without dependency using the
 167 data parallelism paradigm. This paradigm  is linked with the Single Instructions
 168 Multiple Data (SIMD)  architecture. This is the kind  of parallelism provided by
 169 GPUs.
 170
 171 Task parallelism is the common parallelism  achieved  on clusters and grids and
 172 high performance  architectures where different tasks are  executed by different
 173 computing units.
 174 \clearpage
 175 \section{CUDA multithreading}
 176
 177 The data parallelism  of CUDA is more precisely based  on the Single Instruction
 178 Multiple Thread (SIMT) model, because a programmer accesses
 179   the cores  by the  intermediate of  threads. In  the CUDA  model,  all cores
 180 execute the  same set of  instructions but with  different data. This  model has
 181 similarities with the vector programming  model proposed for vector machines through
 182 the  1970s and into  the  90s, notably  the  various Cray  platforms.   On the  CUDA
 183 architecture, the  performance is  led by the  use of  a huge number  of threads
 184 (from thousands up to   millions). The particularity of the  model is that there
 185 is no  context switching as in  CPUs and each  thread has its own  registers. In
 186 practice,  threads  are executed  by  SM  and   gathered  into  groups of  32
 187 threads,  called  warps. Each  SM  alternatively  executes
 188 active warps  and warps becoming temporarily  inactive due to waiting of data
 189 (as shown in Figure~\ref{ch1:fig:latency_throughput}).
 190
 191
 192
 193 The key to scalability in the CUDA model is the use of a huge number of threads.
 194 In practice, threads are  gathered not only in warps but also in thread blocks. A
 195 thread block is executed  by only one SM and it cannot  migrate. The typical size of
 196 a thread block is a  power of two (for example, 64, 128, 256, or 512).
 197
 198
 199
 200 In this  case, without changing anything inside  a CUDA code, it  is possible to
 201 run  code with  a small CUDA  device or  the best performing Tesla  CUDA cards.
 202 Blocks are  executed in any order depending  on the number of  SMs available. So
 203 the  programmer  must  conceive   code  having this  issue  in  mind.   This
 204 independence between thread blocks provides the scalability of CUDA codes.
 205
 206
 207
 208
 209 A kernel is a function which  contains a block of instructions that are executed
 210 by the  threads of a GPU.  When  the problem considered is  a two-dimensional or
 211 three-dimensional problem,  it is possible to  group thread blocks  into a grid.
 212 In practice, the number of thread blocks and the size of thread blocks are given
 213 as parameters  to each kernel.   Figure~\ref{ch1:fig:scalability} illustrates an
 214 example of a kernel composed of 8 thread blocks. Then this kernel is executed on
 215 a small device containing only 2 SMs.  So in this case, blocks are executed 2 by
 216 2 in any order.  If the kernel  is executed on a larger CUDA device containing 4
 217 SMs, blocks are  executed 4 by 4 simultaneously.  The  execution times should be
 218 approximately twice as fast in the latter case. Of course, that depends on other
 219 parameters that will be described later (in this chapter and other chapters).
 220
 221
 222 \begin{figure}[t!]
 223 \centerline{\includegraphics[scale=0.65]{Chapters/chapter1/figures/scalability.pdf}}
 224 \caption{Scalability of GPU.}
 225 \label{ch1:fig:scalability}
 226 \end{figure}
 227
 228 Thread blocks provide a way to cooperate  in the sense that threads of the same
 229 block   cooperatively    load   and   store   blocks   of    memory   they   all
 230 use. Synchronizations of threads in the same block are possible (but not between
 231 threads of different  blocks). Threads of the same block  can also share results
 232 in order  to compute a  single result. In Chapter~\ref{chapter2},  some examples
 233 will explain that.
 234
 235
 236 \section{Memory hierarchy}
 237
 238 The memory hierarchy of  GPUs\index{memory hierarchy} is different from that of CPUs.  In practice,  there are registers\index{memory hierarchy!registers}, local
 239 memory\index{memory hierarchy!local memory},                               shared
 240 memory\index{memory hierarchy!shared memory},                               cache
 241 memory\index{memory hierarchy!cache memory},              and              global
 242 memory\index{memory hierarchy!global memory}.
 243
 244
 245 As  previously  mentioned each  thread  can access  its  own  registers.  It  is
 246 important to keep in mind that the  number of registers per block is limited. On
 247 recent cards,  this number is  limited to 64Kb  per SM.  Access to  registers is
 248 very fast, so it is a good idea to use them whenever possible.
 249
 250 Likewise each thread can access local  memory which, in practice, is much slower
 251 than registers.  Local memory is automatically used by the compiler when all the
 252 registers are  occupied, so the  best idea is  to optimize the use  of registers
 253 even if this involves reducing the number of threads per block.
 254
 255 \begin{figure}[b!]
 256 \centerline{\includegraphics[scale=0.60]{Chapters/chapter1/figures/memory_hierarchy.pdf}}
 257 \caption{Memory hierarchy of a GPU.}
 258 \label{ch1:fig:memory_hierarchy}
 259 \end{figure}
 260
 261
 262
 263 Shared memory allows  cooperation between threads of the  same block.  This kind
 264 of memory is fast but it needs to be manipulated manually and its size is
 265 limited.  It is accessible during the execution of a kernel. So the idea is
 266 to fill the shared  memory at the start of the kernel  with global data that are
 267 used very  frequently, then threads can  access it for  their computation.  Threads
 268 can obviously change  the content of this shared  memory either with computation
 269 or by loading  other data and they can  store its content in the  global memory. So
 270 shared memory can  be seen as a cache memory which is manageable manually. This
 271 obviously  requires an effort from the programmer.
 272
 273 On  recent cards,  the programmer  may decide  what amount  of cache  memory and
 274 shared memory is attributed to a kernel. The cache memory is an L1 cache which is
 275 directly  managed by  the GPU.  Sometimes,  this cache  provides very  efficient
 276 result and sometimes the use of shared memory is a better solution.
 277
 278
 279
 280
 281 Figure~\ref{ch1:fig:memory_hierarchy}  illustrates  the  memory hierarchy  of  a
 282 GPU. Threads are represented on the top  of the figure. They can have access to their
 283 own registers  and their local memory. Threads  of the same block  can access
 284 the shared memory of that block. The cache memory is not represented here but it
 285 is local  to a thread. Then  each block can access  the global  memory of the
 286 GPU.
 287 \clearpage
 288  \section{Conclusion}
 289
 290 In this chapter,  a brief presentation of the video card,  which has later been
 291 used to perform computation, has been  given. The architecture of a GPU has been
 292 illustrated focusing on the particularity of GPUs in terms of parallelism, memory
 293 latency, and  threads. In order to design  an efficient algorithm for  GPU, it is
 294 essential to keep all these parameters in mind.
 295
 296
 297 \putbib[Chapters/chapter1/biblio]
 298