BookGPU/Chapters/chapter1/ch1.tex

   1 \chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte}
   2
   3
   4 \chapter{Presentation of the GPU architecture and of the CUDA environment}
   5 \label{chapter1}
   6
   7 \section{Introduction}\label{ch1:intro}
   8
   9 This chapter introduces the Graphics  Processing Unit (GPU) architecture and all
  10 the concepts needed to understand how GPUs  work and can be used to speed up the
  11 execution of some algorithms. First of all this chapter gives a brief history of
  12 the development of Graphics card until they can be used in order to make general
  13 purpose computation.  Then  the architecture of a GPU  is illustrated. There are
  14 many fundamental differences  between a GPU and a  tradition processor. In order
  15 to benefit from the power of a GPU, a CUDA programmer needs to use threads. They
  16 have  some particularities  which enables  the CUDA  model to  be  efficient and
  17 scalable when some constraints are addressed.
  18
  19
  20
  21 \section{Brief history of Video Card}
  22
  23 Video  card or  Graphics card  have been  introduced in  personnal  computers to
  24 produce  high quality  graphics faster  than classical  Central  Processing Unit
  25 (CPU) and  to alleviate CPU from this  task. In general, display  tasks are very
  26 repetitive and very specific.  Hence,  some manufacturers have produced more and
  27 more sofisticated video cards, providing 2D accelerations then 3D accelerations,
  28 then some  light transforms. Video cards  own their own memory  to perform their
  29 computation.  From  at least two dedaces,  every personnal computer  has a video
  30 card which is  simple for desktop computers or  which provides many accelerations
  31 for game  and/or graphic oriented computers.  In the latter  case, graphic cards
  32 may be more expensive than a CPU.
  33
  34 After 2000,  video cards allowed to apply  arithmetics operations simulatenously
  35 on a  sequence of  pixels, also  later called stream  processing. In  this case,
  36 information of the  pixels (color, location and other  information) are combined
  37 in   order  to   produce   a  pixel   color   that  can   be   displayed  on   a
  38 screen.  Simultaneous  computations  are  provided by  shaders  which  calculate
  39 rendering effects on graphics hardware  with a high degree of flexibility. These
  40 shaders handles the stream data with pipelines.
  41
  42
  43 Some reasearchers  tried to apply  those operations on other  data, representing
  44 something different  from pixels,  and consequently this  resulted in  the first
  45 uses of video cards for  performing general purpose computation. The programming
  46 model  was not  easy  to use  at  all and  was very  dependent  of the  hardware
  47 constraints.  More precisely  it consisted  in  using either  DirectX of  OpenGL
  48 functions  providing  an  interface  to  some classical  operations  for  videos
  49 operations  (memory  transfers,   texture  manipulation,  ...).  Floating  point
  50 operations were  most of  the time unimaginable.   Obviously when  something bad
  51 happened, programmers had no way (and tools) to detect it.
  52
  53 \section{GPGPU}
  54
  55 In order  to benefit from the computing  power of more recent  video cards, CUDA
  56 was first proposed in 2007 by  NVidia. It unifies the programming model for some
  57 of  their most performant  video cards.   Cuda~\cite{ch1:cuda} has  quickly been
  58 considered by  the scientific community as  a great advance  for general purpose
  59 graphics processing  unit (GPGPU) computing.  Of course  other programming model
  60 have been  proposed. The  other well-known alternative  is OpenCL which  aims at
  61 proposing an alternative to Cuda  and which is multi-platform and portable. This
  62 is a  great advantage since  it is even  possible to execute OpenCL  programs on
  63 traditionnal  CPUs.  The  main  drawbacks is  that  it is  less  tight with  the
  64 hardware and consequently provides  sometimes less efficient programs. Moreover,
  65 Cuda benefits  from more mature compilation and  optimization procedures.  Other
  66 less known environment  have been proposed, but most of  them have been stopped,
  67 for example we  can cited: FireStream by ATI which is  not maintened anymore and
  68 replaced  by  OpenCL,  BrookGPU  by  Standford  University~\cite{ch1:Buck:2004:BGS}.
  69 Another environment based  on pragma (insertion of pragma  directives inside the
  70 code to  help the compiler  to generate efficient  code) is call OpenACC.  For a
  71 comparison with OpenCL, interested readers may refer to~\cite{ch1:CMR:12}.
  72
  73
  74
  75 \section{Architecture of current GPUs}
  76
  77 Architecture  \index{architecture  of  a  GPU}  of current  GPUs  is  constantly
  78 evolving.    Nevertheless    some    trends    remains   true    through    this
  79 evolution.  Processing  units  composing a  GPU  are  far  more simpler  than  a
  80 traditional CPU but it is much easier to integrate many computing units inside a
  81 GPU card than many  cores inside a CPU. This is due to the  fact that cores of a
  82 GPU are simpler than  cores of a CPU.  In 2012, the  most powerful GPUs own more
  83 than     500    cores     and    the     most    powerful     CPUs     have    8
  84 cores. Figure~\ref{ch1:fig:comparison_cpu_gpu} shows  the number of cores inside
  85 a  CPU  and  inside a  GPU.   In  fact,  in  a  current NVidia  GPU,  there  are
  86 multiprocessors which have 32 cores (for example on Fermi cards). The core clock
  87 of CPU is generally around 3GHz and the one of GPU is about 1.5GHz. Although the
  88 core clock  of GPU cores is  slower, the amount  of cores inside a  GPU provides
  89 more computational power. This measure  is commonly represented by the number of
  90 floating point operation  per seconds. Nowadays most powerful  GPUs provide more
  91 than 1TFlops, i.e. $10^{12}$ floating point operations per second.  Nevertheless
  92 GPUs  are  very efficient  to  perform  some operations  but  not  all kinds  of
  93 operations. They are very efficient to execute repetitive work in which only the
  94 data change. It  is important to keep in mind that  multiprocessors inside a GPU
  95 have 32 cores. Later we will see that these 32 cores need to do the same work to
  96 get maximum performance.
  97
  98 \begin{figure}[b!]
  99 \centerline{\includegraphics[]{Chapters/chapter1/figures/nb_cores_CPU_GPU.pdf}}
 100 \caption[Comparison of number of cores in a CPU and in a GPU]{Comparison of number of cores in a CPU and in a GPU.}
 101 \label{ch1:fig:comparison_cpu_gpu}
 102 \end{figure}
 103
 104 On most powerful  GPU cards, called Fermi, multiprocessors  are called streaming
 105 multiprocessors  (SM). Each  SM contains  32  cores and  is able  to perform  32
 106 floating point or integer operations on  32bits numbers per clock or 16 floating
 107 point  on  64bits number  per  clock. SMs  have  their  own registers,  execution
 108 pipelines and caches.  On Fermi architecture,  there are 64Kb shared memory + L1
 109 cache  and 32,536 32bits  registers per  SM. More  precisely the  programmer can
 110 decide what amount  of shared memory and  L1 cache SM can use.  The constaint is
 111 that the sum of both amounts is less or equal to 64Kb.
 112
 113 Threads are used to  benefit from the important number of cores  of a GPU. Those
 114 threads    are   different    from    traditional   threads    for   CPU.     In
 115 chapter~\ref{chapter2},  some  examples of  GPU  programming  will explicit  the
 116 details of  the GPU  threads. However,  threads are gathered  into blocks  of 32
 117 threads, called ``warp''. Those warps  are important when designing an algorithm
 118 for GPU.
 119
 120
 121 Another big  difference between CPU and GPU  is the latency of  memory.  In CPU,
 122 everything is optimized  to obtain a low latency  architecture. This is possible
 123 through  the  use  of  cache  memories. Moreover,  nowadays  CPUs  perform  many
 124 performance optimizations  such as speculative execution  which roughly speaking
 125 consists in executing  a small part of  code in advance even if  later this work
 126 reveals to  be useless. In  opposite, GPUs do  not have low latency  memory.  In
 127 comparison GPUs have ridiculous cache memories. Nevertheless the architecture of
 128 GPUs  is optimized for  throughtput computation  and it  takes into  account the
 129 memory latency.
 130
 131
 132
 133 \begin{figure}[b!]
 134 \centerline{\includegraphics[scale=0.7]{Chapters/chapter1/figures/low_latency_vs_high_throughput.pdf}}
 135 \caption[Comparison of low latency of CPU and highthroughput of GPU]{Comparison of low latency of CPU and highthroughput of GPU.}
 136 \label{ch1:fig:latency_throughput}
 137 \end{figure}
 138
 139 Figure~\ref{ch1:fig:latency_throughput}  illustrates   the  main  difference  of
 140 memory latency between a CPU and a  GPU. In a CPU, tasks ``ti'' are executed one
 141 by one with a short memory latency to get the data to process. After some tasks,
 142 there is  a context switch  that allows the  CPU to run  concurrent applications
 143 and/or multi-threaded  applications. Memory latencies  are longer in a  GPU, the
 144 the  principle  to   obtain  a  high  throughput  is  to   have  many  tasks  to
 145 compute. Later we  will see that those tasks are called  threads with CUDA. With
 146 this  principle, as soon  as a  task is  finished the  next one  is ready  to be
 147 executed  while the  waiting for  data for  the previous  task is  overlapped by
 148 computation of other tasks.
 149
 150
 151
 152 \section{Kinds of parallelism}
 153
 154 Many  kinds  of parallelism  are  avaible according  to  the  type of  hardware.
 155 Roughtly  speaking, there  are three  classes of  parallelism: instruction-level
 156 parallelism, data parallelism and task parallelism.
 157
 158 Instruction-level parallelism consists in re-ordering some instructions in order
 159 to execute  some of them in parallel  without changing the result  of the code.
 160 In  modern CPUs, instruction  pipelines allow  processor to  execute instruction
 161 faster.   With   a  pipeline  a  processor  can   execute  multiple  instructions
 162 simultaneously due  to the fact that  the output of a  task is the  input of the
 163 next one.
 164
 165 Data parallelism consists  in executing the same program  with different data on
 166 different computing  units. Of course, no  depency should exist  between the the
 167 data. For example, it is easy  to parallelize loops without dependency using the
 168 data parallelism paradigm. This paradigm  is linked with the Single Instructions
 169 Multiple Data  (SIMD) architecture. This is  the kind of  parallism providing by
 170 GPUs.
 171
 172 Taks parallelism is the common parallism  achieved out on clusters and grids and
 173 high performance  architectures where different tasks are  executed by different
 174 computing units.
 175
 176 \section{CUDA Multithreading}
 177
 178 The data parallelism  of CUDA is more precisely based  on the Single Instruction
 179 Multiple Thread (SIMT) model. This is due to the fact that a programmer accesses
 180 to  the cores  by the  intermediate of  threads. In  the CUDA  model,  all cores
 181 execute the  same set of  instructions but with  different data. This  model has
 182 similarities with vector programming  model proposed for vector machines through
 183 the  1970s into  the  90s, notably  the  various Cray  platforms.   On the  CUDA
 184 architecture, the  performance is  led by the  use of  a huge number  of threads
 185 (from thousand upto  to millions). The particularity of the  model is that there
 186 is no  context switching as in  CPUs and each  thread has its own  registers. In
 187 practice,  threads  are executed  by  SM  and are  gathered  into  groups of  32
 188 threads.  Those  groups  are  call  ``warps''. Each  SM  alternatively  executes
 189 ``active warps''  and warps becoming temporaly  inactive due to  waiting of data
 190 (as shown in Figure~\ref{ch1:fig:latency_throughput}).
 191
 192 The key to scalability in the CUDA model is the use of a huge number of threads.
 193 In practice, threads are not only gathered  in warps but also in thread blocks. A
 194 thread block is executed  by only one SM and it cannot  migrate. Typical size of
 195 thread block is a number power of two (for example: 64, 128, 256 or 512).
 196
 197
 198
 199 In this  case, without changing anything inside  a CUDA code, it  is possible to
 200 run your  code with  a small CUDA  device or  most performant Tesla  CUDA cards.
 201 Blocks are  executed in any order depending  on the number of  SMs available. So
 202 the  programmer  must  conceive  its  code  having this  issue  in  mind.   This
 203 independence between threads blocks provides the scalability of CUDA codes.
 204
 205 \begin{figure}[b!]
 206 \centerline{\includegraphics[scale=0.65]{Chapters/chapter1/figures/scalability.pdf}}
 207 \caption[Scalability of GPU]{Scalability of GPU.}
 208 \label{ch1:fig:scalability}
 209 \end{figure}
 210
 211
 212 A kernel is a function which  contains a block of instructions that are executed
 213 by the  threads of a GPU.   When the problem considered  is a 2  dimensions or 3
 214 dimensions  problem,  it is  possible  to group  thread  blocks  into grid.   In
 215 practice, the number of  thread blocks and the size of thread  block is given in
 216 parameter  to  each  kernel.   Figure~\ref{ch1:fig:scalability}  illustrates  an
 217 example of a kernel composed of 8 thread blocks. Then this kernel is executed on
 218 a small device containing only 2 SMs.  So in in this case, blocks are executed 2
 219 by 2 in any order.  If the kernel is executed on a larger CUDA device containing
 220 4 SMs, blocks are executed 4 by 4 simultaneously.  The execution times should be
 221 approximately twice faster in the latter  case. Of course, that depends on other
 222 parameters that will be described later.
 223
 224 Thread blocks provide a way to cooperation  in the sens that threads of the same
 225 block   cooperatively    load   and   store   blocks   of    memory   they   all
 226 use. Synchronizations of threads in the same block are possible (but not between
 227 threads of different  blocks). Threads of the same block  can also share results
 228 in order  to compute a  single result. In chapter~\ref{chapter2},  some examples
 229 will explicit that.
 230
 231
 232 \section{Memory hierarchy}
 233
 234 The memory hierarchy of  GPUs\index{memory~hierarchy} is different from the CPUs
 235 one.  In practice,  there are registers\index{memory~hierarchy!registers}, local
 236 memory\index{memory~hierarchy!local~memory},                               shared
 237 memory\index{memory~hierarchy!shared~memory},                               cache
 238 memory\index{memory~hierarchy!cache~memory}              and              global
 239 memory\index{memory~hierarchy!global~memory}.
 240
 241
 242 As  previously  mentioned each  thread  can access  its  own  registers.  It  is
 243 important to keep in mind that the  number of registers per block is limited. On
 244 recent cards,  this number is  limited to 64Kb  per SM.  Access to  registers is
 245 very fast, so when possible it is a good idea to use them.
 246
 247 Likewise each thread can access local  memory which, in practice, is much slower
 248 than registers.  Local memory is automatically used by the compiler when all the
 249 registers are  occupied. So the  best idea is  to optimize the use  of registers
 250 even if this implies to reduce the number of threads per block.
 251
 252 \begin{figure}[hbtp!]
 253 \centerline{\includegraphics[scale=0.60]{Chapters/chapter1/figures/memory_hierarchy.pdf}}
 254 \caption[Memory hierarchy of a GPU]{Memory hierarchy of a GPU.}
 255 \label{ch1:fig:memory_hierarchy}
 256 \end{figure}
 257
 258
 259
 260 Shared memory allows  cooperation between threads of the  same block.  This kind
 261 of memory is fast because it requires to be manipulated manually and its size is
 262 limited.  It is accessible during the execution of a kernel. So the principle is
 263 to fill the shared  memory at the start of the kernel  with global data that are
 264 used very  frequently, then threads can  access it for  their computation.  They
 265 can obviously change  the content of this shared  memory either with computation
 266 or load of  other data and they can  store its content in the  global memory. So
 267 shared memory can  be seen as a cache memory  manageable manually. This requires
 268 obviously an effort from the programmer.
 269
 270 On  recent cards,  the programmer  may decide  what amount  of cache  memory and
 271 shared memory is attributed to a kernel. The cache memory is a L1 cache which is
 272 directly  managed by  the GPU.  Sometimes,  this cache  provides very  efficient
 273 result and sometimes the use of shared memory is a better solution.
 274
 275
 276
 277
 278 Figure~\ref{ch1:fig:memory_hierarchy}  illustrates  the  memory hierarchy  of  a
 279 GPU. Threads are represented on the top  of the figure. They can access to their
 280 own registers  and their local memory. Threads  of the same block  can access to
 281 the shared memory of this block. The cache memory is not represented here but it
 282 is local  to a thread. Then  each block can access  to the global  memory of the
 283 GPU.
 284
 285  \section{Conclusion}
 286
 287 In this chapter,  a brief presentation of the video card,  which later have been
 288 used to perform computation, has been  given. The architecture of a GPU has been
 289 illustrated focusing on the particularity of GPUs in term of parallelism, memory
 290 latency and  threads. In order to design  an efficient algorithm for  GPU, it is
 291 essential to have all these parameters in mind.
 292
 293 %%http://people.maths.ox.ac.uk/gilesm/pp10/lec2_2x2.pdf
 294 %%https://people.maths.ox.ac.uk/erban/papers/paperCUDA.pdf
 295 %%http://forum.wttsnxt.com/my_forum/viewtopic.php?f=5&t=9519
 296 %%http://www.cs.nyu.edu/manycores/cuda_many_cores.pdf
 297 %%http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/02-cuda-overview.pdf
 298 %%http://people.maths.ox.ac.uk/~gilesm/cuda/
 299
 300
 301 \putbib[Chapters/chapter1/biblio]
 302