BookGPU/Chapters/chapter1/ch1.tex

   1 \chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte}
   2
   3
   4 \chapter{Presentation of the GPU architecture and of the CUDA environment}
   5 \label{chapter1}
   6
   7 \section{Introduction}\label{ch1:intro}
   8
   9 This chapter introduces the Graphics  Processing Unit (GPU) architecture and all
  10 the concepts needed to understand how GPUs  work and can be used to speed up the
  11 execution of some algorithms. First of all this chapter gives a brief history of
  12 the development of Graphics card until they can be used in order to make general
  13 purpose computation.
  14
  15
  16
  17 \section{Brief history of Video Card}
  18
  19 Video  card or  Graphics card  have been  introduced in  personnal  computers to
  20 produce  high quality  graphics faster  than classical  Central  Processing Unit
  21 (CPU) and  to alleviate CPU from this  task. In general, display  tasks are very
  22 repetitive and very specific.  Hence,  some manufacturers have produced more and
  23 more sofisticated video cards, providing 2D accelerations then 3D accelerations,
  24 then some  light transforms. Video cards  own their own memory  to perform their
  25 computation.  From  at least two dedaces,  every personnal computer  has a video
  26 card which a  simple for desktop computers or  which provides many accelerations
  27 for game  and/or graphic oriented computers.  In the latter  case, graphic cards
  28 may be more expensive than the CPU.
  29
  30 After 2000,  video cards allowed to apply  arithmetics operations simulatenously
  31 on a  sequence of  pixels, also  later called stream  processing. In  this case,
  32 information of the  pixels (color, location and other  information) are combined
  33 in   order  to   produce   a  pixel   color   that  can   be   displayed  on   a
  34 screen.  Simultaneous  computations  are  provided by  shaders  which  calculate
  35 rendering effects on graphics hardware  with a high degree of flexibility. These
  36 shaders handles the stream data with pipelines
  37
  38
  39 Some reasearchers  tried to apply  those operations on other  data, representing
  40 something different  from pixels,  and consequently this  resulted in  the first
  41 uses of video cards for  performing general purpose computation. The programming
  42 model  was not  easy  to use  at  all and  was very  dependent  of the  hardware
  43 constraints.  More precisely  it consisted  in  using either  DirectX of  OpenGL
  44 functions  providing  an  interface  to  some classical  operations  for  videos
  45 operations  (memory  transfers,   texture  manipulation,  ...).  Floating  point
  46 operations were  most of  the time unimaginable.   Obviously when  something bad
  47 happened, programmers had no way (and tools) to detect it.
  48
  49 \section{GPGPU}
  50
  51 In order  to benefit from the computing  power of more recent  video cards, CUDA
  52 was first proposed in 2007 by  NVidia. It unifies the programming model for some
  53 of  their most performant  video cards.   Cuda~\cite{ch1:cuda} has  quickly been
  54 considered by  the scientific community as  a great advance  for general purpose
  55 graphics processing  unit (GPGPU) computing.  Of course  other programming model
  56 have been  proposed. The  other well-known alternative  is OpenCL which  aims at
  57 proposing an alternative to Cuda  and which is multi-platform and portable. This
  58 is a  great advantage since  it is even  possible to execute OpenCL  programs on
  59 traditionnal  CPUs.  The  main  drawbacks is  that  it is  less  tight with  the
  60 hardware and consequently provides  sometimes less efficient programs. Moreover,
  61 Cuda benefits  from more mature compilation and  optimization procedures.  Other
  62 less known environment  have been proposed, but most of  them have been stopped,
  63 for example we  can cited: FireStream by ATI which is  not maintened anymore and
  64 replaced  by  OpenCL,  BrookGPU  by  Standford  University~\cite{ch1:Buck:2004:BGS}.
  65 Another environment based  on pragma (insertion of pragma  directives inside the
  66 code to  help the compiler  to generate efficient  code) is call OpenACC.  For a
  67 comparison with OpenCL, interested readers may refer to~\cite{ch1:CMR:12}.
  68
  69
  70
  71 \section{Architecture of current GPUs}
  72
  73 Architecure  of current GPUs  is constantly  evolving. Nevertheless  some trends
  74 remains true  through this evolution. Processing  units composing a  GPU are far
  75 more simpler  than a  traditional CPU but  it is  much easier to  integrate many
  76 computing units inside a  GPU card than many cores inside a  CPU. This is due to
  77 the fact that cores  of a GPU a simpler than cores of a  CPU.  In 2012, the most
  78 powerful  GPUs own  more  than  500 cores  and  the most  powerful  CPUs have  8
  79 cores. Figure~\ref{ch1:fig:comparison_cpu_gpu} shows  the number of cores inside
  80 a  CPU  and  inside a  GPU.   In  fact,  in  a  current NVidia  GPU,  there  are
  81 multiprocessors which have 32 cores (for example on Fermi cards). The core clock
  82 of CPU is generally around 3GHz and the one of GPU is about 1.5GHz. Although the
  83 core clock  of GPU cores is  slower, the amount  of cores inside a  GPU provides
  84 more computational power. This measure  is commonly represented by the number of
  85 floating point operation  per seconds. Nowadays most powerful  GPUs provide more
  86 than 1TFlops, i.e. $10^{12}$ floating point operations per second.  Nevertheless
  87 GPUs  are  very efficient  to  perform  some operations  but  not  all kinds  of
  88 operations. They are very efficient to execute repetitive work in which only the
  89 data change. It  is important to keep in mind that  multiprocessors inside a GPU
  90 have 32 cores. Later we will see that these 32 cores need to do the same work to
  91 get maximum performance.
  92
  93 \begin{figure}[b!]
  94 \centerline{\includegraphics[]{Chapters/chapter1/figures/nb_cores_CPU_GPU.pdf}}
  95 \caption[Comparison of number of cores in a CPU and in a GPU]{Comparison of number of cores in a CPU and in a GPU.}
  96 \label{ch1:fig:comparison_cpu_gpu}
  97 \end{figure}
  98
  99 On most powerful  GPU cards, called Fermi, multiprocessors  are called streaming
 100 multiprocessors  (SM). Each  SM contains  32  cores and  is able  to perform  32
 101 floating point or integer operations on  32bits numbers per clock or 16 floating
 102 point  on  64bits number  per  clock. SM  have  their  own registers,  execution
 103 pipelines and caches.  On Fermi architecture,  there are 64Kb shared memory + L1
 104 cache  and 32,536 32bits  registers per  SM. More  precisely the  programmer can
 105 decide what amount  of shared memory and  L1 cache SM can use.  The constaint is
 106 that the sum of both amounts is less or equal to 64Kb.
 107
 108 Threads are used to  benefit from the important number of cores  of a GPU. Those
 109 threads    are   different    from    traditional   threads    for   CPU.     In
 110 chapter~\ref{chapter2},  some  examples of  GPU  programming  will explicit  the
 111 details of  the GPU  threads. However,  threads are gathered  into blocks  of 32
 112 threads, called ``warp''. Those warps  are important when designing an algorithm
 113 for GPU.
 114
 115
 116 Another big  difference between CPU and GPU  is the latency of  memory.  In CPU,
 117 everything is optimized  to obtain a low latency  architecture. This is possible
 118 through  the  use  of  cache  memories. Moreover,  nowadays  CPUs  perform  many
 119 performance optimizations  such as speculative execution  which roughly speaking
 120 consists in executing  a small part of  code in advance even if  later this work
 121 reveals to  be useless. In  opposite, GPUs do  not have low latency  memory.  In
 122 comparison GPUs have ridiculous cache memories. Nevertheless the architecture of GPUs is optimized for throughtput computation and it takes into account the memory latency.
 123
 124
 125
 126 \begin{figure}[b!]
 127 \centerline{\includegraphics[scale=0.7]{Chapters/chapter1/figures/low_latency_vs_high_throughput.pdf}}
 128 \caption[Comparison of low latency of CPU and highthroughput of GPU]{Comparison of low latency of CPU and highthroughput of GPU.}
 129 \label{ch1:fig:latency_throughput}
 130 \end{figure}
 131
 132 Figure~\ref{ch1:fig:latency_throughput}  illustrates   the  main  difference  of
 133 memory latency between a CPU and a  GPU. In a CPU, tasks ``ti'' are executed one
 134 by one with a short memory latency to get the data to process. After some tasks,
 135 there is  a context switch  that allows the  CPU to run  concurrent applications
 136 and/or multi-threaded  applications. Memory latencies  are longer in a  GPU, the
 137 the  principle  to   obtain  a  high  throughput  is  to   have  many  tasks  to
 138 compute. Later we  will see that those tasks are called  threads with CUDA. With
 139 this  principle, as soon  as a  task is  finished the  next one  is ready  to be
 140 executed  while the  waiting for  data for  the previous  task is  overlapped by
 141 computation of other tasks.
 142
 143
 144
 145 \section{Kinds of parallelism}
 146
 147 Many  kinds  of parallelism  are  avaible according  to  the  type of  hardware.
 148 Roughtly  speaking,  there are  three  classes  of parallism:  instruction-level
 149 parallelism,   data  parallelism   and   task  parallelism.
 150
 151 Instruction-level parallelism consists in re-ordering some instructions in order
 152 to executed  some of them in parallel  without changing the result  of the code.
 153 In  modern CPUs, instruction  pipelines allow  processor to  execute instruction
 154 faster.   With   a  pipeline  a  processor  can   execute  multiple  instruction
 155 simultaneously due  to the fact that  the output of a  task is the  input of the
 156 next one.
 157
 158 Data parallelism consists  in executing the same program  with different data on
 159 different computing  units. Of course, no  depency should exist  between the the
 160 data. For example, it is easy  to parallelize loops without dependency using the
 161 data parallelism paradigm. This paradigm  is linked with the Single Instructions
 162 Multiple Data  (SIMD) architecture. This is  the kind of  parallism providing by
 163 GPUs.
 164
 165 Taks parallelism  is the common parallism  achieved out on cluster  and grid and
 166 high performance  architecture where different  tasks are executed  by different
 167 computing units.
 168
 169 \section{CUDA Multithreading}
 170
 171 The data parallelism  of CUDA is more precisely based  on the Single Instruction
 172 Multiple Thread (SIMT) model. This is due to the fact that the programmer access
 173 to  the cores  by the  intermediate of  threads. In  the CUDA  model,  all cores
 174 execute the  same set of  instructions but with  different data. This  model has
 175 similarities with vector programming  model proposed for vector machines through
 176 the  1970s into  the  90s, notably  the  various Cray  platforms.   On the  CUDA
 177 architecture, the  performance is  led by the  use of  a huge number  of threads
 178 (from thousand upto  to millions). The particularity of the  model is that there
 179 is no context switching as in CPUs and each thread has its own registers. In practice, threads are executed by SM and are gathered into groups of 32 threads. Those groups are call ``warps''. Each SM alternatively executes ``active warps'' and warps becoming temporaly inactive due to waiting of data (as shown in Figure~\ref{ch1:fig:latency_throughput}).
 180
 181 The key to scalability in the CUDA model is the use of a huge number of threads.
 182 In practice threads are not only gathered  in warps but also in thread blocks. A
 183 thread block is executed  by only one SM and it cannot  migrate. Typical size of
 184 thread block is a number power of two (for example: 64, 128, 256 or 512).
 185
 186
 187
 188 In this  case, without changing anything inside  a CUDA code, it  is possible to
 189 run your  code with  a small CUDA  device or  most performant Tesla  CUDA cards.
 190 Blocks are  executed in any number depending  on the number of  SM available. So
 191 the  programmer  must  conceive  its  code  having this  issue  in  mind.   This
 192 independence between threads blocks provides the scalability of CUDA codes.
 193
 194 \begin{figure}[b!]
 195 \centerline{\includegraphics[scale=0.65]{Chapters/chapter1/figures/scalability.pdf}}
 196 \caption[Scalability of GPU]{Scalability of GPU.}
 197 \label{ch1:fig:scalability}
 198 \end{figure}
 199
 200
 201 A kernel is a function which contains a block a instruction that are executed by
 202 threads of a GPU.  When the problem considered is a 2 dimensions or 3 dimensions
 203 problem,  it is  possible to  group thread  blocks into  grid. In  practice, the
 204 number of thread  blocks and the size  of thread block is given  in parameter to
 205 each  kernel.   Figure~\ref{ch1:fig:scalability}  illustrates  an example  of  a
 206 kernel composed  of 8  thread blocks. Then  this kernel  is executed on  a small
 207 device containing only 2 SMs. So in in  this case, blocks are executed 2 by 2 in
 208 any order.  If the kernel is executed  on a larger CUDA device containing 4 SMs,
 209 blocks  are executed  4  by 4  simultaneously.   The execution  times should  be
 210 approximately twice faster in the latter  case. Of course, that depends on other
 211 parameters that will be described later.
 212
 213 Thread blocks provide a way to cooperation  in the sens that threads of the same
 214 block   cooperatively    load   and   store   blocks   of    memory   they   all
 215 use. Synchronizations of threads in the same block are possible (but not between
 216 thread of different blocks). Threads of the same block can also share results in
 217 order to compute a single  result. In chapter~\ref{chapter2}, some examples will
 218 explicit that.
 219
 220
 221 \section{Memory hierarchy}
 222
 223 The memory  hierarchy of GPUs  is different from  the one of CPUs.  In practice,
 224 there is registers, local memory, shared memory, cache memroy and global memory.
 225
 226 As  previously  mentioned each  thread  can access  its  own  registers.  It  is
 227 important to keep in mind that the  number of registers per block is limited. On
 228 recent cards,  this number is  limited to 64Kb  per SM.  Access to  registers is
 229 very fast, so when possible it is a good idea to use them.
 230
 231 Likewise each thread can access local  memory which in practice much slower than
 232 registers.  In practice, local memory is automatically used by the compiler when
 233 all the  registers are  occupied. So  the best idea  is to  optimize the  use of
 234 registers even if this implies to reduce the number of threads per block.
 235
 236 Shared memory allows  cooperation between threads of the  same block.  This kind
 237 of memory  is fast by  it requires  to be manipulated  manually and its  size is
 238 limited.  It is accessible during the execution of a kernel. So the principle is
 239 to fill the shared  memory at the start of the kernel  with global data that are
 240 used very  frequently, then threads can  access it for  their computation.  They
 241 can obviously change  the content of this shared  memory either with computation
 242 or load of  other data and they can  store its content in the  global memory. So
 243 shared memory can  be seen as a cache memory  manageable manually. This requires
 244 obviously an effort from the programmer.
 245
 246 On  recent cards,  the programmer  may decide  what amount  of cache  memory and
 247 shared memory is attributed to a kernel. The cache memory is a L1 cache which is
 248 directly  managed by  the GPU.  Sometimes,  this cache  provides very  efficient
 249 result and sometimes the use of shared memory is a better solution.
 250
 251 \begin{figure}[b!]
 252 \centerline{\includegraphics[scale=0.60]{Chapters/chapter1/figures/memory_hierarchy.pdf}}
 253 \caption[Memory hierarchy of a GPU]{Memory hierarchy of a GPU.}
 254 \label{ch1:fig:memory_hierarchy}
 255 \end{figure}
 256
 257
 258 Figure~\ref{ch1:fig:memory_hierarchy}  illustrates  the  memory hierarchy  of  a
 259 GPU. Threads are represented on the top  of the figure. They can access to their
 260 own registers  and their local memory. Threads  of the same block  can access to
 261 the shared memory of this block. The cache memory is not represented here but it
 262 is local  to a thread. Then  each block can access  to the global  memory of the
 263 GPU.
 264
 265
 266 %%http://people.maths.ox.ac.uk/gilesm/pp10/lec2_2x2.pdf
 267 %%https://people.maths.ox.ac.uk/erban/papers/paperCUDA.pdf
 268 %%http://forum.wttsnxt.com/my_forum/viewtopic.php?f=5&t=9519
 269 %%http://www.cs.nyu.edu/manycores/cuda_many_cores.pdf
 270 %%http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/02-cuda-overview.pdf
 271 %%http://people.maths.ox.ac.uk/~gilesm/cuda/
 272
 273
 274 \putbib[Chapters/chapter1/biblio]
 275