BookGPU/Chapters/chapter1/ch1.tex

   1 \chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte, France}
   2
   3
   4 \chapter{Presentation of the GPU architecture and of the Cuda environment}
   5 \label{chapter1}
   6
   7 \section{Introduction}\label{ch1:intro}
   8
   9 This chapter introduces the Graphics  Processing Unit (GPU) architecture and all
  10 the concepts needed to understand how GPUs  work and can be used to speed up the
  11 execution of some algorithms. First of all this chapter gives a brief history of
  12 the development  of Graphics  card until they  have been  used in order  to make
  13 general   purpose   computation.    Then   the   architecture  of   a   GPU   is
  14 illustrated.  There  are  many  fundamental  differences between  a  GPU  and  a
  15 tradition  processor. In  order  to benefit  from the  power  of a  GPU, a  Cuda
  16 programmer needs to use threads. They have some particularities which enable the
  17 Cuda model to be efficient and scalable when some constraints are addressed.
  18
  19
  20
  21 \section{Brief history of Video Card}
  22
  23 Video  cards or Graphics  cards have  been introduced  in personal  computers to
  24 produce  high quality graphics  faster than  classical Central  Processing Units
  25 (CPU) and  to alleviate CPU from this  task. In general, display  tasks are very
  26 repetitive and very specific.  Hence,  some manufacturers have produced more and
  27 more sophisticated video cards, providing 2D accelerations then 3D accelerations,
  28 then some  light transforms. Video cards  own their own memory  to perform their
  29 computation.  For at least two decades, every personal computer has had a video
  30 card which is simple for  desktop computers or which provides many accelerations
  31 for game and/or  graphic oriented computers.  In the  latter case, graphic cards
  32 may be more expensive than a CPU.
  33
  34 Since  2000, video  cards have  allowed  users to  apply arithmetic  operations
  35 simultaneously on a sequence of  pixels, also later called stream processing. In
  36 this case, the information of the pixels (color, location and other information) are
  37 combined in order  to produce a pixel  color that can be displayed  on a screen.
  38 Simultaneous  computations are  provided  by shaders  which calculate  rendering
  39 effects on  graphics hardware with a  high degree of  flexibility. These shaders
  40 handles the stream data with pipelines.
  41
  42
  43 Some researchers  tried to  apply those operations  on other  data, representing
  44 something different  from pixels,  and consequently this  resulted in  the first
  45 uses of video cards for  performing general purpose computation. The programming
  46 model  was not  easy  to use  at  all and  was very  dependent  of the  hardware
  47 constraints.   More precisely  it consisted  in using  either DirectX  of OpenGL
  48 functions  providing  an  interface  to  some classical  operations  for  videos
  49 operations  (memory  transfers,  texture  manipulation,  ...).   Floating  point
  50 operations were  most of the  time unimaginable.  Obviously when  something went
  51 wrong, programmers had no way (and neither the tools) to detect it.
  52
  53 \section{GPGPU}
  54
  55 In order  to benefit from the computing  power of more recent  video cards, Cuda
  56 was first proposed in 2007 by  NVidia. It unifies the programming model for some
  57 of  their most performant  video cards.   Cuda~\cite{ch1:cuda} has  quickly been
  58 considered by  the scientific community as  a great advance  for general purpose
  59 graphics processing unit (GPGPU)  computing.  Of course other programming models
  60 have been  proposed. The  other well-known alternative  is OpenCL which  aims at
  61 proposing an alternative to Cuda  and which is multi-platform and portable. This
  62 is a  great advantage since  it is even  possible to execute OpenCL  programs on
  63 traditional CPUs.  The main drawback is that it is less tight with the hardware
  64 and  consequently sometimes  provides  less efficient  programs. Moreover,  Cuda
  65 benefits from  more mature compilation and optimization  procedures.  Other less
  66 known environments  have been proposed, but  most of them have  been stopped, for
  67 example  we can  cite: FireStream  by ATI  which is  not maintained  anymore and
  68 replaced by  OpenCL, BrookGPU by  Standford University~\cite{ch1:Buck:2004:BGS}.
  69 Another environment based  on pragma (insertion of pragma  directives inside the
  70 code to  help the compiler to generate  efficient code) is call  OpenACC.  For a
  71 comparison with OpenCL, interested readers may refer to~\cite{ch1:CMR:12}.
  72
  73
  74
  75 \section{Architecture of current GPUs}
  76
  77 The architecture  \index{architecture of  a GPU} of  current GPUs  is constantly
  78 evolving.  Nevertheless  some trends remain constant  throughout this evolution.
  79 Processing units composing a GPU are  far more simple than a traditional CPU but
  80 it is much easier to integrate many computing units inside a GPU card than to do
  81 so with many cores inside a CPU. This is due to the fact that the cores of a GPU are
  82 simpler than the cores of a CPU.  In  2012, the most powerful GPUs own more than 500
  83 cores       and       the       most       powerful      CPUs       have       8
  84 cores. Figure~\ref{ch1:fig:comparison_cpu_gpu} shows  the number of cores inside
  85 a  CPU  and  inside a  GPU.   In  fact,  in  a  current NVidia  GPU,  there  are
  86 multiprocessors which have 32 cores (for example on Fermi cards). The core clock
  87 of CPU is  generally around 3GHz and  the one of GPU is  about 1.5GHz.  Although
  88 the core clock of GPU cores is slower, the amount of cores inside a GPU provides
  89 more computational power.  This measure is commonly represented by the number of
  90 floating point operation  per seconds. Nowadays the most powerful  GPUs provide more
  91 than   1TFlops,  i.e.    $10^{12}$   floating  point   operations  per   second.
  92 Nevertheless  GPUs are very  efficient to  perform some  operations but  not all
  93 kinds of operations. They are very efficient to execute repetitive work in which
  94 only  the data  change. It  is important  to keep  in mind  that multiprocessors
  95 inside a GPU have 32 cores. Later we will see that these 32 cores need to do the
  96 same work to get maximum performance.
  97
  98 \begin{figure}[b!]
  99 \centerline{\includegraphics[]{Chapters/chapter1/figures/nb_cores_CPU_GPU.pdf}}
 100 \caption{Comparison of number of cores in a CPU and in a GPU.}
 101 %[Comparison of number of cores in a CPU and in a GPU]
 102 \label{ch1:fig:comparison_cpu_gpu}
 103 \end{figure}
 104
 105 On the most powerful  GPU cards, called Fermi, multiprocessors  are called streaming
 106 multiprocessors  (SM). Each  SM contains  32  cores and  is able  to perform  32
 107 floating points or integer operations on  32 bits numbers per clock or 16 floating
 108 points  on  64 bits number  per  clock. SMs  have  their  own registers,  execution
 109 pipelines and caches.  On Fermi architecture,  there are 64Kb shared memory + L1
 110 cache  and 32,536 32bits  registers per  SM. More  precisely the  programmer can
 111 decide what amount  of shared memory and  L1 cache SM can use.  The constraint is
 112 that the sum of both amounts should be less or equal to 64Kb.
 113
 114 Threads are used to  benefit from the important number of cores  of a GPU. Those
 115 threads    are   different    from    traditional   threads    for   CPU.     In
 116 chapter~\ref{chapter2},  some  examples of  GPU  programming  will explicit  the
 117 details of  the GPU  threads. However,  threads are gathered  into blocks  of 32
 118 threads, called ``warps''. Those warps  are important when designing an algorithm
 119 for GPU.
 120
 121
 122 Another big  difference between CPU and GPU  is the latency of  memory.  In CPU,
 123 everything is optimized  to obtain a low latency  architecture. This is possible
 124 through  the  use  of  cache  memories. Moreover,  nowadays  CPUs  perform  many
 125 performance optimizations  such as speculative execution  which roughly speaking
 126 consists in executing  a small part of  code in advance even if  later this work
 127 reveals itself  to be  useless. On the  contrary, GPUs  do not have  low latency
 128 memory.   In comparison GPUs  have small  cache memories.  Nevertheless the
 129 architecture of GPUs is optimized  for throughput computation and it takes into
 130 account the memory latency.
 131
 132
 133
 134 \begin{figure}[b!]
 135 \centerline{\includegraphics[scale=0.7]{Chapters/chapter1/figures/low_latency_vs_high_throughput.pdf}}
 136 \caption{Comparison of low latency of CPU and high throughput of GPU.}
 137 \label{ch1:fig:latency_throughput}
 138 \end{figure}
 139
 140 Figure~\ref{ch1:fig:latency_throughput}  illustrates   the  main  difference  of
 141 memory latency between a CPU and a  GPU. In a CPU, tasks ``ti'' are executed one
 142 by one with a short memory latency to get the data to process. After some tasks,
 143 there is  a context switch  that allows the  CPU to run  concurrent applications
 144 and/or multi-threaded  applications. Memory latencies  are longer in a  GPU, the
 145 the  principle  to   obtain  a  high  throughput  is  to   have  many  tasks  to
 146 compute. Later we  will see that those tasks are called  threads with Cuda. With
 147 this  principle, as soon  as a  task is  finished the  next one  is ready  to be
 148 executed  while the  wait for  data for  the previous  task is  overlapped by
 149 computation of other tasks.
 150
 151
 152
 153 \section{Kinds of parallelism}
 154
 155 Many  kinds  of parallelism  are  amiable according  to  the  type of  hardware.
 156 Roughly  speaking, there  are three  classes of  parallelism: instruction-level
 157 parallelism, data parallelism and task parallelism.
 158
 159 Instruction-level parallelism consists in re-ordering some instructions in order
 160 to execute  some of them in parallel  without changing the result  of the code.
 161 In  modern CPUs, instruction  pipelines allow  processor to  execute instructions
 162 faster.   With   a  pipeline  a  processor  can   execute  multiple  instructions
 163 simultaneously due  to the fact that  the output of a  task is the  input of the
 164 next one.
 165
 166 Data parallelism consists  in executing the same program  with different data on
 167 different computing  units.  Of course,  no dependency should exist  between the
 168 data. For example, it is easy  to parallelize loops without dependency using the
 169 data parallelism paradigm. This paradigm  is linked with the Single Instructions
 170 Multiple Data (SIMD)  architecture. This is the kind  of parallelism provided by
 171 GPUs.
 172
 173 Task parallelism is the common parallelism  achieved out on clusters and grids and
 174 high performance  architectures where different tasks are  executed by different
 175 computing units.
 176
 177 \section{Cuda Multithreading}
 178
 179 The data parallelism  of Cuda is more precisely based  on the Single Instruction
 180 Multiple Thread (SIMT) model. This is due to the fact that a programmer accesses
 181 to  the cores  by the  intermediate of  threads. In  the Cuda  model,  all cores
 182 execute the  same set of  instructions but with  different data. This  model has
 183 similarities with the vector programming  model proposed for vector machines through
 184 the  1970s into  the  90s, notably  the  various Cray  platforms.   On the  Cuda
 185 architecture, the  performance is  led by the  use of  a huge number  of threads
 186 (from thousands up to  to millions). The particularity of the  model is that there
 187 is no  context switching as in  CPUs and each  thread has its own  registers. In
 188 practice,  threads  are executed  by  SM  and are  gathered  into  groups of  32
 189 threads.  Those  groups  are  called  ``warps''. Each  SM  alternatively  executes
 190 ``active warps''  and warps becoming temporarily  inactive due to  waiting of data
 191 (as shown in Figure~\ref{ch1:fig:latency_throughput}).
 192
 193 The key to scalability in the Cuda model is the use of a huge number of threads.
 194 In practice, threads are not only gathered  in warps but also in thread blocks. A
 195 thread block is executed  by only one SM and it cannot  migrate. The typical size of
 196 a thread block is a number power of two (for example: 64, 128, 256 or 512).
 197
 198
 199
 200 In this  case, without changing anything inside  a Cuda code, it  is possible to
 201 run your  code with  a small Cuda  device or  the most performing Tesla  Cuda cards.
 202 Blocks are  executed in any order depending  on the number of  SMs available. So
 203 the  programmer  must  conceive  its  code  having this  issue  in  mind.   This
 204 independence between thread blocks provides the scalability of Cuda codes.
 205
 206 \begin{figure}[b!]
 207 \centerline{\includegraphics[scale=0.65]{Chapters/chapter1/figures/scalability.pdf}}
 208 \caption{Scalability of GPU.}
 209 \label{ch1:fig:scalability}
 210 \end{figure}
 211
 212
 213 A kernel is a function which  contains a block of instructions that are executed
 214 by the  threads of a GPU.   When the problem considered  is a two  dimensional or three
 215 dimensional  problem,  it is  possible  to group  thread  blocks  into a grid.   In
 216 practice, the number of  thread blocks and the size of thread  blocks is given as
 217 parameters  to  each  kernel.   Figure~\ref{ch1:fig:scalability}  illustrates  an
 218 example of a kernel composed of 8 thread blocks. Then this kernel is executed on
 219 a small device containing only 2 SMs.  So in  this case, blocks are executed 2
 220 by 2 in any order.  If the kernel is executed on a larger Cuda device containing
 221 4 SMs, blocks are executed 4 by 4 simultaneously.  The execution times should be
 222 approximately twice faster in the latter  case. Of course, that depends on other
 223 parameters that will be described later.
 224
 225 Thread blocks provide a way to cooperation  in the sense that threads of the same
 226 block   cooperatively    load   and   store   blocks   of    memory   they   all
 227 use. Synchronizations of threads in the same block are possible (but not between
 228 threads of different  blocks). Threads of the same block  can also share results
 229 in order  to compute a  single result. In chapter~\ref{chapter2},  some examples
 230 will explicit that.
 231
 232
 233 \section{Memory hierarchy}
 234
 235 The memory hierarchy of  GPUs\index{memory~hierarchy} is different from the CPUs
 236 one.  In practice,  there are registers\index{memory~hierarchy!registers}, local
 237 memory\index{memory~hierarchy!local~memory},                               shared
 238 memory\index{memory~hierarchy!shared~memory},                               cache
 239 memory\index{memory~hierarchy!cache~memory}              and              global
 240 memory\index{memory~hierarchy!global~memory}.
 241
 242
 243 As  previously  mentioned each  thread  can access  its  own  registers.  It  is
 244 important to keep in mind that the  number of registers per block is limited. On
 245 recent cards,  this number is  limited to 64Kb  per SM.  Access to  registers is
 246 very fast, so it is a good idea to use them whenever possible.
 247
 248 Likewise each thread can access local  memory which, in practice, is much slower
 249 than registers.  Local memory is automatically used by the compiler when all the
 250 registers are  occupied. So the  best idea is  to optimize the use  of registers
 251 even if this implies to reduce the number of threads per block.
 252
 253 \begin{figure}[hbtp!]
 254 \centerline{\includegraphics[scale=0.60]{Chapters/chapter1/figures/memory_hierarchy.pdf}}
 255 \caption{Memory hierarchy of a GPU.}
 256 \label{ch1:fig:memory_hierarchy}
 257 \end{figure}
 258
 259
 260
 261 Shared memory allows  cooperation between threads of the  same block.  This kind
 262 of memory is fast because it requires to be manipulated manually and its size is
 263 limited.  It is accessible during the execution of a kernel. So the principle is
 264 to fill the shared  memory at the start of the kernel  with global data that are
 265 used very  frequently, then threads can  access it for  their computation.  They
 266 can obviously change  the content of this shared  memory either with computation
 267 or load of  other data and they can  store its content in the  global memory. So
 268 shared memory can  be seen as a cache memory  manageable manually. This requires
 269 obviously an effort from the programmer.
 270
 271 On  recent cards,  the programmer  may decide  what amount  of cache  memory and
 272 shared memory is attributed to a kernel. The cache memory is a L1 cache which is
 273 directly  managed by  the GPU.  Sometimes,  this cache  provides very  efficient
 274 result and sometimes the use of shared memory is a better solution.
 275
 276
 277
 278
 279 Figure~\ref{ch1:fig:memory_hierarchy}  illustrates  the  memory hierarchy  of  a
 280 GPU. Threads are represented on the top  of the figure. They can access to their
 281 own registers  and their local memory. Threads  of the same block  can access to
 282 the shared memory of this block. The cache memory is not represented here but it
 283 is local  to a thread. Then  each block can access  to the global  memory of the
 284 GPU.
 285
 286  \section{Conclusion}
 287
 288 In this chapter,  a brief presentation of the video card,  which has later been
 289 used to perform computation, has been  given. The architecture of a GPU has been
 290 illustrated focusing on the particularity of GPUs in term of parallelism, memory
 291 latency and  threads. In order to design  an efficient algorithm for  GPU, it is
 292 essential to have all these parameters in mind.
 293
 294
 295 \putbib[Chapters/chapter1/biblio]
 296