suite

[book_gpu.git] / BookGPU / Chapters / chapter1 / ch1.tex
diff --git a/BookGPU/Chapters/chapter1/ch1.tex b/BookGPU/Chapters/chapter1/ch1.tex

index 88c9361596ccac7d0783bc8958e268afd9550834..cf7a8b59e4b4119a7b729931f45a8ad3a4f5913e 100755 (executable)
--- a/BookGPU/Chapters/chapter1/ch1.tex
+++ b/BookGPU/Chapters/chapter1/ch1.tex
@@ -10,7 +10,11 @@ This chapter introduces the Graphics  Processing Unit (GPU) architecture and all
  the concepts needed to understand how GPUs  work and can be used to speed up the
  execution of some algorithms. First of all this chapter gives a brief history of
  the development of Graphics card until they can be used in order to make general
-purpose computation. Then the 
+purpose computation.  Then  the architecture of a GPU  is illustrated. There are
+many fundamental differences  between a GPU and a  tradition processor. In order
+to benefit from the power of a GPU, a CUDA programmer needs to use threads. They
+have  some particularities  which enables  the CUDA  model to  be  efficient and
+scalable when some constraints are addressed.
  
  
  
@@ -70,7 +74,7 @@ comparison with OpenCL, interested readers may refer to~\cite{ch1:CMR:12}.
  
  \section{Architecture of current GPUs}
  
-Architecture  \index{Architecture  of  a  GPU}  of current  GPUs  is  constantly
+Architecture  \index{architecture  of  a  GPU}  of current  GPUs  is  constantly
  evolving.    Nevertheless    some    trends    remains   true    through    this
  evolution.  Processing  units  composing a  GPU  are  far  more simpler  than  a
  traditional CPU but it is much easier to integrate many computing units inside a
@@ -148,8 +152,8 @@ computation of other tasks.
  \section{Kinds of parallelism}
  
  Many  kinds  of parallelism  are  avaible according  to  the  type of  hardware.
-Roughtly  speaking,  there are  three  classes  of parallelism:  instruction-level
-parallelism,   data  parallelism   and   task  parallelism.   
+Roughtly  speaking, there  are three  classes of  parallelism: instruction-level
+parallelism, data parallelism and task parallelism.
  
  Instruction-level parallelism consists in re-ordering some instructions in order
  to execute  some of them in parallel  without changing the result  of the code.
@@ -165,24 +169,28 @@ data parallelism paradigm. This paradigm  is linked with the Single Instructions
  Multiple Data  (SIMD) architecture. This is  the kind of  parallism providing by
  GPUs.
  
-Taks parallelism  is the common parallism  achieved out on cluster  and grid and
-high performance  architecture where different  tasks are executed  by different
+Taks parallelism is the common parallism  achieved out on clusters and grids and
+high performance  architectures where different tasks are  executed by different
  computing units.
  
  \section{CUDA Multithreading}
  
  The data parallelism  of CUDA is more precisely based  on the Single Instruction
-Multiple Thread (SIMT) model. This is due to the fact that the programmer access
+Multiple Thread (SIMT) model. This is due to the fact that a programmer accesses
  to  the cores  by the  intermediate of  threads. In  the CUDA  model,  all cores
  execute the  same set of  instructions but with  different data. This  model has
  similarities with vector programming  model proposed for vector machines through
  the  1970s into  the  90s, notably  the  various Cray  platforms.   On the  CUDA
  architecture, the  performance is  led by the  use of  a huge number  of threads
  (from thousand upto  to millions). The particularity of the  model is that there
-is no context switching as in CPUs and each thread has its own registers. In practice, threads are executed by SM and are gathered into groups of 32 threads. Those groups are call ``warps''. Each SM alternatively executes ``active warps'' and warps becoming temporaly inactive due to waiting of data (as shown in Figure~\ref{ch1:fig:latency_throughput}).
+is no  context switching as in  CPUs and each  thread has its own  registers. In
+practice,  threads  are executed  by  SM  and are  gathered  into  groups of  32
+threads.  Those  groups  are  call  ``warps''. Each  SM  alternatively  executes
+``active warps''  and warps becoming temporaly  inactive due to  waiting of data
+(as shown in Figure~\ref{ch1:fig:latency_throughput}).
  
  The key to scalability in the CUDA model is the use of a huge number of threads.
-In practice threads are not only gathered  in warps but also in thread blocks. A
+In practice, threads are not only gathered  in warps but also in thread blocks. A
  thread block is executed  by only one SM and it cannot  migrate. Typical size of
  thread block is a number power of two (for example: 64, 128, 256 or 512).
  
@@ -190,7 +198,7 @@ thread block is a number power of two (for example: 64, 128, 256 or 512).
  
  In this  case, without changing anything inside  a CUDA code, it  is possible to
  run your  code with  a small CUDA  device or  most performant Tesla  CUDA cards.
-Blocks are  executed in any number depending  on the number of  SM available. So
+Blocks are  executed in any order depending  on the number of  SMs available. So
  the  programmer  must  conceive  its  code  having this  issue  in  mind.   This
  independence between threads blocks provides the scalability of CUDA codes.
  
@@ -201,31 +209,35 @@ independence between threads blocks provides the scalability of CUDA codes.
  \end{figure}
  
  
-A kernel is a function which contains a block a instruction that are executed by
-threads of a GPU.  When the problem considered is a 2 dimensions or 3 dimensions
-problem,  it is  possible to  group thread  blocks into  grid. In  practice, the
-number of thread  blocks and the size  of thread block is given  in parameter to
-each  kernel.   Figure~\ref{ch1:fig:scalability}  illustrates  an example  of  a
-kernel composed  of 8  thread blocks. Then  this kernel  is executed on  a small
-device containing only 2 SMs. So in in  this case, blocks are executed 2 by 2 in
-any order.  If the kernel is executed  on a larger CUDA device containing 4 SMs,
-blocks  are executed  4  by 4  simultaneously.   The execution  times should  be
+A kernel is a function which  contains a block of instructions that are executed
+by the  threads of a GPU.   When the problem considered  is a 2  dimensions or 3
+dimensions  problem,  it is  possible  to group  thread  blocks  into grid.   In
+practice, the number of  thread blocks and the size of thread  block is given in
+parameter  to  each  kernel.   Figure~\ref{ch1:fig:scalability}  illustrates  an
+example of a kernel composed of 8 thread blocks. Then this kernel is executed on
+a small device containing only 2 SMs.  So in in this case, blocks are executed 2
+by 2 in any order.  If the kernel is executed on a larger CUDA device containing
+4 SMs, blocks are executed 4 by 4 simultaneously.  The execution times should be
  approximately twice faster in the latter  case. Of course, that depends on other
  parameters that will be described later.
  
  Thread blocks provide a way to cooperation  in the sens that threads of the same
  block   cooperatively    load   and   store   blocks   of    memory   they   all
  use. Synchronizations of threads in the same block are possible (but not between
-thread of different blocks). Threads of the same block can also share results in
-order to compute a single  result. In chapter~\ref{chapter2}, some examples will
-explicit that.
+threads of different  blocks). Threads of the same block  can also share results
+in order  to compute a  single result. In chapter~\ref{chapter2},  some examples
+will explicit that.
  
  
  \section{Memory hierarchy}
  
-The memory hierarchy of GPUs\index{Memory  hierarchy of a GPU} is different from
-the CPUs  one.  In  practice, there are  registers, local memory,  shared memory,
-cache memroy and global memory.
+The memory hierarchy of  GPUs\index{memory~hierarchy} is different from the CPUs
+one.  In practice,  there are registers\index{memory~hierarchy!registers}, local
+memory\index{memory~hierarchy!local~memory},                               shared
+memory\index{memory~hierarchy!shared~memory},                               cache
+memory\index{memory~hierarchy!cache~memory}              and              global
+memory\index{memory~hierarchy!global~memory}.
+
  
  As  previously  mentioned each  thread  can access  its  own  registers.  It  is
  important to keep in mind that the  number of registers per block is limited. On
@@ -237,8 +249,16 @@ than registers.  Local memory is automatically used by the compiler when all the
  registers are  occupied. So the  best idea is  to optimize the use  of registers
  even if this implies to reduce the number of threads per block.
  
+\begin{figure}[hbtp!]
+\centerline{\includegraphics[scale=0.60]{Chapters/chapter1/figures/memory_hierarchy.pdf}}
+\caption[Memory hierarchy of a GPU]{Memory hierarchy of a GPU.}
+\label{ch1:fig:memory_hierarchy}
+\end{figure}
+
+
+
  Shared memory allows  cooperation between threads of the  same block.  This kind
-of memory  is fast by  it requires  to be manipulated  manually and its  size is
+of memory is fast because it requires to be manipulated manually and its size is
  limited.  It is accessible during the execution of a kernel. So the principle is
  to fill the shared  memory at the start of the kernel  with global data that are
  used very  frequently, then threads can  access it for  their computation.  They
@@ -252,11 +272,7 @@ shared memory is attributed to a kernel. The cache memory is a L1 cache which is
  directly  managed by  the GPU.  Sometimes,  this cache  provides very  efficient
  result and sometimes the use of shared memory is a better solution.
  
-\begin{figure}[b!]
-\centerline{\includegraphics[scale=0.60]{Chapters/chapter1/figures/memory_hierarchy.pdf}}
-\caption[Memory hierarchy of a GPU]{Memory hierarchy of a GPU.}
-\label{ch1:fig:memory_hierarchy}
-\end{figure}
+
  
  
  Figure~\ref{ch1:fig:memory_hierarchy}  illustrates  the  memory hierarchy  of  a
@@ -266,6 +282,13 @@ the shared memory of this block. The cache memory is not represented here but it
  is local  to a thread. Then  each block can access  to the global  memory of the
  GPU.
  
+ \section{Conclusion}
+
+In this chapter,  a brief presentation of the video card,  which later have been
+used to perform computation, has been  given. The architecture of a GPU has been
+illustrated focusing on the particularity of GPUs in term of parallelism, memory
+latency and  threads. In order to design  an efficient algorithm for  GPU, it is
+essential to have all these parameters in mind.
  
  %%http://people.maths.ox.ac.uk/gilesm/pp10/lec2_2x2.pdf
  %%https://people.maths.ox.ac.uk/erban/papers/paperCUDA.pdf