last version

[book_gpu.git] / BookGPU / Chapters / chapter1 / ch1.tex
diff --git a/BookGPU/Chapters/chapter1/ch1.tex b/BookGPU/Chapters/chapter1/ch1.tex

index e3cbd81fd7cecfc0e2044fe4f54492739abc7ce9..bfa57c8ffae1175f8e66ee1e9633be12ae37b650 100755 (executable)
--- a/BookGPU/Chapters/chapter1/ch1.tex
+++ b/BookGPU/Chapters/chapter1/ch1.tex
@@ -5,19 +5,17 @@
  \label{chapter1}
  
  \section{Introduction}\label{ch1:intro}
-``test"  "test" ``test''
  This chapter introduces the Graphics  Processing Unit (GPU) architecture and all
  the concepts needed to understand how GPUs  work and can be used to speed up the
  execution of some algorithms. First of all this chapter gives a brief history of
-the development  of the graphics  cards up to the point when they started being  used in order  to make
-general   purpose   computation.    Then   the   architecture  of   a   GPU   is
-illustrated.  There  are  many  fundamental  differences between  a  GPU  and  a
-tradition  processor. In  order  to benefit  from the  power  of a  GPU, a  CUDA
+the development  of the graphics cards up  to the point when  they started being
+used in order to perform general purpose computations.  Then the architecture of
+a GPU is illustrated.  There are  many fundamental differences between a GPU and
+a traditional  processor. In order to  benefit from the  power of a GPU,  a CUDA
  programmer needs to use threads. They have some particularities which enable the
  CUDA model to be efficient and scalable when some constraints are addressed.
  
-
-
+\clearpage
  \section{Brief history of the video card}
  
  Video  cards or graphics  cards have  been introduced  in personal  computers to
@@ -26,9 +24,9 @@ produce  high quality graphics  faster than  classical Central  Processing Units
  repetitive and very specific.  Hence,  some manufacturers have produced more and
  more sophisticated video cards, providing 2D accelerations, then 3D accelerations,
  then some  light transforms. Video cards  own their own memory  to perform their
-computation.  For at least two decades, every personal computer has had a video
+computations.  For at least two decades, every personal computer has had a video
  card which is simple for  desktop computers or which provides many accelerations
-for game and/or  graphic-oriented computers.  In the  latter case, graphic cards
+for game and/or  graphic-oriented computers.  In the  latter case, graphics cards
  may be more expensive than a CPU.
  
  Since  2000, video  cards have  allowed  users to  apply arithmetic  operations
@@ -42,7 +40,7 @@ handle the stream data with pipelines.
  
  Some researchers  tried to  apply those operations  on other  data, representing
  something different  from pixels,  and consequently this  resulted in  the first
-uses of video cards for  performing general purpose computation. The programming
+uses of video cards for  performing general purpose computations. The programming
  model  was not  easy  to use  at  all and  was very  dependent  on the  hardware
  constraints.   More precisely  it consisted  in using  either DirectX  of OpenGL
  functions  providing  an  interface  to  some classical  operations  for  videos
@@ -54,20 +52,20 @@ wrong, programmers had no way (and no tools) to detect it.
  
  In order  to benefit from the computing  power of more recent  video cards, CUDA
  was first proposed in 2007 by  NVIDIA. It unifies the programming model for some
-of  their most efficient  video cards.   CUDA~\cite{ch1:cuda} has  quickly been
+of  their most  efficient video  cards.  CUDA~\cite{ch1:cuda}  has  quickly been
  considered by  the scientific community as  a great advance  for general purpose
  graphics processing unit (GPGPU)  computing.  Of course other programming models
  have been  proposed. The  other well-known alternative  is OpenCL which  aims at
-proposing an alternative to CUDA  and which is multiplatform and portable. This
+proposing an alternative  to CUDA and which is  multiplatform and portable. This
  is a  great advantage since  it is even  possible to execute OpenCL  programs on
-traditional CPUs.  The main drawback is that it is less tight with the hardware
-and  consequently sometimes  provides  less efficient  programs. Moreover,  CUDA
+traditional CPUs.  The main drawback is  that it is less close to the hardware
+and,  consequently, it sometimes  provides  less efficient  programs. Moreover,  CUDA
  benefits from  more mature compilation and optimization  procedures.  Other less
-known environments  have been proposed, but  most of them have  been discontinued, for
-example  we can  cite, FireStream  by ATI  which is  not maintained  anymore and
-has been replaced by  OpenCL, BrookGPU by  Standford University~\cite{ch1:Buck:2004:BGS}.
-Another environment based  on pragma (insertion of pragma  directives inside the
-code to  help the compiler to generate  efficient code) is called  OpenACC.  For a
+known environments have been proposed,  but most of them have been discontinued,
+such as FireStream by ATI, which is  not maintained anymore and has been replaced by
+OpenCL and  BrookGPU  by  Stanford  University~\cite{ch1:Buck:2004:BGS}.   Another
+environment based on  pragma (insertion of pragma directives  inside the code to
+help  the  compiler  to generate  efficient  code)  is  called OpenACC.   For  a
  comparison with OpenCL, interested readers may refer to~\cite{ch1:Dongarra}.
  
  
@@ -93,7 +91,7 @@ only  the data  change. It  is important  to keep  in mind  that multiprocessors
  inside a GPU have 32 cores. Later we will see that these 32 cores need to do the
  same work to get maximum performance.
  
-\begin{figure}[b!]
+\begin{figure}[t!]
  \centerline{\includegraphics[]{Chapters/chapter1/figures/nb_cores_CPU_GPU.pdf}}
  \caption{Comparison of number of cores in a CPU and in a GPU.}
  %[Comparison of number of cores in a CPU and in a GPU]
@@ -102,10 +100,10 @@ same work to get maximum performance.
  
  On the most powerful  GPU cards, called Fermi, multiprocessors  are called streaming
  multiprocessors  (SMs). Each  SM contains  32  cores and  is able  to perform  32
-floating points or integer operations per clock on  32 bit numbers  or 16 floating
-points per clock  on  64 bit numbers. SMs  have  their  own registers,  execution
+floating points or integer operations per clock on  32-bit numbers  or 16 floating
+points per clock  on  64-bit numbers. SMs  have  their  own registers,  execution
  pipelines and caches.  On Fermi architecture,  there are 64Kb shared memory plus L1
-cache  and 32,536 32 bit  registers per  SM. More  precisely the  programmer can
+cache  and 32,536 32-bit  registers per  SM. More  precisely the  programmer can
  decide what amounts  of shared memory and  L1 cache SM are to be used.  The constraint is
  that the sum of both amounts should be less than or equal to 64Kb.
  
@@ -123,30 +121,32 @@ through  the  use  of  cache  memories. Moreover,  nowadays  CPUs  carry out  ma
  performance optimizations  such as speculative execution  which roughly speaking
  consists of executing  a small part of the code in advance even if  later this work
  reveals itself  to be  useless. GPUs  do not have  low latency
-memory.   In comparison GPUs  have small  cache memories.  Nevertheless the
+memory.   In comparison GPUs  have small  cache memories; nevertheless the
  architecture of GPUs is optimized  for throughput computation and it takes into
  account the memory latency.
  
  
  
-\begin{figure}[b!]
-\centerline{\includegraphics[scale=0.7]{Chapters/chapter1/figures/low_latency_vs_high_throughput.pdf}}
-\caption{Comparison of low latency of a CPU and high throughput of a GPU.}
-\label{ch1:fig:latency_throughput}
-\end{figure}
+
  
  Figure~\ref{ch1:fig:latency_throughput}  illustrates   the  main  difference  of
  memory latency between a CPU and a  GPU. In a CPU, tasks ``ti'' are executed one
  by one with a short memory latency to get the data to process. After some tasks,
  there is  a context switch  that allows the  CPU to run  concurrent applications
-and/or multi-threaded  applications.  Memory latencies  are longer in a  GPU. Thhe
+and/or multi-threaded  applications.  Memory latencies  are longer in a  GPU. The
   principle  to   obtain  a  high  throughput  is  to   have  many  tasks  to
  compute. Later we  will see that these tasks are called  threads with CUDA. With
  this  principle, as soon  as a  task is  finished the  next one  is ready  to be
  executed  while the  wait for  data for  the previous  task is  overlapped by the
  computation of other tasks. 
  
+\clearpage
  
+\begin{figure}[t!]
+\centerline{\includegraphics[scale=0.7]{Chapters/chapter1/figures/low_latency_vs_high_throughput.pdf}}
+\caption{Comparison of low latency of a CPU and high throughput of a GPU.}
+\label{ch1:fig:latency_throughput}
+\end{figure}
  
  \section{Kinds of parallelism}
  
@@ -171,7 +171,7 @@ GPUs.
  Task parallelism is the common parallelism  achieved  on clusters and grids and
  high performance  architectures where different tasks are  executed by different
  computing units.
-
+\clearpage
  \section{CUDA multithreading}
  
  The data parallelism  of CUDA is more precisely based  on the Single Instruction
@@ -188,11 +188,7 @@ threads,  called  warps. Each  SM  alternatively  executes
  active warps  and warps becoming temporarily  inactive due to waiting of data
  (as shown in Figure~\ref{ch1:fig:latency_throughput}).
  
-\begin{figure}[b!]
-\centerline{\includegraphics[scale=0.65]{Chapters/chapter1/figures/scalability.pdf}}
-\caption{Scalability of GPU.}
-\label{ch1:fig:scalability}
-\end{figure}
+
  
  The key to scalability in the CUDA model is the use of a huge number of threads.
  In practice, threads are  gathered not only in warps but also in thread blocks. A
@@ -211,17 +207,24 @@ independence between thread blocks provides the scalability of CUDA codes.
  
  
  A kernel is a function which  contains a block of instructions that are executed
-by the  threads of a GPU.   When the problem considered  is a two-dimensional or three-dimensional  problem,  it is  possible  to group  thread  blocks  into a grid.   In
-practice, the number of  thread blocks and the size of thread  blocks are given as
-parameters  to  each  kernel.   Figure~\ref{ch1:fig:scalability}  illustrates  an
+by the  threads of a GPU.  When  the problem considered is  a two-dimensional or
+three-dimensional problem,  it is possible to  group thread blocks  into a grid.
+In practice, the number of thread blocks and the size of thread blocks are given
+as parameters  to each kernel.   Figure~\ref{ch1:fig:scalability} illustrates an
  example of a kernel composed of 8 thread blocks. Then this kernel is executed on
-a small device containing only 2 SMs.  So in  this case, blocks are executed 2
-by 2 in any order.  If the kernel is executed on a larger CUDA device containing
-4 SMs, blocks are executed 4 by 4 simultaneously.  The execution times should be
-approximately twice faster in the latter  case. Of course, that depends on other
+a small device containing only 2 SMs.  So in this case, blocks are executed 2 by
+2 in any order.  If the kernel  is executed on a larger CUDA device containing 4
+SMs, blocks are  executed 4 by 4 simultaneously.  The  execution times should be
+approximately twice as fast in the latter case. Of course, that depends on other
  parameters that will be described later (in this chapter and other chapters).
  
  
+\begin{figure}[t!]
+\centerline{\includegraphics[scale=0.65]{Chapters/chapter1/figures/scalability.pdf}}
+\caption{Scalability of GPU.}
+\label{ch1:fig:scalability}
+\end{figure}
+
  Thread blocks provide a way to cooperate  in the sense that threads of the same
  block   cooperatively    load   and   store   blocks   of    memory   they   all
  use. Synchronizations of threads in the same block are possible (but not between
@@ -246,10 +249,10 @@ very fast, so it is a good idea to use them whenever possible.
  
  Likewise each thread can access local  memory which, in practice, is much slower
  than registers.  Local memory is automatically used by the compiler when all the
-registers are  occupied. So the  best idea is  to optimize the use  of registers
+registers are  occupied, so the  best idea is  to optimize the use  of registers
  even if this involves reducing the number of threads per block.
  
-\begin{figure}[hbtp!]
+\begin{figure}[b!]
  \centerline{\includegraphics[scale=0.60]{Chapters/chapter1/figures/memory_hierarchy.pdf}}
  \caption{Memory hierarchy of a GPU.}
  \label{ch1:fig:memory_hierarchy}
@@ -264,8 +267,8 @@ to fill the shared  memory at the start of the kernel  with global data that are
  used very  frequently, then threads can  access it for  their computation.  Threads
  can obviously change  the content of this shared  memory either with computation
  or by loading  other data and they can  store its content in the  global memory. So
-shared memory can  be seen as a cache memory  manageable manually. This
-obviously  requires an effort from the programmer.
+shared memory can  be seen as a cache memory, which is manually managed. This
+obviously  requires effort from the programmer.
  
  On  recent cards,  the programmer  may decide  what amount  of cache  memory and
  shared memory is attributed to a kernel. The cache memory is an L1 cache which is
@@ -281,7 +284,7 @@ own registers  and their local memory. Threads  of the same block  can access
  the shared memory of that block. The cache memory is not represented here but it
  is local  to a thread. Then  each block can access  the global  memory of the
  GPU.
-
+\clearpage
   \section{Conclusion}
  
  In this chapter,  a brief presentation of the video card,  which has later been