new

[book_gpu.git] / BookGPU / Chapters / chapter1 / ch1.tex
diff --git a/BookGPU/Chapters/chapter1/ch1.tex b/BookGPU/Chapters/chapter1/ch1.tex

index fc891b5f7311338c3bf99c97290f99958366941e..36820fd978b2221f032eafa6ebb2f14a49d5c61e 100755 (executable)
--- a/BookGPU/Chapters/chapter1/ch1.tex
+++ b/BookGPU/Chapters/chapter1/ch1.tex
@@ -5,7 +5,6 @@
  \label{chapter1}
  
  \section{Introduction}\label{ch1:intro}
  \label{chapter1}
  
  \section{Introduction}\label{ch1:intro}
-``test"  "test" ``test''
  This chapter introduces the Graphics  Processing Unit (GPU) architecture and all
  the concepts needed to understand how GPUs  work and can be used to speed up the
  execution of some algorithms. First of all this chapter gives a brief history of
  This chapter introduces the Graphics  Processing Unit (GPU) architecture and all
  the concepts needed to understand how GPUs  work and can be used to speed up the
  execution of some algorithms. First of all this chapter gives a brief history of
@@ -74,7 +73,7 @@ comparison with OpenCL, interested readers may refer to~\cite{ch1:Dongarra}.
  
  \section{Architecture of current GPUs}
  
  
  \section{Architecture of current GPUs}
  
-The architecture  \index{architecture of  a GPU} of  current GPUs  is constantly
+The architecture  \index{GPU!architecture of a} of  current GPUs  is constantly
  evolving.  Nevertheless  some trends remain constant  throughout this evolution.
  Processing units composing a GPU are  far simpler than a traditional CPU and
  it is much easier to integrate many computing units inside a GPU card than to do
  evolving.  Nevertheless  some trends remain constant  throughout this evolution.
  Processing units composing a GPU are  far simpler than a traditional CPU and
  it is much easier to integrate many computing units inside a GPU card than to do
@@ -139,12 +138,12 @@ Figure~\ref{ch1:fig:latency_throughput}  illustrates   the  main  difference  of
  memory latency between a CPU and a  GPU. In a CPU, tasks ``ti'' are executed one
  by one with a short memory latency to get the data to process. After some tasks,
  there is  a context switch  that allows the  CPU to run  concurrent applications
  memory latency between a CPU and a  GPU. In a CPU, tasks ``ti'' are executed one
  by one with a short memory latency to get the data to process. After some tasks,
  there is  a context switch  that allows the  CPU to run  concurrent applications
-and/or multi-threaded  applications. {\bf REPHRASE} Memory latencies  are longer in a  GPU, the
+and/or multi-threaded  applications.  Memory latencies  are longer in a  GPU. Thhe
   principle  to   obtain  a  high  throughput  is  to   have  many  tasks  to
  compute. Later we  will see that these tasks are called  threads with CUDA. With
  this  principle, as soon  as a  task is  finished the  next one  is ready  to be
   principle  to   obtain  a  high  throughput  is  to   have  many  tasks  to
  compute. Later we  will see that these tasks are called  threads with CUDA. With
  this  principle, as soon  as a  task is  finished the  next one  is ready  to be
-executed  while the  wait for  data for  the previous  task is  overlapped by
-computation of other tasks. {\bf HERE}
+executed  while the  wait for  data for  the previous  task is  overlapped by the
+computation of other tasks. 
  
  
  
  
  
  
@@ -215,14 +214,14 @@ by the  threads of a GPU.   When the problem considered  is a two-dimensional or
  practice, the number of  thread blocks and the size of thread  blocks are given as
  parameters  to  each  kernel.   Figure~\ref{ch1:fig:scalability}  illustrates  an
  example of a kernel composed of 8 thread blocks. Then this kernel is executed on
  practice, the number of  thread blocks and the size of thread  blocks are given as
  parameters  to  each  kernel.   Figure~\ref{ch1:fig:scalability}  illustrates  an
  example of a kernel composed of 8 thread blocks. Then this kernel is executed on
-a small device containing only 2 SMs. {\bf RELIRE} So in  this case, blocks are executed 2
+a small device containing only 2 SMs.  So in  this case, blocks are executed 2
  by 2 in any order.  If the kernel is executed on a larger CUDA device containing
  4 SMs, blocks are executed 4 by 4 simultaneously.  The execution times should be
  approximately twice faster in the latter  case. Of course, that depends on other
  parameters that will be described later (in this chapter and other chapters).
  
  by 2 in any order.  If the kernel is executed on a larger CUDA device containing
  4 SMs, blocks are executed 4 by 4 simultaneously.  The execution times should be
  approximately twice faster in the latter  case. Of course, that depends on other
  parameters that will be described later (in this chapter and other chapters).
  
-{\bf RELIRE}
-Thread blocks provide a way to cooperation  in the sense that threads of the same
+
+Thread blocks provide a way to cooperate  in the sense that threads of the same
  block   cooperatively    load   and   store   blocks   of    memory   they   all
  use. Synchronizations of threads in the same block are possible (but not between
  threads of different  blocks). Threads of the same block  can also share results
  block   cooperatively    load   and   store   blocks   of    memory   they   all
  use. Synchronizations of threads in the same block are possible (but not between
  threads of different  blocks). Threads of the same block  can also share results
@@ -232,11 +231,11 @@ will explain that.
  
  \section{Memory hierarchy}
  
  
  \section{Memory hierarchy}
  
-The memory hierarchy of  GPUs\index{memory~hierarchy} is different from that of CPUs.  In practice,  there are registers\index{memory~hierarchy!registers}, local
-memory\index{memory~hierarchy!local~memory},                               shared
-memory\index{memory~hierarchy!shared~memory},                               cache
-memory\index{memory~hierarchy!cache~memory},              and              global
-memory\index{memory~hierarchy!global~memory}.
+The memory hierarchy of  GPUs\index{memory hierarchy} is different from that of CPUs.  In practice,  there are registers\index{memory hierarchy!registers}, local
+memory\index{memory hierarchy!local memory},                               shared
+memory\index{memory hierarchy!shared memory},                               cache
+memory\index{memory hierarchy!cache memory},              and              global
+memory\index{memory hierarchy!global memory}.
  
  
  As  previously  mentioned each  thread  can access  its  own  registers.  It  is
  
  
  As  previously  mentioned each  thread  can access  its  own  registers.  It  is