new

[book_gpu.git] / BookGPU / Chapters / chapter1 / ch1.tex
diff --git a/BookGPU/Chapters/chapter1/ch1.tex b/BookGPU/Chapters/chapter1/ch1.tex

index 9c3d8af900e764ebde20e2419476ffc5783aa4b2..e3cbd81fd7cecfc0e2044fe4f54492739abc7ce9 100755 (executable)
--- a/BookGPU/Chapters/chapter1/ch1.tex
+++ b/BookGPU/Chapters/chapter1/ch1.tex
@@ -139,12 +139,12 @@ Figure~\ref{ch1:fig:latency_throughput}  illustrates   the  main  difference  of
  memory latency between a CPU and a  GPU. In a CPU, tasks ``ti'' are executed one
  by one with a short memory latency to get the data to process. After some tasks,
  there is  a context switch  that allows the  CPU to run  concurrent applications
  memory latency between a CPU and a  GPU. In a CPU, tasks ``ti'' are executed one
  by one with a short memory latency to get the data to process. After some tasks,
  there is  a context switch  that allows the  CPU to run  concurrent applications
-and/or multi-threaded  applications. {\bf REPHRASE} Memory latencies  are longer in a  GPU, the
+and/or multi-threaded  applications.  Memory latencies  are longer in a  GPU. Thhe
   principle  to   obtain  a  high  throughput  is  to   have  many  tasks  to
  compute. Later we  will see that these tasks are called  threads with CUDA. With
  this  principle, as soon  as a  task is  finished the  next one  is ready  to be
   principle  to   obtain  a  high  throughput  is  to   have  many  tasks  to
  compute. Later we  will see that these tasks are called  threads with CUDA. With
  this  principle, as soon  as a  task is  finished the  next one  is ready  to be
-executed  while the  wait for  data for  the previous  task is  overlapped by
-computation of other tasks. {\bf HERE}
+executed  while the  wait for  data for  the previous  task is  overlapped by the
+computation of other tasks. 
  
  
  
  
  
  
@@ -215,14 +215,14 @@ by the  threads of a GPU.   When the problem considered  is a two-dimensional or
  practice, the number of  thread blocks and the size of thread  blocks are given as
  parameters  to  each  kernel.   Figure~\ref{ch1:fig:scalability}  illustrates  an
  example of a kernel composed of 8 thread blocks. Then this kernel is executed on
  practice, the number of  thread blocks and the size of thread  blocks are given as
  parameters  to  each  kernel.   Figure~\ref{ch1:fig:scalability}  illustrates  an
  example of a kernel composed of 8 thread blocks. Then this kernel is executed on
-a small device containing only 2 SMs. {\bf RELIRE} So in  this case, blocks are executed 2
+a small device containing only 2 SMs.  So in  this case, blocks are executed 2
  by 2 in any order.  If the kernel is executed on a larger CUDA device containing
  4 SMs, blocks are executed 4 by 4 simultaneously.  The execution times should be
  approximately twice faster in the latter  case. Of course, that depends on other
  parameters that will be described later (in this chapter and other chapters).
  
  by 2 in any order.  If the kernel is executed on a larger CUDA device containing
  4 SMs, blocks are executed 4 by 4 simultaneously.  The execution times should be
  approximately twice faster in the latter  case. Of course, that depends on other
  parameters that will be described later (in this chapter and other chapters).
  
-{\bf RELIRE}
-Thread blocks provide a way to cooperation  in the sense that threads of the same
+
+Thread blocks provide a way to cooperate  in the sense that threads of the same
  block   cooperatively    load   and   store   blocks   of    memory   they   all
  use. Synchronizations of threads in the same block are possible (but not between
  threads of different  blocks). Threads of the same block  can also share results
  block   cooperatively    load   and   store   blocks   of    memory   they   all
  use. Synchronizations of threads in the same block are possible (but not between
  threads of different  blocks). Threads of the same block  can also share results