new

[book_gpu.git] / BookGPU / Chapters / chapter1 / ch1.tex
diff --git a/BookGPU/Chapters/chapter1/ch1.tex b/BookGPU/Chapters/chapter1/ch1.tex

index 68605f604645cbbaadde0891af242f593447b628..2fef3a48353fe505c9a51ffbcde4664b605f9a4c 100755 (executable)
--- a/BookGPU/Chapters/chapter1/ch1.tex
+++ b/BookGPU/Chapters/chapter1/ch1.tex
@@ -58,8 +58,8 @@ graphics processing unit (GPGPU)  computing.  Of course other programming models
  have been  proposed. The  other well-known alternative  is OpenCL which  aims at
  proposing an alternative  to CUDA and which is  multiplatform and portable. This
  is a  great advantage since  it is even  possible to execute OpenCL  programs on
-traditional CPUs.  The main drawback is  that it is less tight with the hardware
-and  consequently sometimes  provides  less efficient  programs. Moreover,  CUDA
+traditional CPUs.  The main drawback is  that it is less close to the hardware
+and  consequently it sometimes  provides  less efficient  programs. Moreover,  CUDA
  benefits from  more mature compilation and optimization  procedures.  Other less
  known environments have been proposed,  but most of them have been discontinued,
  such FireStream by ATI which is  not maintained anymore and has been replaced by
@@ -127,11 +127,7 @@ account the memory latency.
  
  
  
-\begin{figure}[t!]
-\centerline{\includegraphics[scale=0.7]{Chapters/chapter1/figures/low_latency_vs_high_throughput.pdf}}
-\caption{Comparison of low latency of a CPU and high throughput of a GPU.}
-\label{ch1:fig:latency_throughput}
-\end{figure}
+
  
  Figure~\ref{ch1:fig:latency_throughput}  illustrates   the  main  difference  of
  memory latency between a CPU and a  GPU. In a CPU, tasks ``ti'' are executed one
@@ -144,7 +140,13 @@ this  principle, as soon  as a  task is  finished the  next one  is ready  to be
  executed  while the  wait for  data for  the previous  task is  overlapped by the
  computation of other tasks. 
  
+\clearpage
  
+\begin{figure}[t!]
+\centerline{\includegraphics[scale=0.7]{Chapters/chapter1/figures/low_latency_vs_high_throughput.pdf}}
+\caption{Comparison of low latency of a CPU and high throughput of a GPU.}
+\label{ch1:fig:latency_throughput}
+\end{figure}
  
  \section{Kinds of parallelism}
  
@@ -169,7 +171,7 @@ GPUs.
  Task parallelism is the common parallelism  achieved  on clusters and grids and
  high performance  architectures where different tasks are  executed by different
  computing units.
-
+\clearpage
  \section{CUDA multithreading}
  
  The data parallelism  of CUDA is more precisely based  on the Single Instruction
@@ -265,7 +267,7 @@ to fill the shared  memory at the start of the kernel  with global data that are
  used very  frequently, then threads can  access it for  their computation.  Threads
  can obviously change  the content of this shared  memory either with computation
  or by loading  other data and they can  store its content in the  global memory. So
-shared memory can  be seen as a cache memory  manageable manually. This
+shared memory can  be seen as a cache memory which is manageable manually. This
  obviously  requires an effort from the programmer.
  
  On  recent cards,  the programmer  may decide  what amount  of cache  memory and
@@ -282,7 +284,7 @@ own registers  and their local memory. Threads  of the same block  can access
  the shared memory of that block. The cache memory is not represented here but it
  is local  to a thread. Then  each block can access  the global  memory of the
  GPU.
-
+\clearpage
   \section{Conclusion}
  
  In this chapter,  a brief presentation of the video card,  which has later been