X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/0d39f3bfb1736ae41805f75a779e0bb01f4f5139..b1fd489e34a8d46d286a0d271c38cbfb442f511f:/BookGPU/Chapters/chapter1/ch1.tex?ds=inline diff --git a/BookGPU/Chapters/chapter1/ch1.tex b/BookGPU/Chapters/chapter1/ch1.tex index 68605f6..bfa57c8 100755 --- a/BookGPU/Chapters/chapter1/ch1.tex +++ b/BookGPU/Chapters/chapter1/ch1.tex @@ -58,11 +58,11 @@ graphics processing unit (GPGPU) computing. Of course other programming models have been proposed. The other well-known alternative is OpenCL which aims at proposing an alternative to CUDA and which is multiplatform and portable. This is a great advantage since it is even possible to execute OpenCL programs on -traditional CPUs. The main drawback is that it is less tight with the hardware -and consequently sometimes provides less efficient programs. Moreover, CUDA +traditional CPUs. The main drawback is that it is less close to the hardware +and, consequently, it sometimes provides less efficient programs. Moreover, CUDA benefits from more mature compilation and optimization procedures. Other less known environments have been proposed, but most of them have been discontinued, -such FireStream by ATI which is not maintained anymore and has been replaced by +such as FireStream by ATI, which is not maintained anymore and has been replaced by OpenCL and BrookGPU by Stanford University~\cite{ch1:Buck:2004:BGS}. Another environment based on pragma (insertion of pragma directives inside the code to help the compiler to generate efficient code) is called OpenACC. For a @@ -127,11 +127,7 @@ account the memory latency. -\begin{figure}[t!] -\centerline{\includegraphics[scale=0.7]{Chapters/chapter1/figures/low_latency_vs_high_throughput.pdf}} -\caption{Comparison of low latency of a CPU and high throughput of a GPU.} -\label{ch1:fig:latency_throughput} -\end{figure} + Figure~\ref{ch1:fig:latency_throughput} illustrates the main difference of memory latency between a CPU and a GPU. In a CPU, tasks ``ti'' are executed one @@ -144,7 +140,13 @@ this principle, as soon as a task is finished the next one is ready to be executed while the wait for data for the previous task is overlapped by the computation of other tasks. +\clearpage +\begin{figure}[t!] +\centerline{\includegraphics[scale=0.7]{Chapters/chapter1/figures/low_latency_vs_high_throughput.pdf}} +\caption{Comparison of low latency of a CPU and high throughput of a GPU.} +\label{ch1:fig:latency_throughput} +\end{figure} \section{Kinds of parallelism} @@ -169,7 +171,7 @@ GPUs. Task parallelism is the common parallelism achieved on clusters and grids and high performance architectures where different tasks are executed by different computing units. - +\clearpage \section{CUDA multithreading} The data parallelism of CUDA is more precisely based on the Single Instruction @@ -265,8 +267,8 @@ to fill the shared memory at the start of the kernel with global data that are used very frequently, then threads can access it for their computation. Threads can obviously change the content of this shared memory either with computation or by loading other data and they can store its content in the global memory. So -shared memory can be seen as a cache memory manageable manually. This -obviously requires an effort from the programmer. +shared memory can be seen as a cache memory, which is manually managed. This +obviously requires effort from the programmer. On recent cards, the programmer may decide what amount of cache memory and shared memory is attributed to a kernel. The cache memory is an L1 cache which is @@ -282,7 +284,7 @@ own registers and their local memory. Threads of the same block can access the shared memory of that block. The cache memory is not represented here but it is local to a thread. Then each block can access the global memory of the GPU. - +\clearpage \section{Conclusion} In this chapter, a brief presentation of the video card, which has later been