X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/0d39f3bfb1736ae41805f75a779e0bb01f4f5139..21ada473e001305cb2a05a34a2da6c2fb1ecd126:/BookGPU/Chapters/chapter1/ch1.tex diff --git a/BookGPU/Chapters/chapter1/ch1.tex b/BookGPU/Chapters/chapter1/ch1.tex index 68605f6..2fef3a4 100755 --- a/BookGPU/Chapters/chapter1/ch1.tex +++ b/BookGPU/Chapters/chapter1/ch1.tex @@ -58,8 +58,8 @@ graphics processing unit (GPGPU) computing. Of course other programming models have been proposed. The other well-known alternative is OpenCL which aims at proposing an alternative to CUDA and which is multiplatform and portable. This is a great advantage since it is even possible to execute OpenCL programs on -traditional CPUs. The main drawback is that it is less tight with the hardware -and consequently sometimes provides less efficient programs. Moreover, CUDA +traditional CPUs. The main drawback is that it is less close to the hardware +and consequently it sometimes provides less efficient programs. Moreover, CUDA benefits from more mature compilation and optimization procedures. Other less known environments have been proposed, but most of them have been discontinued, such FireStream by ATI which is not maintained anymore and has been replaced by @@ -127,11 +127,7 @@ account the memory latency. -\begin{figure}[t!] -\centerline{\includegraphics[scale=0.7]{Chapters/chapter1/figures/low_latency_vs_high_throughput.pdf}} -\caption{Comparison of low latency of a CPU and high throughput of a GPU.} -\label{ch1:fig:latency_throughput} -\end{figure} + Figure~\ref{ch1:fig:latency_throughput} illustrates the main difference of memory latency between a CPU and a GPU. In a CPU, tasks ``ti'' are executed one @@ -144,7 +140,13 @@ this principle, as soon as a task is finished the next one is ready to be executed while the wait for data for the previous task is overlapped by the computation of other tasks. +\clearpage +\begin{figure}[t!] +\centerline{\includegraphics[scale=0.7]{Chapters/chapter1/figures/low_latency_vs_high_throughput.pdf}} +\caption{Comparison of low latency of a CPU and high throughput of a GPU.} +\label{ch1:fig:latency_throughput} +\end{figure} \section{Kinds of parallelism} @@ -169,7 +171,7 @@ GPUs. Task parallelism is the common parallelism achieved on clusters and grids and high performance architectures where different tasks are executed by different computing units. - +\clearpage \section{CUDA multithreading} The data parallelism of CUDA is more precisely based on the Single Instruction @@ -265,7 +267,7 @@ to fill the shared memory at the start of the kernel with global data that are used very frequently, then threads can access it for their computation. Threads can obviously change the content of this shared memory either with computation or by loading other data and they can store its content in the global memory. So -shared memory can be seen as a cache memory manageable manually. This +shared memory can be seen as a cache memory which is manageable manually. This obviously requires an effort from the programmer. On recent cards, the programmer may decide what amount of cache memory and @@ -282,7 +284,7 @@ own registers and their local memory. Threads of the same block can access the shared memory of that block. The cache memory is not represented here but it is local to a thread. Then each block can access the global memory of the GPU. - +\clearpage \section{Conclusion} In this chapter, a brief presentation of the video card, which has later been