X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/4df8b859fb5445134295e3e7a1df43d911d6d9dd..9378973df8e8a9aac4a7c212a7efb7d831bfae94:/BookGPU/Chapters/chapter1/ch1.tex?ds=sidebyside diff --git a/BookGPU/Chapters/chapter1/ch1.tex b/BookGPU/Chapters/chapter1/ch1.tex index a70f420..36820fd 100755 --- a/BookGPU/Chapters/chapter1/ch1.tex +++ b/BookGPU/Chapters/chapter1/ch1.tex @@ -5,7 +5,6 @@ \label{chapter1} \section{Introduction}\label{ch1:intro} - This chapter introduces the Graphics Processing Unit (GPU) architecture and all the concepts needed to understand how GPUs work and can be used to speed up the execution of some algorithms. First of all this chapter gives a brief history of @@ -68,13 +67,13 @@ example we can cite, FireStream by ATI which is not maintained anymore and has been replaced by OpenCL, BrookGPU by Standford University~\cite{ch1:Buck:2004:BGS}. Another environment based on pragma (insertion of pragma directives inside the code to help the compiler to generate efficient code) is called OpenACC. For a -comparison with OpenCL, interested readers may refer to~\cite{ch1:CMR:12}. +comparison with OpenCL, interested readers may refer to~\cite{ch1:Dongarra}. \section{Architecture of current GPUs} -The architecture \index{architecture of a GPU} of current GPUs is constantly +The architecture \index{GPU!architecture of a} of current GPUs is constantly evolving. Nevertheless some trends remain constant throughout this evolution. Processing units composing a GPU are far simpler than a traditional CPU and it is much easier to integrate many computing units inside a GPU card than to do @@ -113,7 +112,7 @@ Threads are used to benefit from the large number of cores of a GPU. These threads are different from traditional threads for a CPU. In Chapter~\ref{chapter2}, some examples of GPU programming will explain the details of the GPU threads. Threads are gathered into blocks of 32 -threads, called warps. These warps are important when designing an algorithm +threads, called ``warps''. These warps are important when designing an algorithm for GPU. @@ -139,12 +138,12 @@ Figure~\ref{ch1:fig:latency_throughput} illustrates the main difference of memory latency between a CPU and a GPU. In a CPU, tasks ``ti'' are executed one by one with a short memory latency to get the data to process. After some tasks, there is a context switch that allows the CPU to run concurrent applications -and/or multi-threaded applications. {\bf REPHRASE} Memory latencies are longer in a GPU, the +and/or multi-threaded applications. Memory latencies are longer in a GPU. Thhe principle to obtain a high throughput is to have many tasks to compute. Later we will see that these tasks are called threads with CUDA. With this principle, as soon as a task is finished the next one is ready to be -executed while the wait for data for the previous task is overlapped by -computation of other tasks. {\bf HERE} +executed while the wait for data for the previous task is overlapped by the +computation of other tasks. @@ -215,14 +214,14 @@ by the threads of a GPU. When the problem considered is a two-dimensional or practice, the number of thread blocks and the size of thread blocks are given as parameters to each kernel. Figure~\ref{ch1:fig:scalability} illustrates an example of a kernel composed of 8 thread blocks. Then this kernel is executed on -a small device containing only 2 SMs. {\bf RELIRE} So in this case, blocks are executed 2 +a small device containing only 2 SMs. So in this case, blocks are executed 2 by 2 in any order. If the kernel is executed on a larger CUDA device containing 4 SMs, blocks are executed 4 by 4 simultaneously. The execution times should be approximately twice faster in the latter case. Of course, that depends on other parameters that will be described later (in this chapter and other chapters). -{\bf RELIRE} -Thread blocks provide a way to cooperation in the sense that threads of the same + +Thread blocks provide a way to cooperate in the sense that threads of the same block cooperatively load and store blocks of memory they all use. Synchronizations of threads in the same block are possible (but not between threads of different blocks). Threads of the same block can also share results @@ -232,11 +231,11 @@ will explain that. \section{Memory hierarchy} -The memory hierarchy of GPUs\index{memory~hierarchy} is different from that of CPUs. In practice, there are registers\index{memory~hierarchy!registers}, local -memory\index{memory~hierarchy!local~memory}, shared -memory\index{memory~hierarchy!shared~memory}, cache -memory\index{memory~hierarchy!cache~memory}, and global -memory\index{memory~hierarchy!global~memory}. +The memory hierarchy of GPUs\index{memory hierarchy} is different from that of CPUs. In practice, there are registers\index{memory hierarchy!registers}, local +memory\index{memory hierarchy!local memory}, shared +memory\index{memory hierarchy!shared memory}, cache +memory\index{memory hierarchy!cache memory}, and global +memory\index{memory hierarchy!global memory}. As previously mentioned each thread can access its own registers. It is