From: Raphael Couturier Date: Thu, 4 Oct 2012 14:08:52 +0000 (+0200) Subject: new X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/commitdiff_plain/f947c46397ce66a0f014e5e11653f0d34b3b7d50?ds=sidebyside new --- diff --git a/BookGPU/Chapters/chapter1/ch1.tex b/BookGPU/Chapters/chapter1/ch1.tex index 9a3f4eb..ec639cb 100755 --- a/BookGPU/Chapters/chapter1/ch1.tex +++ b/BookGPU/Chapters/chapter1/ch1.tex @@ -10,7 +10,7 @@ This chapter introduces the Graphics Processing Unit (GPU) architecture and all the concepts needed to understand how GPUs work and can be used to speed up the execution of some algorithms. First of all this chapter gives a brief history of the development of Graphics card until they can be used in order to make general -purpose computation. +purpose computation. Then the @@ -23,9 +23,9 @@ repetitive and very specific. Hence, some manufacturers have produced more and more sofisticated video cards, providing 2D accelerations then 3D accelerations, then some light transforms. Video cards own their own memory to perform their computation. From at least two dedaces, every personnal computer has a video -card which a simple for desktop computers or which provides many accelerations +card which is simple for desktop computers or which provides many accelerations for game and/or graphic oriented computers. In the latter case, graphic cards -may be more expensive than the CPU. +may be more expensive than a CPU. After 2000, video cards allowed to apply arithmetics operations simulatenously on a sequence of pixels, also later called stream processing. In this case, @@ -33,7 +33,7 @@ information of the pixels (color, location and other information) are combined in order to produce a pixel color that can be displayed on a screen. Simultaneous computations are provided by shaders which calculate rendering effects on graphics hardware with a high degree of flexibility. These -shaders handles the stream data with pipelines +shaders handles the stream data with pipelines. Some reasearchers tried to apply those operations on other data, representing @@ -70,12 +70,13 @@ comparison with OpenCL, interested readers may refer to~\cite{ch1:CMR:12}. \section{Architecture of current GPUs} -Architecure of current GPUs is constantly evolving. Nevertheless some trends -remains true through this evolution. Processing units composing a GPU are far -more simpler than a traditional CPU but it is much easier to integrate many -computing units inside a GPU card than many cores inside a CPU. This is due to -the fact that cores of a GPU a simpler than cores of a CPU. In 2012, the most -powerful GPUs own more than 500 cores and the most powerful CPUs have 8 +Architecture \index{Architecture of a GPU} of current GPUs is constantly +evolving. Nevertheless some trends remains true through this +evolution. Processing units composing a GPU are far more simpler than a +traditional CPU but it is much easier to integrate many computing units inside a +GPU card than many cores inside a CPU. This is due to the fact that cores of a +GPU are simpler than cores of a CPU. In 2012, the most powerful GPUs own more +than 500 cores and the most powerful CPUs have 8 cores. Figure~\ref{ch1:fig:comparison_cpu_gpu} shows the number of cores inside a CPU and inside a GPU. In fact, in a current NVidia GPU, there are multiprocessors which have 32 cores (for example on Fermi cards). The core clock @@ -220,18 +221,19 @@ explicit that. \section{Memory hierarchy} -The memory hierarchy of GPUs is different from the one of CPUs. In practice, -there is registers, local memory, shared memory, cache memroy and global memory. +The memory hierarchy of GPUs\index{Memory hierarchy of a GPU} is different from +the CPUs one. In practice, there are registers, local memory, shared memory, +cache memroy and global memory. As previously mentioned each thread can access its own registers. It is important to keep in mind that the number of registers per block is limited. On recent cards, this number is limited to 64Kb per SM. Access to registers is very fast, so when possible it is a good idea to use them. -Likewise each thread can access local memory which in practice much slower than -registers. In practice, local memory is automatically used by the compiler when -all the registers are occupied. So the best idea is to optimize the use of -registers even if this implies to reduce the number of threads per block. +Likewise each thread can access local memory which, in practice, is much slower +than registers. Local memory is automatically used by the compiler when all the +registers are occupied. So the best idea is to optimize the use of registers +even if this implies to reduce the number of threads per block. Shared memory allows cooperation between threads of the same block. This kind of memory is fast by it requires to be manipulated manually and its size is diff --git a/BookGPU/Chapters/chapter2/ch2.tex b/BookGPU/Chapters/chapter2/ch2.tex index 0640708..bff6d44 100755 --- a/BookGPU/Chapters/chapter2/ch2.tex +++ b/BookGPU/Chapters/chapter2/ch2.tex @@ -52,7 +52,7 @@ which indicates that this kernel can be call from the C code. The first instruction in this kernel is used to computed the \texttt{tid} which representes the thread index. This thread index is computed according to the values of the block index (it is a variable of CUDA -called \texttt{blockIdx\index{CUDA~keywords!blockIdx}}). Blocks of threads can +called \texttt{blockIdx}\index{CUDA~keywords!blockIdx}). Blocks of threads can be decomposed into 1 dimension, 2 dimensions or 3 dimensions. According to the dimension of data manipulated, the appropriate dimension can be useful. In our example, only one dimension is used. Then using notation \texttt{.x} we can