From: couturie Date: Wed, 28 Aug 2013 18:33:33 +0000 (+0200) Subject: modif X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/commitdiff_plain/0d39f3bfb1736ae41805f75a779e0bb01f4f5139?ds=sidebyside modif --- diff --git a/BookGPU/BookGPU.tex b/BookGPU/BookGPU.tex index 2990040..16c1f8f 100755 --- a/BookGPU/BookGPU.tex +++ b/BookGPU/BookGPU.tex @@ -99,8 +99,8 @@ frame=single, % keywordstyle=[1]\textbf, %identifierstyle=\textbf, - commentstyle=\color{white}\textbf, - stringstyle=\color{white}\ttfamily, + commentstyle=\color{darkgray}\textbf, + stringstyle=\color{darkgray}\ttfamily, % xleftmargin=17pt, % framexleftmargin=17pt, % framexrightmargin=5pt, diff --git a/BookGPU/Chapters/chapter1/ch1.tex b/BookGPU/Chapters/chapter1/ch1.tex index 36820fd..68605f6 100755 --- a/BookGPU/Chapters/chapter1/ch1.tex +++ b/BookGPU/Chapters/chapter1/ch1.tex @@ -8,15 +8,14 @@ This chapter introduces the Graphics Processing Unit (GPU) architecture and all the concepts needed to understand how GPUs work and can be used to speed up the execution of some algorithms. First of all this chapter gives a brief history of -the development of the graphics cards up to the point when they started being used in order to make -general purpose computation. Then the architecture of a GPU is -illustrated. There are many fundamental differences between a GPU and a -tradition processor. In order to benefit from the power of a GPU, a CUDA +the development of the graphics cards up to the point when they started being +used in order to perform general purpose computations. Then the architecture of +a GPU is illustrated. There are many fundamental differences between a GPU and +a traditional processor. In order to benefit from the power of a GPU, a CUDA programmer needs to use threads. They have some particularities which enable the CUDA model to be efficient and scalable when some constraints are addressed. - - +\clearpage \section{Brief history of the video card} Video cards or graphics cards have been introduced in personal computers to @@ -25,9 +24,9 @@ produce high quality graphics faster than classical Central Processing Units repetitive and very specific. Hence, some manufacturers have produced more and more sophisticated video cards, providing 2D accelerations, then 3D accelerations, then some light transforms. Video cards own their own memory to perform their -computation. For at least two decades, every personal computer has had a video +computations. For at least two decades, every personal computer has had a video card which is simple for desktop computers or which provides many accelerations -for game and/or graphic-oriented computers. In the latter case, graphic cards +for game and/or graphic-oriented computers. In the latter case, graphics cards may be more expensive than a CPU. Since 2000, video cards have allowed users to apply arithmetic operations @@ -41,7 +40,7 @@ handle the stream data with pipelines. Some researchers tried to apply those operations on other data, representing something different from pixels, and consequently this resulted in the first -uses of video cards for performing general purpose computation. The programming +uses of video cards for performing general purpose computations. The programming model was not easy to use at all and was very dependent on the hardware constraints. More precisely it consisted in using either DirectX of OpenGL functions providing an interface to some classical operations for videos @@ -53,20 +52,20 @@ wrong, programmers had no way (and no tools) to detect it. In order to benefit from the computing power of more recent video cards, CUDA was first proposed in 2007 by NVIDIA. It unifies the programming model for some -of their most efficient video cards. CUDA~\cite{ch1:cuda} has quickly been +of their most efficient video cards. CUDA~\cite{ch1:cuda} has quickly been considered by the scientific community as a great advance for general purpose graphics processing unit (GPGPU) computing. Of course other programming models have been proposed. The other well-known alternative is OpenCL which aims at -proposing an alternative to CUDA and which is multiplatform and portable. This +proposing an alternative to CUDA and which is multiplatform and portable. This is a great advantage since it is even possible to execute OpenCL programs on -traditional CPUs. The main drawback is that it is less tight with the hardware +traditional CPUs. The main drawback is that it is less tight with the hardware and consequently sometimes provides less efficient programs. Moreover, CUDA benefits from more mature compilation and optimization procedures. Other less -known environments have been proposed, but most of them have been discontinued, for -example we can cite, FireStream by ATI which is not maintained anymore and -has been replaced by OpenCL, BrookGPU by Standford University~\cite{ch1:Buck:2004:BGS}. -Another environment based on pragma (insertion of pragma directives inside the -code to help the compiler to generate efficient code) is called OpenACC. For a +known environments have been proposed, but most of them have been discontinued, +such FireStream by ATI which is not maintained anymore and has been replaced by +OpenCL and BrookGPU by Stanford University~\cite{ch1:Buck:2004:BGS}. Another +environment based on pragma (insertion of pragma directives inside the code to +help the compiler to generate efficient code) is called OpenACC. For a comparison with OpenCL, interested readers may refer to~\cite{ch1:Dongarra}. @@ -92,7 +91,7 @@ only the data change. It is important to keep in mind that multiprocessors inside a GPU have 32 cores. Later we will see that these 32 cores need to do the same work to get maximum performance. -\begin{figure}[b!] +\begin{figure}[t!] \centerline{\includegraphics[]{Chapters/chapter1/figures/nb_cores_CPU_GPU.pdf}} \caption{Comparison of number of cores in a CPU and in a GPU.} %[Comparison of number of cores in a CPU and in a GPU] @@ -101,10 +100,10 @@ same work to get maximum performance. On the most powerful GPU cards, called Fermi, multiprocessors are called streaming multiprocessors (SMs). Each SM contains 32 cores and is able to perform 32 -floating points or integer operations per clock on 32 bit numbers or 16 floating -points per clock on 64 bit numbers. SMs have their own registers, execution +floating points or integer operations per clock on 32-bit numbers or 16 floating +points per clock on 64-bit numbers. SMs have their own registers, execution pipelines and caches. On Fermi architecture, there are 64Kb shared memory plus L1 -cache and 32,536 32 bit registers per SM. More precisely the programmer can +cache and 32,536 32-bit registers per SM. More precisely the programmer can decide what amounts of shared memory and L1 cache SM are to be used. The constraint is that the sum of both amounts should be less than or equal to 64Kb. @@ -122,13 +121,13 @@ through the use of cache memories. Moreover, nowadays CPUs carry out ma performance optimizations such as speculative execution which roughly speaking consists of executing a small part of the code in advance even if later this work reveals itself to be useless. GPUs do not have low latency -memory. In comparison GPUs have small cache memories. Nevertheless the +memory. In comparison GPUs have small cache memories; nevertheless the architecture of GPUs is optimized for throughput computation and it takes into account the memory latency. -\begin{figure}[b!] +\begin{figure}[t!] \centerline{\includegraphics[scale=0.7]{Chapters/chapter1/figures/low_latency_vs_high_throughput.pdf}} \caption{Comparison of low latency of a CPU and high throughput of a GPU.} \label{ch1:fig:latency_throughput} @@ -138,7 +137,7 @@ Figure~\ref{ch1:fig:latency_throughput} illustrates the main difference of memory latency between a CPU and a GPU. In a CPU, tasks ``ti'' are executed one by one with a short memory latency to get the data to process. After some tasks, there is a context switch that allows the CPU to run concurrent applications -and/or multi-threaded applications. Memory latencies are longer in a GPU. Thhe +and/or multi-threaded applications. Memory latencies are longer in a GPU. The principle to obtain a high throughput is to have many tasks to compute. Later we will see that these tasks are called threads with CUDA. With this principle, as soon as a task is finished the next one is ready to be @@ -187,11 +186,7 @@ threads, called warps. Each SM alternatively executes active warps and warps becoming temporarily inactive due to waiting of data (as shown in Figure~\ref{ch1:fig:latency_throughput}). -\begin{figure}[b!] -\centerline{\includegraphics[scale=0.65]{Chapters/chapter1/figures/scalability.pdf}} -\caption{Scalability of GPU.} -\label{ch1:fig:scalability} -\end{figure} + The key to scalability in the CUDA model is the use of a huge number of threads. In practice, threads are gathered not only in warps but also in thread blocks. A @@ -210,17 +205,24 @@ independence between thread blocks provides the scalability of CUDA codes. A kernel is a function which contains a block of instructions that are executed -by the threads of a GPU. When the problem considered is a two-dimensional or three-dimensional problem, it is possible to group thread blocks into a grid. In -practice, the number of thread blocks and the size of thread blocks are given as -parameters to each kernel. Figure~\ref{ch1:fig:scalability} illustrates an +by the threads of a GPU. When the problem considered is a two-dimensional or +three-dimensional problem, it is possible to group thread blocks into a grid. +In practice, the number of thread blocks and the size of thread blocks are given +as parameters to each kernel. Figure~\ref{ch1:fig:scalability} illustrates an example of a kernel composed of 8 thread blocks. Then this kernel is executed on -a small device containing only 2 SMs. So in this case, blocks are executed 2 -by 2 in any order. If the kernel is executed on a larger CUDA device containing -4 SMs, blocks are executed 4 by 4 simultaneously. The execution times should be -approximately twice faster in the latter case. Of course, that depends on other +a small device containing only 2 SMs. So in this case, blocks are executed 2 by +2 in any order. If the kernel is executed on a larger CUDA device containing 4 +SMs, blocks are executed 4 by 4 simultaneously. The execution times should be +approximately twice as fast in the latter case. Of course, that depends on other parameters that will be described later (in this chapter and other chapters). +\begin{figure}[t!] +\centerline{\includegraphics[scale=0.65]{Chapters/chapter1/figures/scalability.pdf}} +\caption{Scalability of GPU.} +\label{ch1:fig:scalability} +\end{figure} + Thread blocks provide a way to cooperate in the sense that threads of the same block cooperatively load and store blocks of memory they all use. Synchronizations of threads in the same block are possible (but not between @@ -245,10 +247,10 @@ very fast, so it is a good idea to use them whenever possible. Likewise each thread can access local memory which, in practice, is much slower than registers. Local memory is automatically used by the compiler when all the -registers are occupied. So the best idea is to optimize the use of registers +registers are occupied, so the best idea is to optimize the use of registers even if this involves reducing the number of threads per block. -\begin{figure}[hbtp!] +\begin{figure}[b!] \centerline{\includegraphics[scale=0.60]{Chapters/chapter1/figures/memory_hierarchy.pdf}} \caption{Memory hierarchy of a GPU.} \label{ch1:fig:memory_hierarchy} diff --git a/BookGPU/Chapters/chapter2/biblio.bib b/BookGPU/Chapters/chapter2/biblio.bib index 0f3ad7e..70a2c17 100644 --- a/BookGPU/Chapters/chapter2/biblio.bib +++ b/BookGPU/Chapters/chapter2/biblio.bib @@ -3,7 +3,7 @@ title = "{CUDA} by example: An Introduction To General-Purpose {GPU} Programming", publisher = "Ad{\-d}i{\-s}on-Wes{\-l}ey", - address = "pub-AW:adr", + address = "Upper Saddle River, NJ", pages = "xix + 290", year = "2010", LCCN = "QA76.76.A65", diff --git a/BookGPU/Chapters/chapter2/ch2.tex b/BookGPU/Chapters/chapter2/ch2.tex index 75be84b..7fc8471 100755 --- a/BookGPU/Chapters/chapter2/ch2.tex +++ b/BookGPU/Chapters/chapter2/ch2.tex @@ -23,16 +23,17 @@ are executed on a GPU. This code is in Listing~\ref{ch2:lst:ex1}. As GPUs have their own memory, the first step consists of allocating memory on -the GPU. A call to \texttt{cudaMalloc}\index{CUDA functions!cudaMalloc} -allocates memory on the GPU. The second parameter represents the size of the -allocated variables, this size is expressed in bits. +the GPU. A call to \texttt{cudaMalloc}\index{CUDA functions!cudaMalloc} +allocates memory on the GPU. {\bf REREAD The first parameter of this function is a pointer +on a memory on the device, i.e. the GPU.} The second parameter represents the +size of the allocated variables, this size is expressed in bits. \pagebreak \lstinputlisting[label=ch2:lst:ex1,caption=simple example]{Chapters/chapter2/ex1.cu} In this example, we want to compare the execution time of the additions of two arrays in CPU and GPU. So for both these operations, a timer is created to -measure the time. CUDA proposes to manipulate timers quite easily. The first +measure the time. CUDA manipulates timers quite easily. The first step is to create the timer\index{CUDA functions!timer}, then to start it, and at the end to stop it. For each of these operations a dedicated function is used. @@ -64,7 +65,7 @@ CUDA). Blocks of threads and thread indexes can be decomposed into 1 dimension, the dimension of blocks of threads must be chosen carefully. In our example, only one dimension is used. Then using the notation \texttt{.x}, we can access the first dimension (\texttt{.y} and \texttt{.z}, respectively allow access to the second and -third dimension). The variable \texttt{blockDim}\index{CUDA keywords!blockDim} +third dimensions). The variable \texttt{blockDim}\index{CUDA keywords!blockDim} gives the size of each block. @@ -74,12 +75,12 @@ gives the size of each block. \section{Second example: using CUBLAS \index{CUBLAS}} \label{ch2:2ex} -The Basic Linear Algebra Subprograms (BLAS) allows programmers to use efficient +The Basic Linear Algebra Subprograms (BLAS) allow programmers to use efficient routines for basic linear operations. Those routines are heavily used in many scientific applications and are optimized for vector operations, matrix-vector operations, and matrix-matrix operations~\cite{ch2:journals/ijhpca/Dongarra02}. Some of those operations seem -to be easy to implement with CUDA. Nevertheless, as soon as a reduction is +to be easy to implement with CUDA; however, as soon as a reduction is needed, implementing an efficient reduction routine with CUDA is far from being simple. Roughly speaking, a reduction operation\index{reduction operation} is an operation which combines all the elements of an array and extracts a number @@ -144,7 +145,7 @@ three loops. We assume that $A$, $B$ represent two square matrices and the result of the multiplication of $A \times B$ is $C$. The element \texttt{C[i*size+j]} is computed as follows: \begin{equation} -C[size*i+j]=\sum_{k=0}^{size-1} A[size*i+k]*B[size*k+j]; +C[size*i+j]=\sum_{k=0}^{size-1} A[size*i+k]*B[size*k+j]. \end{equation} In Listing~\ref{ch2:lst:ex3}, the CPU computation is performed using 3 loops, diff --git a/BookGPU/Chapters/chapter2/ex1.cu b/BookGPU/Chapters/chapter2/ex1.cu index 8f2b404..64c08dd 100644 --- a/BookGPU/Chapters/chapter2/ex1.cu +++ b/BookGPU/Chapters/chapter2/ex1.cu @@ -42,7 +42,7 @@ int main( int argc, char** argv) unsigned int timer_cpu = 0; cutilCheckError(cutCreateTimer(&timer_cpu)); - cutilCheckError(cutStartTimer(timer_cpu)); + cutilCheckError(cutStartTimer(timer_cpu)); for(i=0;i