X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/f947c46397ce66a0f014e5e11653f0d34b3b7d50..b1fd489e34a8d46d286a0d271c38cbfb442f511f:/BookGPU/Chapters/chapter1/ch1.tex?ds=inline diff --git a/BookGPU/Chapters/chapter1/ch1.tex b/BookGPU/Chapters/chapter1/ch1.tex index ec639cb..bfa57c8 100755 --- a/BookGPU/Chapters/chapter1/ch1.tex +++ b/BookGPU/Chapters/chapter1/ch1.tex @@ -1,276 +1,297 @@ -\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte} +\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte, France} \chapter{Presentation of the GPU architecture and of the CUDA environment} \label{chapter1} \section{Introduction}\label{ch1:intro} - This chapter introduces the Graphics Processing Unit (GPU) architecture and all the concepts needed to understand how GPUs work and can be used to speed up the execution of some algorithms. First of all this chapter gives a brief history of -the development of Graphics card until they can be used in order to make general -purpose computation. Then the - - - -\section{Brief history of Video Card} - -Video card or Graphics card have been introduced in personnal computers to -produce high quality graphics faster than classical Central Processing Unit -(CPU) and to alleviate CPU from this task. In general, display tasks are very +the development of the graphics cards up to the point when they started being +used in order to perform general purpose computations. Then the architecture of +a GPU is illustrated. There are many fundamental differences between a GPU and +a traditional processor. In order to benefit from the power of a GPU, a CUDA +programmer needs to use threads. They have some particularities which enable the +CUDA model to be efficient and scalable when some constraints are addressed. + +\clearpage +\section{Brief history of the video card} + +Video cards or graphics cards have been introduced in personal computers to +produce high quality graphics faster than classical Central Processing Units +(CPU) and to free the CPU from this task. In general, display tasks are very repetitive and very specific. Hence, some manufacturers have produced more and -more sofisticated video cards, providing 2D accelerations then 3D accelerations, +more sophisticated video cards, providing 2D accelerations, then 3D accelerations, then some light transforms. Video cards own their own memory to perform their -computation. From at least two dedaces, every personnal computer has a video -card which is simple for desktop computers or which provides many accelerations -for game and/or graphic oriented computers. In the latter case, graphic cards +computations. For at least two decades, every personal computer has had a video +card which is simple for desktop computers or which provides many accelerations +for game and/or graphic-oriented computers. In the latter case, graphics cards may be more expensive than a CPU. -After 2000, video cards allowed to apply arithmetics operations simulatenously -on a sequence of pixels, also later called stream processing. In this case, -information of the pixels (color, location and other information) are combined -in order to produce a pixel color that can be displayed on a -screen. Simultaneous computations are provided by shaders which calculate -rendering effects on graphics hardware with a high degree of flexibility. These -shaders handles the stream data with pipelines. +Since 2000, video cards have allowed users to apply arithmetic operations +simultaneously on a sequence of pixels, later called stream processing. In +this case, the information of the pixels (color, location and other information) is +combined in order to produce a pixel color that can be displayed on a screen. +Simultaneous computations are provided by shaders which calculate rendering +effects on graphics hardware with a high degree of flexibility. These shaders +handle the stream data with pipelines. -Some reasearchers tried to apply those operations on other data, representing +Some researchers tried to apply those operations on other data, representing something different from pixels, and consequently this resulted in the first -uses of video cards for performing general purpose computation. The programming -model was not easy to use at all and was very dependent of the hardware -constraints. More precisely it consisted in using either DirectX of OpenGL +uses of video cards for performing general purpose computations. The programming +model was not easy to use at all and was very dependent on the hardware +constraints. More precisely it consisted in using either DirectX of OpenGL functions providing an interface to some classical operations for videos -operations (memory transfers, texture manipulation, ...). Floating point -operations were most of the time unimaginable. Obviously when something bad -happened, programmers had no way (and tools) to detect it. +operations (memory transfers, texture manipulation, etc.). Floating point +operations were most of the time unimaginable. Obviously when something went +wrong, programmers had no way (and no tools) to detect it. \section{GPGPU} In order to benefit from the computing power of more recent video cards, CUDA -was first proposed in 2007 by NVidia. It unifies the programming model for some -of their most performant video cards. Cuda~\cite{ch1:cuda} has quickly been +was first proposed in 2007 by NVIDIA. It unifies the programming model for some +of their most efficient video cards. CUDA~\cite{ch1:cuda} has quickly been considered by the scientific community as a great advance for general purpose -graphics processing unit (GPGPU) computing. Of course other programming model +graphics processing unit (GPGPU) computing. Of course other programming models have been proposed. The other well-known alternative is OpenCL which aims at -proposing an alternative to Cuda and which is multi-platform and portable. This +proposing an alternative to CUDA and which is multiplatform and portable. This is a great advantage since it is even possible to execute OpenCL programs on -traditionnal CPUs. The main drawbacks is that it is less tight with the -hardware and consequently provides sometimes less efficient programs. Moreover, -Cuda benefits from more mature compilation and optimization procedures. Other -less known environment have been proposed, but most of them have been stopped, -for example we can cited: FireStream by ATI which is not maintened anymore and -replaced by OpenCL, BrookGPU by Standford University~\cite{ch1:Buck:2004:BGS}. -Another environment based on pragma (insertion of pragma directives inside the -code to help the compiler to generate efficient code) is call OpenACC. For a -comparison with OpenCL, interested readers may refer to~\cite{ch1:CMR:12}. +traditional CPUs. The main drawback is that it is less close to the hardware +and, consequently, it sometimes provides less efficient programs. Moreover, CUDA +benefits from more mature compilation and optimization procedures. Other less +known environments have been proposed, but most of them have been discontinued, +such as FireStream by ATI, which is not maintained anymore and has been replaced by +OpenCL and BrookGPU by Stanford University~\cite{ch1:Buck:2004:BGS}. Another +environment based on pragma (insertion of pragma directives inside the code to +help the compiler to generate efficient code) is called OpenACC. For a +comparison with OpenCL, interested readers may refer to~\cite{ch1:Dongarra}. \section{Architecture of current GPUs} -Architecture \index{Architecture of a GPU} of current GPUs is constantly -evolving. Nevertheless some trends remains true through this -evolution. Processing units composing a GPU are far more simpler than a -traditional CPU but it is much easier to integrate many computing units inside a -GPU card than many cores inside a CPU. This is due to the fact that cores of a -GPU are simpler than cores of a CPU. In 2012, the most powerful GPUs own more -than 500 cores and the most powerful CPUs have 8 +The architecture \index{GPU!architecture of a} of current GPUs is constantly +evolving. Nevertheless some trends remain constant throughout this evolution. +Processing units composing a GPU are far simpler than a traditional CPU and +it is much easier to integrate many computing units inside a GPU card than to do +so with many cores inside a CPU. In 2012, the most powerful GPUs contained more than 500 +cores and the most powerful CPUs had 8 cores. Figure~\ref{ch1:fig:comparison_cpu_gpu} shows the number of cores inside -a CPU and inside a GPU. In fact, in a current NVidia GPU, there are -multiprocessors which have 32 cores (for example on Fermi cards). The core clock -of CPU is generally around 3GHz and the one of GPU is about 1.5GHz. Although the -core clock of GPU cores is slower, the amount of cores inside a GPU provides -more computational power. This measure is commonly represented by the number of -floating point operation per seconds. Nowadays most powerful GPUs provide more -than 1TFlops, i.e. $10^{12}$ floating point operations per second. Nevertheless -GPUs are very efficient to perform some operations but not all kinds of -operations. They are very efficient to execute repetitive work in which only the -data change. It is important to keep in mind that multiprocessors inside a GPU -have 32 cores. Later we will see that these 32 cores need to do the same work to -get maximum performance. - -\begin{figure}[b!] +a CPU and inside a GPU. In fact, in a current NVIDIA GPU, there are +multiprocessors which have 32 cores (for example, on Fermi cards). The core clock +of a CPU is generally around 3GHz and the one of a GPU is about 1.5GHz. Although +the core clock of GPU cores is slower, the number of cores inside a GPU provides +more computational power. This measure is commonly represented by the number of +floating point operation per seconds. Nowadays the most powerful GPUs provide more +than 1TFlops, i.e., $10^{12}$ floating point operations per second. +Nevertheless GPUs are very efficient at executing repetitive work in which +only the data change. It is important to keep in mind that multiprocessors +inside a GPU have 32 cores. Later we will see that these 32 cores need to do the +same work to get maximum performance. + +\begin{figure}[t!] \centerline{\includegraphics[]{Chapters/chapter1/figures/nb_cores_CPU_GPU.pdf}} -\caption[Comparison of number of cores in a CPU and in a GPU]{Comparison of number of cores in a CPU and in a GPU.} +\caption{Comparison of number of cores in a CPU and in a GPU.} +%[Comparison of number of cores in a CPU and in a GPU] \label{ch1:fig:comparison_cpu_gpu} \end{figure} -On most powerful GPU cards, called Fermi, multiprocessors are called streaming -multiprocessors (SM). Each SM contains 32 cores and is able to perform 32 -floating point or integer operations on 32bits numbers per clock or 16 floating -point on 64bits number per clock. SM have their own registers, execution -pipelines and caches. On Fermi architecture, there are 64Kb shared memory + L1 -cache and 32,536 32bits registers per SM. More precisely the programmer can -decide what amount of shared memory and L1 cache SM can use. The constaint is -that the sum of both amounts is less or equal to 64Kb. - -Threads are used to benefit from the important number of cores of a GPU. Those -threads are different from traditional threads for CPU. In -chapter~\ref{chapter2}, some examples of GPU programming will explicit the -details of the GPU threads. However, threads are gathered into blocks of 32 -threads, called ``warp''. Those warps are important when designing an algorithm +On the most powerful GPU cards, called Fermi, multiprocessors are called streaming +multiprocessors (SMs). Each SM contains 32 cores and is able to perform 32 +floating points or integer operations per clock on 32-bit numbers or 16 floating +points per clock on 64-bit numbers. SMs have their own registers, execution +pipelines and caches. On Fermi architecture, there are 64Kb shared memory plus L1 +cache and 32,536 32-bit registers per SM. More precisely the programmer can +decide what amounts of shared memory and L1 cache SM are to be used. The constraint is +that the sum of both amounts should be less than or equal to 64Kb. + +Threads are used to benefit from the large number of cores of a GPU. These +threads are different from traditional threads for a CPU. In +Chapter~\ref{chapter2}, some examples of GPU programming will explain the +details of the GPU threads. Threads are gathered into blocks of 32 +threads, called ``warps''. These warps are important when designing an algorithm for GPU. -Another big difference between CPU and GPU is the latency of memory. In CPU, +Another big difference between a CPU and a GPU is the latency of memory. In a CPU, everything is optimized to obtain a low latency architecture. This is possible -through the use of cache memories. Moreover, nowadays CPUs perform many +through the use of cache memories. Moreover, nowadays CPUs carry out many performance optimizations such as speculative execution which roughly speaking -consists in executing a small part of code in advance even if later this work -reveals to be useless. In opposite, GPUs do not have low latency memory. In -comparison GPUs have ridiculous cache memories. Nevertheless the architecture of GPUs is optimized for throughtput computation and it takes into account the memory latency. +consists of executing a small part of the code in advance even if later this work +reveals itself to be useless. GPUs do not have low latency +memory. In comparison GPUs have small cache memories; nevertheless the +architecture of GPUs is optimized for throughput computation and it takes into +account the memory latency. + -\begin{figure}[b!] -\centerline{\includegraphics[scale=0.7]{Chapters/chapter1/figures/low_latency_vs_high_throughput.pdf}} -\caption[Comparison of low latency of CPU and highthroughput of GPU]{Comparison of low latency of CPU and highthroughput of GPU.} -\label{ch1:fig:latency_throughput} -\end{figure} Figure~\ref{ch1:fig:latency_throughput} illustrates the main difference of memory latency between a CPU and a GPU. In a CPU, tasks ``ti'' are executed one by one with a short memory latency to get the data to process. After some tasks, there is a context switch that allows the CPU to run concurrent applications -and/or multi-threaded applications. Memory latencies are longer in a GPU, the -the principle to obtain a high throughput is to have many tasks to -compute. Later we will see that those tasks are called threads with CUDA. With +and/or multi-threaded applications. Memory latencies are longer in a GPU. The + principle to obtain a high throughput is to have many tasks to +compute. Later we will see that these tasks are called threads with CUDA. With this principle, as soon as a task is finished the next one is ready to be -executed while the waiting for data for the previous task is overlapped by -computation of other tasks. +executed while the wait for data for the previous task is overlapped by the +computation of other tasks. +\clearpage +\begin{figure}[t!] +\centerline{\includegraphics[scale=0.7]{Chapters/chapter1/figures/low_latency_vs_high_throughput.pdf}} +\caption{Comparison of low latency of a CPU and high throughput of a GPU.} +\label{ch1:fig:latency_throughput} +\end{figure} \section{Kinds of parallelism} -Many kinds of parallelism are avaible according to the type of hardware. -Roughtly speaking, there are three classes of parallism: instruction-level -parallelism, data parallelism and task parallelism. +Many kinds of parallelism are available according to the type of hardware. +Roughly speaking, there are three classes of parallelism: instruction-level +parallelism, data parallelism, and task parallelism. -Instruction-level parallelism consists in re-ordering some instructions in order -to executed some of them in parallel without changing the result of the code. -In modern CPUs, instruction pipelines allow processor to execute instruction -faster. With a pipeline a processor can execute multiple instruction -simultaneously due to the fact that the output of a task is the input of the +Instruction-level parallelism consists in reordering some instructions in order +to execute some of them in parallel without changing the result of the code. +In modern CPUs, instruction pipelines allow the processor to execute instructions +faster. With a pipeline a processor can execute multiple instructions +simultaneously because the output of a task is the input of the next one. Data parallelism consists in executing the same program with different data on -different computing units. Of course, no depency should exist between the the +different computing units. Of course, no dependency should exist among the data. For example, it is easy to parallelize loops without dependency using the data parallelism paradigm. This paradigm is linked with the Single Instructions -Multiple Data (SIMD) architecture. This is the kind of parallism providing by +Multiple Data (SIMD) architecture. This is the kind of parallelism provided by GPUs. -Taks parallelism is the common parallism achieved out on cluster and grid and -high performance architecture where different tasks are executed by different +Task parallelism is the common parallelism achieved on clusters and grids and +high performance architectures where different tasks are executed by different computing units. - -\section{CUDA Multithreading} +\clearpage +\section{CUDA multithreading} The data parallelism of CUDA is more precisely based on the Single Instruction -Multiple Thread (SIMT) model. This is due to the fact that the programmer access -to the cores by the intermediate of threads. In the CUDA model, all cores +Multiple Thread (SIMT) model, because a programmer accesses + the cores by the intermediate of threads. In the CUDA model, all cores execute the same set of instructions but with different data. This model has -similarities with vector programming model proposed for vector machines through -the 1970s into the 90s, notably the various Cray platforms. On the CUDA +similarities with the vector programming model proposed for vector machines through +the 1970s and into the 90s, notably the various Cray platforms. On the CUDA architecture, the performance is led by the use of a huge number of threads -(from thousand upto to millions). The particularity of the model is that there -is no context switching as in CPUs and each thread has its own registers. In practice, threads are executed by SM and are gathered into groups of 32 threads. Those groups are call ``warps''. Each SM alternatively executes ``active warps'' and warps becoming temporaly inactive due to waiting of data (as shown in Figure~\ref{ch1:fig:latency_throughput}). +(from thousands up to millions). The particularity of the model is that there +is no context switching as in CPUs and each thread has its own registers. In +practice, threads are executed by SM and gathered into groups of 32 +threads, called warps. Each SM alternatively executes +active warps and warps becoming temporarily inactive due to waiting of data +(as shown in Figure~\ref{ch1:fig:latency_throughput}). + + The key to scalability in the CUDA model is the use of a huge number of threads. -In practice threads are not only gathered in warps but also in thread blocks. A -thread block is executed by only one SM and it cannot migrate. Typical size of -thread block is a number power of two (for example: 64, 128, 256 or 512). +In practice, threads are gathered not only in warps but also in thread blocks. A +thread block is executed by only one SM and it cannot migrate. The typical size of +a thread block is a power of two (for example, 64, 128, 256, or 512). In this case, without changing anything inside a CUDA code, it is possible to -run your code with a small CUDA device or most performant Tesla CUDA cards. -Blocks are executed in any number depending on the number of SM available. So -the programmer must conceive its code having this issue in mind. This -independence between threads blocks provides the scalability of CUDA codes. +run code with a small CUDA device or the best performing Tesla CUDA cards. +Blocks are executed in any order depending on the number of SMs available. So +the programmer must conceive code having this issue in mind. This +independence between thread blocks provides the scalability of CUDA codes. -\begin{figure}[b!] + + + +A kernel is a function which contains a block of instructions that are executed +by the threads of a GPU. When the problem considered is a two-dimensional or +three-dimensional problem, it is possible to group thread blocks into a grid. +In practice, the number of thread blocks and the size of thread blocks are given +as parameters to each kernel. Figure~\ref{ch1:fig:scalability} illustrates an +example of a kernel composed of 8 thread blocks. Then this kernel is executed on +a small device containing only 2 SMs. So in this case, blocks are executed 2 by +2 in any order. If the kernel is executed on a larger CUDA device containing 4 +SMs, blocks are executed 4 by 4 simultaneously. The execution times should be +approximately twice as fast in the latter case. Of course, that depends on other +parameters that will be described later (in this chapter and other chapters). + + +\begin{figure}[t!] \centerline{\includegraphics[scale=0.65]{Chapters/chapter1/figures/scalability.pdf}} -\caption[Scalability of GPU]{Scalability of GPU.} +\caption{Scalability of GPU.} \label{ch1:fig:scalability} \end{figure} - -A kernel is a function which contains a block a instruction that are executed by -threads of a GPU. When the problem considered is a 2 dimensions or 3 dimensions -problem, it is possible to group thread blocks into grid. In practice, the -number of thread blocks and the size of thread block is given in parameter to -each kernel. Figure~\ref{ch1:fig:scalability} illustrates an example of a -kernel composed of 8 thread blocks. Then this kernel is executed on a small -device containing only 2 SMs. So in in this case, blocks are executed 2 by 2 in -any order. If the kernel is executed on a larger CUDA device containing 4 SMs, -blocks are executed 4 by 4 simultaneously. The execution times should be -approximately twice faster in the latter case. Of course, that depends on other -parameters that will be described later. - -Thread blocks provide a way to cooperation in the sens that threads of the same +Thread blocks provide a way to cooperate in the sense that threads of the same block cooperatively load and store blocks of memory they all use. Synchronizations of threads in the same block are possible (but not between -thread of different blocks). Threads of the same block can also share results in -order to compute a single result. In chapter~\ref{chapter2}, some examples will -explicit that. +threads of different blocks). Threads of the same block can also share results +in order to compute a single result. In Chapter~\ref{chapter2}, some examples +will explain that. \section{Memory hierarchy} -The memory hierarchy of GPUs\index{Memory hierarchy of a GPU} is different from -the CPUs one. In practice, there are registers, local memory, shared memory, -cache memroy and global memory. +The memory hierarchy of GPUs\index{memory hierarchy} is different from that of CPUs. In practice, there are registers\index{memory hierarchy!registers}, local +memory\index{memory hierarchy!local memory}, shared +memory\index{memory hierarchy!shared memory}, cache +memory\index{memory hierarchy!cache memory}, and global +memory\index{memory hierarchy!global memory}. + As previously mentioned each thread can access its own registers. It is important to keep in mind that the number of registers per block is limited. On recent cards, this number is limited to 64Kb per SM. Access to registers is -very fast, so when possible it is a good idea to use them. +very fast, so it is a good idea to use them whenever possible. Likewise each thread can access local memory which, in practice, is much slower than registers. Local memory is automatically used by the compiler when all the -registers are occupied. So the best idea is to optimize the use of registers -even if this implies to reduce the number of threads per block. +registers are occupied, so the best idea is to optimize the use of registers +even if this involves reducing the number of threads per block. + +\begin{figure}[b!] +\centerline{\includegraphics[scale=0.60]{Chapters/chapter1/figures/memory_hierarchy.pdf}} +\caption{Memory hierarchy of a GPU.} +\label{ch1:fig:memory_hierarchy} +\end{figure} + + Shared memory allows cooperation between threads of the same block. This kind -of memory is fast by it requires to be manipulated manually and its size is -limited. It is accessible during the execution of a kernel. So the principle is +of memory is fast but it needs to be manipulated manually and its size is +limited. It is accessible during the execution of a kernel. So the idea is to fill the shared memory at the start of the kernel with global data that are -used very frequently, then threads can access it for their computation. They +used very frequently, then threads can access it for their computation. Threads can obviously change the content of this shared memory either with computation -or load of other data and they can store its content in the global memory. So -shared memory can be seen as a cache memory manageable manually. This requires -obviously an effort from the programmer. +or by loading other data and they can store its content in the global memory. So +shared memory can be seen as a cache memory, which is manually managed. This +obviously requires effort from the programmer. On recent cards, the programmer may decide what amount of cache memory and -shared memory is attributed to a kernel. The cache memory is a L1 cache which is +shared memory is attributed to a kernel. The cache memory is an L1 cache which is directly managed by the GPU. Sometimes, this cache provides very efficient result and sometimes the use of shared memory is a better solution. -\begin{figure}[b!] -\centerline{\includegraphics[scale=0.60]{Chapters/chapter1/figures/memory_hierarchy.pdf}} -\caption[Memory hierarchy of a GPU]{Memory hierarchy of a GPU.} -\label{ch1:fig:memory_hierarchy} -\end{figure} + Figure~\ref{ch1:fig:memory_hierarchy} illustrates the memory hierarchy of a -GPU. Threads are represented on the top of the figure. They can access to their -own registers and their local memory. Threads of the same block can access to -the shared memory of this block. The cache memory is not represented here but it -is local to a thread. Then each block can access to the global memory of the +GPU. Threads are represented on the top of the figure. They can have access to their +own registers and their local memory. Threads of the same block can access +the shared memory of that block. The cache memory is not represented here but it +is local to a thread. Then each block can access the global memory of the GPU. - - -%%http://people.maths.ox.ac.uk/gilesm/pp10/lec2_2x2.pdf -%%https://people.maths.ox.ac.uk/erban/papers/paperCUDA.pdf -%%http://forum.wttsnxt.com/my_forum/viewtopic.php?f=5&t=9519 -%%http://www.cs.nyu.edu/manycores/cuda_many_cores.pdf -%%http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/02-cuda-overview.pdf -%%http://people.maths.ox.ac.uk/~gilesm/cuda/ +\clearpage + \section{Conclusion} + +In this chapter, a brief presentation of the video card, which has later been +used to perform computation, has been given. The architecture of a GPU has been +illustrated focusing on the particularity of GPUs in terms of parallelism, memory +latency, and threads. In order to design an efficient algorithm for GPU, it is +essential to keep all these parameters in mind. \putbib[Chapters/chapter1/biblio]