X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/2d004d3498accbbc57604339ae5815d96f2e3bf2..fa1939d2294b408e8a62f2d91149f369f8710113:/BookGPU/Chapters/chapter1/ch1.tex?ds=sidebyside diff --git a/BookGPU/Chapters/chapter1/ch1.tex b/BookGPU/Chapters/chapter1/ch1.tex index b15918e..17a7e47 100755 --- a/BookGPU/Chapters/chapter1/ch1.tex +++ b/BookGPU/Chapters/chapter1/ch1.tex @@ -1,7 +1,7 @@ -\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte} +\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte, France} -\chapter{Presentation of the GPU architecture and of the CUDA environment} +\chapter{Presentation of the GPU architecture and of the Cuda environment} \label{chapter1} \section{Introduction}\label{ch1:intro} @@ -12,9 +12,9 @@ execution of some algorithms. First of all this chapter gives a brief history of the development of Graphics card until they have been used in order to make general purpose computation. Then the architecture of a GPU is illustrated. There are many fundamental differences between a GPU and a -tradition processor. In order to benefit from the power of a GPU, a CUDA +tradition processor. In order to benefit from the power of a GPU, a Cuda programmer needs to use threads. They have some particularities which enable the -CUDA model to be efficient and scalable when some constraints are addressed. +Cuda model to be efficient and scalable when some constraints are addressed. @@ -52,7 +52,7 @@ wrong, programmers had no way (and neither the tools) to detect it. \section{GPGPU} -In order to benefit from the computing power of more recent video cards, CUDA +In order to benefit from the computing power of more recent video cards, Cuda was first proposed in 2007 by NVidia. It unifies the programming model for some of their most performant video cards. Cuda~\cite{ch1:cuda} has quickly been considered by the scientific community as a great advance for general purpose @@ -143,7 +143,7 @@ by one with a short memory latency to get the data to process. After some tasks, there is a context switch that allows the CPU to run concurrent applications and/or multi-threaded applications. Memory latencies are longer in a GPU, the the principle to obtain a high throughput is to have many tasks to -compute. Later we will see that those tasks are called threads with CUDA. With +compute. Later we will see that those tasks are called threads with Cuda. With this principle, as soon as a task is finished the next one is ready to be executed while the wait for data for the previous task is overlapped by computation of other tasks. @@ -174,14 +174,14 @@ Task parallelism is the common parallelism achieved out on clusters and grids a high performance architectures where different tasks are executed by different computing units. -\section{CUDA Multithreading} +\section{Cuda Multithreading} -The data parallelism of CUDA is more precisely based on the Single Instruction +The data parallelism of Cuda is more precisely based on the Single Instruction Multiple Thread (SIMT) model. This is due to the fact that a programmer accesses -to the cores by the intermediate of threads. In the CUDA model, all cores +to the cores by the intermediate of threads. In the Cuda model, all cores execute the same set of instructions but with different data. This model has similarities with the vector programming model proposed for vector machines through -the 1970s into the 90s, notably the various Cray platforms. On the CUDA +the 1970s into the 90s, notably the various Cray platforms. On the Cuda architecture, the performance is led by the use of a huge number of threads (from thousands up to to millions). The particularity of the model is that there is no context switching as in CPUs and each thread has its own registers. In @@ -190,18 +190,18 @@ threads. Those groups are called ``warps''. Each SM alternatively execut ``active warps'' and warps becoming temporarily inactive due to waiting of data (as shown in Figure~\ref{ch1:fig:latency_throughput}). -The key to scalability in the CUDA model is the use of a huge number of threads. +The key to scalability in the Cuda model is the use of a huge number of threads. In practice, threads are not only gathered in warps but also in thread blocks. A thread block is executed by only one SM and it cannot migrate. The typical size of a thread block is a number power of two (for example: 64, 128, 256 or 512). -In this case, without changing anything inside a CUDA code, it is possible to -run your code with a small CUDA device or the most performing Tesla CUDA cards. +In this case, without changing anything inside a Cuda code, it is possible to +run your code with a small Cuda device or the most performing Tesla Cuda cards. Blocks are executed in any order depending on the number of SMs available. So the programmer must conceive its code having this issue in mind. This -independence between thread blocks provides the scalability of CUDA codes. +independence between thread blocks provides the scalability of Cuda codes. \begin{figure}[b!] \centerline{\includegraphics[scale=0.65]{Chapters/chapter1/figures/scalability.pdf}} @@ -217,7 +217,7 @@ practice, the number of thread blocks and the size of thread blocks is given a parameters to each kernel. Figure~\ref{ch1:fig:scalability} illustrates an example of a kernel composed of 8 thread blocks. Then this kernel is executed on a small device containing only 2 SMs. So in this case, blocks are executed 2 -by 2 in any order. If the kernel is executed on a larger CUDA device containing +by 2 in any order. If the kernel is executed on a larger Cuda device containing 4 SMs, blocks are executed 4 by 4 simultaneously. The execution times should be approximately twice faster in the latter case. Of course, that depends on other parameters that will be described later. @@ -291,13 +291,6 @@ illustrated focusing on the particularity of GPUs in term of parallelism, memory latency and threads. In order to design an efficient algorithm for GPU, it is essential to have all these parameters in mind. -%%http://people.maths.ox.ac.uk/gilesm/pp10/lec2_2x2.pdf -%%https://people.maths.ox.ac.uk/erban/papers/paperCUDA.pdf -%%http://forum.wttsnxt.com/my_forum/viewtopic.php?f=5&t=9519 -%%http://www.cs.nyu.edu/manycores/cuda_many_cores.pdf -%%http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/02-cuda-overview.pdf -%%http://people.maths.ox.ac.uk/~gilesm/cuda/ - \putbib[Chapters/chapter1/biblio]