\chapterauthor{Raphaël Couturier}{Femto-ST Institute, University of Franche-Comte}
-\chapter{Presentation of the GPU architecture and of the CUDA environment}
+\chapter{Presentation of the GPU architecture and of the Cuda environment}
\label{chapter1}
\section{Introduction}\label{ch1:intro}
This chapter introduces the Graphics Processing Unit (GPU) architecture and all
the concepts needed to understand how GPUs work and can be used to speed up the
execution of some algorithms. First of all this chapter gives a brief history of
-the development of Graphics card until they can be used in order to make general
-purpose computation. Then the architecture of a GPU is illustrated. There are
-many fundamental differences between a GPU and a tradition processor. In order
-to benefit from the power of a GPU, a CUDA programmer needs to use threads. They
-have some particularities which enable the CUDA model to be efficient and
-scalable when some constraints are addressed.
+the development of Graphics card until they have been used in order to make
+general purpose computation. Then the architecture of a GPU is
+illustrated. There are many fundamental differences between a GPU and a
+tradition processor. In order to benefit from the power of a GPU, a Cuda
+programmer needs to use threads. They have some particularities which enable the
+Cuda model to be efficient and scalable when some constraints are addressed.
produce high quality graphics faster than classical Central Processing Units
(CPU) and to alleviate CPU from this task. In general, display tasks are very
repetitive and very specific. Hence, some manufacturers have produced more and
-more sofisticated video cards, providing 2D accelerations then 3D accelerations,
+more sophisticated video cards, providing 2D accelerations then 3D accelerations,
then some light transforms. Video cards own their own memory to perform their
-computation. For at least two dedaces, every personnal computer has had a video
+computation. For at least two decades, every personal computer has had a video
card which is simple for desktop computers or which provides many accelerations
for game and/or graphic oriented computers. In the latter case, graphic cards
may be more expensive than a CPU.
-Since 2000, video cards have allowed users to apply arithmetics operations
+Since 2000, video cards have allowed users to apply arithmetic operations
simultaneously on a sequence of pixels, also later called stream processing. In
this case, the information of the pixels (color, location and other information) are
combined in order to produce a pixel color that can be displayed on a screen.
\section{GPGPU}
-In order to benefit from the computing power of more recent video cards, CUDA
+In order to benefit from the computing power of more recent video cards, Cuda
was first proposed in 2007 by NVidia. It unifies the programming model for some
of their most performant video cards. Cuda~\cite{ch1:cuda} has quickly been
considered by the scientific community as a great advance for general purpose
have been proposed. The other well-known alternative is OpenCL which aims at
proposing an alternative to Cuda and which is multi-platform and portable. This
is a great advantage since it is even possible to execute OpenCL programs on
-traditionnal CPUs. The main drawback is that it is less tight with the hardware
+traditional CPUs. The main drawback is that it is less tight with the hardware
and consequently sometimes provides less efficient programs. Moreover, Cuda
benefits from more mature compilation and optimization procedures. Other less
known environments have been proposed, but most of them have been stopped, for
-example we can cite: FireStream by ATI which is not maintened anymore and
+example we can cite: FireStream by ATI which is not maintained anymore and
replaced by OpenCL, BrookGPU by Standford University~\cite{ch1:Buck:2004:BGS}.
Another environment based on pragma (insertion of pragma directives inside the
code to help the compiler to generate efficient code) is call OpenACC. For a
consists in executing a small part of code in advance even if later this work
reveals itself to be useless. On the contrary, GPUs do not have low latency
memory. In comparison GPUs have small cache memories. Nevertheless the
-architecture of GPUs is optimized for throughtput computation and it takes into
+architecture of GPUs is optimized for throughput computation and it takes into
account the memory latency.
there is a context switch that allows the CPU to run concurrent applications
and/or multi-threaded applications. Memory latencies are longer in a GPU, the
the principle to obtain a high throughput is to have many tasks to
-compute. Later we will see that those tasks are called threads with CUDA. With
+compute. Later we will see that those tasks are called threads with Cuda. With
this principle, as soon as a task is finished the next one is ready to be
executed while the wait for data for the previous task is overlapped by
computation of other tasks.
\section{Kinds of parallelism}
-Many kinds of parallelism are avaible according to the type of hardware.
+Many kinds of parallelism are amiable according to the type of hardware.
Roughly speaking, there are three classes of parallelism: instruction-level
parallelism, data parallelism and task parallelism.
Multiple Data (SIMD) architecture. This is the kind of parallelism provided by
GPUs.
-Taks parallelism is the common parallism achieved out on clusters and grids and
+Task parallelism is the common parallelism achieved out on clusters and grids and
high performance architectures where different tasks are executed by different
computing units.
-\section{CUDA Multithreading}
+\section{Cuda Multithreading}
-The data parallelism of CUDA is more precisely based on the Single Instruction
+The data parallelism of Cuda is more precisely based on the Single Instruction
Multiple Thread (SIMT) model. This is due to the fact that a programmer accesses
-to the cores by the intermediate of threads. In the CUDA model, all cores
+to the cores by the intermediate of threads. In the Cuda model, all cores
execute the same set of instructions but with different data. This model has
similarities with the vector programming model proposed for vector machines through
-the 1970s into the 90s, notably the various Cray platforms. On the CUDA
+the 1970s into the 90s, notably the various Cray platforms. On the Cuda
architecture, the performance is led by the use of a huge number of threads
(from thousands up to to millions). The particularity of the model is that there
is no context switching as in CPUs and each thread has its own registers. In
practice, threads are executed by SM and are gathered into groups of 32
threads. Those groups are called ``warps''. Each SM alternatively executes
-``active warps'' and warps becoming temporarilly inactive due to waiting of data
+``active warps'' and warps becoming temporarily inactive due to waiting of data
(as shown in Figure~\ref{ch1:fig:latency_throughput}).
-The key to scalability in the CUDA model is the use of a huge number of threads.
+The key to scalability in the Cuda model is the use of a huge number of threads.
In practice, threads are not only gathered in warps but also in thread blocks. A
thread block is executed by only one SM and it cannot migrate. The typical size of
a thread block is a number power of two (for example: 64, 128, 256 or 512).
-In this case, without changing anything inside a CUDA code, it is possible to
-run your code with a small CUDA device or the most performing Tesla CUDA cards.
+In this case, without changing anything inside a Cuda code, it is possible to
+run your code with a small Cuda device or the most performing Tesla Cuda cards.
Blocks are executed in any order depending on the number of SMs available. So
the programmer must conceive its code having this issue in mind. This
-independence between thread blocks provides the scalability of CUDA codes.
+independence between thread blocks provides the scalability of Cuda codes.
\begin{figure}[b!]
\centerline{\includegraphics[scale=0.65]{Chapters/chapter1/figures/scalability.pdf}}
parameters to each kernel. Figure~\ref{ch1:fig:scalability} illustrates an
example of a kernel composed of 8 thread blocks. Then this kernel is executed on
a small device containing only 2 SMs. So in this case, blocks are executed 2
-by 2 in any order. If the kernel is executed on a larger CUDA device containing
+by 2 in any order. If the kernel is executed on a larger Cuda device containing
4 SMs, blocks are executed 4 by 4 simultaneously. The execution times should be
approximately twice faster in the latter case. Of course, that depends on other
parameters that will be described later.
latency and threads. In order to design an efficient algorithm for GPU, it is
essential to have all these parameters in mind.
-%%http://people.maths.ox.ac.uk/gilesm/pp10/lec2_2x2.pdf
-%%https://people.maths.ox.ac.uk/erban/papers/paperCUDA.pdf
-%%http://forum.wttsnxt.com/my_forum/viewtopic.php?f=5&t=9519
-%%http://www.cs.nyu.edu/manycores/cuda_many_cores.pdf
-%%http://www.cc.gatech.edu/~vetter/keeneland/tutorial-2011-04-14/02-cuda-overview.pdf
-%%http://people.maths.ox.ac.uk/~gilesm/cuda/
-
\putbib[Chapters/chapter1/biblio]