the concepts needed to understand how GPUs work and can be used to speed up the
execution of some algorithms. First of all this chapter gives a brief history of
the development of Graphics card until they can be used in order to make general
-purpose computation. Then the
+purpose computation. Then the architecture of a GPU is illustrated. There are
+many fundamental differences between a GPU and a tradition processor. In order
+to benefit from the power of a GPU, a CUDA programmer needs to use threads. They
+have some particularities which enables the CUDA model to be efficient and
+scalable when some constraints are addressed.
\section{Architecture of current GPUs}
-Architecture \index{Architecture of a GPU} of current GPUs is constantly
+Architecture \index{architecture of a GPU} of current GPUs is constantly
evolving. Nevertheless some trends remains true through this
evolution. Processing units composing a GPU are far more simpler than a
traditional CPU but it is much easier to integrate many computing units inside a
\section{Kinds of parallelism}
Many kinds of parallelism are avaible according to the type of hardware.
-Roughtly speaking, there are three classes of parallelism: instruction-level
-parallelism, data parallelism and task parallelism.
+Roughtly speaking, there are three classes of parallelism: instruction-level
+parallelism, data parallelism and task parallelism.
Instruction-level parallelism consists in re-ordering some instructions in order
to execute some of them in parallel without changing the result of the code.
Multiple Data (SIMD) architecture. This is the kind of parallism providing by
GPUs.
-Taks parallelism is the common parallism achieved out on cluster and grid and
-high performance architecture where different tasks are executed by different
+Taks parallelism is the common parallism achieved out on clusters and grids and
+high performance architectures where different tasks are executed by different
computing units.
\section{CUDA Multithreading}
The data parallelism of CUDA is more precisely based on the Single Instruction
-Multiple Thread (SIMT) model. This is due to the fact that the programmer access
+Multiple Thread (SIMT) model. This is due to the fact that a programmer accesses
to the cores by the intermediate of threads. In the CUDA model, all cores
execute the same set of instructions but with different data. This model has
similarities with vector programming model proposed for vector machines through
the 1970s into the 90s, notably the various Cray platforms. On the CUDA
architecture, the performance is led by the use of a huge number of threads
(from thousand upto to millions). The particularity of the model is that there
-is no context switching as in CPUs and each thread has its own registers. In practice, threads are executed by SM and are gathered into groups of 32 threads. Those groups are call ``warps''. Each SM alternatively executes ``active warps'' and warps becoming temporaly inactive due to waiting of data (as shown in Figure~\ref{ch1:fig:latency_throughput}).
+is no context switching as in CPUs and each thread has its own registers. In
+practice, threads are executed by SM and are gathered into groups of 32
+threads. Those groups are call ``warps''. Each SM alternatively executes
+``active warps'' and warps becoming temporaly inactive due to waiting of data
+(as shown in Figure~\ref{ch1:fig:latency_throughput}).
The key to scalability in the CUDA model is the use of a huge number of threads.
-In practice threads are not only gathered in warps but also in thread blocks. A
+In practice, threads are not only gathered in warps but also in thread blocks. A
thread block is executed by only one SM and it cannot migrate. Typical size of
thread block is a number power of two (for example: 64, 128, 256 or 512).
In this case, without changing anything inside a CUDA code, it is possible to
run your code with a small CUDA device or most performant Tesla CUDA cards.
-Blocks are executed in any number depending on the number of SM available. So
+Blocks are executed in any order depending on the number of SMs available. So
the programmer must conceive its code having this issue in mind. This
independence between threads blocks provides the scalability of CUDA codes.
\end{figure}
-A kernel is a function which contains a block a instruction that are executed by
-threads of a GPU. When the problem considered is a 2 dimensions or 3 dimensions
-problem, it is possible to group thread blocks into grid. In practice, the
-number of thread blocks and the size of thread block is given in parameter to
-each kernel. Figure~\ref{ch1:fig:scalability} illustrates an example of a
-kernel composed of 8 thread blocks. Then this kernel is executed on a small
-device containing only 2 SMs. So in in this case, blocks are executed 2 by 2 in
-any order. If the kernel is executed on a larger CUDA device containing 4 SMs,
-blocks are executed 4 by 4 simultaneously. The execution times should be
+A kernel is a function which contains a block of instructions that are executed
+by the threads of a GPU. When the problem considered is a 2 dimensions or 3
+dimensions problem, it is possible to group thread blocks into grid. In
+practice, the number of thread blocks and the size of thread block is given in
+parameter to each kernel. Figure~\ref{ch1:fig:scalability} illustrates an
+example of a kernel composed of 8 thread blocks. Then this kernel is executed on
+a small device containing only 2 SMs. So in in this case, blocks are executed 2
+by 2 in any order. If the kernel is executed on a larger CUDA device containing
+4 SMs, blocks are executed 4 by 4 simultaneously. The execution times should be
approximately twice faster in the latter case. Of course, that depends on other
parameters that will be described later.
Thread blocks provide a way to cooperation in the sens that threads of the same
block cooperatively load and store blocks of memory they all
use. Synchronizations of threads in the same block are possible (but not between
-thread of different blocks). Threads of the same block can also share results in
-order to compute a single result. In chapter~\ref{chapter2}, some examples will
-explicit that.
+threads of different blocks). Threads of the same block can also share results
+in order to compute a single result. In chapter~\ref{chapter2}, some examples
+will explicit that.
\section{Memory hierarchy}
-The memory hierarchy of GPUs\index{Memory hierarchy of a GPU} is different from
-the CPUs one. In practice, there are registers, local memory, shared memory,
-cache memroy and global memory.
+The memory hierarchy of GPUs\index{memory~hierarchy} is different from the CPUs
+one. In practice, there are registers\index{memory~hierarchy!registers}, local
+memory\index{memory~hierarchy!local~memory}, shared
+memory\index{memory~hierarchy!shared~memory}, cache
+memory\index{memory~hierarchy!cache~memory} and global
+memory\index{memory~hierarchy!global~memory}.
+
As previously mentioned each thread can access its own registers. It is
important to keep in mind that the number of registers per block is limited. On
registers are occupied. So the best idea is to optimize the use of registers
even if this implies to reduce the number of threads per block.
+\begin{figure}[hbtp!]
+\centerline{\includegraphics[scale=0.60]{Chapters/chapter1/figures/memory_hierarchy.pdf}}
+\caption[Memory hierarchy of a GPU]{Memory hierarchy of a GPU.}
+\label{ch1:fig:memory_hierarchy}
+\end{figure}
+
+
+
Shared memory allows cooperation between threads of the same block. This kind
-of memory is fast by it requires to be manipulated manually and its size is
+of memory is fast because it requires to be manipulated manually and its size is
limited. It is accessible during the execution of a kernel. So the principle is
to fill the shared memory at the start of the kernel with global data that are
used very frequently, then threads can access it for their computation. They
directly managed by the GPU. Sometimes, this cache provides very efficient
result and sometimes the use of shared memory is a better solution.
-\begin{figure}[b!]
-\centerline{\includegraphics[scale=0.60]{Chapters/chapter1/figures/memory_hierarchy.pdf}}
-\caption[Memory hierarchy of a GPU]{Memory hierarchy of a GPU.}
-\label{ch1:fig:memory_hierarchy}
-\end{figure}
+
Figure~\ref{ch1:fig:memory_hierarchy} illustrates the memory hierarchy of a
is local to a thread. Then each block can access to the global memory of the
GPU.
+ \section{Conclusion}
+
+In this chapter, a brief presentation of the video card, which later have been
+used to perform computation, has been given. The architecture of a GPU has been
+illustrated focusing on the particularity of GPUs in term of parallelism, memory
+latency and threads. In order to design an efficient algorithm for GPU, it is
+essential to have all these parameters in mind.
%%http://people.maths.ox.ac.uk/gilesm/pp10/lec2_2x2.pdf
%%https://people.maths.ox.ac.uk/erban/papers/paperCUDA.pdf