have been proposed. The other well-known alternative is OpenCL which aims at
proposing an alternative to CUDA and which is multiplatform and portable. This
is a great advantage since it is even possible to execute OpenCL programs on
-traditional CPUs. The main drawback is that it is less tight with the hardware
-and consequently sometimes provides less efficient programs. Moreover, CUDA
+traditional CPUs. The main drawback is that it is less close to the hardware
+and consequently it sometimes provides less efficient programs. Moreover, CUDA
benefits from more mature compilation and optimization procedures. Other less
known environments have been proposed, but most of them have been discontinued,
such FireStream by ATI which is not maintained anymore and has been replaced by
-\begin{figure}[t!]
-\centerline{\includegraphics[scale=0.7]{Chapters/chapter1/figures/low_latency_vs_high_throughput.pdf}}
-\caption{Comparison of low latency of a CPU and high throughput of a GPU.}
-\label{ch1:fig:latency_throughput}
-\end{figure}
+
Figure~\ref{ch1:fig:latency_throughput} illustrates the main difference of
memory latency between a CPU and a GPU. In a CPU, tasks ``ti'' are executed one
executed while the wait for data for the previous task is overlapped by the
computation of other tasks.
+\clearpage
+\begin{figure}[t!]
+\centerline{\includegraphics[scale=0.7]{Chapters/chapter1/figures/low_latency_vs_high_throughput.pdf}}
+\caption{Comparison of low latency of a CPU and high throughput of a GPU.}
+\label{ch1:fig:latency_throughput}
+\end{figure}
\section{Kinds of parallelism}
Task parallelism is the common parallelism achieved on clusters and grids and
high performance architectures where different tasks are executed by different
computing units.
-
+\clearpage
\section{CUDA multithreading}
The data parallelism of CUDA is more precisely based on the Single Instruction
used very frequently, then threads can access it for their computation. Threads
can obviously change the content of this shared memory either with computation
or by loading other data and they can store its content in the global memory. So
-shared memory can be seen as a cache memory manageable manually. This
+shared memory can be seen as a cache memory which is manageable manually. This
obviously requires an effort from the programmer.
On recent cards, the programmer may decide what amount of cache memory and
the shared memory of that block. The cache memory is not represented here but it
is local to a thread. Then each block can access the global memory of the
GPU.
-
+\clearpage
\section{Conclusion}
In this chapter, a brief presentation of the video card, which has later been