\section{Architecture of current GPUs}
-Architecture \index{Architecture of a GPU} of current GPUs is constantly
+Architecture \index{architecture of a GPU} of current GPUs is constantly
evolving. Nevertheless some trends remains true through this
evolution. Processing units composing a GPU are far more simpler than a
traditional CPU but it is much easier to integrate many computing units inside a
\section{Memory hierarchy}
-The memory hierarchy of GPUs\index{Memory~hierarchy} is different from the CPUs
-one. In practice, there are registers\index{Memory~hierarchy!registers}, local
-memory\index{Memory~hierarchy!local~memory}, shared
-memory\index{Memory~hierarchy!shared~memory}, cache
-memory\index{Memory~hierarchy!cache~memory} and global
-memory\index{Memory~hierarchy!global~memory}.
+The memory hierarchy of GPUs\index{memory~hierarchy} is different from the CPUs
+one. In practice, there are registers\index{memory~hierarchy!registers}, local
+memory\index{memory~hierarchy!local~memory}, shared
+memory\index{memory~hierarchy!shared~memory}, cache
+memory\index{memory~hierarchy!cache~memory} and global
+memory\index{memory~hierarchy!global~memory}.
As previously mentioned each thread can access its own registers. It is
function is the destination array, the second is the source array and the third
is the number of elements to copy (exprimed in bytes).
-Now the GPU contains the data needed to perform the addition. In sequential such
-addition is achieved out with a loop on all the elements. With a GPU, it is
-possible to perform the addition of all elements of the arrays in parallel (if
-the number of blocks and threads per blocks is sufficient). In
+Now that the GPU contains the data needed to perform the addition. In sequential
+such addition is achieved out with a loop on all the elements. With a GPU, it
+is possible to perform the addition of all elements of the arrays in parallel
+(if the number of blocks and threads per blocks is sufficient). In
Listing\ref{ch2:lst:ex1} at the beginning, a simple kernel,
called \texttt{addition} is defined to compute in parallel the summation of the
-two arrays. With CUDA, a kernel starts with the keyword \texttt{\_\_global\_\_}
-which indicates that this kernel can be call from the C code. The first
-instruction in this kernel is used to computed the \texttt{tid} which
-representes the thread index. This thread index is computed according to the
-values of the block index (it is a variable of CUDA
+two arrays. With CUDA, a kernel starts with the keyword \texttt{\_\_global\_\_} \index{CUDA~keywords!\_\_shared\_\_}
+which indicates that this kernel can be called from the C code. The first
+instruction in this kernel is used to compute the variable \texttt{tid} which
+represents the thread index. This thread index\index{thread index} is computed
+according to the values of the block index (it is a variable of CUDA
called \texttt{blockIdx}\index{CUDA~keywords!blockIdx}). Blocks of threads can
be decomposed into 1 dimension, 2 dimensions or 3 dimensions. According to the
dimension of data manipulated, the appropriate dimension can be useful. In our