+Now the GPU contains the data needed to perform the addition. In sequential such
+addition is achieved out with a loop on all the elements. With a GPU, it is
+possible to perform the addition of all elements of the arrays in parallel (if
+the number of blocks and threads per blocks is sufficient). In
+Listing\ref{ch2:lst:ex1} at the beginning, a simple kernel,
+called \texttt{addition} is defined to compute in parallel the summation of the
+two arrays. With CUDA, a kernel starts with the keyword \texttt{\_\_global\_\_}
+which indicates that this kernel can be call from the C code. The first
+instruction in this kernel is used to computed the \texttt{tid} which
+representes the thread index. This thread index is computed according to the
+values of the block index (it is a variable of CUDA
+called \texttt{blockIdx\index{CUDA~keywords!blockIdx}}). Blocks of threads can
+be decomposed into 1 dimension, 2 dimensions or 3 dimensions. According to the
+dimension of data manipulated, the appropriate dimension can be useful. In our
+example, only one dimension is used. Then using notation \texttt{.x} we can
+access to the first dimension (\texttt{.y} and \texttt{.z} allow respectively to
+access to the second and third dimension). The
+variable \texttt{blockDim}\index{CUDA~keywords!blockDim} gives the size of each
+block.
+
+
+
+\lstinputlisting[label=ch2:lst:ex1,caption=A simple example]{Chapters/chapter2/ex1.cu}
+
+\putbib[Chapters/chapter2/biblio]