-Each kernel is executed by $NX\times ny$ GPU threads so that $nz$ slices of $(NX\times ny)$ vector elements are computed in
-a {\bf for} loop. In this case, each thread is in charge of one vector element from each slice (in total $nz$ vector elements
-along $z$-axis). We can notice from the formula~(\ref{eq:17}) that the computation of a vector element $u^{p+1}(x,y,z)$, by
-a thread at iteration $p+1$, requires seven vector elements computed at the previous iteration $p$: two vector elements in
-each dimension plus the vector element at the intersection of the three axises $x$, $y$ and $z$ (see Figure~\ref{fig:04}).
-So, to reduce the memory accesses to the high-latency global memory, the vector elements of the current slice can be stored
-in the low-latency shared memories of thread blocks, as is described in~\cite{ref9}. Nevertheless, the fact that the computation
-of a vector element requires only two elements in each dimension does not allow to maximize the data reuse from the shared memories.
-The computation of a slice involves in total $(bx+2)\times(by+2)$ accesses to the global memory per thread block, to fill the
-required vector elements in the shared memory where $bx$ and $by$ are the dimensions of a thread block. Then, in order to optimize
-the memory accesses on GPUs, the elements of the iterate vector $U$ are filled in the cache texture memory (see~\cite{ref10}).
-In new GPU generations as Fermi or Kepler, the global memory accesses are always cached in L1 and L2 caches. For example, for
-a given kernel, we can favour the use of the L1 cache to that of the shared memory by using the function \verb+cudaFuncSetCacheConfig(Kernel,cudaFuncCachePreferL1)+.
-So, the initial access to the global memory loads the vector elements required by the threads of a block into the cache memory
-(texture or L1/L2 caches). Then, all the following memory accesses read from this cache memory. In Listing~\ref{list:02}, the
-function \verb+fetch_double(v,i)+ is used to read from the texture memory the $i^{th}$ element of the double-precision vector
-\verb+v+ (see Listing~\ref{list:03}). Moreover, the seven constant coefficients of matrix $A$ can be stored in the constant memory
-but, since they are reused $nz$ times by each thread, it is more interesting to fill them on the low-latency registers of each thread.
-
-\lstinputlisting[label=list:03,caption=Memory access to the cache texture memory]{Chapters/chapter13/ex3.cu}
-
-The function $Convergence()$ (line~$11$ in Algorithm~\ref{alg:02}) allows to detect the convergence of the parallel iterative algorithm
-and is based on the tolerance threshold $\varepsilon$ and the maximum number of relaxations $MaxRelax$. We take into account the number
-of relaxations since that of iterations cannot be computed in the asynchronous case. Indeed, a relaxation is the update~(\ref{eq:13}) of
-a local iterate vector $U_i$ according to $F_i$. Then, counting the number of relaxations is possible in both synchronous and asynchronous
-cases. On the other hand, an iteration is the update of at least all vector components with $F_i$.
-
-In the synchronous algorithm, the global convergence is detected when the maximal value of the absolute error, $error$, is sufficiently small
-and/or the maximum number of relaxations, $MaxRelax$, is reached, as follows:
+Each kernel is executed by $NX\times ny$ GPU threads so that $nz$ slices
+of $(NX\times ny)$ vector elements are computed in a {\bf for} loop. In
+this case, each thread is in charge of one vector element from each slice
+(in total $nz$ vector elements along the $z$-axis). We can notice from the
+formula~(\ref{ch13:eq:17}) that the computation of a vector element $u^{p+1}(x,y,z)$,
+by a thread at iteration $p+1$, requires seven vector elements computed
+at the previous iteration $p$: two vector elements in each dimension plus
+the vector element at the intersection of the three axes $x$, $y$, and $z$
+(see Figure~\ref{ch13:fig:04}). So, to reduce the memory accesses to the
+high-latency global memory, the vector elements of the current slice can
+be stored in the low-latency shared memories of thread blocks, as is described
+in~\cite{ch13:ref9}. Nevertheless, the fact that the computation of a vector
+element requires only two elements in each dimension does not allow us to maximize
+the data reuse from the shared memories. The computation of a slice involves
+in total $(bx+2)\times(by+2)$ accesses to the global memory per thread block,
+to fill the required vector elements in the shared memory where $bx$ and $by$
+are the dimensions of a thread block. Then, in order to optimize the memory
+accesses on GPUs, the elements of the iterate vector $U$ are filled in the
+cache texture memory (see~\cite{ch13:ref10}). In new GPU hardware and software as Fermi
+or Kepler, the global memory accesses are always cached in L1 and L2 caches.
+For example, for a given kernel, we can favor the use of the L1 cache to that
+of the shared memory by using the function \verb+cudaFuncSetCacheConfig(Kernel,cudaFuncCachePreferL1)+.
+So, the initial access to the global memory loads the vector elements required
+by the threads of a block into the cache memory (texture or L1/L2 caches). Then,
+all the following memory accesses read from this cache memory. In Listing~\ref{ch13:list:02},
+the function \verb+fetch_double(v,i)+ is used to read from the texture memory
+the $ith$ element of the double-precision vector \verb+v+ (see Listing~\ref{ch13:list:03}).
+Moreover, the seven constant coefficients of matrix $A$ can be stored in the
+constant memory but, since they are reused $nz$ times by each thread, it is more
+efficient to fill them on the low-latency registers of each thread.
+
+\pagebreak
+\lstinputlisting[label=ch13:list:03,caption=memory access to the cache texture memory]{Chapters/chapter13/ex3.cu}
+
+The function $Convergence()$ (line~$11$ in Algorithm~\ref{ch13:alg:02}) allows us
+to detect the convergence of the parallel iterative algorithm and is based on
+the tolerance threshold\index{Convergence!Tolerance~threshold} $\varepsilon$
+and the maximum number of relaxations\index{Convergence!Maximum~number~of~relaxations}
+$MaxRelax$. We take into account the number of relaxations since that of iterations
+cannot be computed in the asynchronous case. Indeed, a relaxation is the update~(\ref{ch13:eq:13})
+of a local iterate vector $U_i$ according to $F_i$. Then, counting the number
+of relaxations is possible in both synchronous and asynchronous cases. On the
+other hand, an iteration is the update of at least all vector components with
+$F_i$.
+
+In the synchronous\index{Synchronous} algorithm, the global convergence is detected
+when the maximal value of the absolute error, $error$, is sufficiently small and/or
+the maximum number of relaxations, $MaxRelax$, is reached, as follows: