-(for example the result of one \emph{reduction} per iteration), this transfer is achieved
-synchronously on the default stream (no particular stream is specified) on lines 51--54.
-Availability of the result values is ensured by the synchronization implemented on line 49.
-However, if a partial result has to be transferred on the CPU on each stream, then $NbS$ asynchronous data
-transfers could be started in parallel (one per stream), and should be implemented before the
-synchronization operation on line 49. The end of the computation loop includes a synchronization
-barrier of the two OpenMP threads, waiting they have finished to access the different data
-buffers in the current iteration. Then, each OpenMP thread exchanges its local buffer pointers, like
-in the previous algorithm. However, after the computation loop, we have added the
-destruction of the CUDA streams (lines 63--65).
-
-Finally, CUDA streams\index{CUDA!stream} have been used to extend \Lst{algo:ch6p1overlapseqsequence}
-with respect to its global scheme. \Lst{algo:ch6p1overlapstreamsequence} still creates an
-OpenMP parallel region, with two CPU threads, one in charge of MPI communications, and the other
-managing data transfers and GPU computations. Unfortunately, using GPU streams require to be able to
-split a GPU computation in independent subparts, working on independent subsets of data.
-\Lst{algo:ch6p1overlapstreamsequence} is not so generic than \Lst{algo:ch6p1overlapseqsequence}.
-
-
-\subsection{Interleaved communications-transfers-computations overlapping}
-
-\begin{figure}[t]
- \centering
- \includegraphics{Chapters/chapter6/figures/Sync-CompleteInterleaveOverlap.pdf}
- \caption{Complete overlap of internode CPU communications, CPU/GPU data transfers and GPU
- computations, interleaving computation-communication iterations}
- \label{fig:ch6p1overlapinterleaved}
-\end{figure}
-
-Many algorithms do not support to split data transfers and kernel calls, and can
-not exploit CUDA streams. For example, when each GPU thread requires to access
+(for example, the result of one \emph{reduction} per iteration), this transfer is achieved
+synchronously on the default stream (no particular stream is specified) on lines~51--54.
+Availability of the result values is ensured by the synchronization implemented on line~49.
+However, if a partial result has to be transferred onto the CPU on each stream, then $NbS$ asynchronous data
+transfers could be started in parallel (one per stream) and should be implemented before the
+synchronization operation on line~49. The end of the computation loop includes a synchronization
+barrier of the two OpenMP threads, waiting until they have finished accessing the different data
+buffers in the current iteration. Then, each OpenMP thread exchanges its local buffer pointers, as
+in the previous algorithm. After the computation loop, we have added the
+destruction of the CUDA streams (lines~64--65).
+
+In conclusion, CUDA streams\index{CUDA!stream} have been used to extend
+\Lst{algo:ch6p1overlapseqsequence} with respect to its global
+scheme. \Lst{algo:ch6p1overlapstreamsequence} still creates an OpenMP parallel
+region, with two CPU threads, one in charge of MPI communications and the other
+managing data transfers and GPU computations. Unfortunately, using GPU streams
+requires the ability to split a GPU computation into independent subparts,
+working on independent subsets of data. \Lst{algo:ch6p1overlapstreamsequence}
+is not so generic as \Lst{algo:ch6p1overlapseqsequence}.
+
+
+\subsection{Interleaved communications-transfers-computations\\overlapping}
+
+Many algorithms do not support splitting data transfers and kernel calls, and
+cannot exploit CUDA streams, for example, when each GPU thread requires access to