-(for example the result of one \emph{reduction} per iteration), this transfer is achieved
-synchronously on the default stream (no particular stream is specified) on lines 51--54.
-Availability of the result values is ensured by the synchronization implemented on line 49.
-However, if a partial result has to be transferred on the CPU on each stream, then $NbS$ asynchronous data
-transfers could be started in parallel (one per stream), and should be implemented before the
-synchronization operation on line 49. The end of the computation loop includes a synchronization
-barrier of the two OpenMP threads, waiting they have finished to access the different data
-buffers in the current iteration. Then, each OpenMP thread exchanges its local buffer pointers, like
-in the previous algorithm. However, after the computation loop, we have added the
-destruction of the CUDA streams (lines 63--65).
-
-Finally, CUDA streams\index{CUDA!stream} have been used to extend \Lst{algo:ch6p1overlapseqsequence}
-with respect to its global scheme. \Lst{algo:ch6p1overlapstreamsequence} still creates an
-OpenMP parallel region, with two CPU threads, one in charge of MPI communications, and the other
-managing data transfers and GPU computations. Unfortunately, using GPU streams require to be able to
-split a GPU computation in independent subparts, working on independent subsets of data.
-\Lst{algo:ch6p1overlapstreamsequence} is not so generic than \Lst{algo:ch6p1overlapseqsequence}.
+(for example, the result of one \emph{reduction} per iteration), this transfer is achieved
+synchronously on the default stream (no particular stream is specified) on lines~51--54.
+Availability of the result values is ensured by the synchronization implemented on line~49.
+However, if a partial result has to be transferred onto the CPU on each stream, then $NbS$ asynchronous data
+transfers could be started in parallel (one per stream) and should be implemented before the
+synchronization operation on line~49. The end of the computation loop includes a synchronization
+barrier of the two OpenMP threads, waiting until they have finished accessing the different data
+buffers in the current iteration. Then, each OpenMP thread exchanges its local buffer pointers, as
+in the previous algorithm. After the computation loop, we have added the
+destruction of the CUDA streams (lines~64--65).
+
+In conclusion, CUDA streams\index{CUDA!stream} have been used to extend
+\Lst{algo:ch6p1overlapseqsequence} with respect to its global
+scheme. \Lst{algo:ch6p1overlapstreamsequence} still creates an OpenMP parallel
+region, with two CPU threads, one in charge of MPI communications and the other
+managing data transfers and GPU computations. Unfortunately, using GPU streams
+requires the ability to split a GPU computation into independent subparts,
+working on independent subsets of data. \Lst{algo:ch6p1overlapstreamsequence}
+is not so generic as \Lst{algo:ch6p1overlapseqsequence}.