% Reduce operation
\subsection{Parallel reduction}
\label{chXXX:subsec:reduction}
-A parallel reduction\index{parallel reduction} operation is performed in an efficient manner inside a GPU block as shown in Figure~\ref{chXXX:fig:reduc}. Shared memory is used for a fast and reliable way to communicate between threads. However, at the grid level, reduction cannot be easily implemented due to the lack of direct communication between blocks. The usual way of dealing with this type of limitation is to apply the reduction in two separate steps. The first one involves a GPU kernel reducing the data over multiple blocks, the local result of each block being stored on completion. The second step finishes the reduction on a single block or on the CPU.
+A parallel reduction\index{parallel!reduction} operation is performed in an efficient manner inside a GPU block as shown in Figure~\ref{chXXX:fig:reduc}. Shared memory is used for a fast and reliable way to communicate between threads. However, at the grid level, reduction cannot be easily implemented due to the lack of direct communication between blocks. The usual way of dealing with this type of limitation is to apply the reduction in two separate steps. The first one involves a GPU kernel reducing the data over multiple blocks, the local result of each block being stored on completion. The second step finishes the reduction on a single block or on the CPU.
An optimized way of doing the reduction can be found in the examples\footnote{Available at http://docs.nvidia.com/cuda/cuda-samples/index.html\#advanced} provided by NVIDIA.
%In order to keep code listings compact hereafter, the reduction of values among a block will be referred to as \textit{reduceOperation(value)} (per extension \textit{reduceArgMax(maxVal)}).