new

[book_gpu.git] / BookGPU / Chapters / chapter3 / ch3.tex
diff --git a/BookGPU/Chapters/chapter3/ch3.tex b/BookGPU/Chapters/chapter3/ch3.tex

index 64709ea4e51f510d3925a37e43f552f5852e8068..1b2e263ffa5f3dbaebeea25d86c25383b41f6aa7 100755 (executable)
--- a/BookGPU/Chapters/chapter3/ch3.tex
+++ b/BookGPU/Chapters/chapter3/ch3.tex
@@ -18,10 +18,10 @@ However, so as to propose concise and more readable code, we will assume the fol
  \section{Data transfers, memory management.}
  This section deals with the following issues: 
  \begin{enumerate}
  \section{Data transfers, memory management.}
  This section deals with the following issues: 
  \begin{enumerate}
-\item Data transfer from CPU memory to GPU global memory: several GPU memory areas are available as destination memory but the 2D caching mechanism of texture memory, \index{memory~hierarchy!texture~memory} specifically designed for fetching neighboring pixels, is currently the fastest way to fetch gray-level pixel values inside a kernel computation. This has led us to choose \textbf{texture memory} as primary GPU memory area for input images.
-\item Data fetching from GPU global memory to kernel local memory: as said above, we use texture memory. \index{memory~hierarchy!texture~memory} Depending on which process is run, texture data is used either by direct fetching in kernel local memory or through a prefetching \index{prefetching} in shared memory. \index{memory~hierarchy!shared~memory}
+\item Data transfer from CPU memory to GPU global memory: several GPU memory areas are available as destination memory but the 2D caching mechanism of texture memory, \index{memory hierarchy!texture memory} specifically designed for fetching neighboring pixels, is currently the fastest way to fetch gray-level pixel values inside a kernel computation. This has led us to choose \textbf{texture memory} as primary GPU memory area for input images.
+\item Data fetching from GPU global memory to kernel local memory: as said above, we use texture memory. \index{memory hierarchy!texture memory} Depending on which process is run, texture data is used either by direct fetching in kernel local memory or through a prefetching \index{prefetching} in shared memory. \index{memory hierarchy!shared memory}
  \item Data outputting from kernels to GPU memory: there is actually no alternative to global memory, as kernels cannot directly write into texture memory and as copying from texture to CPU memory would not be faster than from simple global memory.
  \item Data outputting from kernels to GPU memory: there is actually no alternative to global memory, as kernels cannot directly write into texture memory and as copying from texture to CPU memory would not be faster than from simple global memory.
-\item Data transfer from GPU global memory to CPU memory: it can be drastically accelerated by use of \textbf{pinned memory}, \index{memory~hierarchy!pinned~memory} keeping in mind it has to be used sparingly.
+\item Data transfer from GPU global memory to CPU memory: it can be drastically accelerated by use of \textbf{pinned memory}, \index{memory hierarchy!pinned memory} keeping in mind it has to be used sparingly.
  \end{enumerate}
  Algorithm \ref{algo:memcopy} summarizes all the above considerations and describes how data are handled in our examples. For more information on how to handle the different types of GPU memory, we suggest referring to the CUDA programmer's guide. 
  
  \end{enumerate}
  Algorithm \ref{algo:memcopy} summarizes all the above considerations and describes how data are handled in our examples. For more information on how to handle the different types of GPU memory, we suggest referring to the CUDA programmer's guide. 
  
@@ -61,7 +61,7 @@ The Makefile given in Listing \ref{lst:mkfile} shows how to adapt examples given
  \section{Performance measurements}
  As our goal is to design very fast implementations of basic image processing algorithms, we need to make quite accurate time-measurements, within the order of magnitude of $0.01$~ms. Again, the easiest way of doing so is to use the helper functions of the \textbf{cutil} library. As usual, because the durations we are measuring are short and possibly subject to non negligible variations, a good practice is to measure multiple executions and report the mean runtime. All time results given in this chapter have been obtained through 1000 calls to each kernel.
  
  \section{Performance measurements}
  As our goal is to design very fast implementations of basic image processing algorithms, we need to make quite accurate time-measurements, within the order of magnitude of $0.01$~ms. Again, the easiest way of doing so is to use the helper functions of the \textbf{cutil} library. As usual, because the durations we are measuring are short and possibly subject to non negligible variations, a good practice is to measure multiple executions and report the mean runtime. All time results given in this chapter have been obtained through 1000 calls to each kernel.
  
-Listing \ref{lst:chronos} shows how to use the dedicated \textbf{cutil} functions \index{Cutil library!Timer usage}. Timer declaration and creation need to be performed only once while reset, start and stop functions can be used as often as necessary. Synchronization is mandatory before stopping the timer (Line 7), to avoid runtime measurement being biased.
+Listing \ref{lst:chronos} shows how to use the dedicated \textbf{cutil} functions \index{Cutil library!timer usage}. Timer declaration and creation need to be performed only once while reset, start and stop functions can be used as often as necessary. Synchronization is mandatory before stopping the timer (Line 7), to avoid runtime measurement being biased.
  \lstinputlisting[label={lst:chronos},caption=Time measurement technique using cutil functions]{Chapters/chapter3/code/exChronos.cu}
  
  In an attempt to provide relevant speedup values, we either implemented CPU versions of the algorithms studied or used the values found in existing literature. Still, the large number and diversity of hardware platforms and GPU cards makes it impossible to benchmark every possible combination and significant differences may occur between the speedups we report and those obtained with different devices. As a reference, our developing platform details as follows:
  \lstinputlisting[label={lst:chronos},caption=Time measurement technique using cutil functions]{Chapters/chapter3/code/exChronos.cu}
  
  In an attempt to provide relevant speedup values, we either implemented CPU versions of the algorithms studied or used the values found in existing literature. Still, the large number and diversity of hardware platforms and GPU cards makes it impossible to benchmark every possible combination and significant differences may occur between the speedups we report and those obtained with different devices. As a reference, our developing platform details as follows:
@@ -127,7 +127,7 @@ copy data from GPU global memory to CPU memory\label{algoMedianGeneric:memcpyD2H
  As mentioned earlier, the selection of the median value can be performed by more than one technique, using either histogram-based or sorting methods, each having its own benefits and drawbacks as will be discussed further down. 
  
  \subsection{A naive implementation}
  As mentioned earlier, the selection of the median value can be performed by more than one technique, using either histogram-based or sorting methods, each having its own benefits and drawbacks as will be discussed further down. 
  
  \subsection{A naive implementation}
-As a reference, Listing \ref{lst:medianGeneric} gives a simple, not to say simplistic, implementation of a CUDA kernel (\texttt{kernel\_medianR}) achieving generic $n\times n$ histogram-based median filtering. Its runtime has a very low data dependency, but this implementation does not suit  GPU architecture very well. Each pixel loads the whole of its $n\times n$ neighborhood, meaning that one pixel is loaded multiple times inside one single thread block, and even more time-consuming, the use of a local vector (histogram[]) considerably downgrades performance, as the compiler automatically stores such vectors in local memory (slow) \index{memory~hierarchy!local~memory}.
+As a reference, Listing \ref{lst:medianGeneric} gives a simple, not to say simplistic, implementation of a CUDA kernel (\texttt{kernel\_medianR}) achieving generic $n\times n$ histogram-based median filtering. Its runtime has a very low data dependency, but this implementation does not suit  GPU architecture very well. Each pixel loads the whole of its $n\times n$ neighborhood, meaning that one pixel is loaded multiple times inside one single thread block, and even more time-consuming, the use of a local vector (histogram[]) considerably downgrades performance, as the compiler automatically stores such vectors in local memory (slow) \index{memory hierarchy!local memory}.
  
  Table \ref{tab:medianHisto1} displays measured runtimes of \texttt{kernel\_medianR} and pixel throughputs for each GPU version (C2070 and GTX480 targets) and for both CPU and GPU implementations. Usual window sizes of $3\times 3$, $5\times 5$, and $7\times 7$ are shown. Though some specific applications require larger window sizes and dedicated algorithms, such small square window sizes are most widely used in general purpose image processing. GPU runtimes have been obtained with a grid of 64-thread blocks.
  
  
  Table \ref{tab:medianHisto1} displays measured runtimes of \texttt{kernel\_medianR} and pixel throughputs for each GPU version (C2070 and GTX480 targets) and for both CPU and GPU implementations. Usual window sizes of $3\times 3$, $5\times 5$, and $7\times 7$ are shown. Though some specific applications require larger window sizes and dedicated algorithms, such small square window sizes are most widely used in general purpose image processing. GPU runtimes have been obtained with a grid of 64-thread blocks.
  
@@ -181,10 +181,10 @@ On the GPU's side, we note high dependence on window size due to the redundancy
  
  \begin{figure}[t]
  \centering
  
  \begin{figure}[t]
  \centering
-   \subfigure[Airplane image, corrupted by salt and pepper noise of density 0.25]{\label{img:sap_example_ref} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25.png}}\qquad
-   \subfigure[Image denoised by a $3\times 3$ median filter]{\label{img:sap_example_med3} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_med3.png}}\\
-   \subfigure[Image denoised by a $5\times 5$ median filter]{\label{img:sap_example_med5} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_med5.png}}\qquad
-   \subfigure[Image denoised by 2 iterations of a $3\times 3$ median filter]{\label{img:sap_example_med3_it2} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_med3_it2.png}}\\
+   \subfigure[Airplane image, corrupted by salt and pepper noise of density 0.25]{ \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25.png}}\qquad
+   \subfigure[Image denoised by a $3\times 3$ median filter]{ \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_med3.png}}\\
+   \subfigure[Image denoised by a $5\times 5$ median filter]{ \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_med5.png}}\qquad
+   \subfigure[Image denoised by 2 iterations of a $3\times 3$ median filter]{ \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_med3_it2.png}}\\
     \caption{Example of median filtering, applied to salt and pepper noise reduction.}
     \label{fig:sap_examples}
  \end{figure}
     \caption{Example of median filtering, applied to salt and pepper noise reduction.}
     \label{fig:sap_examples}
  \end{figure}
@@ -192,9 +192,9 @@ On the GPU's side, we note high dependence on window size due to the redundancy
  \section{NVIDIA GPU tuning recipes}
  When designing GPU code, besides thinking of the actual data computing process, one must choose the memory type in which to store temporary data. Three types of GPU memory are available:
  \begin{enumerate}
  \section{NVIDIA GPU tuning recipes}
  When designing GPU code, besides thinking of the actual data computing process, one must choose the memory type in which to store temporary data. Three types of GPU memory are available:
  \begin{enumerate}
-\item \textbf{Global memory, the most versatile:} \index{memory~hierarchy!global~memory}\\Offers the largest storing space and global scope but is the slowest (400 to 800 clock cycles latency). \textbf{Texture memory} is physically included into it, but allows access through an efficient 2D caching mechanism.
-\item \textbf{Registers, the fastest:} \index{memory~hierarchy!registers}\\Allow access without latency, but only 63 registers are available per thread (thread scope), with a maximum of 32K per Streaming Multiprocessor (SM). \index{register count}
-\item \textbf{Shared memory, a complex compromise:} \index{memory~hierarchy!shared~memory}\\All threads in one block can access $48~KBytes$ of shared memory, which is faster than global memory (20 clock cycles latency) but slower than registers. 
+\item \textbf{Global memory, the most versatile:} \index{memory hierarchy!global memory}\\Offers the largest storing space and global scope but is the slowest (400 to 800 clock cycles latency). \textbf{Texture memory} is physically included into it, but allows access through an efficient 2D caching mechanism.
+\item \textbf{Registers, the fastest:} \index{memory hierarchy!registers}\\Allow access without latency, but only 63 registers are available per thread (thread scope), with a maximum of 32K per Streaming Multiprocessor (SM). \index{register count}
+\item \textbf{Shared memory, a complex compromise:} \index{memory hierarchy!shared memory}\\All threads in one block can access $48~KBytes$ of shared memory, which is faster than global memory (20 clock cycles latency) but slower than registers. 
  However, bank conflicts can occur if two threads of a warp try to access data stored in one single memory bank. In such cases, the parallel process is serialized which may cause significant performance decrease. One easy way to avoid this is to ensure that two consecutive threads in one block always access 32-bit data at two consecutive addresses.  
  \end{enumerate}
  
  However, bank conflicts can occur if two threads of a warp try to access data stored in one single memory bank. In such cases, the parallel process is serialized which may cause significant performance decrease. One easy way to avoid this is to ensure that two consecutive threads in one block always access 32-bit data at two consecutive addresses.  
  \end{enumerate}
  
@@ -204,7 +204,7 @@ To overcome this, the most frequent choice made in efficient implementations fou
  As for registers, designing a generic median filter that would use only that type of memory seems difficult, due to the above mentioned 63 register-per-thread limitation. \index{register count} 
  Yet, nothing forbids us to design fixed-size filters, each of them specific to one of the most popular window sizes. It might be worth the effort as dramatic increase in performance could be expected.
  
  As for registers, designing a generic median filter that would use only that type of memory seems difficult, due to the above mentioned 63 register-per-thread limitation. \index{register count} 
  Yet, nothing forbids us to design fixed-size filters, each of them specific to one of the most popular window sizes. It might be worth the effort as dramatic increase in performance could be expected.
  
-Another track to follow in order to improve performance of GPU implementations consists of hiding latencies generated by arithmetic instruction calls and memory accesses. Both can be partially hidden by introducing Instruction-Level Parallelism \index{Instruction-Level Parallelism}(ILP) and by increasing the data count outputted by each thread. Though such techniques may seem to break the NVIDIA occupancy paradigm, they can lead to dramatically higher data throughput values.
+Another track to follow in order to improve performance of GPU implementations consists of hiding latencies generated by arithmetic instruction calls and memory accesses. Both can be partially hidden by introducing Instruction-Level Parallelism \index{instruction-level parallelism}(ILP) and by increasing the data count outputted by each thread. Though such techniques may seem to break the NVIDIA occupancy paradigm, they can lead to dramatically higher data throughput values.
  The following sections illustrate these ideas and detail the design of the fastest CUDA median filter known to date.
    
  \section{A 3$\times$3 median filter:  using registers}
  The following sections illustrate these ideas and detail the design of the fastest CUDA median filter known to date.
    
  \section{A 3$\times$3 median filter:  using registers}
@@ -252,7 +252,7 @@ In our $3\times 3$ pixel window example, the minimum register count becomes $k_9
  This iterative process is illustrated in Figure \ref{fig:forgetful3}, where it achieves one entire $3\times 3$ median selection, beginning with $k_9=6$ elements.
  
  The \textit{forgetful selection} method, used in \cite{mcguire2008median}, does not imply full sorting of values, but only selecting minimum and maximum values, which, at the price of a few iteration steps ($n^2-k$), reduces arithmetic complexity.
  This iterative process is illustrated in Figure \ref{fig:forgetful3}, where it achieves one entire $3\times 3$ median selection, beginning with $k_9=6$ elements.
  
  The \textit{forgetful selection} method, used in \cite{mcguire2008median}, does not imply full sorting of values, but only selecting minimum and maximum values, which, at the price of a few iteration steps ($n^2-k$), reduces arithmetic complexity.
-Listing \ref{lst:medianForget1pix3} details this process where forgetful selection is achieved by use of simple 2-value swapping function ($s()$, lines 1 to 5) that swaps input values if necessary, so as to achieve the first steps of an incomplete sorting network \cite{Batcher:1968:SNA:1468075.1468121}. Moreover, whenever possible, in order to increase the ILP, \index{Instruction-Level Parallelism} successive calls to $s()$ are done with independant elements as arguments. This is illustrated by the macro definitions of lines 7 to 12 and by Figure \ref{fig:bitonic} which details the first iteration of the $5\times 5$ selection, starting with $k_{25}=14$ elements. 
+Listing \ref{lst:medianForget1pix3} details this process where forgetful selection is achieved by use of simple 2-value swapping function ($s()$, lines 1 to 5) that swaps input values if necessary, so as to achieve the first steps of an incomplete sorting network \cite{Batcher:1968:SNA:1468075.1468121}. Moreover, whenever possible, in order to increase the ILP, \index{instruction-level parallelism} successive calls to $s()$ are done with independant elements as arguments. This is illustrated by the macro definitions of lines 7 to 12 and by Figure \ref{fig:bitonic} which details the first iteration of the $5\times 5$ selection, starting with $k_{25}=14$ elements. 
  \begin{figure}[b]
     \centering
     \includegraphics[width=6cm]{Chapters/chapter3/img/forgetful_selection.png}
  \begin{figure}[b]
     \centering
     \includegraphics[width=6cm]{Chapters/chapter3/img/forgetful_selection.png}
@@ -351,10 +351,10 @@ Torben Morgensen sorting takes place between lines 37 and 66 and eventually, the
  It has to be noticed that this smoother, unlike the technique we proposed for fixed-size median filters, cannot be considered as a state-of-the-art technique as, for example, the one presented in \cite{4287006}. However, it may be considered as a good, easy to use and efficient alternative as confirmed by the results presented in Table \ref{tab:medianSeparable}. Pixel throughput values achieved by our kernel, though not constant with window size, remain very competitive if window size is kept under $120\times 120$ pixels, especially when outputting 2 pixels per thread (in \cite{4287006}, pixel throughput is around 7MP/s).
  Figure \ref{fig:sap_examples2} shows an example of a $512\times 512$ pixel image, corrupted by a  \textit{salt and pepper} noise, and the denoised versions, outputted respectively by a $3\times 3$, a $5\times 5$, and a $55\times 55 $ separable smoother.
  \begin{figure}
  It has to be noticed that this smoother, unlike the technique we proposed for fixed-size median filters, cannot be considered as a state-of-the-art technique as, for example, the one presented in \cite{4287006}. However, it may be considered as a good, easy to use and efficient alternative as confirmed by the results presented in Table \ref{tab:medianSeparable}. Pixel throughput values achieved by our kernel, though not constant with window size, remain very competitive if window size is kept under $120\times 120$ pixels, especially when outputting 2 pixels per thread (in \cite{4287006}, pixel throughput is around 7MP/s).
  Figure \ref{fig:sap_examples2} shows an example of a $512\times 512$ pixel image, corrupted by a  \textit{salt and pepper} noise, and the denoised versions, outputted respectively by a $3\times 3$, a $5\times 5$, and a $55\times 55 $ separable smoother.
  \begin{figure}
-   \subfigure[Airplane image, corrupted with by salt and pepper noise of density 0.25]{\label{img:sap_example_ref} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25.png}}\qquad
-   \subfigure[Image denoised by a $3\times 3$ separable smoother]{\label{img:sap_example_sep_med3} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_sep_med3.png}}\\
-   \subfigure[Image denoised by a $5\times 5$ separable smoother]{\label{img:sap_example_sep_med5} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_sep_med5.png}}\qquad
-   \subfigure[Image background estimation by a $55\times 55$ separable smoother]{\label{img:sap_example_sep_med3_it2} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_sep_med111.png}}\\
+   \subfigure[Airplane image, corrupted with by salt and pepper noise of density 0.25]{ \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25.png}}\qquad
+   \subfigure[Image denoised by a $3\times 3$ separable smoother]{ \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_sep_med3.png}}\\
+   \subfigure[Image denoised by a $5\times 5$ separable smoother]{ \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_sep_med5.png}}\qquad
+   \subfigure[Image background estimation by a $55\times 55$ separable smoother]{ \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_sep_med111.png}}\\
     \caption{Example of separable median filtering (smoother), applied to salt and pepper noise reduction.}
     \label{fig:sap_examples2}
  \end{figure}
     \caption{Example of separable median filtering (smoother), applied to salt and pepper noise reduction.}
     \label{fig:sap_examples2}
  \end{figure}