ch17

[book_gpu.git] / BookGPU / Chapters / chapter3 / ch3.tex
diff --git a/BookGPU/Chapters/chapter3/ch3.tex b/BookGPU/Chapters/chapter3/ch3.tex

index 1b2e263ffa5f3dbaebeea25d86c25383b41f6aa7..1c6453dd2d1f82cd958b6acf1c3daf59edd55dad 100755 (executable)
--- a/BookGPU/Chapters/chapter3/ch3.tex
+++ b/BookGPU/Chapters/chapter3/ch3.tex
@@ -59,9 +59,9 @@ The Makefile given in Listing \ref{lst:mkfile} shows how to adapt examples given
  
  
  \section{Performance measurements}
  
  
  \section{Performance measurements}
-As our goal is to design very fast implementations of basic image processing algorithms, we need to make quite accurate time-measurements, within the order of magnitude of $0.01$~ms. Again, the easiest way of doing so is to use the helper functions of the \textbf{cutil} library. As usual, because the durations we are measuring are short and possibly subject to non negligible variations, a good practice is to measure multiple executions and report the mean runtime. All time results given in this chapter have been obtained through 1000 calls to each kernel.
+As our goal is to design very fast implementations of basic image processing algorithms, we need to make quite accurate time-measurements, within the order of magnitude of $0.01$~ms. Again, the easiest way of doing so is to use the helper functions of the \textbf{cutil} library. As usual, because the durations we are measuring are short and possibly subject to nonnegligible variations, a good practice is to measure multiple executions and report the mean runtime. All time results given in this chapter have been obtained through 1000 calls to each kernel.
  
  
-Listing \ref{lst:chronos} shows how to use the dedicated \textbf{cutil} functions \index{Cutil library!timer usage}. Timer declaration and creation need to be performed only once while reset, start and stop functions can be used as often as necessary. Synchronization is mandatory before stopping the timer (Line 7), to avoid runtime measurement being biased.
+Listing \ref{lst:chronos} shows how to use the dedicated \textbf{cutil} functions\index{Cutil library!timer usage}. Timer declaration and creation need to be performed only once while reset, start and stop functions can be used as often as necessary. Synchronization is mandatory before stopping the timer (Line 7), to avoid runtime measurement being biased.
  \lstinputlisting[label={lst:chronos},caption=Time measurement technique using cutil functions]{Chapters/chapter3/code/exChronos.cu}
  
  In an attempt to provide relevant speedup values, we either implemented CPU versions of the algorithms studied or used the values found in existing literature. Still, the large number and diversity of hardware platforms and GPU cards makes it impossible to benchmark every possible combination and significant differences may occur between the speedups we report and those obtained with different devices. As a reference, our developing platform details as follows:
  \lstinputlisting[label={lst:chronos},caption=Time measurement technique using cutil functions]{Chapters/chapter3/code/exChronos.cu}
  
  In an attempt to provide relevant speedup values, we either implemented CPU versions of the algorithms studied or used the values found in existing literature. Still, the large number and diversity of hardware platforms and GPU cards makes it impossible to benchmark every possible combination and significant differences may occur between the speedups we report and those obtained with different devices. As a reference, our developing platform details as follows:
@@ -93,7 +93,7 @@ Median filtering is a well-known method used in a wide range of application fram
  First introduced by Tukey in \cite{tukey77}, it has been widely studied since then, and many researchers have proposed efficient implementations of it, adapted to various hypotheses, architectures and processors. 
  Originally, its main drawbacks were its compute complexity, its nonlinearity and its data-dependent runtime. Several researchers have addressed these issues and designed, for example, efficient histogram-based median filters with predictible runtimes \cite{Huang:1981:TDS:539567, Weiss:2006:FMB:1179352.1141918}.  
  
  First introduced by Tukey in \cite{tukey77}, it has been widely studied since then, and many researchers have proposed efficient implementations of it, adapted to various hypotheses, architectures and processors. 
  Originally, its main drawbacks were its compute complexity, its nonlinearity and its data-dependent runtime. Several researchers have addressed these issues and designed, for example, efficient histogram-based median filters with predictible runtimes \cite{Huang:1981:TDS:539567, Weiss:2006:FMB:1179352.1141918}.  
  
-More recently, the advent of GPUs opened new perspectives in terms of image processing performance, and some researchers managed to take advantage of the new graphics capabilities: in that respect, we can cite the Branchless Vectorized Median (BVM) filter \cite{5402362, chen09} which allows very interesting runtimes on CUDA-enabled devices but, as far as we know, the fastest implementation to date is the histogram-based PCMF median filter \cite{Sanchez-2-2012}.
+More recently, the advent of GPUs opened new perspectives in terms of image processing performance, and some researchers managed to take advantage of the new graphics capabilities: in that respect, we can cite the Branchless Vectorized Median (BVM) filter \cite{5402362, chen09} which allows very interesting runtimes on CUDA-enabled devices but, as far as we know, the fastest implementation to date is the histogram-based Parallel Ccdf-based Median Filter (PCMF) \cite{Sanchez-2-2012} where Ccdf means Complementary Cumulative Distribution Function.
  
  Some of the following implementations feature very fast runtimes. They are targeted on NVIDIA Tesla GPU (Fermi architecture, compute capability 2.x) but may easily be adapted to other models, e.g., those of compute capability 1.3.
  
  
  Some of the following implementations feature very fast runtimes. They are targeted on NVIDIA Tesla GPU (Fermi architecture, compute capability 2.x) but may easily be adapted to other models, e.g., those of compute capability 1.3.
  
@@ -135,7 +135,7 @@ The first observation to make when analysing results of Table \ref{tab:medianHis
  Since inner loops that fill the histogram vector contain very few fetching instructions (from 9 to 49, depending on the window size), it is not surprising to note their negligible impact compared to outer loops that fetch image pixels (from 256k to 16M instructions). 
  One could be tempted to claim that CPU has no chance to win, which is not so obvious as it highly depends on what kind of algorithm is run and, above all, how it is implemented. To illustrate this, we can observe that, despite a maximum effective throughput potential that is almost five times higher, measured GTX280 throughput values sometimes prove slower than CPU values, as shown in Table \ref{tab:medianHisto1}.
  
  Since inner loops that fill the histogram vector contain very few fetching instructions (from 9 to 49, depending on the window size), it is not surprising to note their negligible impact compared to outer loops that fetch image pixels (from 256k to 16M instructions). 
  One could be tempted to claim that CPU has no chance to win, which is not so obvious as it highly depends on what kind of algorithm is run and, above all, how it is implemented. To illustrate this, we can observe that, despite a maximum effective throughput potential that is almost five times higher, measured GTX280 throughput values sometimes prove slower than CPU values, as shown in Table \ref{tab:medianHisto1}.
  
-On the GPU's side, we note high dependence on window size due to the redundancy induced by the multiple fetches of each pixel inside each block, becoming higher with the window size. Figure \ref{fig:median_overlap} shows for example that two $5\times 5$ windows, centered on two neighbor pixels share at least 16 pixels. On C2070 card, thanks to a more efficient caching mechanism, this effect is less. On GPUs, dependency on image size is low, and due to slightly more efficient data transfers when copying larger data amounts, pixel throughputs increases with image size. As an example, transferring a 4096$\times$4096 pixel image (32~MBytes) is a bit faster than transferring  a 512$\times$512 pixel image (0.5~MBytes) 64 times.
+On the GPU's side, we note high dependence on window size due to the redundancy induced by the multiple fetches of each pixel inside each block, becoming higher with the window size. Figure \ref{fig:median_overlap} shows for example that two $5\times 5$ windows, centered on two neighbor pixels share at least 16 pixels. On a C2070 card, thanks to a more efficient caching mechanism, this effect is less. On GPUs, dependency on image size is low, and due to slightly more efficient data transfers when copying larger data amounts, pixel throughputs increases with image size. As an example, transferring a 4096$\times$4096 pixel image (32~MBytes) is a bit faster than transferring  a 512$\times$512 pixel image (0.5~MBytes) 64 times.
  \begin{figure}[h]
     \centering
     \includegraphics[width=5cm]{Chapters/chapter3/img/median_overlap.png}
  \begin{figure}[h]
     \centering
     \includegraphics[width=5cm]{Chapters/chapter3/img/median_overlap.png}
@@ -192,7 +192,7 @@ On the GPU's side, we note high dependence on window size due to the redundancy
  \section{NVIDIA GPU tuning recipes}
  When designing GPU code, besides thinking of the actual data computing process, one must choose the memory type in which to store temporary data. Three types of GPU memory are available:
  \begin{enumerate}
  \section{NVIDIA GPU tuning recipes}
  When designing GPU code, besides thinking of the actual data computing process, one must choose the memory type in which to store temporary data. Three types of GPU memory are available:
  \begin{enumerate}
-\item \textbf{Global memory, the most versatile:} \index{memory hierarchy!global memory}\\Offers the largest storing space and global scope but is the slowest (400 to 800 clock cycles latency). \textbf{Texture memory} is physically included into it, but allows access through an efficient 2D caching mechanism.
+\item \textbf{Global memory, the most versatile:} \index{memory hierarchy!global memory}\\Offers the largest storing space and global scope but is the slowest (400 to 800 clock cycles latency). \textbf{Texture memory} is physically included in it, but allows access through an efficient 2D caching mechanism.
  \item \textbf{Registers, the fastest:} \index{memory hierarchy!registers}\\Allow access without latency, but only 63 registers are available per thread (thread scope), with a maximum of 32K per Streaming Multiprocessor (SM). \index{register count}
  \item \textbf{Shared memory, a complex compromise:} \index{memory hierarchy!shared memory}\\All threads in one block can access $48~KBytes$ of shared memory, which is faster than global memory (20 clock cycles latency) but slower than registers. 
  However, bank conflicts can occur if two threads of a warp try to access data stored in one single memory bank. In such cases, the parallel process is serialized which may cause significant performance decrease. One easy way to avoid this is to ensure that two consecutive threads in one block always access 32-bit data at two consecutive addresses.  
  \item \textbf{Registers, the fastest:} \index{memory hierarchy!registers}\\Allow access without latency, but only 63 registers are available per thread (thread scope), with a maximum of 32K per Streaming Multiprocessor (SM). \index{register count}
  \item \textbf{Shared memory, a complex compromise:} \index{memory hierarchy!shared memory}\\All threads in one block can access $48~KBytes$ of shared memory, which is faster than global memory (20 clock cycles latency) but slower than registers. 
  However, bank conflicts can occur if two threads of a warp try to access data stored in one single memory bank. In such cases, the parallel process is serialized which may cause significant performance decrease. One easy way to avoid this is to ensure that two consecutive threads in one block always access 32-bit data at two consecutive addresses.  
@@ -201,7 +201,7 @@ However, bank conflicts can occur if two threads of a warp try to access data st
  As observed earlier, designing a median filter GPU implementation using only global memory is fairly straightforward, but its performance remains quite low even if it is faster than CPU. 
  To overcome this, the most frequent choice made in efficient implementations found in literature is to use shared memory. Such option implies prefetching \index{prefetching}data prior to doing the actual computations, a relevant choice, as each pixel of an image belongs to $n^2$ different neighborhoods. Thus, it can be expected that fetching each gray-level value from global memory only once should be more efficient than doing it each time  it is required. One of the most efficient implementations using shared memory is presented in \cite{5402362}. In the case of the generic kernel of Listing \ref{lst:medianGeneric}, using shared memory without further optimization would not bring valuable speedup because that would just move redundancy from texture to shared memory fetching and would generate bank conflicts. For information, we wrote such a version of the generic median kernel and our measurements showed a speedup of around 3\% (as an example, $32~ms$ for $5\times 5$ median on a 1024$^2$ pixel image, i.e., $33~MP/s$ ). 
  
  As observed earlier, designing a median filter GPU implementation using only global memory is fairly straightforward, but its performance remains quite low even if it is faster than CPU. 
  To overcome this, the most frequent choice made in efficient implementations found in literature is to use shared memory. Such option implies prefetching \index{prefetching}data prior to doing the actual computations, a relevant choice, as each pixel of an image belongs to $n^2$ different neighborhoods. Thus, it can be expected that fetching each gray-level value from global memory only once should be more efficient than doing it each time  it is required. One of the most efficient implementations using shared memory is presented in \cite{5402362}. In the case of the generic kernel of Listing \ref{lst:medianGeneric}, using shared memory without further optimization would not bring valuable speedup because that would just move redundancy from texture to shared memory fetching and would generate bank conflicts. For information, we wrote such a version of the generic median kernel and our measurements showed a speedup of around 3\% (as an example, $32~ms$ for $5\times 5$ median on a 1024$^2$ pixel image, i.e., $33~MP/s$ ). 
  
-As for registers, designing a generic median filter that would use only that type of memory seems difficult, due to the above mentioned 63 register-per-thread limitation. \index{register count} 
+As for registers, designing a generic median filter that would only use that type of memory seems difficult, due to the above mentioned 63 register-per-thread limitation. \index{register count} 
  Yet, nothing forbids us to design fixed-size filters, each of them specific to one of the most popular window sizes. It might be worth the effort as dramatic increase in performance could be expected.
  
  Another track to follow in order to improve performance of GPU implementations consists of hiding latencies generated by arithmetic instruction calls and memory accesses. Both can be partially hidden by introducing Instruction-Level Parallelism \index{instruction-level parallelism}(ILP) and by increasing the data count outputted by each thread. Though such techniques may seem to break the NVIDIA occupancy paradigm, they can lead to dramatically higher data throughput values.
  Yet, nothing forbids us to design fixed-size filters, each of them specific to one of the most popular window sizes. It might be worth the effort as dramatic increase in performance could be expected.
  
  Another track to follow in order to improve performance of GPU implementations consists of hiding latencies generated by arithmetic instruction calls and memory accesses. Both can be partially hidden by introducing Instruction-Level Parallelism \index{instruction-level parallelism}(ILP) and by increasing the data count outputted by each thread. Though such techniques may seem to break the NVIDIA occupancy paradigm, they can lead to dramatically higher data throughput values.
@@ -212,7 +212,7 @@ Designing a median filter dedicated to the smallest possible square window size
  One first issue is that the exclusive use of registers forbids us to implement a naive histogram-based method. In a \textit{8-bit gray-level pixel per thread} rule, each histogram requires one 256-element vector to store its values, i.e., more than four times the maximum register count allowed per thread (63).\index{register count} Considering that a $3\times 3$ median filter involves only 9 pixel values per thread, it seem obvious they can be sorted within the 63-register limit.
  
  \subsection{The simplest way}
  One first issue is that the exclusive use of registers forbids us to implement a naive histogram-based method. In a \textit{8-bit gray-level pixel per thread} rule, each histogram requires one 256-element vector to store its values, i.e., more than four times the maximum register count allowed per thread (63).\index{register count} Considering that a $3\times 3$ median filter involves only 9 pixel values per thread, it seem obvious they can be sorted within the 63-register limit.
  
  \subsection{The simplest way}
-In the case of a 3$\times$3 median filter, the simplest solution consists of associating one register to each gray-level value, then sorting those 9 values and selecting the fifth one, i.e., the median value.  For such a small amount of data to sort, a simple selection method is well indicated. As shown in Listing \ref{lst:kernelMedian3RegTri9} (\texttt{kernel\_Median3RegSort9()}), the constraint of using only registers forces the adoption of an unusual manner of coding. However, results are persuasive: runtimes are divided by around 120 on GTX280 and 80 on C2070, while only reduced by a 3.5 factor on CPU (CPU median3 bubble sort).
+In the case of a 3$\times$3 median filter, the simplest solution consists of associating one register to each gray-level value, then sorting those 9 values and selecting the fifth one, i.e., the median value.  For such a small amount of data to sort, a simple selection method is well indicated. As shown in Listing \ref{lst:kernelMedian3RegTri9} (\texttt{kernel\_Median3RegSort9()}), the constraint of only using registers forces the adoption of an unusual manner of coding. However, results are persuasive: runtimes are divided by around 120 on GTX280 and 80 on C2070, while only reduced by a 3.5 factor on CPU (CPU median3 bubble sort).
  The diagram of Figure \ref{fig:compMedians1} summarizes these first results for C2070, obtained with a block size of 256 threads, and Xeon CPU. We included the maximum effective pixel throughput in order to see the improvement potential of the different implementations. We also introduced throughput achieved by libJacket, a commercial implementation, as it was the fastest known implementation of a $3\times 3$ median filter to date, as illustrated in \cite{chen09}. One of the authors of libJacket kindly posted the CUDA code of its  $3\times 3$ median filter, which we inserted into our own coding structure. The algorithm itself is quite similar to ours, but running it in our own environement produced higher throughput values than those published in \cite{chen09}, not due to different hardware capabilities between our GTX280 and the GTX260 those authors used, but due to the way we perform memory transfers and our register-only method of storing temporary data.
  
  \lstinputlisting[label={lst:kernelMedian3RegTri9},caption= $3\times 3$ median filter kernel using one register per neighborhood pixel and bubble sort]{Chapters/chapter3/code/kernMedianRegTri9.cu}
  The diagram of Figure \ref{fig:compMedians1} summarizes these first results for C2070, obtained with a block size of 256 threads, and Xeon CPU. We included the maximum effective pixel throughput in order to see the improvement potential of the different implementations. We also introduced throughput achieved by libJacket, a commercial implementation, as it was the fastest known implementation of a $3\times 3$ median filter to date, as illustrated in \cite{chen09}. One of the authors of libJacket kindly posted the CUDA code of its  $3\times 3$ median filter, which we inserted into our own coding structure. The algorithm itself is quite similar to ours, but running it in our own environement produced higher throughput values than those published in \cite{chen09}, not due to different hardware capabilities between our GTX280 and the GTX260 those authors used, but due to the way we perform memory transfers and our register-only method of storing temporary data.
  
  \lstinputlisting[label={lst:kernelMedian3RegTri9},caption= $3\times 3$ median filter kernel using one register per neighborhood pixel and bubble sort]{Chapters/chapter3/code/kernMedianRegTri9.cu}
@@ -308,7 +308,7 @@ Considering the maximum register count allowed per thread (63) and trying to pus
  The next two sections will first detail the particular case of the 5$\times$5 median through register-only method and eventually a generic kernel for larger window sizes.
  
  \subsection{A register-only 5$\times$5 median filter \label{sec:median5}}
  The next two sections will first detail the particular case of the 5$\times$5 median through register-only method and eventually a generic kernel for larger window sizes.
  
  \subsection{A register-only 5$\times$5 median filter \label{sec:median5}}
-The minimum register count required to apply the forgetful selection method to a 5$\times$5 median filter is $k_{25}=\lceil 25/2\rceil+1 = 14$. Moreover, two adjacent overlapping windows share 20 pixels ($n^2-one\_column$) so that, when processing 2 pixels simultaneously, a count of 7 common selection stages can be carried out from the first selection stage with 14 common values to the processing of the last common value. This allows limiting register count to 22 per thread. Figure \ref{fig:median5overlap} describes the distribution of overlapping pixels, implemented in Listing \ref{lst:medianForget2pix5}: common selection stages take place from line 25 to line 37, while the remaining separate selection stages occur between lines 45 and 62 after the separation of line 40.
+The minimum register count required to apply the forgetful selection method to a 5$\times$5 median filter is $k_{25}=\lceil 25/2\rceil+1 = 14$. Moreover, two adjacent overlapping windows share 20 pixels ($n^2-one\_column$) so that, when processing 2 pixels simultaneously, a count of 7 common selection stages can be carried out from the first selection stage with 14 common values to the processing of the last common value. This allows limiting the register count to 22 per thread. Figure \ref{fig:median5overlap} describes the distribution of overlapping pixels, implemented in Listing \ref{lst:medianForget2pix5}: common selection stages take place from line 25 to line 37, while the remaining separate selection stages occur between lines 45 and 62 after the separation of line 40.
  \begin{figure}
     \centering
     \includegraphics[width=6cm]{Chapters/chapter3/img/median5_overlap4.png}
  \begin{figure}
     \centering
     \includegraphics[width=6cm]{Chapters/chapter3/img/median5_overlap4.png}
@@ -347,7 +347,7 @@ which favors the use of shared memory. The 1D operation almost completely avoids
  Furthermore, the above-described forgetful selection method cannot be used anymore, as too many registers would be required. Instead, the Torben Morgensen sorting algorithm is used, as its required register count is both low and constant, and avoids the use of a local vector, unlike histogram-based methods.
  
  Listing \ref{lst:medianSeparable} presents a kernel code that implements the above considerations and achieves a 1D vertical $n \times 1$ median filter. The shared memory vector is declared as \texttt{extern} (Line 16) as its size is determined at runtime and passed to the kernel call as an argument. Lines 20 to 29 perform data prefetching, including the $2n$-row halo ($n$ at the bottom and $n$ at the top of each block). Then one synchronization barrier is mandatory (line 31) to ensure that all needed data is ready prior to its use by the different threads.
  Furthermore, the above-described forgetful selection method cannot be used anymore, as too many registers would be required. Instead, the Torben Morgensen sorting algorithm is used, as its required register count is both low and constant, and avoids the use of a local vector, unlike histogram-based methods.
  
  Listing \ref{lst:medianSeparable} presents a kernel code that implements the above considerations and achieves a 1D vertical $n \times 1$ median filter. The shared memory vector is declared as \texttt{extern} (Line 16) as its size is determined at runtime and passed to the kernel call as an argument. Lines 20 to 29 perform data prefetching, including the $2n$-row halo ($n$ at the bottom and $n$ at the top of each block). Then one synchronization barrier is mandatory (line 31) to ensure that all needed data is ready prior to its use by the different threads.
-Torben Morgensen sorting takes place between lines 37 and 66 and eventually, the transposed output value is stored in global memory at line 69. Outputting the transposed image in global memory saves time and allows to reuse the same kernel to achieve the second step, e.g 1D horizontal $n \times 1$ median filtering.
+Torben Morgensen sorting takes place between lines 37 and 66 and eventually, the transposed output value is stored in global memory at line 69. Outputting the transposed image in global memory saves time and allows the reuse of the same kernel to achieve the second step, e.g 1D horizontal $n \times 1$ median filtering.
  It has to be noticed that this smoother, unlike the technique we proposed for fixed-size median filters, cannot be considered as a state-of-the-art technique as, for example, the one presented in \cite{4287006}. However, it may be considered as a good, easy to use and efficient alternative as confirmed by the results presented in Table \ref{tab:medianSeparable}. Pixel throughput values achieved by our kernel, though not constant with window size, remain very competitive if window size is kept under $120\times 120$ pixels, especially when outputting 2 pixels per thread (in \cite{4287006}, pixel throughput is around 7MP/s).
  Figure \ref{fig:sap_examples2} shows an example of a $512\times 512$ pixel image, corrupted by a  \textit{salt and pepper} noise, and the denoised versions, outputted respectively by a $3\times 3$, a $5\times 5$, and a $55\times 55 $ separable smoother.
  \begin{figure}
  It has to be noticed that this smoother, unlike the technique we proposed for fixed-size median filters, cannot be considered as a state-of-the-art technique as, for example, the one presented in \cite{4287006}. However, it may be considered as a good, easy to use and efficient alternative as confirmed by the results presented in Table \ref{tab:medianSeparable}. Pixel throughput values achieved by our kernel, though not constant with window size, remain very competitive if window size is kept under $120\times 120$ pixels, especially when outputting 2 pixels per thread (in \cite{4287006}, pixel throughput is around 7MP/s).
  Figure \ref{fig:sap_examples2} shows an example of a $512\times 512$ pixel image, corrupted by a  \textit{salt and pepper} noise, and the denoised versions, outputted respectively by a $3\times 3$, a $5\times 5$, and a $55\times 55 $ separable smoother.
  \begin{figure}