BookGPU/Chapters/chapter3/ch3.tex

   1 \chapterauthor{Gilles Perrot}{Femto-ST Institute, University of Franche-Comte, France}
   2
   3 \newcommand{\kl}{\includegraphics[scale=0.6]{Chapters/chapter3/img/kernLeft.png}~}
   4 \newcommand{\kr}{\includegraphics[scale=0.6]{Chapters/chapter3/img/kernRight.png}}
   5
   6
   7 \chapter{Setting up the environnement.}
   8 Image processing using a GPU often means using it as a general purpose computing processor, which soon brings up the issue of data transfers, especially when kernel runtime is fast and/or when large data sets are processed.
   9 The truth is that, in certain cases, data transfers between GPU and CPU are slower than the actual computation on GPU.
  10 It remains that global runtime can still be faster than similar processes run on CPU.
  11 Therefore, to fully optimize global runtimes, it is important to pay attention to how memory transfers are done.
  12 This leads us to propose, in the following section, an overall code structure to be used with all our kernel examples.
  13
  14 Obviously, our code originally accepts various image dimensions and can process color images when an extrapolated definition of the median filter is choosen.
  15 However, so as to propose concise and more readable code, we will assume the following limitations:
  16 16~bit-coded gray-level input images whose dimensions $H\times W$ are multiples of 512 pixels.
  17
  18 \section{Data transfers, memory management.}
  19 This section deals with the following issues:
  20 \begin{enumerate}
  21 \item Data transfer from CPU memory to GPU global memory: several GPU memory areas are available as destination memory but the 2D caching mechanism of texture memory, \index{memory~hierarchy!texture~memory} specifically designed for fetching neighboring pixels, is currently the fastest way to fetch gray-level pixel values inside a kernel computation. This has led us to choose \textbf{texture memory} as primary GPU memory area for input images.
  22 \item Data fetching from GPU global memory to kernel local memory: as said above, we use texture memory. \index{memory~hierarchy!texture~memory} Depending on which process is run, texture data is used either by direct fetching in kernel local memory or through a prefetching \index{prefetching} in shared memory. \index{memory~hierarchy!shared~memory}
  23 \item Data outputting from kernels to GPU memory: there is actually no alternative to global memory, as kernels cannot directly write into texture memory and as copying from texture to CPU memory would not be faster than from simple global memory.
  24 \item Data transfer from GPU global memory to CPU memory: it can be drastically accelerated by use of \textbf{pinned memory}, \index{memory~hierarchy!pinned~memory} keeping in mind it has to be used sparingly.
  25 \end{enumerate}
  26 Algorithm \ref{algo:memcopy} summarizes all the above considerations and describes how data are handled in our examples. For more information on how to handle the different types of GPU memory, we suggest referring to the CUDA programmer's guide.
  27
  28 \begin{algorithm}
  29 %\SetNlSty{}{}{:}
  30  allocate and populate CPU memory \textbf{h\_in}\;
  31  allocate CPU pinned-memory \textbf{h\_out}\;
  32  allocate GPU global memory \textbf{d\_out}\;
  33  declare GPU texture reference \textbf{tex\_img\_in}\;
  34  allocate GPU array in global memory \textbf{array\_img\_in}\;
  35  bind GPU array \textbf{array\_img\_in} to texture \textbf{tex\_img\_in}\;
  36  copy data from \textbf{h\_in} to \textbf{array\_img\_in}\label{algo:memcopy:H2D}\;
  37  kernel\kl gridDim,blockDim\kr()\tcc*[f]{outputs to d\_out}\label{algo:memcopy:kernel}\;
  38  copy data from \textbf{d\_out} to \textbf{h\_out} \label{algo:memcopy:D2H}\;
  39 \caption{global memory management on CPU and GPU sides}
  40 \label{algo:memcopy}
  41 \end{algorithm}
  42
  43
  44 At debug stage, for simplicity's sake, we use the \textbf{cutil} \index{Cutil library} library supplied by the NVIDIA software development kit (SDK). Thus, in order to easily implement our examples, we suggest readers download and install the latest NVIDIA-SDK (ours is SDK4.0), create a new directory \textit{SDK-root-dir/C/src/fast\_kernels} and adapt the generic \textit{Makefile} that can be found in each subdirectory of \textit{SDK-root-dir/C/src/}. Then, only two more files will be needed to have a fully operational environnement: \textit{main.cu} and \textit{fast\_kernels.cu}.
  45 Listings \ref{lst:main1}, \ref{lst:fkern1} and \ref{lst:mkfile} implement all the above considerations minimally, while remaining functional.
  46
  47 The main file of Listing \ref{lst:main1} is a simplified version of our actual main file.
  48 It has to be noted that functions \texttt{cutLoadPGMi} \index{Cutil library!cutLoadPGMi} and \texttt{cutSavePGMi} \index{Cutil library!cutSavePGMi} of the \textbf{cutil} library operate only on unsigned integer data. As data is coded in short integer format for performance reasons, the use of these functions involves one data cast after loading and before saving. This may be overcome by use of a different library. Actually, our choice was to modify the above mentioned cutil functions.
  49
  50 Listing \ref{lst:fkern1} gives a minimal kernel skeleton that will serve as the basis for all other kernels. Lines 5 and 6 determine the coordinates $(i, j)$ of the pixel to be processed, each pixel being associated to one thread.
  51 The instruction in line 8 combines writing the output gray-level value into global memory and fetching the input gray-level value from 2D texture memory.
  52 The Makefile given in Listing \ref{lst:mkfile} shows how to adapt examples given in SDK.
  53
  54 \lstinputlisting[label={lst:main1},caption=generic main.cu file used to launch CUDA kernels]{Chapters/chapter3/code/mainSkel.cu}
  55
  56 \lstinputlisting[label={lst:fkern1},caption=fast\_kernels.cu file featuring one kernel skeleton]{Chapters/chapter3/code/kernSkel.cu}
  57
  58 \lstinputlisting[label={lst:mkfile},caption=generic makefile based on those provided by NVIDIA SDK]{Chapters/chapter3/code/Makefile}
  59
  60
  61 \section{Performance measurements}
  62 As our goal is to design very fast implementations of basic image processing algorithms, we need to make quite accurate time-measurements, within the order of magnitude of $0.01$~ms. Again, the easiest way of doing so is to use the helper functions of the \textbf{cutil} library. As usual, because the durations we are measuring are short and possibly subject to non negligible variations, a good practice is to measure multiple executions and report the mean runtime. All time results given in this chapter have been obtained through 1000 calls to each kernel.
  63
  64 Listing \ref{lst:chronos} shows how to use the dedicated \textbf{cutil} functions \index{Cutil library!Timer usage}. Timer declaration and creation need to be performed only once while reset, start and stop functions can be used as often as necessary. Synchronization is mandatory before stopping the timer (Line 7), to avoid runtime measurement being biased.
  65 \lstinputlisting[label={lst:chronos},caption=Time measurement technique using cutil functions]{Chapters/chapter3/code/exChronos.cu}
  66
  67 In an attempt to provide relevant speedup values, we either implemented CPU versions of the algorithms studied or used the values found in existing literature. Still, the large number and diversity of hardware platforms and GPU cards makes it impossible to benchmark every possible combination and significant differences may occur between the speedups we report and those obtained with different devices. As a reference, our developing platform details as follows:
  68
  69 \begin{itemize}
  70 \item CPU codes run on
  71   \begin{itemize}
  72   \item \textbf{Xeon}: a recent and very efficient Quad Core Xeon E31245 at 3.3GHz-8GByte RAM running Linux kernel 3.2.
  73   \end{itemize}
  74 \item GPU codes run on
  75 \begin{itemize}
  76   \item \textbf{C2070}: NVIDIA Tesla C2070 hosted by a PC QuadCore Xeon E5620 at 2.4GHz-12GByte RAM, running Linux kernel 2.6.18
  77     \item \textbf{GTX280}: NVIDIA GeForce GTX 280 hosted by a PC QuadCore Xeon X5482 at 3.20GHz-4GByte RAM, running Linux kernel 2.6.32
  78   \end{itemize}
  79 \end{itemize}
  80
  81 All kernels have also been tested with various image sizes from 512$\times$512 to 4096$\times$4096 pixels. This allows estimating runtime dependancy over image size.
  82
  83 Last, like many authors, we chose to use the pixel throughput value of each process in Mega Pixels per second (MP/s) as a performance indicator, including data transfers and kernel runtimes.
  84 In order to estimate the potential for improvement of each kernel, a reference throughput measurement, involving the identity kernel of Listing \ref{lst:fkern1}, was performed. As this kernel only fetches input values from texture memory and outputs them to global memory without doing any computation, it represents the smallest, thus fastest, possible process and is taken as the reference throughput value (100\%). The same measurement was performed on CPU, with a maximum effective pixel throughput of $130~MP/s$. On GPU, depending on grid parameters this measurement was $800~MP/s$ on GTX280 and $1300~MP/s$ on C2070.
  85
  86
  87 \chapterauthor{Gilles Perrot}{Femto-ST Institute, University of Franche-Comte, France}
  88
  89 \chapter{Implementing a fast median filter}
  90 \section{Introduction}
  91 Median filtering is a well-known method used in a wide range of application frameworks as well as a standalone filter especially for \textit{salt and pepper} denoising. It is able to greatly reduce the power of noise without blurring edges too much. That is actually why we originally focused on this filtering technique as a preprocessing stage when we were in the process of designing a GPU implementation of one region-based image segmentation algorithm \cite{6036776}.
  92
  93 First introduced by Tukey in \cite{tukey77}, it has been widely studied since then, and many researchers have proposed efficient implementations of it, adapted to various hypotheses, architectures and processors.
  94 Originally, its main drawbacks were its compute complexity, its nonlinearity and its data-dependent runtime. Several researchers have addressed these issues and designed, for example, efficient histogram-based median filters with predictible runtimes \cite{Huang:1981:TDS:539567, Weiss:2006:FMB:1179352.1141918}.
  95
  96 More recently, the advent of GPUs opened new perspectives in terms of image processing performance, and some researchers managed to take advantage of the new graphics capabilities: in that respect, we can cite the Branchless Vectorized Median (BVM) filter \cite{5402362, chen09} which allows very interesting runtimes on CUDA-enabled devices but, as far as we know, the fastest implementation to date is the histogram-based PCMF median filter \cite{Sanchez-2-2012}.
  97
  98 Some of the following implementations feature very fast runtimes. They are targeted on NVIDIA Tesla GPU (Fermi architecture, compute capability 2.x) but may easily be adapted to other models, e.g., those of compute capability 1.3.
  99
 100 The fastest ones are based on one efficient parallel implementation of the BVM algorithm described in \cite{mcguire2008median}, improving its performance through fine tuning of its implementation as presented in \cite{median_zul} and detailed in the following sections.
 101
 102 \section{Median filtering}
 103 \subsection{Basic principles}
 104 Designing a 2D median filter basically consists of defining a square window $H(i,j)$ for each pixel $I(i,j)$ of the input image, containing $n\times n$ pixels and centered on $I(i,j)$. The output value $I'(i,j)$ is the median value of the gray-level values of the $n\times n$ pixels of $H(i,j)$. Figure \ref{fig:median_1} illustrates this principle with an example of a 5x5 median filter applied on pixel $I(5,6)$. The output value is the median value of the 25 values of the dark gray window centered on pixel $I(5,6)$.
 105 \begin{figure}[b]
 106    \centering
 107    \includegraphics[width=8cm]{Chapters/chapter3/img/median_1.png}
 108    \caption{Example of 5x5 median filtering}
 109    \label{fig:median_1}
 110 \end{figure}
 111 Figure \ref{fig:sap_examples} shows an example of a $512\times 512$ pixel image, corrupted by a  \textit{salt and pepper} noise and the denoised versions, output respectively by a $3\times 3$, a $5\times 5$, and 2 iterations of a $3\times 3$ median filter.
 112
 113  The generic filtering method is given by Algorithm \ref{algoMedianGeneric}. After the data transfer stage of the first line, which copies data from CPU memory to GPU texture memory, the actual median computing occurs, before the final transfer which copies data back to CPU memory at the last line. Obviously, one key issue is the selection method that identifies the median value. But, as shown in Figure \ref{fig:median_overlap}, since two neighboring pixels share part of the values to be sorted, a second key issue is how to rule redundancy between consecutive positions of the running window $H(i,j)$.
 114 \begin{algorithm}
 115  %\SetNlSty{}{}{:}
 116   % \SetLine
 117   %\linesnumbered
 118   copy data from CPU to GPU texture memory\label{algoMedianGeneric:memcpyH2D}\;
 119   \ForEach(\tcc*[f]{in parallel}){pixel at position $(x, y)$}{
 120     Read gray-level values of the n$\times$n neighborhood\label{algoMedianGeneric:cptstart}\;
 121     Selects the median value among those n$\times$n values\;
 122     Outputs the new gray-level value \label{algoMedianGeneric:cptend}\;
 123   }
 124 copy data from GPU global memory to CPU memory\label{algoMedianGeneric:memcpyD2H}\;
 125 \caption{\label{algoMedianGeneric}generic n$\times$n median filter}
 126 \end{algorithm}
 127 As mentioned earlier, the selection of the median value can be performed by more than one technique, using either histogram-based or sorting methods, each having its own benefits and drawbacks as will be discussed further down.
 128
 129 \subsection{A naive implementation}
 130 As a reference, Listing \ref{lst:medianGeneric} gives a simple, not to say simplistic, implementation of a CUDA kernel (\texttt{kernel\_medianR}) achieving generic $n\times n$ histogram-based median filtering. Its runtime has a very low data dependency, but this implementation does not suit  GPU architecture very well. Each pixel loads the whole of its $n\times n$ neighborhood, meaning that one pixel is loaded multiple times inside one single thread block, and even more time-consuming, the use of a local vector (histogram[]) considerably downgrades performance, as the compiler automatically stores such vectors in local memory (slow) \index{memory~hierarchy!local~memory}.
 131
 132 Table \ref{tab:medianHisto1} displays measured runtimes of \texttt{kernel\_medianR} and pixel throughputs for each GPU version (C2070 and GTX480 targets) and for both CPU and GPU implementations. Usual window sizes of $3\times 3$, $5\times 5$, and $7\times 7$ are shown. Though some specific applications require larger window sizes and dedicated algorithms, such small square window sizes are most widely used in general purpose image processing. GPU runtimes have been obtained with a grid of 64-thread blocks.
 133
 134 The first observation to make when analysing results of Table \ref{tab:medianHisto1} is that, on CPU, window size has almost no influence on the effective pixel throughput.
 135 Since inner loops that fill the histogram vector contain very few fetching instructions (from 9 to 49, depending on the window size), it is not surprising to note their negligible impact compared to outer loops that fetch image pixels (from 256k to 16M instructions).
 136 One could be tempted to claim that CPU has no chance to win, which is not so obvious as it highly depends on what kind of algorithm is run and, above all, how it is implemented. To illustrate this, we can observe that, despite a maximum effective throughput potential that is almost five times higher, measured GTX280 throughput values sometimes prove slower than CPU values, as shown in Table \ref{tab:medianHisto1}.
 137
 138 On the GPU's side, we note high dependence on window size due to the redundancy induced by the multiple fetches of each pixel inside each block, becoming higher with the window size. Figure \ref{fig:median_overlap} shows for example that two $5\times 5$ windows, centered on two neighbor pixels share at least 16 pixels. On C2070 card, thanks to a more efficient caching mechanism, this effect is less. On GPUs, dependency on image size is low, and due to slightly more efficient data transfers when copying larger data amounts, pixel throughputs increases with image size. As an example, transferring a 4096$\times$4096 pixel image (32~MBytes) is a bit faster than transferring  a 512$\times$512 pixel image (0.5~MBytes) 64 times.
 139 \begin{figure}[h]
 140    \centering
 141    \includegraphics[width=5cm]{Chapters/chapter3/img/median_overlap.png}
 142    \caption{Illustration of window overlapping in 5x5 median filtering}
 143    \label{fig:median_overlap}
 144 \end{figure}
 145
 146
 147 \lstinputlisting[label={lst:medianGeneric},caption=generic CUDA kernel achieving median filtering]{Chapters/chapter3/code/medianGeneric.cu}
 148
 149 \begin{table}[h]
 150 %\newcolumntype{I}{!{\vrule width 1.5pt}}
 151 \newlength\savedwidth
 152 \newcommand\whline{\noalign{\global\savedwidth
 153   \arrayrulewidth\global\arrayrulewidth 1.5pt}
 154   \hline \noalign{\global\arrayrulewidth
 155   \savedwidth}
 156 }
 157 \renewcommand{\arraystretch}{1.5}
 158 \centering
 159 {\tiny
 160 \begin{tabular}{|c|l||c|c|c|c|c|c|c|c|c|}
 161 \hline
 162 \multicolumn{2}{|l||}{Processor} & \multicolumn{3}{c|}{\textbf{GTX280}} & \multicolumn{3}{c|}{\textbf{C2070}} & \multicolumn{3}{c|}{\textbf{CPU (Xeon)}} \\ \hline
 163 \multicolumn{2}{|l||}{\shortstack{Performances$\rightarrow$\\sizes (pixels)$\downarrow$}} & \shortstack{t\\(ms)}& \shortstack{output\\(MP/s)}& \shortstack{rate\\\% }&\shortstack{t\\(ms)}& \shortstack{output\\(MP/s)}& \shortstack{rate\\\% }&\shortstack{t\\(ms)}& \shortstack{output\\(MP/s)}& \shortstack{rate\\\% }   \\ \whline
 164 \multirow{3}{*}{\rotatebox{90}{512$^2$}} &3$\times$3&11.50 &22 &2.2 &7.58 &33 &3.4 & 19.25& 14&11\\
 165                                          &5$\times$5&19.10 &14 &1.3 &8.60 &30 &3.0 &18.49 &14 &11\\
 166                                          &7$\times$7&31.30 &8 &0.8 &10.60 &24 &2.5 &20.27 &13 &10\\\whline
 167 \multirow{3}{*}{\rotatebox{90}{1024$^2$}}&3$\times$3&44.50 &23 &2.3 &29.60 &34 &3.5 &75.49 &14 &11\\
 168                                          &5$\times$5&71.10 &14 &1.4 &33.00 &31 &3.2 &73.88 &14 &11\\
 169                                          &7$\times$7&114.50 &9 &0.9 &39.10 &26 &2.7 &77.40 &13 &10\\\whline
 170 \multirow{3}{*}{\rotatebox{90}{2048$^2$}}&3$\times$3&166.00 &24 &2.4 &115.20 &36 &3.6 &296.18&14 &11\\
 171                                          &5$\times$5&261.00&16 &1.5 &128.20&32 &3.3 &294.55&14 &11\\
 172                                          &7$\times$7&411.90 &10&1.0 &143.30&28 &2.8 &303.48&14 &11\\\whline
 173 \multirow{3}{*}{\rotatebox{90}{4096$^2$}}&3$\times$3&523.80 &31 &3.0 &435.00 &38 &3.9 &1184.16&14 &11\\
 174                                          &5$\times$5&654.10&25 &2.4 &460.20&36 &3.7 &1158.26&14 &11\\
 175                                          &7$\times$7&951.30 &17&1.7 &509.60&32 &3.3 &1213.55&14 &11\\\whline
 176
 177 \end{tabular}}
 178 \caption{Performance results of \texttt{kernel medianR}. }
 179 \label{tab:medianHisto1}
 180 \end{table}
 181
 182 \begin{figure}[t]
 183 \centering
 184    \subfigure[Airplane image, corrupted by salt and pepper noise of density 0.25]{\label{img:sap_example_ref} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25.png}}\qquad
 185    \subfigure[Image denoised by a $3\times 3$ median filter]{\label{img:sap_example_med3} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_med3.png}}\\
 186    \subfigure[Image denoised by a $5\times 5$ median filter]{\label{img:sap_example_med5} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_med5.png}}\qquad
 187    \subfigure[Image denoised by 2 iterations of a $3\times 3$ median filter]{\label{img:sap_example_med3_it2} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_med3_it2.png}}\\
 188    \caption{Example of median filtering, applied to salt and pepper noise reduction.}
 189    \label{fig:sap_examples}
 190 \end{figure}
 191
 192 \section{NVIDIA GPU tuning recipes}
 193 When designing GPU code, besides thinking of the actual data computing process, one must choose the memory type in which to store temporary data. Three types of GPU memory are available:
 194 \begin{enumerate}
 195 \item \textbf{Global memory, the most versatile:} \index{memory~hierarchy!global~memory}\\Offers the largest storing space and global scope but is the slowest (400 to 800 clock cycles latency). \textbf{Texture memory} is physically included into it, but allows access through an efficient 2D caching mechanism.
 196 \item \textbf{Registers, the fastest:} \index{memory~hierarchy!registers}\\Allow access without latency, but only 63 registers are available per thread (thread scope), with a maximum of 32K per Streaming Multiprocessor (SM). \index{register count}
 197 \item \textbf{Shared memory, a complex compromise:} \index{memory~hierarchy!shared~memory}\\All threads in one block can access $48~KBytes$ of shared memory, which is faster than global memory (20 clock cycles latency) but slower than registers.
 198 However, bank conflicts can occur if two threads of a warp try to access data stored in one single memory bank. In such cases, the parallel process is serialized which may cause significant performance decrease. One easy way to avoid this is to ensure that two consecutive threads in one block always access 32-bit data at two consecutive addresses.
 199 \end{enumerate}
 200
 201 As observed earlier, designing a median filter GPU implementation using only global memory is fairly straightforward, but its performance remains quite low even if it is faster than CPU.
 202 To overcome this, the most frequent choice made in efficient implementations found in literature is to use shared memory. Such option implies prefetching \index{prefetching}data prior to doing the actual computations, a relevant choice, as each pixel of an image belongs to $n^2$ different neighborhoods. Thus, it can be expected that fetching each gray-level value from global memory only once should be more efficient than doing it each time  it is required. One of the most efficient implementations using shared memory is presented in \cite{5402362}. In the case of the generic kernel of Listing \ref{lst:medianGeneric}, using shared memory without further optimization would not bring valuable speedup because that would just move redundancy from texture to shared memory fetching and would generate bank conflicts. For information, we wrote such a version of the generic median kernel and our measurements showed a speedup of around 3\% (as an example, $32~ms$ for $5\times 5$ median on a 1024$^2$ pixel image, i.e., $33~MP/s$ ).
 203
 204 As for registers, designing a generic median filter that would use only that type of memory seems difficult, due to the above mentioned 63 register-per-thread limitation. \index{register count}
 205 Yet, nothing forbids us to design fixed-size filters, each of them specific to one of the most popular window sizes. It might be worth the effort as dramatic increase in performance could be expected.
 206
 207 Another track to follow in order to improve performance of GPU implementations consists of hiding latencies generated by arithmetic instruction calls and memory accesses. Both can be partially hidden by introducing Instruction-Level Parallelism \index{Instruction-Level Parallelism}(ILP) and by increasing the data count outputted by each thread. Though such techniques may seem to break the NVIDIA occupancy paradigm, they can lead to dramatically higher data throughput values.
 208 The following sections illustrate these ideas and detail the design of the fastest CUDA median filter known to date.
 209
 210 \section{A 3$\times$3 median filter:  using registers}
 211 Designing a median filter dedicated to the smallest possible square window size is a good challenge to start using registers.
 212 One first issue is that the exclusive use of registers forbids us to implement a naive histogram-based method. In a \textit{8-bit gray-level pixel per thread} rule, each histogram requires one 256-element vector to store its values, i.e., more than four times the maximum register count allowed per thread (63).\index{register count} Considering that a $3\times 3$ median filter involves only 9 pixel values per thread, it seem obvious they can be sorted within the 63-register limit.
 213
 214 \subsection{The simplest way}
 215 In the case of a 3$\times$3 median filter, the simplest solution consists of associating one register to each gray-level value, then sorting those 9 values and selecting the fifth one, i.e., the median value.  For such a small amount of data to sort, a simple selection method is well indicated. As shown in Listing \ref{lst:kernelMedian3RegTri9} (\texttt{kernel\_Median3RegSort9()}), the constraint of using only registers forces the adoption of an unusual manner of coding. However, results are persuasive: runtimes are divided by around 120 on GTX280 and 80 on C2070, while only reduced by a 3.5 factor on CPU (CPU median3 bubble sort).
 216 The diagram of Figure \ref{fig:compMedians1} summarizes these first results for C2070, obtained with a block size of 256 threads, and Xeon CPU. We included the maximum effective pixel throughput in order to see the improvement potential of the different implementations. We also introduced throughput achieved by libJacket, a commercial implementation, as it was the fastest known implementation of a $3\times 3$ median filter to date, as illustrated in \cite{chen09}. One of the authors of libJacket kindly posted the CUDA code of its  $3\times 3$ median filter, which we inserted into our own coding structure. The algorithm itself is quite similar to ours, but running it in our own environement produced higher throughput values than those published in \cite{chen09}, not due to different hardware capabilities between our GTX280 and the GTX260 those authors used, but due to the way we perform memory transfers and our register-only method of storing temporary data.
 217
 218 \lstinputlisting[label={lst:kernelMedian3RegTri9},caption= $3\times 3$ median filter kernel using one register per neighborhood pixel and bubble sort]{Chapters/chapter3/code/kernMedianRegTri9.cu}
 219
 220 \begin{figure}
 221    \centering
 222    \includegraphics[width=15cm]{Chapters/chapter3/img/debitPlot1.pdf}
 223    \caption[Comparison of pixel throughputs for CPU generic median, CPU 3$\times$3 median register-only with bubble sort, GPU generic median, GPU 3$\times$3 median register-only with bubble sort, and GPU libJacket.]{Comparison of pixel throughputs for CPU generic median, CPU 3$\times$3 median register-only with bubble sort, GPU generic median, GPU 3$\times$3 median register-only with bubble sort, and GPU libJacket. The GPU is the C2070 card and the CPU is the Xeon processor. The maximum effective C2070 throughput is also shown.}
 224    \label{fig:compMedians1}
 225 \end{figure}
 226
 227 \subsection{Further optimization}
 228 Running the above register-only 3$\times$3 median filter through the NVIDIA CUDA profiler teaches us that the memory throughput achieved by the kernel remains quite low. To improve this, two methods can be used:
 229 \begin{itemize}
 230 \item increasing the number of concurrent threads, which can be achieved by reducing the number of registers used by each thread.
 231 \item having each thread process more data which can be achieved at thread level by processing and outputting the gray-level value of two pixels or more.
 232 \end{itemize}
 233
 234
 235 \subsubsection{Reducing register count \index{register count}}
 236 Our current kernel (\texttt{kernel\_Median3RegSort9}) uses one register per gray-level value, which amounts to 9 registers for the entire 3$\times$3 window.
 237 This count can be reduced by use of an iterative sorting process called \textit{forgetful selection}, where both \textit{extrema} are eliminated at each sorting stage, until only 3 elements remain. The question is to learn the minimal register count $k_{n^2}$ that allows the selection of the median amoung $n^2$ values. The answer can be evaluated  considering that, when eliminating the maximum and the minimum values, one has to make sure not to eliminate the global median value. Such a situation is illustrated in Figure \ref{fig:forgetful_selection} for a  $3\times 3$ median filter. For better comprehension, the 9 elements of the  $3\times 3$ pixel window have been represented in a row.
 238
 239 We must remember that by definition, in the fully sorted vector, the median value will have the middle index, i.e., $\lfloor n^2/2\rfloor$.
 240 Moreover, assuming that both \textit{extrema} are eliminated from the first $k$ elements and that the global median is one of them would mean that
 241 \begin{itemize}
 242 \item if the global median was the minimum among the $k$ elements, then at least $k-1$ elements would have a higher index. Considering the above median definition, at least $k-1$ elements should also have a lower index in the entire vector.
 243 \item if the global median was the maximum among the $k$ elements, then at least $k-1$ elements would have a lower index. Considering the above median definition, at least $k-1$ elements should also have a higher index in the entire vector.
 244
 245 Therefore, the number $k$ of elements that are part of the first selection stage can be defined by the condition
 246 $$n^2-k \leq \lfloor \frac{n^2}{2} \rfloor -1$$
 247 which leads to
 248 $$k_{n^2}=\lceil \frac{n^2}{2}\rceil+1 $$
 249
 250 This rule can be applied to the first eliminating stage and remains true with the next ones as each stage suppresses exactly two values, one above and one below the median value.
 251 In our $3\times 3$ pixel window example, the minimum register count becomes $k_9=\lceil 9/2\rceil+1 = 6$.
 252 This iterative process is illustrated in Figure \ref{fig:forgetful3}, where it achieves one entire $3\times 3$ median selection, beginning with $k_9=6$ elements.
 253
 254 The \textit{forgetful selection} method, used in \cite{mcguire2008median}, does not imply full sorting of values, but only selecting minimum and maximum values, which, at the price of a few iteration steps ($n^2-k$), reduces arithmetic complexity.
 255 Listing \ref{lst:medianForget1pix3} details this process where forgetful selection is achieved by use of simple 2-value swapping function ($s()$, lines 1 to 5) that swaps input values if necessary, so as to achieve the first steps of an incomplete sorting network \cite{Batcher:1968:SNA:1468075.1468121}. Moreover, whenever possible, in order to increase the ILP, \index{Instruction-Level Parallelism} successive calls to $s()$ are done with independant elements as arguments. This is illustrated by the macro definitions of lines 7 to 12 and by Figure \ref{fig:bitonic} which details the first iteration of the $5\times 5$ selection, starting with $k_{25}=14$ elements.
 256 \begin{figure}[b]
 257    \centering
 258    \includegraphics[width=6cm]{Chapters/chapter3/img/forgetful_selection.png}
 259    \caption{Forgetful selection with the minimal element register count. Illustration for $3\times 3$ pixel window represented in a row and supposed sorted.}
 260    \label{fig:forgetful_selection}
 261 \end{figure}
 262 \begin{figure}
 263    \centering
 264    \includegraphics[width=5cm]{Chapters/chapter3/img/forgetful_selectionb.png}
 265    \caption{Determination of the median value by the \textit{forgetful selection} process, applied to a $3\times 3$ neighborhood window.}
 266    \label{fig:forgetful3}
 267 \end{figure}
 268 \end{itemize}
 269
 270 \begin{figure}
 271    \centering
 272    \includegraphics[width=6cm]{Chapters/chapter3/img/fig3.jpg}
 273    \caption[First iteration of the $5\times 5$ selection process, with $k_{25}=14$, which shows how Instruction Level Parallelism is maximized by the use of an incomplete sorting network.]{First iteration of the $5\times 5$ selection process, with $k_{25}=14$, which shows how Instruction Level Parallelism is maximized by the use of an incomplete sorting network. Arrows represent the result of the swapping function, with the lower value at the starting point and the higher value at the end point.}
 274    \label{fig:bitonic}
 275 \end{figure}
 276
 277 \lstinputlisting[label={lst:medianForget1pix3},caption= 3$\times$3 median filter kernel using the minimum register count of 6 to find the median value by forgetful selection method. The optimal thread block size is 128 on GTX280 and 256 on C2070]{Chapters/chapter3/code/kernMedianForget1pix3.cu}
 278
 279 Our such modified kernel provides significantly improved runtimes: an average speedup of 16\% is obtained, and pixel throughput reaches around $1000~MP/s$ on C2070.
 280
 281
 282 \subsubsection{More data output per thread}
 283 In the case of a kernel achieving an effective memory throughput value far from the GPU peak value, and if enough threads are run, another technique may help with hiding memory latency and thus leverage performance: making sure that each thread generates multiple pixel outputs.
 284
 285 Attentive readers could remark that it would increase the register count per thread, which can be compensated by dividing thread block size accordingly, thus keeping the same register count per block.
 286 Moreover, it is now possible to take advantage of window overlapping, first illustrated in Figure \ref{fig:median_overlap}, and further detailed in Figure \ref{fig:median3_overlap}. As the selection is first processed on the first 6 gray-level values, i.e., exactly the number of pixels that overlap between the neighborhoods of two adjacent center pixels, 6 texture fetches, and one \texttt{minmax6} selection per thread can be saved. There again, some speedup can be  expected through our modified kernel source code presented in Listing \ref{lst:medianForget2pix3}. One important difference from previous versions lies in the way pixel coordinates are computed from thread indexes. As each thread has to process two pixels, the number of threads in each block is divided by 2, while the grid size remains unchanged. Consequently, in our kernel code, each thread whose block-related coordinates are $(tx, ty)$ will be in charge of processing pixels of block-related coordinates $(2tx, ty)$ and $(2tx+1, ty)$; lines 5 and 6 implement this.
 287
 288 \begin{figure}
 289    \centering
 290    \includegraphics[width=4cm]{Chapters/chapter3/img/median3_overlap.png}
 291    \caption{Illustration of how window overlapping is used to combine 2 pixel selections in a $3\times 3$ median kernel.}
 292    \label{fig:median3_overlap}
 293 \end{figure}
 294
 295 \lstinputlisting[label={lst:medianForget2pix3},caption=$3\times 3$ median filter kernel processing 2 output pixel values per thread using combined forgetful selection]{Chapters/chapter3/code/kernMedian2pix3.cu}
 296
 297 Running this $3\times 3$ kernel saves another 10\% runtime, as shown in Figure \ref{fig:compMedians2} and provides the best peak pixel throughput value known so far on the C2070: $1155~MP/s$ which is 86\% of the maximum effective throughput.
 298
 299 \begin{figure}
 300    \centering
 301    \includegraphics[width=15cm]{Chapters/chapter3/img/debitPlot2.pdf}
 302    \caption{Comparison of pixel throughput on GPU C2070 for the different 3$\times$3 median kernels.}
 303    \label{fig:compMedians2}
 304 \end{figure}
 305
 306 \section{A 5$\times$5 and more median filter }
 307 Considering the maximum register count allowed per thread (63) and trying to push this technique to its limit potentially allows designing up to 9$\times$9 median filters. Such maximum would actually use  $k_{81}=\lceil 81/2\rceil+1 = 42$ registers per thread plus 9, used by the compiler to complete arithmetic operations, and 9 more when outputting 2 pixels per thread. This leads to a total register count of 60, which would limit the number of concurrent threads per block. As for larger window sizes, one option could be using shared memory.
 308 The next two sections will first detail the particular case of the 5$\times$5 median through register-only method and eventually a generic kernel for larger window sizes.
 309
 310 \subsection{A register-only 5$\times$5 median filter \label{sec:median5}}
 311 The minimum register count required to apply the forgetful selection method to a 5$\times$5 median filter is $k_{25}=\lceil 25/2\rceil+1 = 14$. Moreover, two adjacent overlapping windows share 20 pixels ($n^2-one\_column$) so that, when processing 2 pixels simultaneously, a count of 7 common selection stages can be carried out from the first selection stage with 14 common values to the processing of the last common value. This allows limiting register count to 22 per thread. Figure \ref{fig:median5overlap} describes the distribution of overlapping pixels, implemented in Listing \ref{lst:medianForget2pix5}: common selection stages take place from line 25 to line 37, while the remaining separate selection stages occur between lines 45 and 62 after the separation of line 40.
 312 \begin{figure}
 313    \centering
 314    \includegraphics[width=6cm]{Chapters/chapter3/img/median5_overlap4.png}
 315    \caption[Reducing register count in a 5$\times$5 register-only median kernel outputting 2 pixels simultaneously.]{Reducing register count in a 5$\times$5 register-only median kernel outputting 2 pixels simultaneously. The first 7 forgetful selection stages are common to both processed center pixels. Only the last 5 selections have to be done separately.}
 316    \label{fig:median5overlap}
 317 \end{figure}
 318
 319 \lstinputlisting[label={lst:medianForget2pix5},caption=kernel 5$\times$5 median filter processing 2 output pixel values per thread by a combined forgetfull selection]{Chapters/chapter3/code/kernMedian2pix5.cu}
 320
 321 Timing results follow the same variations with image size as in previously presented kernels. That is why  Table \ref{tab:median5comp} shows only throughput values obtained for C2070 card and 4096$\times$4096 pixel image.
 322
 323 \begin{table}[h]
 324 %\newlength\savedwidth
 325 \newcommand\whline{\noalign{\global\savedwidth
 326   \arrayrulewidth\global\arrayrulewidth 1.5pt}
 327   \hline \noalign{\global\arrayrulewidth
 328   \savedwidth}
 329 }
 330 \centering
 331 {\scriptsize
 332 \begin{tabular}{|l||c|c|c|c|}
 333 \hline
 334 \textbf{Implementation}&\shortstack{\textbf{registers only}\\\textbf{1 pix/thread}}&\shortstack{\textbf{registers only}\\\textbf{2 pix/thread}}&\shortstack{\textbf{libJacket}\\(interpolated)}&\shortstack{\textbf{shared mem}}\\\whline
 335  \shortstack{\textbf{Throughput}\\\textbf{(MP/s)}}&551&738&152&540\\\hline
 336 \end{tabular}
 337 }
 338 \caption{Performance of various 5$\times$5 median kernel implementations, applied on 4096$\times$4096 pixel image with C2070 GPU card.}
 339 \label{tab:median5comp}
 340 \end{table}
 341
 342 \subsection{Fast approximated $n\times n$ median filter }
 343 Large window median filters are less widespread but are used in more specific fields, such as digital microscopy where, for example, background estimation of images is achieved through $64\times 64$ or $128\times 128$ median filters \cite{Wu2010}. In such cases, a possible technique is to split median selection into two separate 1D stages: one in the vertical direction and the other in the horizontal direction. Image processing specialists may object that this method does not select the actual median value. This is true but, in the case of large window sizes and \textit{real-life} images, the value selected in this manner is statistically near the actual median value and often represents an acceptable approximation. Such a filter is sometimes called a \textit{smoother}.
 344
 345 As explained earlier in this section, the use of large window median filters rules out register-only implementation,
 346 which favors the use of shared memory. The 1D operation almost completely avoids bank conflicts in shared memory accesses.
 347 Furthermore, the above-described forgetful selection method cannot be used anymore, as too many registers would be required. Instead, the Torben Morgensen sorting algorithm is used, as its required register count is both low and constant, and avoids the use of a local vector, unlike histogram-based methods.
 348
 349 Listing \ref{lst:medianSeparable} presents a kernel code that implements the above considerations and achieves a 1D vertical $n \times 1$ median filter. The shared memory vector is declared as \texttt{extern} (Line 16) as its size is determined at runtime and passed to the kernel call as an argument. Lines 20 to 29 perform data prefetching, including the $2n$-row halo ($n$ at the bottom and $n$ at the top of each block). Then one synchronization barrier is mandatory (line 31) to ensure that all needed data is ready prior to its use by the different threads.
 350 Torben Morgensen sorting takes place between lines 37 and 66 and eventually, the transposed output value is stored in global memory at line 69. Outputting the transposed image in global memory saves time and allows to reuse the same kernel to achieve the second step, e.g 1D horizontal $n \times 1$ median filtering.
 351 It has to be noticed that this smoother, unlike the technique we proposed for fixed-size median filters, cannot be considered as a state-of-the-art technique as, for example, the one presented in \cite{4287006}. However, it may be considered as a good, easy to use and efficient alternative as confirmed by the results presented in Table \ref{tab:medianSeparable}. Pixel throughput values achieved by our kernel, though not constant with window size, remain very competitive if window size is kept under $120\times 120$ pixels, especially when outputting 2 pixels per thread (in \cite{4287006}, pixel throughput is around 7MP/s).
 352 Figure \ref{fig:sap_examples2} shows an example of a $512\times 512$ pixel image, corrupted by a  \textit{salt and pepper} noise, and the denoised versions, outputted respectively by a $3\times 3$, a $5\times 5$, and a $55\times 55 $ separable smoother.
 353 \begin{figure}
 354    \subfigure[Airplane image, corrupted with by salt and pepper noise of density 0.25]{\label{img:sap_example_ref} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25.png}}\qquad
 355    \subfigure[Image denoised by a $3\times 3$ separable smoother]{\label{img:sap_example_sep_med3} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_sep_med3.png}}\\
 356    \subfigure[Image denoised by a $5\times 5$ separable smoother]{\label{img:sap_example_sep_med5} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_sep_med5.png}}\qquad
 357    \subfigure[Image background estimation by a $55\times 55$ separable smoother]{\label{img:sap_example_sep_med3_it2} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_sep_med111.png}}\\
 358    \caption{Example of separable median filtering (smoother), applied to salt and pepper noise reduction.}
 359    \label{fig:sap_examples2}
 360 \end{figure}
 361
 362 \begin{table}[h]
 363 %\newlength\savedwidth
 364 \newcommand\whline{\noalign{\global\savedwidth
 365   \arrayrulewidth\global\arrayrulewidth 1.5pt}
 366   \hline \noalign{\global\arrayrulewidth
 367   \savedwidth}
 368 }
 369 \centering
 370 {\scriptsize
 371 \begin{tabular}{|l||c|c|c|c|}
 372 \hline
 373 \shortstack{\textbf{Window edge size}\\(in pixels)}&\textbf{41}&\textbf{81}&\textbf{111}&\textbf{121}\\\whline
 374  \shortstack{\textbf{Throughput}\\\textbf{(MP/s)}}&54 &27 & 20& 18\\\hline
 375 \end{tabular}
 376 }
 377 \caption{Measured performance of one generic pseudo-separable median kernel applied to 4096$\times$4096 pixel image with various window sizes.}
 378 \label{tab:medianSeparable}
 379 \end{table}
 380
 381 \lstinputlisting[label={lst:medianSeparable},caption= generic pseudo median kernel.]{Chapters/chapter3/code/kernMedianSeparable.cu}
 382
 383 % \section{Glossary}
 384 % \begin{Glossary}
 385 % \item[CUDA] Compute Unified Device Architecture.
 386 % \end{Glossary}
 387
 388 \putbib[Chapters/chapter3/biblio3]
 389