paper_fast_median/paper_fast_median_springer.tex

   1
   2 \documentclass[twocolumn, final]{svjour3}
   3
   4 \usepackage[square, numbers, sort]{natbib}
   5
   6 \usepackage{transparent}
   7 %\usepackage[pdftex]{graphicx,color}
   8 %   \graphicspath{{imgfs/}}
   9 %   \DeclareGraphicsExtensions{.pdf,.jpeg,.png}
  10
  11   %\usepackage[dvips]{graphicx}
  12   \usepackage{graphicx}
  13   %\graphicspath{{imgfs/}}
  14   \DeclareGraphicsExtensions{.jpg}
  15
  16 % *** MATH PACKAGES ***
  17 %
  18 \usepackage[cmex10]{amsmath}
  19
  20 % *** SPECIALIZED LIST PACKAGES ***
  21 %
  22 \usepackage[ruled,lined,linesnumbered]{algorithm2e}
  23
  24 % *** ALIGNMENT PACKAGES ***
  25 %
  26 \usepackage{array}
  27 \usepackage{mdwmath}
  28 \usepackage{mdwtab}
  29
  30 % *** SUBFIGURE PACKAGES ***
  31 \usepackage[caption=false,font=footnotesize]{subfig}
  32
  33 % *** FLOAT PACKAGES ***
  34 %
  35 \usepackage{fixltx2e}
  36
  37 \journalname{Signal Processing Systems}
  38
  39 \newcommand{\kl}{\includegraphics[scale=0.4]{kernLeft.jpg}~}
  40 \newcommand{\kr}{\includegraphics[scale=0.4]{kernRight.jpg}}
  41
  42
  43 \begin{document}
  44 %
  45 % paper title
  46 % can use linebreaks \\ within to get better formatting as desired
  47 \title{Fine-tuned high-speed implementation \\of a GPU-based median filter.}
  48
  49 % author names and affiliations
  50 % use a multiple column layout for up to two different
  51 % affiliations
  52
  53 \author{
  54 Gilles Perrot \and
  55 St\'{e}phane Domas \and
  56 Rapha\"{e}l Couturier}
  57
  58 \institute{
  59 FEMTO-ST institute\\
  60 Rue Engel Gros, 90000 Belfort, France.\\\email{forename.name@univ-fcomte.fr}
  61 }
  62
  63 \date{Received: date / Revised: date}
  64
  65 % make the title area
  66 \maketitle
  67
  68 \keywords{median, filter, GPU}
  69
  70 \begin{abstract}
  71   Median filtering is a well-known method used in a wide range of application frameworks as well as a standalone filter, especially for \textit{salt-and-pepper} denoising. It is able to highly reduce the power of noise while minimizing edge blurring.
  72 Currently, existing algorithms and implementations are quite efficient but may be improved as far as processing speed is concerned, which has led us to further investigate the specificities of modern GPUs.
  73 In this paper, we propose the GPU implementation of fixed-size kernel median filters, able to output up to 1.85 billion pixels per second on C2070 Tesla cards.
  74 Based on a Branchless Vectorized Median class algorithm and implemented through memory fine tuning and the use of GPU registers, our median drastically outperforms existing implementations, resulting, as far as we know, in the fastest median filter to date.
  75 \end{abstract}
  76
  77 \section{Introduction}
  78 First introduced by Tukey in \cite{tukey77}, median filtering has been widely studied since then, and many researchers have proposed efficient implementations of it, adapted to various hypothesis, architectures and processors.
  79 Originally, its main drawbacks were its compute complexity, its non linearity and its data-dependent runtime. Several researchers have addressed these issues and designed, for example, efficient histogram-based median filters featuring predictable runtimes \cite{Huang:1981:TDS:539567, Weiss:2006:FMB:1179352.1141918}.
  80 More recently, authors have managed to take advantage of the newly opened perspectives offered by modern GPUs, to develop CUDA-based filters such as the Branchless Vectorized Median filter (BVM) \cite{5402362, chen09} which allows very interesting runtimes and  the histogram-based, PCMF median filter \cite{6288187} which was the fastest median filter implementation to our knowledge.
  81
  82 The use of a GPU as a general-purpose computing processor raises the issue of data transfers, especially when kernel runtime is fast and/or when large data sets are processed. In certain cases, data transfers between GPU and CPU are slower than the actual computation on GPU, even though global GPU processes can prove faster than similar ones run on CPU.
  83 In the following section, we propose the overall code structure to be used with our median kernels.
  84 For more concision and readability, our coding will be restricted to 8 or 16 bit gray-level input images whose height ($H$) and width ($W$) are both multiples of 512 pixels.
  85 Let us also point out that the following implementation, targeted on Nvidia Tesla GPU (Fermi architecture, compute capability 2.x), may easily be adapted to other models e.g. those of compute capability 1.3.
  86
  87 \section{General structure}
  88 Algorithm \ref{algo:memcopy} describes how data is handled in our code.
  89 Input image data is stored in the GPU's texture memory so as to benefit from the 2-D caching mechanism. After kernel execution, copying output image back to CPU memory is done by use of pinned memory, which drastically accelerates data transfer.
  90
  91 \begin{algorithm}
  92 %\SetNlSty{textbf}{}{:}
  93 \footnotesize
  94  allocate and populate CPU memory \textbf{h\_in}\;
  95  allocate CPU pinned-memory \textbf{h\_out}\;
  96  allocate GPU global memory \textbf{d\_out}\;
  97  declare GPU texture reference \textbf{tex\_img\_in}\;
  98  allocate GPU array \textbf{array\_img\_in}\;
  99  bind \textbf{array\_img\_in} to texture \textbf{tex\_img\_in}\;
 100  copy data from \textbf{h\_in} to \textbf{array\_img\_in}\label{algo:memcopy:H2D}\;
 101  kernel\kl gridDim,blockDim\kr\tcc*[f]{to d\_out}\label{algo:memcopy:kernel}\;
 102  copy data from \textbf{d\_out} to \textbf{h\_out} \label{algo:memcopy:D2H}\;
 103 \caption{Global memory management on CPU and GPU sides.}
 104 \label{algo:memcopy}
 105 \end{algorithm}
 106
 107 % references section
 108 \bibliographystyle{spbasic}
 109
 110 \bibliography{biblio3}
 111
 112
 113
 114 % that's all folks
 115 \end{document}
 116
 117 %doi = {10.5201/ipol.2011.bcm_nlm},