paper_fast_median/paper_fast_median_springer.tex~

   1
   2 \documentclass[twocolumn, final]{svjour2}
   3
   4 \usepackage[square, numbers, sort]{natbib}
   5
   6 \usepackage{transparent}
   7 %\usepackage[pdftex]{graphicx,color}
   8 %   \graphicspath{{imgfs/}}
   9 %   \DeclareGraphicsExtensions{.pdf,.jpeg,.png}
  10
  11   %\usepackage[dvips]{graphicx}
  12   \usepackage{graphicx}
  13   %\graphicspath{{imgfs/}}
  14   \DeclareGraphicsExtensions{.jpg}
  15
  16 % *** MATH PACKAGES ***
  17 %
  18 \usepackage[cmex10]{amsmath}
  19
  20 % *** SPECIALIZED LIST PACKAGES ***
  21 %
  22 \usepackage[ruled,lined,linesnumbered]{algorithm2e}
  23
  24 % *** ALIGNMENT PACKAGES ***
  25 %
  26 \usepackage{array}
  27 \usepackage{mdwmath}
  28 \usepackage{mdwtab}
  29
  30 % *** SUBFIGURE PACKAGES ***
  31 \usepackage[caption=false,font=footnotesize]{subfig}
  32
  33 % *** FLOAT PACKAGES ***
  34 %
  35 \usepackage{fixltx2e}
  36
  37 \journalname{Real Time Image Processing}
  38
  39 \begin{document}
  40 %
  41 % paper title
  42 % can use linebreaks \\ within to get better formatting as desired
  43 \title{Fast GPU-based denoising filter using isoline levels}
  44
  45
  46 % author names and affiliations
  47 % use a multiple column layout for up to two different
  48 % affiliations
  49
  50 \author{
  51 Gilles Perrot$^1$ \and
  52 St\'{e}phane Domas$^1$ \and
  53 Rapha\"{e}l Couturier$^1$ \and
  54 \\Nicolas Bertaux$^2$}
  55
  56 \institute{
  57 $^1$ \at FEMTO-ST institute\\
  58 Rue Engel Gros, 90000 Belfort, France.\\
  59 forename.name@univ-fcomte.fr
  60 \and $^2$ \at Institut Fresnel, CNRS, Aix-Marseille Universit\'e, Ecole Centrale Marseille,\\
  61 Campus de Saint-J\'er\^ome, 13013 Marseille, France.\\
  62 nicolas.bertaux@ec-marseille.fr
  63 }
  64
  65 \date{Received: date / Revised: date}
  66
  67 % make the title area
  68 \maketitle
  69
  70 \begin{abstract}
  71 In this study, we propose to address the issue of image denoising by means of a GPU-based filter, able to achieve high-speed processing by taking advantage of the parallel computation capabilities of modern GPUs. Our approach is based on the level sets theory first introduced by \citep{matheron75} in 1975 but little implemented because of its high computation costs. What we actually do is try to guess the best isoline shapes inside the noisy image. At first, our method involved the polyline modelling of isolines; then we found an optimization heuristics which very closely fits the capabilities of GPUs. So far, though our proposed hybrid PI-PD filter has not achieved the best denoising levels, it is nonetheless able to process a 512x512 image in about 11~ms.
  72 \end{abstract}
  73
  74 \section{Introduction}
  75 Denoising has been a much studied research issue since electronic transmission was first used. The wide range of applications that involve denoising makes it uneasy to propose a universal filtering method. Among them, digital image processing is a major field of interest as the number of digital devices able to take pictures or make movies is growing fast and shooting is rarely done in optimal conditions. Moreover, the increase in pixel density of the CCD or CMOS sensors used to measure light intensity leads to higher noise effects and imposes high output flow rates to the various processing algorithms.
  76
  77 In addition, it is difficult to quantify the quality of an image processing algorithm, as visual perception is subject to high variation from one human to another.
  78 So far, the advent of GPUs has brought high speedups to a lot of algorithms, and many researchers and developpers have successfully adressed the issue of implementing existing algorithms on such devices. For example in \citep{mcguire2008median},\citep{chen09} and \citep{sanchezICASSP12}, authors managed to design quite fast median filters. Bilateral filtering has also been successfully proposed in \citep{YangTA09}.
  79 Still, most high quality algorithms, like NL-means \citep{ipol.2011.bcm_nlm} or BM3D \citep{Dabov09bm3dimage} make use of non-local similarities and/or frequency domain transforms. However, speedups achieved by their current GPU implementations, though quite sigificant (as shown for example with NL-means in \citep{PALHANOXAVIERDEFONTES}), do not come near those achieved by local methods such as gaussian, median or neighborhood filters, as they have not originally been designed against GPU architecture.
  80 In order to fully benefit from the capabilities of GPUs, it is important that the approach to designing algorithms be more hardware-oriented, keeping in mind, from the very beginning, the intrinsic constraints of the device which is actually going to run those algorithms. Consequently, this often results in unusual options and even apparently sub-optimal solutions, but the considerable speed benefits obtained would possibly make it at least a good compromise or even the only current way to real-time high-definition image processing.
  81
  82
  83
  84 \section{\label{contrib}Contribution}
  85 As early as 1975 \citep{matheron75}, it was found that, under the conditions mentioned in section \ref{isolines}, an image can be decomposed into a set of level lines. Accordingly, real-life images fulfill the above conditions and since then, with the increase of computing capabilities, researchers have succeded in implementing such level-lines based algorithms as in \citep{caselles97} and \citep{springerlink:10.1007/3-540-48236-9_16}.
  86 A few years ago, in \citep{bertaux:04}, authors  proposed an original method which significantly reduces speckle noise inside coherent images, using the level lines in the image to constrain the minimization process.
  87 Those level lines are actually \textit{iso-gray-level} lines, which are called \textit{isolines}. In \citep{bertaux:04}, isolines consist in neighborhoods of polyline shapes determined by maximum likelihood optimization. This method proved not only to bring good enhancement but also to preserve edges between regions. Nevertheless, the costs in computation time, though not prohibitive, did not allow real-time image processing; as an example, the authors of \citep{bertaux:04} managed to process an almost 2Mpixel image within a minute on an old PIII-1GHz.
  88
  89 Our work started by designing a set of GPU implementations with various optimization heuristics, in order to find out which tracks could be followed  towards minimizing loss in quality and preserve admissible execution times. Those algorithms have been tested with reference images taken from \citep{denoiselab} for which various processing results have been published. Some of the more interesting ones are listed and compared in  \citep{denoisereview}.
  90 Statistical observations (to be detailed below) made on the output images produced by the method proposed in \citep{bertaux:04}, led us to propose a very fast and simple parallel denoising method which gives good results in terms of average gray-level error, but also avoids the blurring of edges.
  91
  92 On the basis of the BM3D timings listed in \citep{Dabov09bm3dimage} and with our own measurements, our proposed GPU-based filter runs around 350 times faster and thus is able to process high definition images at over 16fps. It also achieves good denoising quality.
  93
  94 \section{Plan}
  95 In the following, section \ref{GPUgeneralites} briefly focuses on recent Nvidia GPU characteristics. Section \ref{isolines} will introduce the theory and notations used to define isolines. Then, in section \ref{lniv0}, we will describe the two isoline based models that led to the final hybrid model, while section \ref{lniv} details the parallel implementation of the proposed algorithm. Finally, we present our results in section \ref{results} before drawing our conclusions and outlining our future works in section \ref{conclusion}.
  96
  97
  98 \section{\label{GPUgeneralites}NVidia's GPU architecture}
  99 GPUs are multi-core, multi-threaded processors, optimized for highly parallel computation. Their design focuses on a Single Instruction Multiple Threads (SIMT) model that devotes more transistors to data processing rather than data-caching and flow control (see \citep{CUDAPG} for more details).
 100 For example, a C2070 card features 6~GBytes global memory and a total of 448 cores bundled in several Streaming Multiprocessors (SM). An amount of shared memory, much faster than global memory, is avalaible on each SM (up to 48~KB for a C20xx card)
 101
 102 Writing efficient code for such architectures is not obvious, as re-serialization must be avoided as much as possible. Thus, code design requires one pays attention to a number of points, among which:
 103 \begin{itemize}
 104  \item the CUDA model organizes threads by a) thread blocks in which synchronization is possible, b) a grid of blocks with no possible synchronization between them.
 105   \item there is no way to know how blocks are scheduled during one single kernel execution.
 106   \item data must be kept in GPU memory, to reduce the overhead generated by copying between CPU and GPU.
 107   \item the total amount of threads running the same computation must be as large as possible.
 108   \item the number of execution branches inside one block should be as small as possible.
 109   \item global memory accesses should be coalescent, \emph{i.e.} memory accesses done by physically parallel threads (2 x 16 at a time) must be consecutive and contained in a 128 Bytes range.
 110   \item shared memory is organized in 32x32 bit-wide banks. To avoid bank conflicts, each parallel thread (2 x 16 at a time) must access a different bank.
 111 \end{itemize}
 112
 113 All the above characteristics always make designing efficient GPU code all the more constraining as non-suited code would probably run even slower on GPU than on CPU.
 114
 115 \section{\label{isolines}Isolines}
 116 In the following, let $I$ be the reference noiseless image (assuming we have one), $I'$ the noisy acquired image corrupted by Independent and Identically Distributed (IID) additive white gaussian noise of zero mean value and standard deviation $\sigma$. Let $\widehat{I}$ be the denoised image. Each pixel of $I'$ of coordinates $(i,j)$ has its own gray level $z(i,j)$.
 117
 118 As introduced above and since most common images are continuous and contain few edges, they can be decomposed into a set of constant gray level lines called \textit{isolines}. Then our goal is to find, for each single pixel of a noisy image, the isoline it belongs to.
 119 The generalized likelihood criterion (GL) is used to select the best isoline among all the considered ones, all of which must have the same number of pixels in order to be compared.
 120
 121 \subsection{Fixed-length isolines}
 122 For each pixel $(i,j)$ of the corrupted image, we look for the gray level of the isoline it belongs to, inside a rectangular window $\omega$ centered on $(i,j)$. Inside $\omega$, let $S^n$ be the isoline part which the center pixel belongs to. $S^n$ is a set of $n$ pixel positions $(i_q,j_q)$ ($q \in [0;n[$).\\
 123 The gray levels $z$ along $S^n$ follow a gaussian probability density function whose parameters $\mu_{S^n}$ (mean value of isoline part) and $\sigma$ (standard deviation brought by gaussian noise ) are unknown.\\
 124 Let $\overline{S^n}$ be defined by $\omega = S^n \cup \overline{S^n}$.\\
 125 For each pixel, the mean values $\mu_{ij}$ of gray levels $z$ over $\overline{S^n}$ are unknown and supposed independant .\\
 126 Let $Z$ be the gray levels of pixels in $\omega$ and $\left\{\mu_{ij}\right\}_{\overline{S^n}}$ the mean values of pixels in $\overline{S^n}$.
 127 The likelihood is given by:
 128 $$
 129 \displaystyle P \left[Z | S^n, \mu_{S^n}, \left\{\mu_{ij}\right\}_{\overline{S^n}}, \sigma \right]
 130 $$
 131 When separating contributions from regions $S^n$ and $\overline{S^n}$, it becomes:
 132 \begin{eqnarray}
 133 \displaystyle \prod_{(i,j)\in S^n}{P\left[z(i,j) | \mu_{S^n}, \sigma \right]} . \displaystyle\prod_{(i,j)\in \overline{S^n}}{P\left[z(i,j) | \left\{\mu_{ij}\right\}_{\overline{S^n}}, \sigma \right]}
 134 \label{LL2}
 135 \end{eqnarray}
 136 The goal is then to estimate the value of the above expression, in order to find the boundaries of $S^n$ that maximize expression \eqref{LL2}.\\
 137 Let us consider that, on $\overline{S^n}$, the values $z(i,j)$ are the likelihood estimations $\widehat{\mu_{ij}}$ for $\mu_{ij}$. The second term of expression \eqref{LL2} becomes:
 138 \begin{eqnarray}
 139 \displaystyle\prod_{(i,j)\in \overline{S^n}}{P\left[z(i,j) | \left\{\widehat{\mu_{ij}}\right\}_{\overline{S^n}}, \sigma \right]=1}
 140 \end{eqnarray}
 141 which leads to the generalized likelihood expression:
 142 \begin{eqnarray}
 143 \displaystyle \prod_{(i,j)\in S^n}{P\left[z(i,j) | \mu_{S^n}, \sigma \right]}
 144 \label{GL}
 145 \end{eqnarray}
 146 As we know the probability density function on $S^n$, \eqref{GL} can then be developped as
 147 \begin{eqnarray}
 148 \displaystyle \prod_{(i,j)\in S^n}\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{\left(z(i,j)-\mu_{S^n}\right)^2}{2\sigma^2}}
 149 \label{GL2}
 150 \end{eqnarray}
 151
 152 The log-likelihood is then given by:
 153 \begin{eqnarray}
 154 \displaystyle -\frac{n}{2}log\left(2\pi\right) - \frac{n}{2}log\left(\sigma^2\right) - \frac{n}{2}
 155 \label{LL1}
 156 \end{eqnarray}
 157 inside which the vector of parameters $(\mu_{S^n}, \sigma)$ is determined by maximum likelihood estimation
 158 $$
 159 \left(
 160 \begin{array}{l}
 161 \widehat{\mu_{S^n}} = \displaystyle\frac{1}{n} \sum_{(i,j)\in S^n} z(i,j) \\
 162 \widehat{\sigma^2} = \displaystyle\frac{1}{n} \sum_{(i,j)\in S^n} \left(z(i,j) - \widehat{\mu_{S^n}}\right)^2 \\
 163 \end{array}
 164 \right.
 165 $$
 166 The selection of the best isoline is done by searching which one maximizes the expression of equation \eqref{LL1}.
 167
 168 \begin{figure}[h]
 169 \centering \includegraphics[width=0.75\linewidth]{isolines_1.jpg}
 170 \caption{Determination and lengthening of an isoline: The gray level $z$ of each pixel is seen as an elevation value. $S^n$ is the $n$ pixel length isoline for pixel of coordinates $(i, j)$. The elongation of $S^n$ by $S^p$ ($p$ pixel length) is submitted to the GLRT condition (see eq. \eqref{GLRT}).}
 171 \label{si3}
 172 \end{figure}
 173
 174
 175 \subsection{Lengthenable  isolines}
 176 Searching for larger isolines should lead to better filtering as a larger number of pixels would be involved.
 177 However, processing all possible isolines starting from each pixel would be too costly in computing time, even in the case of a small GPU-processed 512x512 pixel image. Therefore, we chose to build large isolines inside an iterative process including a mandatory validation stage between each lengthening iteration, so as to reduce the number of pixel combinations to be examined and keep the estimation of deviation $\sigma$ within a satisfactory range of values.
 178
 179 Let $S^n$ be a previously selected isoline part and $S^p$ connected to $S^n$ in such a way that $S^p$ could be seen as an addition to $S^n$ so as to define a possible valid isoline $S^{n+p}$.
 180 Figure \ref{si3} illustrates this situation with a very simple example image. In this figure, the gray level of each pixel is used as its corresponding height ($z$) in order to visualize isolines easily. Some of the orthogonal isoline projections have been drawn in dotted line in the $(\vec{i},\vec{j})$ plane. Both labeled parts $S^p$ and $S^n$ are represented in the $(\vec{i},\vec{j})$ plane and in the 3D associated plot.
 181
 182 In order to decide whether $S^{n+p}$ can be considered as an actual isoline, we compare the log-likelihood of both hypothesis below by using GLRT (Generalized Likelihood Ratio Test):
 183
 184 First, assuming that $S^{n+p}$ is an isoline, the gray levels of its pixels share the same mean value $\mu_{n+p}$. According to \eqref{LL1}, its log-likelihood is
 185 \begin{eqnarray}
 186 \displaystyle -\frac{(n+p)}{2}\left(log\left(2\pi\right)+1\right) - \frac{(n+p)}{2}log\left(\widehat{\sigma_1}^2\right)
 187 \label{LLNP}
 188 \end{eqnarray}
 189 where $\widehat{\sigma_1}$ is the estimation of the standard deviation along $S^n$.
 190
 191 Second, considering $S^n$ and $S^p$ as two separate isoline parts connected together, the gray levels of their pixels have two different mean values $\mu_n$ and $\mu_p$. The log-likelihood is the sum of both log-likelihoods, given by
 192 \begin{eqnarray}
 193 \displaystyle -\frac{(n+p)}{2}\left(log\left(2\pi\right)+1\right) - \frac{n}{2}log\left(\widehat{\sigma_2}^2\right) - \frac{p}{2}log\left(\widehat{\sigma_2}^2\right)
 194 \label{LLNP2}
 195 \end{eqnarray}
 196 where $\widehat{\sigma_2}$ is the estimation of the standard deviation along $S^n$ and $S^p$.
 197
 198 The difference between \eqref{LLNP} and \eqref{LLNP2} leads to the expression of $GLRT(S^{n+p},S^n, S^p, T_{max})$:
 199 \begin{eqnarray}
 200 T_{max}- (n+p).\left[log\left(\widehat{\sigma_1}^2\right) - log\left(\widehat{\sigma_2}^2\right) \right]
 201 \label{GLRT}
 202 \end{eqnarray}
 203
 204 The decision to validate lengthening from $S^n$ to $S^{n+p}$ depends whether $GLRT(S^{n+p},S^n, S^p, T_{max})$ is higher or lower than $0$. Value $T_{max}$ is the GLRT threshold.
 205
 206 \section{\label{lniv0}Isoline models}
 207 The most obvious model considers isolines as polylines. Each isoline can then be curved by allowing a direction change at the end of each segment; we shall call such isolines \textit{poly-isolines}.
 208
 209 In order to keep the number of candidate isolines within reasonable range, we chose to build them by combinating segments described by simple pre-computed patterns. Each pattern $p_{l,d}$ describes a segment of length $l$ and direction $d$. For one given $l$ value, all $p_{l,d}$ patterns are grouped into a matrix denoted $P_l$. Figure \ref{p5q1} shows an example of such a pattern matrix for $l=5$.
 210
 211 To fit the GPU-specific architecture, we define regularly distributed $D$ primary directions ($D=32$ in our examples).
 212
 213
 214 \subsection{\label{lniv2}Poly-isolines with limited deviation angle (PI-LD)}
 215 At one stage we implemented an algorithm parsing the tree of all possible polyline configurations, but the process proved far too slow regarding our goal, even on GPU, because of the amount of memory  involved (and consequent memory accesses) and because of the necessary reduction stage for which GPU efficiency is not maximum.
 216 So we focused on a variant inspired by \citep{bertaux:04} in which the selected direction of the next segment depends on the whole of the previously built and validated poly-isoline.
 217
 218 Let us consider a poly-isoline $S^n$ under construction, starting from pixel $(i,j)$ and made of $K$ validated segments $s_k~(k \in [1;K])$ of length $l$, each of them having its own direction $d_k$. The coordinates of the ending pixel of each segment $s_k$ are denoted $(i_k, j_k)$.
 219 Both of the following sums
 220 \begin{eqnarray}
 221 C_x\left(Z(S^n)\right) &=& \displaystyle\sum_{(i,j)\in S^n}z(i,j)\label{cx}\\
 222 and~~C_{x^2}\left(Z(S^n)\right)&=& \displaystyle\sum_{(i,j)\in S^n}z(i,j)^2\label{cx2}
 223 \end{eqnarray}
 224 have been obtained during the previous lengthening steps.
 225
 226 Let us examine now how to decide wether to add a new segment to $S^n$ or to stop the lenghtening process.
 227 The main idea is to apply each pattern $p_{l,d}$ to the ending pixel $(i_k,j_k)$, on the condition that its direction is contained within the limits of maximum deviation $\Delta d_{max}$.
 228 Maximum deviation $\Delta d_{max}$ prevents poly-isolines from beeing of circular shape (or backward-oriented) which would possibly generate supplementary artefacts in the output image. Another of its benefits is to reduce the number of combinations to be evaluated.
 229
 230 For each allowed pattern, GLRT is performed in order to decide if the corresponding segment could likely be added to the end of the poly-isoline $S^n$.
 231 If none is validated by GLRT, the poly-isoline $S^n$ is stopped.
 232
 233 If at least one segment has been accepted by GLRT, the one that leads to the maximum likelihood (ML) value of the lengthened poly-isoline $S^{n+l}$ is selected and integrated to $S^{n+l}$ as $s_{K+1}$.
 234
 235 In order to avoid critical situations where the first selected segment would not share the primary direction of the actual poly-isoline,
 236 no selection is performed on the level of the first segment; $D$ poly-isolines are kept and submitted to the lengthening process. To ensure isotropy, each of them shares the direction of one pattern $p_{l,d}~ (d \in[0;D])$.
 237
 238 Eventually, the poly-isoline with the maximum likelihood value is selected among the longest ones.
 239
 240 Figure \ref{pild} illustrates one stage of the lengthening process with the example of a two-segment poly-isoline  at the beginning of stage ($l=5$ and $\Delta d_{max}=2$).
 241
 242
 243 \begin{figure}[h]
 244  \centering
 245 \subfloat[Isoline with two validated segments $s_1$ and $s_2$.]{\label{pild:debut} \includegraphics{PI-LD_detail_3.jpg}}\qquad
 246 \subfloat[First evaluated segment, corresponding to pattern $p_{5,0}$.]{\label{pild:sub1} \includegraphics{PI-LD_detail_sub1.jpg}}\\
 247 \subfloat[Second evaluated segment, corresponding to pattern $p_{5,1}$.]{\label{pild:sub2} \includegraphics{PI-LD_detail_sub2.jpg}}\qquad
 248 \subfloat[Third evaluated segment, corresponding to pattern $p_{5,2}$.]{\label{pild:sub3} \includegraphics{PI-LD_detail_sub3.jpg}}\\
 249 \subfloat[Fourth evaluated segment, corresponding to pattern $p_{5,3}$.]{\label{pild:sub4} \includegraphics{PI-LD_detail_sub4.jpg}}\qquad
 250 \subfloat[Fifth evaluated segment, corresponding to pattern $p_{5,4}$.]{\label{pild:sub5} \includegraphics{PI-LD_detail_sub5.jpg}}\\
 251 \caption{Example of lengthening process starting with a two-segment poly-isoline ($l=5$, $\Delta d_{max}=2$).
 252 The initial situation is shown in \ref{pild:debut}, while \ref{pild:sub1} to \ref{pild:sub5} represent the successive candidate segments.
 253 The direction index of the last validated segment is $d_2=2$ (\ref{pild:debut}). It implies that direction indexes allowed for the third segment range from $d_2-\Delta d{max}=0$ to $d_2+\Delta d{max}=4$ (\ref{pild:sub1} to \ref{pild:sub5}).
 254 The lengthening of the poly-isoline is accepted if at least one segment has a positive GLRT. If there are several, the one which minimizes the standard deviation of the whole poly-isoline is selected.
 255 }
 256 \label{pild}
 257 \end{figure}
 258
 259 \renewcommand{\labelenumi}{\alph{enumi})}
 260 \renewcommand{\theenumi}{\alph{enumi})}
 261
 262 \subsection{\label{lniv3}Poly-isolines with precomputed directions (PI-PD)}
 263 Though much faster, the PI-LD-based filter may be considered a bit weak compared to \textit{state-of-the-art} filters like BM3D family algorithms \citep{Dabov09bm3dimage}. Furthermore, we saw that this way of building poly-isolines  requires the alternate use of two different types of validation at each lengthening stage: GLRT and maximum likelihood minimization. In order to be performed, each of them generates numerous branches during kernel execution, which does not fit GPU architecture well and leads to execution times that we hoped would be more impressive.
 264
 265 Within the PI-LD model, at each pixel $(i,j)$, as no selection is done at the first stage, $D$ poly-isolines are computed and kept as candidate though, obviously, only one follows the actual isoline at $(i,j)$. So, if we assume we can achieve a robust determination of the direction at any given pixel of this isoline, it becomes unnecessary to perform the selection at each lenghtening step.
 266 Thus, at each pixel $(i,j)$, only the first segment has to be determined in order to obtain the local direction of the isoline. This leads to an important reduction of the work complexity: the above PI-LD model needs to evaluate $D.\left(2.\Delta_{dmax}+1\right)^{K-1}$ segments at each pixel position, while only $D.K$ evaluations are needed in the second case. For example, with a maximum of $K=5$ segments and a maximum deviation of $\Delta_{dmax}=2$, the PI-LD needs to evaluate up to 20000 segments per pixel where only 160 should be enough.
 267
 268
 269 On the basis of these observations, we propose a new model that we shall call PI-PD, that completely separates the validation stages performed in the PI-LD model implementation mentioned above:
 270
 271 A first computation stage selects the best first segment $s_1$ starting at each pixel $(i,j)$ of the input image. Its direction index $d_1(i,j)$ is then stored in a reference matrix denoted $I_\Theta$; sums $C_x$ and $C_{x2}$ along $s_1(i,j)$ are also computed and stored in a dedicated matrix $I_\Sigma$. It can be noticed that this selection method of $s_1$ segments is a degraded version of PI-LD constrained by $K=1$.
 272
 273 A second stage manages the now independant lengthening process. For one given state of a poly-isoline where the last added segment has been $s_K$, the pattern whose direction index is given by $d=I_\Theta(i_K,j_K)$ defines the only segment to be evaluated.
 274 Both corresponding sums  $C_x$ and $C_{x2}$ are read from matrix $I_\Sigma$ and used in GLRT evaluation.
 275 The last point is to prevent poly-isolines from turning back.
 276
 277 Figure \ref{pipd} details this process, starting from the same initial state as in figure \ref{pild} with the noticeable difference that no deviation limit is needed.
 278
 279 Thus, as introduced above, work complexity is considerably reduced, as each pattern is only applied once at one given pixel $(i,j)$, and   associated values are computed only once; they are re-used every time one poly-isoline's segment ends at pixel $(i,j)$. Also, this fits GPU constraints better, as it avoids multiple branches during kernel execution.
 280 It remains that, the building of poly-isolines is done without global likelihood optimization.
 281
 282 Eventually, the model has been improved by adding to it the ability to thicken poly-isolines from one pixel up to three which allows to achieve higher PSNR values by increasing the number of pixels of poly-isolines in addition to the lengthening process.
 283 This may apply to large images which do not contain small relevant details, as it may blur small significant details or objects present in the noisy image.
 284 Still, this feature makes PI-PD more versatile than our reference BM3D, which has prohibitive computation times when processing large images (over 5 minutes for a 4096x4096 pixel image) and thus should require a slicing stage prior to processing them, causing some overhead.
 285
 286 \begin{figure}[h]
 287  \centering
 288 \subfloat[Poly-isoline with two validated segments.]{\label{pipd:pipd1} \includegraphics[width=2.3cm]{PI-PD_detail_sub1.jpg}}\qquad
 289 \subfloat[Next direction is read from element $(i_2,j_2)$ of $I_{\Theta}$.]{\label{pipd:pipd2} \includegraphics[width=5.0cm]{PI-PD_detail_sub2.jpg}}\\
 290 \subfloat[Pattern $p_{l,d_3}$ is then applied at $(i_2,j_2)$ and GLRT is performed. Both sums needed to perform GLRT are read from element $(i_2,j_2)$ of $I_{\Sigma}$.]{\label{pipd:pipd3} \includegraphics[width=4cm]{PI-PD_detail_sub3.jpg}}\qquad
 291 \subfloat[If accepted by GLRT, segment $s_3$ is added to poly-isoline.]{\label{pipd:pipd4} \includegraphics[width=2.7cm]{PI-PD_detail_sub4.jpg}}\\
 292 \caption{Example of PI-PD lengthening process starting with a two-segment poly-isoline ($l=5$).
 293 The initial situation is represented in \ref{pipd:pipd1}, while \ref{pipd:pipd1} to \ref{pipd:pipd4} represent the successive processing steps.
 294 The end pixel of the last validated segment is $(i_2,j_2)$ (\ref{pipd:pipd1}). Reference matrices \(I_{\Theta}\) and \(I_{\Sigma}\) provide the values needed to select the pattern to be applied on \((i_2,j_2)\) (\ref{pipd:pipd2} and \ref{pipd:pipd3}). GLRT is performed to validated lengthening or not. This process goes on until one submitted segment does not comply with GLRT.
 295 }
 296 \label{pipd}
 297 \end{figure}
 298
 299
 300 \subsection{\label{pipd_plan}Hybrid PI-PD}
 301 As the determination of each segment's direction only involves a few pixels,
 302 the PI-PD model may not be robust enough in regions where the surface associated with $Z$ has a low local slope value
 303 regarding power of noise $\sigma^2$.
 304 We shall call those regions Low Slope Regions (LSR).
 305 Figure \ref{img_plans} shows this lack of robustness with an example of two drawings of additive white gaussian noise applied on the same reference image (Figure \ref{img_window}). Within this image, we focused on a small 11x11 pixel window containing two LSR with one sharp edge between them.
 306
 307 Figures \ref{fig:dir1} and \ref{fig:dir2} show that the directions computed by PI-PD are identical from one drawing to the other near the edge (lines 5-7), while they vary in LSR (lines 1-4, 8-11).
 308
 309 \begin{figure}[h]
 310  \centering
 311 \subfloat[Reference image]{\label{fig:ref} \includegraphics{./zoom_edge_ref2.jpg}}\\
 312 \subfloat[Image corrupted by random drawing $n^{\circ}1$]{\label{fig:noisy1} \includegraphics{./zoom_edge_bruit.jpg}}\qquad
 313 \subfloat[Image corrupted by random drawing $n^{\circ}2$]{\label{fig:noisy2} \includegraphics{./zoom_edge2_bruit.jpg}}\\
 314 \subfloat[Isoline directions for random drawing $n^{\circ}1$]{\label{fig:dir1} \includegraphics{./zoom_edge1_2D_superpose3.jpg}}\qquad
 315 \subfloat[Isoline directions for random drawing $n^{\circ}2$]{\label{fig:dir2} \includegraphics{./zoom_edge2_2D_superpose2.jpg}}\\
 316  \caption{Zoom on a small square window of the airplane image. \ref{fig:ref} reproduce the zoom on the window, taken from the reference image of Figure \ref{img_window}. \ref{fig:noisy1}, \ref{fig:noisy2} and \ref{fig:ref} and are 3D views where each bar represents a pixel whose gray-level corresponds to the height of the bar. Figures \ref{fig:dir1} and \ref{fig:dir2} are 2D top views of the window. The chosen window shows an edge between two regions of low slope. The images \ref{fig:noisy1} and \ref{fig:noisy2} are corrupted with two different random drawings of the same additive  white gaussian noise (AWGN) of power $\sigma^2$ and mean value $0$. \ref{fig:dir1} and \ref{fig:dir2} show, for each pixel of the window, the direction of the isoline found by PI-PD. In regions of low slope (the two regions at the top and the bottom), the determination of the direction is not robust. But near the edge, directions do not vary from one drawing to another.}
 317   \label{img_plans}
 318 \end{figure}
 319
 320 Within such regions, our speed goals forbid us to compute isoline directions with the PI-LD model, more robust
 321 but far too slow.
 322 Instead we propose a fast solution which implies designing an edge detector whose
 323 principle is to re-use the segment patterns defined in section \ref{lniv0} and to combine them by pairs in order to detect any possible LSR around the center pixel.
 324 If a LSR is detected, the output gray-level value is the average value computed on the current square window, otherwise, the PI-PD output value is used.
 325
 326 In order to further simplify computation, only the patterns that do not share any pixel are used.
 327 These patterns have a direction which is a multiple of $45^{\circ}$.
 328
 329 Each base direction $(\Theta_i)$ and its opposite $(\Theta_i + \pi) \left[2\pi\right]$ define a line that separates the square
 330 window in two regions (top and and bottom regions, denoted T and B).
 331 We assume that segments on the limit belong to the T region which includes pixels of orientation from $\Theta_i$ to $\Theta_i+\pi$. This region comprises three more segments of directions $(\Theta_i+\frac{\pi}{4})$, $(\Theta_i+\frac{2\pi}{4})$ and $(\Theta_i+\frac{3\pi}{4})$.
 332 The other region (B) only includes three segments of directions $(\Theta_i+\frac{5\pi}{4})$, $(\Theta_i+\frac{6\pi}{4})$ and $(\Theta_i+\frac{7\pi}{4})$.
 333
 334 Figure \ref{detect_plans} illustrates this organization for $\Theta_i=\Theta_4=45^{\circ}$. Each bar represents a pixel in the detector's window. Pixels with null height are not involved in the GLRT. Pixels represented by higher bars define the T region and those represented by shorter bars define the B region.
 335
 336 \begin{figure}[h]
 337 \begin{center}
 338  \includegraphics{./pattern_detecteur.jpg}
 339  \caption{\label{detect_plans}Edge detector. 3D view representing an example square 11x11 pixel window ($l=5$) used in the edge detector for $\Theta_4=45^{\circ}$ around a center pixel colored in black. Each pixel is represented by a bar. Bars of height value 0 are for pixels that are not involved in the detector. Top region is defined by five pattern segments and includes the center pixel. Bottomp region only includes three pattern segments. The different height values are meant to distinguish between each of the three different sets of pixels and their role. }
 340 \end{center}
 341 \end{figure}
 342
 343 For each $\Theta_i$, one GLRT is performed in order to decide whether the two regions T and B defined above are likely to be seen as a single region or as two different ones, separated by an edge as shown in figure \ref{detect_plans}. The center pixel is located on the edge.
 344 Equations \eqref{LLNP}, \eqref{LLNP2} and \eqref{GLRT} lead to a similar GLRT expression:
 345 \begin{eqnarray}
 346 T2_{max}- (8.l+1).\left[log\left(\widehat{\sigma_3}^2\right) - log\left(\widehat{\sigma_4}^2\right) \right]
 347 \label{GLRT2}
 348 \end{eqnarray}
 349 where $\sigma_3$ is the standard deviation considering that the two regions are likely to define a single one and $\sigma_4$ the standard deviation if an edge is more likely to separate the two regions. $T2_{max}$ is the decision threshold.
 350 With equation \eqref{GLRT2}, a negative result leads to an edge detection, oriented towards direction $\Theta_i$.
 351 When GLRT is known for each $\Theta_i$, we apply the following hybridation policy:
 352 \begin{enumerate}
 353 \item more than one negative GLRT: the PI-PD output value is used.
 354 \item only one negative GLRT: the center pixel is likely to be on a well-defined edge, and only the region it belongs to is considered. The average value of its pixel gray levels is then used.\label{halfplane}
 355 \item no negative GLRT: the window around the center pixel is likely to be a LSR. The average value on the whole square window is used (11x11 pixels in the example of Figure \ref{detect_plans}).
 356 \end{enumerate}
 357
 358 \begin{figure}[h]
 359  \centering
 360 \subfloat[Reference noiseless airplane image]{\label{img_window:ref} \includegraphics{./airplane.jpg}}\qquad
 361 \subfloat[Location of the example window in the reference image.]{\label{img_window:win} \includegraphics{./zoom_windows_A.jpg}}\\
 362  \caption{Location of the example window inside the reference image. Figure \ref{img_window:ref} shows the whole reference image and \ref{img_window:win} zooms on the part where the example 11x11 pixel window is.}
 363   \label{img_window}
 364 \end{figure}
 365
 366 It must be noticed that point \ref{halfplane} has been introduced in order to achieve smoother transitions between regions to which  PI-PD is applied and those in which the plain average value is used.
 367 Figure \ref{exbords} shows an example of such a classification achieved by the edge detector. The detector has been applied on the top noisy airplane image with a GLRT threshold value $T2_{max}=2$. Black pixels represent pixel classified as \textit{on an edge}, while white ones are those which belong to LSR.
 368
 369 \begin{figure}[h]
 370  \centering
 371 \subfloat[Noisy airplane image]{\label{exbords:noisy} \includegraphics{./airplane_noisy_small.jpg}}\qquad
 372 \subfloat[Pixel classification performed by the edge detector. ]{\label{exbords:bords} \includegraphics{./img_bords_T2_small.jpg}}\\
 373  \caption{Pixel classification inside the noisy image. Figure \ref{exbords:noisy} shows the noisy input image and \ref{exbords:bords} reproduces the output classification of pixels, as a black and white image, obtained with threshold value  $T2_{max}=2$. Black pixels are supposed to be near an edge, while white pixels belong to Low Slope Regions.}
 374   \label{exbords}
 375 \end{figure}
 376
 377 \section{\label{lniv} Hybrid PI-PD filter Implementation: details}
 378 All implementation details that will be given here are relative to the proposed PI-PD models and Nvidia$^\copyright~$ GPU devices.
 379
 380 \subsection{\label{genPaths}Segment patterns}
 381 The first kernel to be run is \texttt{kernel\_genPaths()} which generates matrix $P_l$. Its elements $(\Delta i; \Delta j)$ are the relative coordinates of the pixels which define segment patterns $p_{l,d}$. The dimensions of matrix $P_l$ are $D$ rows $\times$  $l$ columns.
 382 To fit GPU architecture as closely as possible, we chose $D=32$ patterns.
 383 Each segment $s_k$ of a poly-isoline can then be seen as a pattern $p_{l,d}$ applied on the starting pixel $(i,j)$ of this segment, denoted $p_{l,d}(i,j)$.
 384
 385 The example in figure \ref{p5q1} shows the first quarter of $P_5$ and the corresponding eight discrete segment patterns in the first quadrant. The three remaining quarters of the matrix are easily deduced by applying successive rotations of angle $\frac{\pi}{2}$ to the above elements.
 386
 387 \begin{figure}[h]
 388 \begin{center}
 389 \includegraphics[width=0.65\linewidth]{./P5Q1.jpg}
 390 \end{center}
 391 \tiny{
 392 $$
 393 P_5 =
 394 \begin{bmatrix}
 395 (0,1)&(0,2)&(0,3)&(0,4)&(0,5)\\
 396 (0,1)&(0,2)&(-1,3)&(-1,4)&(-1,5)\\
 397 (0,1)&(-1,2)&(-1,3)&(-2,4)&(-2,5)\\
 398 (-1,1)&(-1,2)&(-2,3)&(-3,4)&(-3,5)\\
 399 (-1,1)&(-2,2)&(-3,3)&(-4,4)&(-5,5)\\
 400 (-1,1)&(-2,1)&(-3,2)&(-4,3)&(-5,3)\\
 401 (-1,0)&(-2,1)&(-3,1)&(-4,2)&(-5,2)\\
 402 (-1,0)&(-2,0)&(-3,1)&(-4,1)&(-5,1)\\
 403 \ldots&\ldots&\ldots&\ldots&\ldots\\
 404 \end{bmatrix}
 405 $$
 406 }
 407 \caption{\label{p5q1}Top: example segment patterns $p_{5,d}$ for $d\in[0;7]$; the black pixel represents the center pixel $(i,j)$, which does not belong to the pattern. The gray ones define the actual pattern segments. Bottom: the first 8 lines of corresponding matrix $P_5$ whose elements are the positions of segment pixels with respect to the center pixel.}
 408 \end{figure}
 409
 410 \subsection{\label{sipd}Generation of reference matrices $I_{\Sigma}$ and $I_{\Theta}$}
 411 In order to generate both matrices, a GPU kernel named \texttt{kernel\_precomp()} computes, in parallel for each pixel $(i,j)$:
 412 \begin{itemize}
 413 \item the direction $\delta$ of the most likely segment $s_1 = p_{l,\delta}(i,j)$ among the $D$ possible ones. This value is stored in matrix $I_{\Theta}$ at position $(i,j)$.
 414 \item values $C_x(s_1)$ and $C_{x^2}(s_1)$ defined in equations \eqref{cx} and \eqref{cx2}. This vector of values is stored in matrix $I_{\Sigma}$ at position $(i,j)$.
 415 \end{itemize}
 416
 417 In order to reduce processing time, the input image is first copied into texture memory (see algorithm \ref{algoinit} for initializations and memory transfer details), thus taking advantage of the 2D optimized caching mechanism.
 418
 419 This kernel follows the \textit{one thread per pixel} rule. Consequently, each value of $P_l$ has to be accessed by every thread of a block. That led us to load it from texture memory first, then copy it into all shared memory blocks. This has proved to be the fastest scheme.
 420
 421 Algorithm \ref{algoprecomp} summarizes the computations achieved by \texttt{kernel\_precomp()}.
 422 Vector $(C_x, C_{x2})$ stores the values of $C_x(s_1)$ and $C_{x^2}(s_1)$ associated with the current tested pattern.
 423 Vector $(C_{x-best}, C_{x2-best})$ stores the values of $C_x(s_1)$ and $C_{x^2}(s_1)$ associated with the best previously tested pattern.
 424
 425 In the same manner, $\sigma$ and $\sigma_{best}$ are deviation values for current and best tested patterns.
 426
 427  The selection of the best pattern is driven by the value of the standard deviation of candidate isolines.
 428 Lines 2 and 3 compute both sums for the first pattern to be evaluated. Line 4 computes its standard deviation.
 429 Then, lines 5 to 14 loop on each pattern and keep values associated with the best pattern found. These values are eventually stored in matrices $I_{\Theta}$ and $I_{\Sigma}$ on lines 16 and 17.
 430
 431 \begin{algorithm}[htb]
 432   \SetNlSty{textbf}{}{:}
 433 \caption{Initializations in GPU memory}
 434 \label{algoinit}
 435 $l \leftarrow$ step size\;
 436 $D \leftarrow$ number of primary directions\;
 437 $I_n \leftarrow$ noisy image\;
 438 $I_{n tex} \leftarrow I_n $\tcc*[r]{copy to texture mem.}
 439 $P_l \leftarrow$ kernel\_genPaths \tcc*[r]{pattern matrix}
 440 $P_{l tex} \leftarrow P_l $\tcc*[r]{copy to texture mem.}
 441 $T_{max} \leftarrow$ GLRT threshold (lengthening)\;
 442 $T2_{max} \leftarrow$ GLRT threshold (edge detection)\;
 443 \end{algorithm}
 444
 445 \begin{algorithm}
 446   \SetNlSty{textbf}{}{:}
 447   \SetKwComment{Videcomment}{}{}
 448 \caption{generation of reference matrices, kernel \texttt{kernel\_precomp()}}
 449 \label{algoprecomp}
 450 \ForEach(\tcc*[f]{\textbf{in parallel}}){pixel $(i,j)$}{
 451   $C_{x-best} \leftarrow  \displaystyle\sum_{(y,x)\in p_{l,0}(i,j)} I_{n tex}(i+y,j+x)$ \;
 452   $C_{x2-best} \leftarrow  \displaystyle\sum_{(y,x)\in p_{l,0}(i,j)} I_{n tex}^2(i+y,j+x)$ \;
 453   $\sigma_{best} \leftarrow$ standard deviation along $p_{l,0}(i,j)$ \;
 454   %\Videcomment{}
 455   \tcc{loop on each pattern}
 456   \ForEach{$d \in [1;D-1]$}{
 457     $C_x \leftarrow \displaystyle\sum_{(y,x)\in p_{l,d}(i,j)} I_{n tex}(i+y,j+x)$\;
 458     $C_{x2} \leftarrow \displaystyle\sum_{(y,x)\in p_{l,d}(i,j)} I_{n tex}^2(i+y,j+x)$\;
 459     $\sigma \leftarrow$ standard deviation along $p_{l,d}(i,j)$\;
 460     \If(\tcc*[f]{keep the best}){$\sigma_d < \sigma_{best}$}{
 461       $C_{x-best} \leftarrow C_x$ \;
 462       $C_{x2-best} \leftarrow C_{x2}$ \;
 463       $\Theta_{best} \leftarrow d$ \;
 464     }
 465   }
 466   $I_{\Sigma}(i,j) \leftarrow \left[ C_{x-best}, C_{x2-best}\right]$ \tcc*[r]{stores}
 467   $I_{\Theta}(i,j) \leftarrow \Theta_{best}$ \tcc*[r]{in matrices}
 468 }
 469 \end{algorithm}
 470
 471 \begin{algorithm}[ht]
 472 \SetNlSty{textbf}{}{:}
 473 \caption{PI-PD lengthening process \texttt{kernel\_PIPD()}}
 474 \label{algoPIPD}
 475 \ForEach(\tcc*[f]{\textbf{in parallel}}){pixel $(i,j)$}{
 476  % \tcc{in parallel}
 477  % \tcc{starting pixel $(i,j)$ and first segment without GLRT}
 478   $(C_x^1, C_{x2}^1) \leftarrow z(i,j)$ \tcc*[r]{starting pixel}
 479   $(i_1, j_1) \leftarrow (i, j)$ \tcc*[r]{first segment}
 480   $(C_x^1, C_{x2}^1) \leftarrow I_{\Sigma}(i_1,j_1)$ \tcc*[r]{read matrix}
 481   $d_1 \leftarrow I_{\Theta}(i,j)$ \tcc*[r]{read matrix}
 482   $l_1 \leftarrow l$ \tcc*[r]{isoline length}
 483   $\sigma_1 \leftarrow (C_{x2}^1/l_1 - C_x^1)/l_1$\;
 484   $(i_2, j_2) \leftarrow end~of~first~segment$\;
 485   $(C_{x}^2, C_{x2}^2) \leftarrow I_{\Sigma}(i_2,j_2) $ \tcc*[r]{2$^{nd}$ segment}
 486   $d_2 \leftarrow I_{\Theta}(i_2,j_2)$\;
 487   $\sigma_2 \leftarrow (C_{x2}^2/l - C_x^2)/l$ \;
 488   %
 489   \While{$GLRT(\sigma_1, \sigma_2, l_1, l) < T_{max}$}{
 490     $l_1 \leftarrow l_1 + l$ \tcc*[r]{lengthening}
 491     $(C_x^1, C_{x2}^1) \leftarrow (C_x^1, C_{x2}^1)+(C_x^2, C_{x2}^2)$\;
 492     $\sigma_1 \leftarrow (C_{x2}^1/l_1 - C_x^1)/l_1$ \tcc*[r]{update}
 493     $(i_1,j_1) \leftarrow (i_2, j_2)$ \tcc*[r]{step forward}
 494     $d_1 \leftarrow d_2$\;
 495     $(i_2, j_2) \leftarrow end~of~next~segment$\;
 496     \tcc*[f]{next segment}
 497     $(C_{x}^2, C_{x2}^2) \leftarrow I_{\Sigma}(i_2,j_2) $\;
 498     $d_2 \leftarrow I_{\Theta}(i_2,j_2)$\;
 499     $\sigma_2 \leftarrow (C_{s2}^2/l - C_s^2)/l$ \;
 500     }
 501   }
 502   $\widehat{I}(i, j) \leftarrow C_x^1/l_1$ \tcc*[r]{isoline value}
 503 \end{algorithm}
 504
 505
 506 \subsection{\label{sipdl}PI-PD lengthening process:  \texttt{kernel\_PIPD()} }
 507 This parallel kernel is run in order to obtain the image of the \emph{isolines}. It is detailed in algorithm \ref{algoPIPD}, (see section \ref{lniv3} for process description).
 508
 509 Lines from 2 to 11 perform allocations for the first lengthening to evaluate.
 510 More precisely, $(i_1, j_1)$ represents the starting pixel of the current segment; $(i_2, j_2)$ is both its ending pixel and the starting pixel of the next segment; $d_1$ and $d_2$ are their directions, read from precomputed matrix $I_{\Theta}$.
 511 $C_x^1$ and $C_{x2}^1$ are the gray-level sums along the current poly-isoline; $C_x^2$ and $C_{x2}^2$ are the gray-level sums of the candidate segment.
 512 The current poly-isoline ends at $(i_1, j_1)$ and is made of $l_1$ pixels (already accepted segments); its standard deviation is $\sigma_1$.
 513 The loop extending from lines 12 to 21 performs the allocations needed to proceed one segment forward, as long as GLRT is true.
 514 If the lengthening has been accepted, the length of the poly-isoline is updated in line 13, and the same is done with $C_x$ and $C_{x2}$ which are read from precomputed matrix $I_{\Sigma}$ (see equations \eqref{cx} and \eqref{cx2} for definition).
 515 Finally, using direction value $d_2$, it translates the coordinates $(i_1, j_1)$ to the end of the newly elongated poly-isoline, and $(i_2, j_2)$ to the end of the next segment to be tested.
 516 As soon as the GLRT condition becomes false, line 23 eventually produces the output value of the denoised image at pixel $(i,j)$, that is, the average gray-level value along the poly-isoline.
 517
 518 \subsection{\label{sipdd}Hybrid PI-PD :  \texttt{kernel\_edge\_detector()} }
 519 As introduced in section \ref{pipd_plan}, the aim of the kernel named \texttt{kernel\_edge\_detector()} is to divide pixels into two classes according to their belonging to a LSR or not. Algorithm \ref{algoDetect} explains the detailled procedure. Lines 2 to 6 initialize values of the direction index ($\Theta$), the number of edges detected ($edgeCount$), the gray-level sum along the pixels that defines the H half-plane ($sumEdge$) and the number of pixels that defines both half-planes H and L ($nH$, $nL$). Then the loop starting at line 7 performs the GLRT for every considered direction index $\Theta$.
 520 Values $sumH$ and $sumL$ are vectors of two parameters $x$ and $y$, parameter $x$ being the sum of gray-level values and $y$ the sum of square gray-level values. Value $sumH$ is computed along the pixels of half-plane H and is obtained by loop at lines 10 to 14; Value $sumL$ is computed along the pixels of half-plane L and is obtained by loop at lines 15 to 19. Value $I_{ntex}(i,j)$ refers to the gray-level value at pixel (i,j) previously stored in texture memory. Eventually, the isoline level value is output at line 27, 30 or 33 depending on the situation (see \ref{pipd_plan} for details about the decision process).
 521
 522
 523 \begin{algorithm}[ht]
 524 \SetNlSty{textbf}{}{:}
 525 \caption{edge detector and pixel classifier \texttt{kernel\_edge\_detector()}}
 526 \label{algoDetect}
 527 \ForEach(\tcc*[f]{\textbf{in parallel}}){pixel $(i,j)$}{
 528   $\Theta \leftarrow 0$\tcc*[r]{direction index}
 529   $edgeCount \leftarrow 0$\;
 530   $sumEdge \leftarrow 0$\;
 531   $nH \leftarrow 5l+1$\;
 532   $nL \leftarrow 3l$\;
 533   \While{($\Theta < 32$) }{
 534     $sumH \leftarrow (I_{ntex}(i,j), I_{ntex}^2(i,j))$\;
 535     $sumL \leftarrow (0, 0)$\;
 536     \For{($\alpha=\Theta$ to $\alpha=\Theta+16$ by step $4$)}{
 537       $sPat \leftarrow \displaystyle\sum_{(y,x)\in P_{l,\alpha}(i,j)} I_{n tex}(i+y,j+x)$\;
 538       $sPat2 \leftarrow \displaystyle\sum_{(y,x)\in P_{l,\alpha}(i,j)} I^2_{n tex}(i+y,j+x)$\;
 539       $sumH \leftarrow sumH + (sPat, sPat2)$\;
 540     }
 541     \For{($\alpha=\Theta+20$ to $\alpha=\Theta+28$ by step $4$)}{
 542       $sPat \leftarrow \displaystyle\sum_{(y,x)\in P_{l,\alpha}(i,j)} I_{n tex}(i+y,j+x)$\;
 543       $sPat2 \leftarrow \displaystyle\sum_{(y,x)\in P_{l,\alpha}(i,j)} I^2_{n tex}(i+y,j+x)$\;
 544       $sumL \leftarrow sumL + (sPat, sPat2)$\;
 545     }
 546     \If{($GLRT(sumH, nH, sumL, nL) > T2_{max}$)}{
 547       $edgeCount \leftarrow edgeCount + 1$\;
 548       $sumEdge \leftarrow sumH.x$\;
 549     }
 550     $\Theta \leftarrow \Theta + 4$\;
 551     }
 552     \tcc{outputs isoline value}
 553     \If{($edgeCount == 0$)}{
 554       $\widehat{I}(i,j) \leftarrow \dfrac{(sumH.x + sumL.x)}{nH+nL}$ \tcc*[r]{LSR}
 555     }
 556     \If{($edgeCount == 1$)}{
 557       $\widehat{I}(i,j) \leftarrow \dfrac{(sumEdge)}{nH}$
 558     }
 559     \If{($edgeCount > 1$)}{
 560       $\widehat{I}(i,j) \leftarrow \widehat{I_{PIPD}}(i,j)$\tcc*[r]{PI-PD}
 561     }
 562   }
 563 \end{algorithm}
 564
 565 \section{\label{results}Results}
 566  The proposed hybrid PI-PD model has been evaluated with the 512x512 pixel sample images used by \citep{denoiselab} in order to make relevant comparisons with other filtering techniques. As we aim to address image processing in very noisy conditions (as in \citep{6036776}), we focused on the noisiest versions, degraded by AWGN of standard deviation $\sigma=25$.
 567
 568 Quality measurements of the denoised images in comparison with reference images have been obtained by the
 569 evaluation of:
 570 \begin{enumerate}
 571 \item Peak Signal to Noise Ratio (PSNR) that quantifies the mean square error between denoised and reference images:
 572  $MSE(I,\widehat{I})$. We used the following expression: $$PSNR =10.log_{10}\left(\frac{max(\widehat{I})}{MSE(I,\widehat{I})}\right)$$
 573  PSNR values are given in dB and highest values mean best PSNR.
 574 \item The Mean Structure Similarity Index (MSSIM, defined in \citep{Wang04imagequality}), which quantifies local similarities between denoised and reference images inside a sliding window. MSSIM values belong to an interval $[0; 1]$; the closer to 1 the better.
 575 \end{enumerate}
 576 PSNR is widely used to measure image quality but can be misleading when used by itself: as demonstrated in \citep{Wang04imagequality}, the processing of noisy images can bring a high PSNR value but very bad visual quality. This is avoided by the use of the MSSIM index along with the PSNR value: when both of them show high values, the overall quality can be considered high.
 577
 578 Result figure \ref{tablePI} provides the PSNR and MSSIM of every image, denoised with three different filters: average 5x5, hybrid PI-PD and BM3D. The \emph{noisy} column shows the values for each image before denoising. BM3D (\citep{Dabov09bm3dimage}) is taken as a reference in terms of denoising quality, while the average filter is taken as a reference in terms of processing time. The window size of 5x5 pixels has been choosen to achieve PSNR values similar to those obtained by PI-PD.
 579
 580 BM3D code is run on a quad-core Xeon E31245 at 3.3GHz and 8GByte RAM under linux kernel 3.2 (64bits), while PI-PD as well as average filter codes is run on a Nvidia C2070 GPU hosted by a PC running linux kernel 2.6.18 (64bits).
 581 The average filter used is an efficient parallel GPU implementation that we developped. It is a generic and versatile separable convolution kernel that outputs more than 700MPixels per second in the 5x5 averaging configuration.
 582
 583 Hybrid PI-PD measurements were performed with $n=25$, $l=5$, $T_{max}=1$ and $T2_{max}=2$. BM3D measurements have been performed with the freely available BM3D software proposed in \citep{Dabov09bm3dimage}.
 584
 585 The hybrid PI-PD model proves much faster than BM3D and better than the average 5x5 filter.
 586 Processing the thirteen images of the database reveals that hybrid PI-PD brings an average improvement of 1.5dB (PSNR) and 7.2\% (MSSIM) against the average filter at the cost of 35 times its computational duration. Against hybrid PI-PD, BM3D achieves an average improvement of 2.4dB and 4.6\%  at the cost of 350 times as much duration.
 587 Actually, the 5x5 average filter takes around \textbf{0.35~ms} to process an image while hybrid PI-PD needs around \textbf{11~ms} and BM3D around \textbf{4.3~s}.
 588
 589 It must be noticed that experimental optimization show that the vector of parameter values $T_{max}=1$ and $T2_{max}=2$ is optimal for 11 of the 13 images of the database. Better results are obtained with a slightly different value of $T2_{max}$ for \textit{peppers} or \textit{zelda} whose denoised images can obtain a MSSIM index of 0.90.
 590 Most of the computational time of hybrid PI-PD is spent by the edge detector, which clearly does not fit GPU requirements to achieve good performance. For information, the simple PI-PD model runs in less than 4~ms in the same conditions.
 591
 592 Figure \ref{comparimg} shows denoised images produced by hybrid PI-PD model compared with the output of the BM3D and the average 5x5 filters. The figure illustrates the merits and drawbacks of each model: edges are well preserved by hybrid PI-PD, but a \textit{staircase} effect is visible, a well-known artefact inherent to this type of neighborhood filters. Our recent GPU-implementation of the regression method proposed in \citep{BuadesCM06} brings a mean improvement of 1dB at the cost of 0.4~ms.
 593
 594
 595
 596 \begin{figure}
 597 %\begin{table}[h]
 598 \footnotesize
 599  \begin{center}
 600 \begin{tabular}{|c|r|r|r|r|}\hline
 601 \bf Image&\bf Noisy  &\bf average &\bf hybrid&\bf BM3D \\
 602          &           &\bf 5x5     &\bf PI-PD &         \\\hline\hline
 603 airplane &   19.49dB &   26.39dB  &  28.46dB& 30.88dB  \\
 604          &    0.58   &    0.84    &  0.88  &   0.93    \\\hline
 605 barbara  &   20.04dB &   22.76dB  & 24.26dB&  30.60dB  \\
 606          &    0.70   &    0.76    &  0.83  &   0.94    \\\hline
 607 boat     &   20.33dB &   25.58dB  & 27.54dB&  30.02dB  \\
 608          &    0.66   &    0.81    &  0.87  &   0.91    \\\hline
 609 couple   &   20.28dB &   25.25dB  & 27.33dB&  29.77dB  \\
 610          &    0.69   &    0.79    &  0.87  &   0.91    \\\hline
 611 elaine   &   19.85dB &   28.71dB  & 28.94dB&  30.60dB  \\
 612          &    0.59   &    0.86    &  0.87  &   0.91    \\\hline
 613 fingerprint &20.34dB &   23.33dB  & 26.07dB&  27.93dB  \\
 614          &     0.93  &    0.87    &  0.95  &   0.96    \\\hline
 615 goldhill &    19.59dB&   26.47dB  & 27.43dB&  29.22dB  \\
 616          &     0.67  &    0.82    &  0.87  &   0.88    \\\hline
 617 lena     &    19.92dB&   27.99dB  & 29.14dB&  31.80dB  \\
 618          &     0.60  &    0.84    &  0.88  &   0.93    \\\hline
 619 man      &    20.38dB&   24.74dB  & 26.74dB&  28.14dB  \\
 620          &     0.71  &    0.80    &  0.86  &   0.87    \\\hline
 621 mandrill &    19.34dB&   20.34dB  & 22.38dB&  24.75dB  \\
 622          &     0.77  &    0.69    &  0.83  &   0.88    \\\hline
 623 peppers  &    19.53dB&   27.30dB  & 28.68dB&  30.87dB  \\
 624          &     0.61  &    0.86    &  0.87  &   0.92    \\\hline
 625 stream   &    20.35dB&   23.23dB  & 25.35dB&  26.34dB  \\
 626          &     0.80  &    0.78    &  0.87  &   0.88    \\\hline
 627 zelda    &    17.71dB&   23.13dB  & 27.71dB&  30.49dB  \\
 628          &     0.58  &    0.87    &  0.88  &   0.93    \\\hline
 629  \end{tabular}
 630  \end{center}
 631 \caption{Comparison between hybrid PI-PD, average and BM3D filters. PI-PD parameter values: $n=25$, $l=5$, $T_{max}=1$ and $T2_{max}=2$. The \emph{noisy} column correspond to the noisy input images, before denoising. \newline Timings: average filter in around 0.35~ms hybrid PI-PD in around 11.0~ms and BM3D in around 4.3~s}
 632 \label{tablePI}
 633 %\end{table}
 634 \end{figure}
 635
 636
 637 \begin{figure}[h]
 638 \centering
 639 \subfloat[Noisy image $\sigma=25$]{\label{fig:noisy} \includegraphics{./airplane_25_noisy_zoom.jpg}}\qquad
 640 \subfloat[Average 5x5 filter, in $0.35~ms$]{\label{fig:pipd} \includegraphics{./airplane_25_mean5_zoom.jpg}}\\
 641 \subfloat[PI-PD hybrid filter, $n=25$, $l=5$, $T_{max}=1$, $T2_{max}=2$, in $11~ms$ ]{\label{fig:hpipd} \includegraphics{./airplane_zoom_hybrid_6_r50_T10_P2.jpg}}\qquad
 642 \subfloat[BM3D filter, in $4.3s$]{\label{fig:bm3d} \includegraphics{./airplane_bm3d_zoom.jpg}}
 643 \caption{Comparison of 512x512 images denoised from noisy airplane image (\ref{fig:noisy}) with a PI-PD filter (\ref{fig:pipd}), PI-PD hybrid filter (\ref{fig:hpipd}) and BM3D filter (\ref{fig:bm3d}). Only zoomed parts of images are shown in order to ensure better viewing.}
 644 \label{comparimg}
 645 \end{figure}
 646
 647
 648 \section{\label{conclusion}Conclusion, future work}
 649 From the start, our approach, unlike quite a few others, has been to base this study on the conception and characteristics of the targeted hardware (Nvidia Graphic cards).
 650
 651 So as to get high execution speeds, we chose, for example, to find a method that remains local (concentrating on the immediate neighborhood of the center pixel), but still provides very significant benefits, using our technique of progressive lengthening.
 652
 653 Nevertheless, our method has proved slightly sub-optimal and lacking robustness in \textit{flat} regions (see above, Low Slope Regions), even if the actual visual effect may be considered quite satisfactory.
 654
 655 As a first step to address the above drawbacks, we have devised a hybrid method that detects and applies distinct processing to LSR regions (see above). Processing speeds remain fast, and much higher than the BM3D implementation taken as quality reference. This is very promising, and opens the perspective of real-time high definition  image sequence processing at 25 fps, provided we improve the edge detector, which currently limits the HD frame rate at 16fps (High Definition: 1920x1080 pixels).
 656
 657 To further improve the quality of output images, we also implemented a efficient parallel implementation of the staircase effect reduction technique presented in \citep{BuadesCM06}. With this method, searching for best improvement factors leads to different parameters values for each image processed, which prompts to studying some way of overriding such parameters.
 658
 659 Our study so far has been based on additive noise; we are currently working on transposing criteria to various multiplicative noise types. We also extended the process to color images with very interesting visual results to be confirmed by the experimental measurement currently in progress.
 660
 661 % references section
 662 \bibliographystyle{spbasic}
 663
 664 %\bibliography{bibliosv}
 665
 666 \begin{thebibliography}{17}
 667 \providecommand{\natexlab}[1]{#1}
 668 \providecommand{\url}[1]{{#1}}
 669 \providecommand{\urlprefix}{URL }
 670 \expandafter\ifx\csname urlstyle\endcsname\relax
 671   \providecommand{\doi}[1]{DOI~\discretionary{}{}{}#1}\else
 672   \providecommand{\doi}{DOI~\discretionary{}{}{}\begingroup
 673   \urlstyle{rm}\Url}\fi
 674 \providecommand{\eprint}[2][]{\url{#2}}
 675
 676 \bibitem[{den(2007)}]{denoiselab}
 677  (2007) DenoiseLab Philosophy: A Standard Test Set and Evaluation Method to
 678   Compare Denoising Algorithms
 679
 680 \bibitem[{CUD(2010)}]{CUDAPG}
 681  (2010) NVIDIA CUDA C Programming Guide v3.1.1. NVIDIA Corporation
 682
 683 \bibitem[{Bertaux et~al(2004)Bertaux, Frauel, R{\'e}fr{\'e}gier, and
 684   Javidi}]{bertaux:04}
 685 Bertaux N, Frauel Y, R{\'e}fr{\'e}gier P, Javidi B (2004) Speckle removal using
 686   a maximum-likelihood technique with isoline gray-level regularization. J Opt
 687   Soc Am A 21(12):2283--2291, \doi{10.1364/JOSAA.21.002283},
 688   \urlprefix\url{http://josaa.osa.org/abstract.cfm?URI=josaa-21-12-2283}
 689
 690 \bibitem[{Buades et~al(2005)Buades, Coll, and Morel}]{denoisereview}
 691 Buades A, Coll B, Morel J (2005) A review of image denoising algorithms, with a
 692   new one. Multiscale Modeling and Simulation 4(2):490--530
 693
 694 \bibitem[{Buades et~al(2006)Buades, Coll, and Morel}]{BuadesCM06}
 695 Buades A, Coll B, Morel JM (2006) The staircasing effect in neighborhood
 696   filters and its solution. IEEE Transactions on Image Processing
 697   15(6):1499--1505
 698
 699 \bibitem[{Caselles et~al(1997)Caselles, Coll, and Morel}]{caselles97}
 700 Caselles V, Coll B, Morel JM (1997) Scale space versus topographic map for
 701   natural images. Springer, pp 29--49
 702
 703 \bibitem[{Chen et~al(2009)Chen, Beister, Kyriakou, and Kachelries}]{chen09}
 704 Chen W, Beister M, Kyriakou Y, Kachelries M (2009) High performance median
 705   filtering using commodity graphics hardware. In: Nuclear Science Symposium
 706   Conference Record (NSS/MIC), 2009 IEEE, pp 4142--4147,
 707   \doi{10.1109/NSSMIC.2009.5402323}
 708
 709 \bibitem[{Coll et~al(2011)Coll, Morel, and Buades}]{ipol.2011.bcm_nlm}
 710 Coll B, Morel JM, Buades A (2011) Non-local means denoising. Image Processing
 711   On Line
 712
 713 \bibitem[{Dabov et~al(2009)Dabov, Foi, Katkovnik, and
 714   Egiazarian}]{Dabov09bm3dimage}
 715 Dabov K, Foi R, Katkovnik V, Egiazarian K (2009) Bm3d image denoising with
 716   shape-adaptive principal component analysis. In: Proc. Workshop on Signal
 717   Processing with Adaptive Sparse Structured Representations (SPARS{\rq}09
 718
 719 \bibitem[{Matheron(1975)}]{matheron75}
 720 Matheron G (1975) Random sets and integral geometry. Wiley
 721
 722 \bibitem[{Mc{G}uire(2008)}]{mcguire2008median}
 723 Mc{G}uire M (2008) A fast, small-radius gpu median filter. In: ShaderX6,
 724   \urlprefix\url{http://graphics.cs.williams.edu/papers/\\ MedianShaderX6}
 725
 726 \bibitem[{Monasse and Guichard(1999)}]{springerlink:10.1007/3-540-48236-9_16}
 727 Monasse P, Guichard F (1999) Scale-space from a level lines tree. In: Nielsen
 728   M, Johansen P, Olsen O, Weickert J (eds) Scale-Space Theories in Computer
 729   Vision, Lecture Notes in Computer Science, vol 1682, Springer Berlin /
 730   Heidelberg, pp 175--186,
 731   \urlprefix\url{http://dx.doi.org/10.1007/3-540-48236-9\_16},
 732   10.1007/3-540-48236-9\_16
 733
 734 \bibitem[{Palhano Xavier De~Fontes et~al(2010)Palhano Xavier De~Fontes,
 735   Andrade~Barroso, Coup{\'e}, and Hellier}]{PALHANOXAVIERDEFONTES}
 736 Palhano Xavier De~Fontes F, Andrade~Barroso G, Coup{\'e} P, Hellier P (2010)
 737   {Real time ultrasound image denoising}. Journal of Real-Time Image Processing
 738   \doi{10.1007/s11554-010-0158-5},
 739   \urlprefix\url{http://hal.inria.fr/inria-00476122}
 740
 741 \bibitem[{Perrot et~al(2011)Perrot, Domas, Couturier, and Bertaux}]{6036776}
 742 Perrot G, Domas S, Couturier R, Bertaux N (2011) Gpu implementation of a region
 743   based algorithm for large images segmentation. In: Computer and Information
 744   Technology (CIT), 2011 IEEE 11th International Conference on, pp 291 --298,
 745   \doi{10.1109/CIT.2011.60}
 746
 747 \bibitem[{Sanchez and Rodriguez(2012)}]{sanchezICASSP12}
 748 Sanchez RM, Rodriguez PA (2012) Bidimensional median filter for parallel
 749   computing architectures. In: Acoustics, Speech and Signal Processing
 750   (ICASSP), 2012 IEEE International Conference on, pp 1549--1552,
 751   \doi{10.1109/ICASSP.2012.6288187}
 752
 753 \bibitem[{Wang et~al(2004)Wang, Bovik, Sheikh, Member, Simoncelli, and
 754   Member}]{Wang04imagequality}
 755 Wang Z, Bovik AC, Sheikh HR, Member S, Simoncelli EP, Member S (2004) Image
 756   quality assessment: From error visibility to structural similarity. IEEE
 757   Transactions on Image Processing 13:600--612
 758
 759 \bibitem[{Yang et~al(2009)Yang, Tan, and Ahuja}]{YangTA09}
 760 Yang Q, Tan KH, Ahuja N (2009) Real-time o(1) bilateral filtering. In: CVPR,
 761   IEEE, pp 557--564,
 762   \urlprefix\url{http://doi.ieeecomputersociety.org/10.1109/\\
 763   CVPRW.2009.5206542}
 764
 765 \end{thebibliography}
 766
 767
 768 % that's all folks
 769 \end{document}
 770
 771 %doi = {10.5201/ipol.2011.bcm_nlm},