paper_lniv_gpu/paper_lniv_gpu_springer.tex~

   1
   2 \documentclass[twocolumn, final, natbib]{svjour2}
   3
   4 \usepackage{cite}
   5
   6 \usepackage{transparent}
   7 %\usepackage[pdftex]{graphicx,color}
   8 %   \graphicspath{{imgfs/}}
   9 %   \DeclareGraphicsExtensions{.pdf,.jpeg,.png}
  10
  11   %\usepackage[dvips]{graphicx}
  12   \usepackage{graphicx}
  13   %\graphicspath{{imgfs/}}
  14   %\DeclareGraphicsExtensions{.png,.jpg,.ps}
  15
  16 % *** MATH PACKAGES ***
  17 %
  18 \usepackage[cmex10]{amsmath}
  19
  20 % *** SPECIALIZED LIST PACKAGES ***
  21 %
  22 \usepackage[ruled,lined,linesnumbered]{algorithm2e}
  23
  24 % *** ALIGNMENT PACKAGES ***
  25 %
  26 \usepackage{array}
  27 \usepackage{mdwmath}
  28 \usepackage{mdwtab}
  29
  30 % *** SUBFIGURE PACKAGES ***
  31 \usepackage[caption=false,font=footnotesize]{subfig}
  32
  33 % *** FLOAT PACKAGES ***
  34 %
  35 \usepackage{fixltx2e}
  36
  37 \journalname{Real Time Image Processing}
  38
  39 \begin{document}
  40 %
  41 % paper title
  42 % can use linebreaks \\ within to get better formatting as desired
  43 \title{Fast GPU-based denoising filter using isoline levels}
  44
  45
  46 % author names and affiliations
  47 % use a multiple column layout for up to two different
  48 % affiliations
  49
  50 \author{
  51 Gilles Perrot$^1$ \and
  52 St\'{e}phane Domas$^1$ \and
  53 Rapha\"{e}l Couturier$^1$ \and
  54 Nicolas Bertaux$^2$}
  55
  56 \institute{
  57 $^1$ \at FEMTO-ST institute\\
  58 Rue Engel Gros, 90000 Belfort, France.\\
  59 forename.name@univ-fcomte.fr
  60 \and $^2$ \at Institut Fresnel, CNRS, Aix-Marseille Universit\'e, Ecole Centrale Marseille,\\
  61 Campus de Saint-J\'er\^ome, 13013 Marseille, France.\\
  62 nicolas.bertaux@ec-marseille.fr
  63 }
  64
  65 \date{Received: date / Revised: date}
  66
  67 % make the title area
  68 \maketitle
  69
  70 \begin{abstract}
  71 In this study, we propose to address the issue of image denoising by means of a GPU-based filter, able to achieve high-speed processing by taking advantage of the parallel computation capabilities of modern GPUs. Our approach is based on the level sets theory first introduced by \citet{matheron75} in 1975 but little implemented because of its high computation costs. What we actually do is try to guess the best isoline shapes inside the noisy image. At first, our method involved the polyline modelling of isolines; then we found an optimization heuristics which very closely fits the capabilities of GPUs. So far, though our proposed hybrid PI-PD filter has not achieved the best denoising levels, it is nonetheless able to process a 512x512 image in about 11~ms.
  72 \end{abstract}
  73
  74 \section{Introduction}
  75 Denoising has been a much studied research issue since electronic transmission was first used. The wide range of applications that involve denoising makes it uneasy to propose a universal filtering method. Among them, digital image processing is a major field of interest as the number of digital devices able to take pictures or make movies is growing fast and shooting is rarely done in optimal conditions. Moreover, the increase in pixel density of the CCD or CMOS sensors used to measure light intensity leads to higher noise effects and imposes high output flow rates to the various processing algorithms.
  76
  77 In addition, it is difficult to quantify the quality of an image processing algorithm, as visual perception is subject to high variation from one human to another.
  78 So far, the advent of GPUs has brought high speedups to a lot of algorithms, and many researchers and developpers have successfully adressed the issue of implementing existing algorithms on such devices. For example in \citet{mcguire2008median},\citet{chen09} and \citet{sanchezICASSP12}, authors managed to design quite fast median filters. Bilateral filtering has also been successfully proposed in \citet{YangTA09}.
  79 Still, most high quality algorithms, like NL-means \citet{ipol.2011.bcm_nlm} or BM3D \citet{Dabov09bm3dimage} make use of non-local similarities and/or frequency domain transforms. However, speedups achieved by their current GPU implementations, though quite sigificant (as shown for example with NL-means in \citet{PALHANOXAVIERDEFONTES}), do not come near those achieved by local methods such as gaussian, median or neighborhood filters, as they have not originally been designed against GPU architecture.
  80 In order to fully benefit from the capabilities of GPUs, it is important that the approach to designing algorithms be more hardware-oriented, keeping in mind, from the very beginning, the intrinsic constraints of the device which is actually going to run those algorithms. Consequently, this often results in unusual options and even apparently sub-optimal solutions, but the considerable speed benefits obtained would possibly make it at least a good compromise or even the only current way to real-time high-definition image processing.
  81
  82
  83
  84 \section{\label{contrib}Contribution}
  85 As early as 1975 \citet{matheron75}, it was found that, under the conditions mentioned in section \ref{isolines}, an image can be decomposed into a set of level lines. Accordingly, real-life images fulfill the above conditions and since then, with the increase of computing capabilities, researchers have succeded in implementing such level-lines based algorithms as in \citet{caselles97} and \citet{springerlink:10.1007/3-540-48236-9_16}.
  86 A few years ago, in \citet{bertaux:04}, authors  proposed an original method which significantly reduces speckle noise inside coherent images, using the level lines in the image to constrain the minimization process.
  87 Those level lines are actually \textit{iso-gray-level} lines, which are called \textit{isolines}. In \citet{bertaux:04}, isolines consist in neighborhoods of polyline shapes determined by maximum likelihood optimization. This method proved not only to bring good enhancement but also to preserve edges between regions. Nevertheless, the costs in computation time, though not prohibitive, did not allow real-time image processing; as an example, the authors of \citet{bertaux:04} managed to process an almost 2Mpixel image within a minute on an old PIII-1GHz.
  88
  89 Our work started by designing a set of GPU implementations with various optimization heuristics, in order to find out which tracks could be followed  towards minimizing loss in quality and preserve admissible execution times. Those algorithms have been tested with reference images taken from \citet{denoiselab} for which various processing results have been published. Some of the more interesting ones are listed and compared in  \citet{denoisereview}.
  90 Statistical observations (to be detailed below) made on the output images produced by the method proposed in \citet{bertaux:04}, led us to propose a very fast and simple parallel denoising method which gives good results in terms of average gray-level error, but also avoids the blurring of edges.
  91
  92 On the basis of the BM3D timings listed in \citet{Dabov09bm3dimage} and with our own measurements, our proposed GPU-based filter runs around 350 times faster and thus is able to process high definition images at over 16fps. It also achieves good denoising quality.
  93
  94 \section{Plan}
  95 In the following, section \ref{GPUgeneralites} briefly focuses on recent Nvidia GPU characteristics. Section \ref{isolines} will introduce the theory and notations used to define isolines. Then, in section \ref{lniv0}, we will describe the two isoline based models that led to the final hybrid model, while section \ref{lniv} details the parallel implementation of the proposed algorithm. Finally, we present our results in section \ref{results} before drawing our conclusions and outlining our future works in section \ref{conclusion}.
  96
  97
  98 \section{\label{GPUgeneralites}NVidia's GPU architecture}
  99 GPUs are multi-core, multi-threaded processors, optimized for highly parallel computation. Their design focuses on a SIMT model (Single Instruction Multiple Threads) that devotes more transistors to data processing rather than data-caching and flow control (see \citet{CUDAPG} for more details).
 100 For example, a C2070 card features 6~GBytes global memory and a total of 448 cores bundled in several Streaming Multiprocessors (SM). An amount of shared memory, much faster than global memory, is avalaible on each SM (up to 48~KB for a C20xx card)
 101
 102 Writing efficient code for such architectures is not obvious, as re-serialization must be avoided as much as possible. Thus, code design requires one pays attention to a number of points, among which:
 103 \begin{itemize}
 104  \item the CUDA model organizes threads by a) thread blocks in which synchronization is possible, b) a grid of blocks with no possible synchronization between them.
 105   \item there is no way to know how blocks are scheduled during one single kernel execution.
 106   \item data must be kept in GPU memory, to reduce the overhead generated by copying between CPU and GPU.
 107   \item the total amount of threads running the same computation must be as large as possible.
 108   \item the number of execution branches inside one block should be as small as possible.
 109   \item global memory accesses should be coalescent, \emph{i.e.} memory accesses done by physically parallel threads (2 x 16 at a time) must be consecutive and contained in a 128 Bytes range.
 110   \item shared memory is organized in 32x32 bit-wide banks. To avoid bank conflicts, each parallel thread (2 x 16 at a time) must access a different bank.
 111 \end{itemize}
 112
 113 All the above characteristics always make designing efficient GPU code all the more constraining as non-suited code would probably run even slower on GPU than on CPU.
 114
 115 \section{\label{isolines}Isolines}
 116 In the following, let $I$ be the reference noiseless image (assuming we have one), $I'$ the noisy acquired image corrupted by Independent and Identically Distributed (IID) additive white gaussian noise of zero mean value and standard deviation $\sigma$. Let $\widehat{I}$ be the denoised image. Each pixel of $I'$ of coordinates $(i,j)$ has its own gray level $z(i,j)$.
 117
 118 As introduced above and since most common images are continuous and contain few edges, they can be decomposed into a set of constant gray level lines called \textit{isolines}. Then our goal is to find, for each single pixel of a noisy image, the isoline it belongs to.
 119 The generalized likelihood criterion (GL) is used to select the best isoline among all the considered ones, all of which must have the same number of pixels in order to be compared.
 120
 121 \subsection{Fixed-length isolines}
 122 For each pixel $(i,j)$ of the corrupted image, we look for the gray level of the isoline it belongs to, inside a rectangular window $\omega$ centered on $(i,j)$. Inside $\omega$, let $S^n$ be the isoline part which the center pixel belongs to. $S^n$ is a set of $n$ pixel positions $(i_q,j_q)$ ($q \in [0;n[$).\\
 123 The gray levels $z$ along $S^n$ follow a gaussian probability density function whose parameters $\mu_{S^n}$ (mean value of isoline part) and $\sigma$ (standard deviation brought by gaussian noise ) are unknown.\\
 124 Let $\overline{S^n}$ be defined by $\omega = S^n \cup \overline{S^n}$.\\
 125 For each pixel, the mean values $\mu_{ij}$ of gray levels $z$ over $\overline{S^n}$ are unknown and supposed independant .\\
 126 Let $Z$ be the gray levels of pixels in $\omega$ and $\left\{\mu_{ij}\right\}_{\overline{S^n}}$ the mean values of pixels in $\overline{S^n}$.
 127 The likelihood is given by:
 128 $$
 129 \displaystyle P \left[Z | S^n, \mu_{S^n}, \left\{\mu_{ij}\right\}_{\overline{S^n}}, \sigma \right]
 130 $$
 131 When separating contributions from regions $S^n$ and $\overline{S^n}$, it becomes:
 132 \begin{eqnarray}
 133 \displaystyle \prod_{(i,j)\in S^n}{P\left[z(i,j) | \mu_{S^n}, \sigma \right]} . \displaystyle\prod_{(i,j)\in \overline{S^n}}{P\left[z(i,j) | \left\{\mu_{ij}\right\}_{\overline{S^n}}, \sigma \right]}
 134 \label{LL2}
 135 \end{eqnarray}
 136 The goal is then to estimate the value of the above expression, in order to find the boundaries of $S^n$ that maximize expression \eqref{LL2}.\\
 137 Let us consider that, on $\overline{S^n}$, the values $z(i,j)$ are the likelihood estimations $\widehat{\mu_{ij}}$ for $\mu_{ij}$. The second term of expression \eqref{LL2} becomes:
 138 \begin{eqnarray}
 139 \displaystyle\prod_{(i,j)\in \overline{S^n}}{P\left[z(i,j) | \left\{\widehat{\mu_{ij}}\right\}_{\overline{S^n}}, \sigma \right]=1}
 140 \end{eqnarray}
 141 which leads to the generalized likelihood expression:
 142 \begin{eqnarray}
 143 \displaystyle \prod_{(i,j)\in S^n}{P\left[z(i,j) | \mu_{S^n}, \sigma \right]}
 144 \label{GL}
 145 \end{eqnarray}
 146 As we know the probability density function on $S^n$, \eqref{GL} can then be developped as
 147 \begin{eqnarray}
 148 \displaystyle \prod_{(i,j)\in S^n}\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{\left(z(i,j)-\mu_{S^n}\right)^2}{2\sigma^2}}
 149 \label{GL2}
 150 \end{eqnarray}
 151
 152 The log-likelihood is then given by:
 153 \begin{eqnarray}
 154 \displaystyle -\frac{n}{2}log\left(2\pi\right) - \frac{n}{2}log\left(\sigma^2\right) - \frac{n}{2}
 155 \label{LL1}
 156 \end{eqnarray}
 157 inside which the vector of parameters $(\mu_{S^n}, \sigma)$ is determined by maximum likelihood estimation
 158 $$
 159 \left(
 160 \begin{array}{l}
 161 \widehat{\mu_{S^n}} = \displaystyle\frac{1}{n} \sum_{(i,j)\in S^n} z(i,j) \\
 162 \widehat{\sigma^2} = \displaystyle\frac{1}{n} \sum_{(i,j)\in S^n} \left(z(i,j) - \widehat{\mu_{S^n}}\right)^2 \\
 163 \end{array}
 164 \right.
 165 $$
 166 The selection of the best isoline is done by searching which one maximizes the expression of equation \eqref{LL1}.
 167
 168 \begin{figure}[h]
 169 \includegraphics[width=\linewidth]{imgfs/isolines_1.jpg}
 170 \caption{Determination and lengthening of an isoline: The gray level $z$ of each pixel is seen as an elevation value. $S^n$ is the $n$ pixel length isoline for pixel of coordinates $(i, j)$. The elongation of $S^n$ by $S^p$ ($p$ pixel length) is submitted to the GLRT condition (see eq. \eqref{GLRT}).}
 171 \label{si3}
 172 \end{figure}
 173
 174
 175 \subsection{Lengthenable  isolines}
 176 Searching for larger isolines should lead to better filtering as a larger number of pixels would be involved.
 177 However, processing all possible isolines starting from each pixel would be too costly in computing time, even in the case of a small GPU-processed 512x512 pixel image. Therefore, we chose to build large isolines inside an iterative process including a mandatory validation stage between each lengthening iteration, so as to reduce the number of pixel combinations to be examined and keep the estimation of deviation $\sigma$ within a satisfactory range of values.
 178
 179 Let $S^n$ be a previously selected isoline part and $S^p$ connected to $S^n$ in such a way that $S^p$ could be seen as an addition to $S^n$ so as to define a possible valid isoline $S^{n+p}$.
 180 Figure \ref{si3} illustrates this situation with a very simple example image. In this figure, the gray level of each pixel is used as its corresponding height ($z$) in order to visualize isolines easily. Some of the orthogonal isoline projections have been drawn in dotted line in the $(\vec{i},\vec{j})$ plane. Both labeled parts $S^p$ and $S^n$ are represented in the $(\vec{i},\vec{j})$ plane and in the 3D associated plot.
 181
 182 In order to decide whether $S^{n+p}$ can be considered as an actual isoline, we compare the log-likelihood of both hypothesis below by using GLRT (Generalized Likelihood Ratio Test):
 183
 184 First, assuming that $S^{n+p}$ is an isoline, the gray levels of its pixels share the same mean value $\mu_{n+p}$. According to \eqref{LL1}, its log-likelihood is
 185 \begin{eqnarray}
 186 \displaystyle -\frac{(n+p)}{2}\left(log\left(2\pi\right)+1\right) - \frac{(n+p)}{2}log\left(\widehat{\sigma_1}^2\right)
 187 \label{LLNP}
 188 \end{eqnarray}
 189 where $\widehat{\sigma_1}$ is the estimation of the standard deviation along $S^n$.
 190
 191 Second, considering $S^n$ and $S^p$ as two separate isoline parts connected together, the gray levels of their pixels have two different mean values $\mu_n$ and $\mu_p$. The log-likelihood is the sum of both log-likelihoods, given by
 192 \begin{eqnarray}
 193 \displaystyle -\frac{(n+p)}{2}\left(log\left(2\pi\right)+1\right) - \frac{n}{2}log\left(\widehat{\sigma_2}^2\right) - \frac{p}{2}log\left(\widehat{\sigma_2}^2\right)
 194 \label{LLNP2}
 195 \end{eqnarray}
 196 where $\widehat{\sigma_2}$ is the estimation of the standard deviation along $S^n$ and $S^p$.
 197
 198 The difference between \eqref{LLNP} and \eqref{LLNP2} leads to the expression of $GLRT(S^{n+p},S^n, S^p, T_{max})$:
 199 \begin{eqnarray}
 200 T_{max}- (n+p).\left[log\left(\widehat{\sigma_1}^2\right) - log\left(\widehat{\sigma_2}^2\right) \right]
 201 \label{GLRT}
 202 \end{eqnarray}
 203
 204 The decision to validate lengthening from $S^n$ to $S^{n+p}$ depends whether $GLRT(S^{n+p},S^n, S^p, T_{max})$ is higher or lower than $0$. Value $T_{max}$ is the GLRT threshold.
 205
 206 \section{\label{lniv0}Isoline models}
 207 The most obvious model considers isolines as polylines. Each isoline can then be curved by allowing a direction change at the end of each segment; we shall call such isolines \textit{poly-isolines}.
 208
 209 In order to keep the number of candidate isolines within reasonable range, we chose to build them by combinating segments described by simple pre-computed patterns. Each pattern $p_{l,d}$ describes a segment of length $l$ and direction $d$. For one given $l$ value, all $p_{l,d}$ patterns are grouped into a matrix denoted $P_l$. Figure \ref{p5q1} shows an example of such a pattern matrix for $l=5$.
 210
 211 To fit the GPU-specific architecture, we define regularly distributed $D$ primary directions ($D=32$ in our examples).
 212
 213
 214 \subsection{\label{lniv2}Poly-isolines with limited deviation angle (PI-LD)}
 215 At one stage we implemented an algorithm parsing the tree of all possible polyline configurations, but the process proved far too slow regarding our goal, even on GPU, because of the amount of memory  involved (and consequent memory accesses) and because of the necessary reduction stage for which GPU efficiency is not maximum.
 216 So we focused on a variant inspired by \citet{bertaux:04} in which the selected direction of the next segment depends on the whole of the previously built and validated poly-isoline.
 217
 218 Let us consider a poly-isoline $S^n$ under construction, starting from pixel $(i,j)$ and made of $K$ validated segments $s_k~(k \in [1;K])$ of length $l$, each of them having its own direction $d_k$. The coordinates of the ending pixel of each segment $s_k$ are denoted $(i_k, j_k)$.
 219 Both of the following sums
 220 \begin{eqnarray}
 221 C_x\left(Z(S^n)\right) &=& \displaystyle\sum_{(i,j)\in S^n}z(i,j)\label{cx}\\
 222 and~~C_{x^2}\left(Z(S^n)\right)&=& \displaystyle\sum_{(i,j)\in S^n}z(i,j)^2\label{cx2}
 223 \end{eqnarray}
 224 have been obtained during the previous lengthening steps.
 225
 226 Let us examine now how to decide wether to add a new segment to $S^n$ or to stop the lenghtening process.
 227 The main idea is to apply each pattern $p_{l,d}$ to the ending pixel $(i_k,j_k)$, on the condition that its direction is contained within the limits of maximum deviation $\Delta d_{max}$.
 228 Maximum deviation $\Delta d_{max}$ prevents poly-isolines from beeing of circular shape (or backward-oriented) which would possibly generate supplementary artefacts in the output image. Another of its benefits is to reduce the number of combinations to be evaluated.
 229
 230 For each allowed pattern, GLRT is performed in order to decide if the corresponding segment could likely be added to the end of the poly-isoline $S^n$.
 231 If none is validated by GLRT, the poly-isoline $S^n$ is stopped.
 232
 233 If at least one segment has been accepted by GLRT, the one that leads to the maximum likelihood (ML) value of the lengthened poly-isoline $S^{n+l}$ is selected and integrated to $S^{n+l}$ as $s_{K+1}$.
 234
 235 In order to avoid critical situations where the first selected segment would not share the primary direction of the actual poly-isoline,
 236 no selection is performed on the level of the first segment; $D$ poly-isolines are kept and submitted to the lengthening process. To ensure isotropy, each of them shares the direction of one pattern $p_{l,d}~ (d \in[0;D])$.
 237
 238 Eventually, the poly-isoline with the maximum likelihood value is selected among the longest ones.
 239
 240 Figure \ref{pild} illustrates one stage of the lengthening process with the example of a two-segment poly-isoline  at the beginning of stage ($l=5$ and $\Delta d_{max}=2$).
 241
 242
 243 \begin{figure}[h]
 244  \centering
 245 \subfloat[Isoline with two validated segments $s_1$ and $s_2$.]{\label{pild:debut} \includegraphics{imgfs/PI-LD_detail_3.jpg}}\qquad
 246 \subfloat[First evaluated segment, corresponding to pattern $p_{5,0}$.]{\label{pild:sub1} \includegraphics{imgfs/PI-LD_detail_sub1.jpg}}\\
 247 \subfloat[Second evaluated segment, corresponding to pattern $p_{5,1}$.]{\label{pild:sub2} \includegraphics{imgfs/PI-LD_detail_sub2.jpg}}\qquad
 248 \subfloat[Third evaluated segment, corresponding to pattern $p_{5,2}$.]{\label{pild:sub3} \includegraphics{imgfs/PI-LD_detail_sub3.jpg}}\\
 249 \subfloat[Fourth evaluated segment, corresponding to pattern $p_{5,3}$.]{\label{pild:sub4} \includegraphics{imgfs/PI-LD_detail_sub4.jpg}}\qquad
 250 \subfloat[Fifth evaluated segment, corresponding to pattern $p_{5,4}$.]{\label{pild:sub5} \includegraphics{imgfs/PI-LD_detail_sub5.jpg}}\\
 251 \caption{Example of lengthening process starting with a two-segment poly-isoline ($l=5$, $\Delta d_{max}=2$).
 252 The initial situation is shown in \ref{pild:debut}, while \ref{pild:sub1} to \ref{pild:sub5} represent the successive candidate segments.
 253 The direction index of the last validated segment is $d_2=2$ (\ref{pild:debut}). It implies that direction indexes allowed for the third segment range from $d_2-\Delta d{max}=0$ to $d_2+\Delta d{max}=4$ (\ref{pild:sub1} to \ref{pild:sub5}).
 254 The lengthening of the poly-isoline is accepted if at least one segment has a positive GLRT. If there are several, the one which minimizes the standard deviation of the whole poly-isoline is selected.
 255 }
 256 \label{pild}
 257 \end{figure}
 258
 259
 260 \renewcommand{\labelenumi}{\alph{enumi})}
 261 \renewcommand{\theenumi}{\alph{enumi})}
 262
 263 \subsection{\label{lniv3}Poly-isolines with precomputed directions (PI-PD)}
 264 Though much faster, the PI-LD-based filter may be considered a bit weak compared to \textit{state-of-the-art} filters like BM3D family algorithms \citet{Dabov09bm3dimage}. Furthermore, we saw that this way of building poly-isolines  requires the alternate use of two different types of validation at each lengthening stage: GLRT and maximum likelihood minimization. In order to be performed, each of them generates numerous branches during kernel execution, which does not fit GPU architecture well and leads to execution times that we hoped would be more impressive.
 265
 266 Within the PI-LD model, at each pixel $(i,j)$, as no selection is done at the first stage, $D$ poly-isolines are computed and kept as candidate though, obviously, only one follows the actual isoline at $(i,j)$. So, if we assume we can achieve a robust determination of the direction at any given pixel of this isoline, it becomes unnecessary to perform the selection at each lenghtening step.
 267 Thus, at each pixel $(i,j)$, only the first segment has to be determined in order to obtain the local direction of the isoline. This leads to an important reduction of the work complexity: the above PI-LD model needs to evaluate $D.\left(2.\Delta_{dmax}+1\right)^{K-1}$ segments at each pixel position, while only $D.K$ evaluations are needed in the second case. For example, with a maximum of $K=5$ segments and a maximum deviation of $\Delta_{dmax}=2$, the PI-LD needs to evaluate up to 20000 segments per pixel where only 160 should be enough.
 268
 269
 270 On the basis of these observations, we propose a new model that we shall call PI-PD, that completely separates the validation stages performed in the PI-LD model implementation mentioned above:
 271
 272 A first computation stage selects the best first segment $s_1$ starting at each pixel $(i,j)$ of the input image. Its direction index $d_1(i,j)$ is then stored in a reference matrix denoted $I_\Theta$; sums $C_x$ and $C_{x2}$ along $s_1(i,j)$ are also computed and stored in a dedicated matrix $I_\Sigma$. It can be noticed that this selection method of $s_1$ segments is a degraded version of PI-LD constrained by $K=1$.
 273
 274 A second stage manages the now independant lengthening process. For one given state of a poly-isoline where the last added segment has been $s_K$, the pattern whose direction index is given by $d=I_\Theta(i_K,j_K)$ defines the only segment to be evaluated.
 275 Both corresponding sums  $C_x$ and $C_{x2}$ are read from matrix $I_\Sigma$ and used in GLRT evaluation.
 276 The last point is to prevent poly-isolines from turning back.
 277
 278 Figure \ref{pipd} details this process, starting from the same initial state as in figure \ref{pild} with the noticeable difference that no deviation limit is needed.
 279
 280 Thus, as introduced above, work complexity is considerably reduced, as each pattern is only applied once at one given pixel $(i,j)$, and   associated values are computed only once; they are re-used every time one poly-isoline's segment ends at pixel $(i,j)$. Also, this fits GPU constraints better, as it avoids multiple branches during kernel execution.
 281 It remains that, the building of poly-isolines is done without global likelihood optimization.
 282
 283 Eventually, the model has been improved by adding to it the ability to thicken poly-isolines from one pixel up to three which allows to achieve higher PSNR values by increasing the number of pixels of poly-isolines in addition to the lengthening process.
 284 This may apply to large images which do not contain small relevant details, as it may blur small significant details or objects present in the noisy image.
 285 Still, this feature makes PI-PD more versatile than our reference BM3D, which has prohibitive computation times when processing large images (over 5 minutes for a 4096x4096 pixel image) and thus should require a slicing stage prior to processing them, causing some overhead.
 286
 287 \begin{figure}[h]
 288  \centering
 289 \subfloat[Poly-isoline with two validated segments.]{\label{pipd:pipd1} \includegraphics[width=2.3cm]{imgfs/PI-PD_detail_sub1.jpg}}\qquad
 290 \subfloat[Next direction is read from element $(i_2,j_2)$ of $I_{\Theta}$.]{\label{pipd:pipd2} \includegraphics[width=5.0cm]{imgfs/PI-PD_detail_sub2.jpg}}\\
 291 \subfloat[Pattern $p_{l,d_3}$ is then applied at $(i_2,j_2)$ and GLRT is performed. Both sums needed to perform GLRT are read from element $(i_2,j_2)$ of $I_{\Sigma}$.]{\label{pipd:pipd3} \includegraphics[width=4cm]{imgfs/PI-PD_detail_sub3.jpg}}\qquad
 292 \subfloat[If accepted by GLRT, segment $s_3$ is added to poly-isoline.]{\label{pipd:pipd4} \includegraphics[width=2.7cm]{imgfs/PI-PD_detail_sub4.jpg}}\\
 293 \caption{Example of PI-PD lengthening process starting with a two-segment poly-isoline ($l=5$).
 294 The initial situation is represented in \ref{pipd:pipd1}, while \ref{pipd:pipd1} to \ref{pipd:pipd4} represent the successive processing steps.
 295 The end pixel of the last validated segment is $(i_2,j_2)$ (\ref{pipd:pipd1}). Reference matrices \(I_{\Theta}\) and \(I_{\Sigma}\) provide the values needed to select the pattern to be applied on \((i_2,j_2)\) (\ref{pipd:pipd2} and \ref{pipd:pipd3}). GLRT is performed to validated lengthening or not. This process goes on until one submitted segment does not comply with GLRT.
 296 }
 297 \label{pipd}
 298 \end{figure}
 299
 300 \subsection{\label{pipd_plan}Hybrid PI-PD}
 301 As the determination of each segment's direction only involves a few pixels,
 302 the PI-PD model may not be robust enough in regions where the surface associated with $Z$ has a low local slope value
 303 regarding power of noise $\sigma^2$.
 304 We shall call those regions Low Slope Regions (LSR).
 305 Figure \ref{img_plans} shows this lack of robustness with an example of two drawings of additive white gaussian noise applied on the same reference image (Figure \ref{img_window}). Within this image, we focused on a small 11x11 pixel window containing two LSR with one sharp edge between them.
 306
 307 Figures \ref{fig:dir1} and \ref{fig:dir2} show that the directions computed by PI-PD are identical from one drawing to the other near the edge (lines 5-7), while they vary in LSR (lines 1-4, 8-11).
 308
 309 \begin{figure}[h]
 310  \centering
 311 \subfloat[Reference image]{\label{fig:ref} \includegraphics{./imgfs/zoom_edge_ref2.ps}}\\
 312 \subfloat[Image corrupted by random drawing $n^{\circ}1$]{\label{fig:noisy1} \includegraphics{./imgfs/zoom_edge_bruit.ps}}\qquad
 313 \subfloat[Image corrupted by random drawing $n^{\circ}2$]{\label{fig:noisy2} \includegraphics{./imgfs/zoom_edge2_bruit.ps}}\\
 314 \subfloat[Isoline directions for random drawing $n^{\circ}1$]{\label{fig:dir1} \includegraphics{./imgfs/zoom_edge1_2D_superpose3.ps}}\qquad
 315 \subfloat[Isoline directions for random drawing $n^{\circ}2$]{\label{fig:dir2} \includegraphics{./imgfs/zoom_edge2_2D_superpose2.ps}}\\
 316  \caption{Zoom on a small square window of the airplane image. \ref{fig:ref} reproduce the zoom on the window, taken from the reference image of Figure \ref{img_window}. \ref{fig:noisy1}, \ref{fig:noisy2} and \ref{fig:ref} and are 3D views where each bar represents a pixel whose gray-level corresponds to the height of the bar. Figures \ref{fig:dir1} and \ref{fig:dir2} are 2D top views of the window. The chosen window shows an edge between two regions of low slope. The images \ref{fig:noisy1} and \ref{fig:noisy2} are corrupted with two different random drawings of the same additive  white gaussian noise (AWGN) of power $\sigma^2$ and mean value $0$. \ref{fig:dir1} and \ref{fig:dir2} show, for each pixel of the window, the direction of the isoline found by PI-PD. In regions of low slope (the two regions at the top and the bottom), the determination of the direction is not robust. But near the edge, directions do not vary from one drawing to another.}
 317   \label{img_plans}
 318 \end{figure}
 319
 320 Within such regions, our speed goals forbid us to compute isoline directions with the PI-LD model, more robust
 321 but far too slow.
 322 Instead we propose a fast solution which implies designing an edge detector whose
 323 principle is to re-use the segment patterns defined in section \ref{lniv0} and to combine them by pairs in order to detect any possible LSR around the center pixel.
 324 If a LSR is detected, the output gray-level value is the average value computed on the current square window, otherwise, the PI-PD output value is used.
 325
 326 In order to further simplify computation, only the patterns that do not share any pixel are used.
 327 These patterns have a direction which is a multiple of $45^{\circ}$.
 328
 329 Each base direction $(\Theta_i)$ and its opposite $(\Theta_i + \pi) \left[2\pi\right]$ define a line that separates the square
 330 window in two regions (top and and bottom regions, denoted T and B).
 331 We assume that segments on the limit belong to the T region which includes pixels of orientation from $\Theta_i$ to $\Theta_i+\pi$. This region comprises three more segments of directions $(\Theta_i+\frac{\pi}{4})$, $(\Theta_i+\frac{2\pi}{4})$ and $(\Theta_i+\frac{3\pi}{4})$.
 332 The other region (B) only includes three segments of directions $(\Theta_i+\frac{5\pi}{4})$, $(\Theta_i+\frac{6\pi}{4})$ and $(\Theta_i+\frac{7\pi}{4})$.
 333
 334 Figure \ref{detect_plans} illustrates this organization for $\Theta_i=\Theta_4=45^{\circ}$. Each bar represents a pixel in the detector's window. Pixels with null height are not involved in the GLRT. Pixels represented by higher bars define the T region and those represented by shorter bars define the B region.
 335
 336 \begin{figure}[h]
 337 \begin{center}
 338  \includegraphics{./imgfs/pattern_detecteur.ps}
 339  \caption{\label{detect_plans}Edge detector. 3D view representing an example square 11x11 pixel window ($l=5$) used in the edge detector for $\Theta_4=45^{\circ}$ around a center pixel colored in black. Each pixel is represented by a bar. Bars of height value 0 are for pixels that are not involved in the detector. Top region is defined by five pattern segments and includes the center pixel. Bottomp region only includes three pattern segments. The different height values are meant to distinguish between each of the three different sets of pixels and their role. }
 340 \end{center}
 341
 342 \end{figure}
 343
 344 For each $\Theta_i$, one GLRT is performed in order to decide whether the two regions T and B defined above are likely to be seen as a single region or as two different ones, separated by an edge as shown in figure \ref{detect_plans}. The center pixel is located on the edge.
 345 Equations \eqref{LLNP}, \eqref{LLNP2} and \eqref{GLRT} lead to a similar GLRT expression:
 346 \begin{eqnarray}
 347 T2_{max}- (8.l+1).\left[log\left(\widehat{\sigma_3}^2\right) - log\left(\widehat{\sigma_4}^2\right) \right]
 348 \label{GLRT2}
 349 \end{eqnarray}
 350 where $\sigma_3$ is the standard deviation considering that the two regions are likely to define a single one and $\sigma_4$ the standard deviation if an edge is more likely to separate the two regions. $T2_{max}$ is the decision threshold.
 351 With equation \eqref{GLRT2}, a negative result leads to an edge detection, oriented towards direction $\Theta_i$.
 352 When GLRT is known for each $\Theta_i$, we apply the following hybridation policy:
 353 \begin{enumerate}
 354 \item more than one negative GLRT: the PI-PD output value is used.
 355 \item only one negative GLRT: the center pixel is likely to be on a well-defined edge, and only the region it belongs to is considered. The average value of its pixel gray levels is then used.\label{halfplane}
 356 \item no negative GLRT: the window around the center pixel is likely to be a LSR. The average value on the whole square window is used (11x11 pixels in the example of Figure \ref{detect_plans}).
 357 \end{enumerate}
 358
 359 \begin{figure}[h]
 360  \centering
 361 \subfloat[Reference noiseless airplane image]{\label{img_window:ref} \includegraphics{./imgfs/airplane.ps}}\qquad
 362 \subfloat[Location of the example window in the reference image.]{\label{img_window:win} \includegraphics{./imgfs/zoom_windows_A.ps}}\\
 363  \caption{Location of the example window inside the reference image. Figure \ref{img_window:ref} shows the whole reference image and \ref{img_window:win} zooms on the part where the example 11x11 pixel window is.}
 364   \label{img_window}
 365 \end{figure}
 366
 367 It must be noticed that point \ref{halfplane} has been introduced in order to achieve smoother transitions between regions to which  PI-PD is applied and those in which the plain average value is used.
 368 Figure \ref{exbords} shows an example of such a classification achieved by the edge detector. The detector has been applied on the top noisy airplane image with a GLRT threshold value $T2_{max}=2$. Black pixels represent pixel classified as \textit{on an edge}, while white ones are those which belong to LSR.
 369
 370 \begin{figure}[h]
 371  \centering
 372 \subfloat[Noisy airplane image]{\label{exbords:noisy} \includegraphics{./imgfs/airplane_noisy_small.ps}}\qquad
 373 \subfloat[Pixel classification performed by the edge detector. ]{\label{exbords:bords} \includegraphics{./imgfs/img_bords_T2_small.ps}}\\
 374  \caption{Pixel classification inside the noisy image. Figure \ref{exbords:noisy} shows the noisy input image and \ref{exbords:bords} reproduces the output classification of pixels, as a black and white image, obtained with threshold value  $T2_{max}=2$. Black pixels are supposed to be near an edge, while white pixels belong to Low Slope Regions.}
 375   \label{exbords}
 376 \end{figure}
 377
 378
 379 \section{\label{lniv} Hybrid PI-PD filter Implementation: details}
 380 All implementation details that will be given here are relative to the proposed PI-PD models and Nvidia$^\copyright~$ GPU devices.
 381
 382 \subsection{\label{genPaths}Segment patterns}
 383 The first kernel to be run is \texttt{kernel\_genPaths()} which generates matrix $P_l$. Its elements $(\Delta i; \Delta j)$ are the relative coordinates of the pixels which define segment patterns $p_{l,d}$. The dimensions of matrix $P_l$ are $D$ rows $\times$  $l$ columns.
 384 To fit GPU architecture as closely as possible, we chose $D=32$ patterns.
 385 Each segment $s_k$ of a poly-isoline can then be seen as a pattern $p_{l,d}$ applied on the starting pixel $(i,j)$ of this segment, denoted $p_{l,d}(i,j)$.
 386
 387 The example in figure \ref{p5q1} shows the first quarter of $P_5$ and the corresponding eight discrete segment patterns in the first quadrant. The three remaining quarters of the matrix are easily deduced by applying successive rotations of angle $\frac{\pi}{2}$ to the above elements.
 388
 389 \begin{figure}[h]
 390 \begin{center}
 391 \includegraphics[width=0.65\linewidth]{./imgfs/P5Q1.ps}
 392 \end{center}
 393 \tiny{
 394 $$
 395 P_5 =
 396 \begin{bmatrix}
 397 (0,1)&(0,2)&(0,3)&(0,4)&(0,5)\\
 398 (0,1)&(0,2)&(-1,3)&(-1,4)&(-1,5)\\
 399 (0,1)&(-1,2)&(-1,3)&(-2,4)&(-2,5)\\
 400 (-1,1)&(-1,2)&(-2,3)&(-3,4)&(-3,5)\\
 401 (-1,1)&(-2,2)&(-3,3)&(-4,4)&(-5,5)\\
 402 (-1,1)&(-2,1)&(-3,2)&(-4,3)&(-5,3)\\
 403 (-1,0)&(-2,1)&(-3,1)&(-4,2)&(-5,2)\\
 404 (-1,0)&(-2,0)&(-3,1)&(-4,1)&(-5,1)\\
 405 \ldots&\ldots&\ldots&\ldots&\ldots\\
 406 \end{bmatrix}
 407 $$
 408 }
 409 \caption{\label{p5q1}Top: example segment patterns $p_{5,d}$ for $d\in[0;7]$; the black pixel represents the center pixel $(i,j)$, which does not belong to the pattern. The gray ones define the actual pattern segments. Bottom: the first 8 lines of corresponding matrix $P_5$ whose elements are the positions of segment pixels with respect to the center pixel.}
 410 \end{figure}
 411
 412 \subsection{\label{sipd}Generation of reference matrices $I_{\Sigma}$ and $I_{\Theta}$}
 413 In order to generate both matrices, a GPU kernel \texttt{kernel\_precomp()} computes, in parallel for each pixel $(i,j)$:
 414 \begin{itemize}
 415 \item the direction $\delta$ of the most likely segment $s_1 = p_{l,\delta}(i,j)$ among the $D$ possible ones. This value is stored in matrix $I_{\Theta}$ at position $(i,j)$.
 416 \item values $C_x(s_1)$ and $C_{x^2}(s_1)$ defined in equations \eqref{cx} and \eqref{cx2}. This vector of values is stored in matrix $I_{\Sigma}$ at position $(i,j)$.
 417 \end{itemize}
 418
 419 In order to reduce processing time, the input image is first copied into texture memory (see algorithm \ref{algoinit} for initializations and memory transfer details), thus taking advantage of the 2D optimized caching mechanism.
 420
 421 This kernel follows the \textit{one thread per pixel} rule. Consequently, each value of $P_l$ has to be accessed by every thread of a block. That led us to load it from texture memory first, then copy it into all shared memory blocks. This has proved to be the fastest scheme.
 422
 423 Algorithm \ref{algoprecomp} summarizes the computations achieved by \texttt{kernel\_precomp()}.
 424 Vector $(C_x, C_{x2})$ stores the values of $C_x(s_1)$ and $C_{x^2}(s_1)$ associated with the current tested pattern.
 425 Vector $(C_{x-best}, C_{x2-best})$ stores the values of $C_x(s_1)$ and $C_{x^2}(s_1)$ associated with the best previously tested pattern.
 426
 427 In the same manner, $\sigma$ and $\sigma_{best}$ are deviation values for current and best tested patterns.
 428
 429  The selection of the best pattern is driven by the value of the standard deviation of candidate isolines.
 430 Lines 2 and 3 compute both sums for the first pattern to be evaluated. Line 4 computes its standard deviation.
 431 Then, lines 5 to 14 loop on each pattern and keep values associated with the best pattern found. These values are eventually stored in matrices $I_{\Theta}$ and $I_{\Sigma}$ on lines 16 and 17.
 432
 433 \begin{algorithm}[htb]
 434   \SetNlSty{textbf}{}{:}
 435 \caption{Initializations in GPU memory}
 436 \label{algoinit}
 437 $l \leftarrow$ step size\;
 438 $D \leftarrow$ number of primary directions\;
 439 $I_n \leftarrow$ noisy image\;
 440 $I_{n tex} \leftarrow I_n $\tcc*[r]{copy to texture mem.}
 441 $P_l \leftarrow$ kernel\_genPaths \tcc*[r]{pattern matrix}
 442 $P_{l tex} \leftarrow P_l $\tcc*[r]{copy to texture mem.}
 443 $T_{max} \leftarrow$ GLRT threshold (lengthening)\;
 444 $T2_{max} \leftarrow$ GLRT threshold (edge detection)\;
 445 \end{algorithm}
 446
 447 \begin{algorithm}
 448   \SetNlSty{textbf}{}{:}
 449   \SetKwComment{Videcomment}{}{}
 450 \caption{generation of reference matrices, kernel \texttt{kernel\_precomp()}}
 451 \label{algoprecomp}
 452 \ForEach(\tcc*[f]{\textbf{in parallel}}){pixel $(i,j)$}{
 453   $C_{x-best} \leftarrow  \displaystyle\sum_{(y,x)\in p_{l,0}(i,j)} I_{n tex}(i+y,j+x)$ \;
 454   $C_{x2-best} \leftarrow  \displaystyle\sum_{(y,x)\in p_{l,0}(i,j)} I_{n tex}^2(i+y,j+x)$ \;
 455   $\sigma_{best} \leftarrow$ standard deviation along $p_{l,0}(i,j)$ \;
 456   %\Videcomment{}
 457   \tcc{loop on each pattern}
 458   \ForEach{$d \in [1;D-1]$}{
 459     $C_x \leftarrow \displaystyle\sum_{(y,x)\in p_{l,d}(i,j)} I_{n tex}(i+y,j+x)$\;
 460     $C_{x2} \leftarrow \displaystyle\sum_{(y,x)\in p_{l,d}(i,j)} I_{n tex}^2(i+y,j+x)$\;
 461     $\sigma \leftarrow$ standard deviation along $p_{l,d}(i,j)$\;
 462     \If(\tcc*[f]{keep the best}){$\sigma_d < \sigma_{best}$}{
 463       $C_{x-best} \leftarrow C_x$ \;
 464       $C_{x2-best} \leftarrow C_{x2}$ \;
 465       $\Theta_{best} \leftarrow d$ \;
 466     }
 467   }
 468   $I_{\Sigma}(i,j) \leftarrow \left[ C_{x-best}, C_{x2-best}\right]$ \tcc*[r]{stores}
 469   $I_{\Theta}(i,j) \leftarrow \Theta_{best}$ \tcc*[r]{in matrices}
 470 }
 471 \end{algorithm}
 472
 473 \begin{algorithm}[ht]
 474 \SetNlSty{textbf}{}{:}
 475 \caption{PI-PD lengthening process \texttt{kernel\_PIPD()}}
 476 \label{algoPIPD}
 477 \ForEach(\tcc*[f]{\textbf{in parallel}}){pixel $(i,j)$}{
 478  % \tcc{in parallel}
 479  % \tcc{starting pixel $(i,j)$ and first segment without GLRT}
 480   $(C_x^1, C_{x2}^1) \leftarrow z(i,j)$ \tcc*[r]{starting pixel}
 481   $(i_1, j_1) \leftarrow (i, j)$ \tcc*[r]{first segment}
 482   $(C_x^1, C_{x2}^1) \leftarrow I_{\Sigma}(i_1,j_1)$ \tcc*[r]{read matrix}
 483   $d_1 \leftarrow I_{\Theta}(i,j)$ \tcc*[r]{read matrix}
 484   $l_1 \leftarrow l$ \tcc*[r]{isoline length}
 485   $\sigma_1 \leftarrow (C_{x2}^1/l_1 - C_x^1)/l_1$\;
 486   $(i_2, j_2) \leftarrow end~of~first~segment$\;
 487   $(C_{x}^2, C_{x2}^2) \leftarrow I_{\Sigma}(i_2,j_2) $ \tcc*[r]{2$^{nd}$ segment}
 488   $d_2 \leftarrow I_{\Theta}(i_2,j_2)$\;
 489   $\sigma_2 \leftarrow (C_{x2}^2/l - C_x^2)/l$ \;
 490   %
 491   \While{$GLRT(\sigma_1, \sigma_2, l_1, l) < T_{max}$}{
 492     $l_1 \leftarrow l_1 + l$ \tcc*[r]{lengthening}
 493     $(C_x^1, C_{x2}^1) \leftarrow (C_x^1, C_{x2}^1)+(C_x^2, C_{x2}^2)$\;
 494     $\sigma_1 \leftarrow (C_{x2}^1/l_1 - C_x^1)/l_1$ \tcc*[r]{update}
 495     $(i_1,j_1) \leftarrow (i_2, j_2)$ \tcc*[r]{step forward}
 496     $d_1 \leftarrow d_2$\;
 497     $(i_2, j_2) \leftarrow end~of~next~segment$\;
 498     \tcc*[f]{next segment}
 499     $(C_{x}^2, C_{x2}^2) \leftarrow I_{\Sigma}(i_2,j_2) $\;
 500     $d_2 \leftarrow I_{\Theta}(i_2,j_2)$\;
 501     $\sigma_2 \leftarrow (C_{s2}^2/l - C_s^2)/l$ \;
 502     }
 503   }
 504   $\widehat{I}(i, j) \leftarrow C_x^1/l_1$ \tcc*[r]{isoline value}
 505 \end{algorithm}
 506
 507
 508 \subsection{\label{sipdl}PI-PD lengthening process:  \texttt{kernel\_PIPD()} }
 509 This parallel kernel is run in order to obtain the image of the \emph{isolines}. It is detailed in algorithm \ref{algoPIPD}, (see section \ref{lniv3} for process description).
 510
 511 Lines from 2 to 11 perform allocations for the first lengthening to evaluate.
 512 More precisely, $(i_1, j_1)$ represents the starting pixel of the current segment; $(i_2, j_2)$ is both its ending pixel and the starting pixel of the next segment; $d_1$ and $d_2$ are their directions, read from precomputed matrix $I_{\Theta}$.
 513 $C_x^1$ and $C_{x2}^1$ are the gray-level sums along the current poly-isoline; $C_x^2$ and $C_{x2}^2$ are the gray-level sums of the candidate segment.
 514 The current poly-isoline ends at $(i_1, j_1)$ and is made of $l_1$ pixels (already accepted segments); its standard deviation is $\sigma_1$.
 515 The loop extending from lines 12 to 21 performs the allocations needed to proceed one segment forward, as long as GLRT is true.
 516 If the lengthening has been accepted, the length of the poly-isoline is updated in line 13, and the same is done with $C_x$ and $C_{x2}$ which are read from precomputed matrix $I_{\Sigma}$ (see equations \eqref{cx} and \eqref{cx2} for definition).
 517 Finally, using direction value $d_2$, it translates the coordinates $(i_1, j_1)$ to the end of the newly elongated poly-isoline, and $(i_2, j_2)$ to the end of the next segment to be tested.
 518 As soon as the GLRT condition becomes false, line 23 eventually produces the output value of the denoised image at pixel $(i,j)$, that is, the average gray-level value along the poly-isoline.
 519
 520
 521
 522
 523 \subsection{\label{sipdd}Hybrid PI-PD :  \texttt{kernel\_edge\_detector()} }
 524 As introduced in section \ref{pipd_plan}, the aim of kernel \texttt{kernel\_edge\_detector()} is to divide pixels into two classes according to their belonging to a LSR or not. Algorithm \ref{algoDetect} explains the detailled procedure. Lines 2 to 6 initialize values of the direction index ($\Theta$), the number of edges detected ($edgeCount$), the gray-level sum along the pixels that defines the H half-plane ($sumEdge$) and the number of pixels that defines both half-planes H and L ($nH$, $nL$). Then the loop starting at line 7 performs the GLRT for every considered direction index $\Theta$.
 525 Values $sumH$ and $sumL$ are vectors of two parameters $x$ and $y$, parameter $x$ being the sum of gray-level values and $y$ the sum of square gray-level values. Value $sumH$ is computed along the pixels of half-plane H and is obtained by loop at lines 10 to 14; Value $sumL$ is computed along the pixels of half-plane L and is obtained by loop at lines 15 to 19. Value $I_{ntex}(i,j)$ refers to the gray-level value at pixel (i,j) previously stored in texture memory. Eventually, the isoline level value is output at line 27, 30 or 33 depending on the situation (see \ref{pipd_plan} for details about the decision process).
 526
 527 \begin{algorithm}[ht]
 528 \SetNlSty{textbf}{}{:}
 529 \caption{edge detector and pixel classifier \texttt{kernel\_edge\_detector()}}
 530 \label{algoDetect}
 531 \ForEach(\tcc*[f]{\textbf{in parallel}}){pixel $(i,j)$}{
 532   $\Theta \leftarrow 0$\tcc*[r]{direction index}
 533   $edgeCount \leftarrow 0$\;
 534   $sumEdge \leftarrow 0$\;
 535   $nH \leftarrow 5l+1$\;
 536   $nL \leftarrow 3l$\;
 537   \While{($\Theta < 32$) }{
 538     $sumH \leftarrow (I_{ntex}(i,j), I_{ntex}^2(i,j))$\;
 539     $sumL \leftarrow (0, 0)$\;
 540     \For{($\alpha=\Theta$ to $\alpha=\Theta+16$ by step $4$)}{
 541       $sPat \leftarrow \displaystyle\sum_{(y,x)\in P_{l,\alpha}(i,j)} I_{n tex}(i+y,j+x)$\;
 542       $sPat2 \leftarrow \displaystyle\sum_{(y,x)\in P_{l,\alpha}(i,j)} I^2_{n tex}(i+y,j+x)$\;
 543       $sumH \leftarrow sumH + (sPat, sPat2)$\;
 544     }
 545     \For{($\alpha=\Theta+20$ to $\alpha=\Theta+28$ by step $4$)}{
 546       $sPat \leftarrow \displaystyle\sum_{(y,x)\in P_{l,\alpha}(i,j)} I_{n tex}(i+y,j+x)$\;
 547       $sPat2 \leftarrow \displaystyle\sum_{(y,x)\in P_{l,\alpha}(i,j)} I^2_{n tex}(i+y,j+x)$\;
 548       $sumL \leftarrow sumL + (sPat, sPat2)$\;
 549     }
 550     \If{($GLRT(sumH, nH, sumL, nL) > T2_{max}$)}{
 551       $edgeCount \leftarrow edgeCount + 1$\;
 552       $sumEdge \leftarrow sumH.x$\;
 553     }
 554     $\Theta \leftarrow \Theta + 4$\;
 555     }
 556     \tcc{outputs isoline value}
 557     \If{($edgeCount == 0$)}{
 558       $\widehat{I}(i,j) \leftarrow \dfrac{(sumH.x + sumL.x)}{nH+nL}$ \tcc*[r]{LSR}
 559     }
 560     \If{($edgeCount == 1$)}{
 561       $\widehat{I}(i,j) \leftarrow \dfrac{(sumEdge)}{nH}$
 562     }
 563     \If{($edgeCount > 1$)}{
 564       $\widehat{I}(i,j) \leftarrow \widehat{I_{PIPD}}(i,j)$\tcc*[r]{PI-PD}
 565     }
 566   }
 567 \end{algorithm}
 568
 569 \section{\label{results}Results}
 570  The proposed hybrid PI-PD model has been evaluated with the 512x512 pixel sample images used by \citet{denoiselab} in order to make relevant comparisons with other filtering techniques. As we aim to address image processing in very noisy conditions (as in \citet{6036776}), we focused on the noisiest versions, degraded by AWGN of standard deviation $\sigma=25$.
 571
 572 Quality measurements of the denoised images in comparison with reference images have been obtained by the
 573 evaluation of:
 574 \begin{enumerate}
 575 \item Peak Signal to Noise Ratio (PSNR) that quantifies the mean square error between denoised and reference images:
 576  $MSE(I,\widehat{I})$. We used the following expression: $$PSNR =10.log_{10}\left(\frac{max(\widehat{I})}{MSE(I,\widehat{I})}\right)$$
 577  PSNR values are given in dB and highest values mean best PSNR.
 578 \item The Mean Structure Similarity Index (MSSIM, defined in \citet{Wang04imagequality}), which quantifies local similarities between denoised and reference images inside a sliding window. MSSIM values belong to an interval $[0; 1]$; the closer to 1 the better.
 579 \end{enumerate}
 580 PSNR is widely used to measure image quality but can be misleading when used by itself: as demonstrated in \citet{Wang04imagequality}, the processing of noisy images can bring a high PSNR value but very bad visual quality. This is avoided by the use of the MSSIM index along with the PSNR value: when both of them show high values, the overall quality can be considered high.
 581
 582 Result figure \ref{tablePI} provides the PSNR and MSSIM of every image, denoised with three different filters: average 5x5, hybrid PI-PD and BM3D. The \emph{noisy} column shows the values for each image before denoising. BM3D (\citet{Dabov09bm3dimage}) is taken as a reference in terms of denoising quality, while the average filter is taken as a reference in terms of processing time. The window size of 5x5 pixels has been choosen to achieve PSNR values similar to those obtained by PI-PD.
 583
 584 BM3D code is run on a quad-core Xeon E31245 at 3.3GHz and 8GByte RAM under linux kernel 3.2 (64bits), while PI-PD as well as average filter codes is run on a Nvidia C2070 GPU hosted by a PC running linux kernel 2.6.18 (64bits).
 585 The average filter used is an efficient parallel GPU implementation that we developped. It is a generic and versatile separable convolution kernel that outputs more than 700MPixels per second in the 5x5 averaging configuration.
 586
 587 Hybrid PI-PD measurements were performed with $n=25$, $l=5$, $T_{max}=1$ and $T2_{max}=2$. BM3D measurements have been performed with the freely available BM3D software proposed in \citet{Dabov09bm3dimage}.
 588
 589 The hybrid PI-PD model proves much faster than BM3D and better than the average 5x5 filter.
 590 Processing the thirteen images of the database reveals that hybrid PI-PD brings an average improvement of 1.5dB (PSNR) and 7.2\% (MSSIM) against the average filter at the cost of 35 times its computational duration. Against hybrid PI-PD, BM3D achieves an average improvement of 2.4dB and 4.6\%  at the cost of 350 times as much duration.
 591 Actually, the 5x5 average filter takes around \textbf{0.35~ms} to process an image while hybrid PI-PD needs around \textbf{11~ms} and BM3D around \textbf{4.3~s}.
 592
 593 It must be noticed that experimental optimization show that the vector of parameter values $T_{max}=1$ and $T2_{max}=2$ is optimal for 11 of the 13 images of the database. Better results are obtained with a slightly different value of $T2_{max}$ for \textit{peppers} or \textit{zelda} whose denoised images can obtain a MSSIM index of 0.90.
 594 Most of the computational time of hybrid PI-PD is spent by the edge detector, which clearly does not fit GPU requirements to achieve good performance. For information, the simple PI-PD model runs in less than 4~ms in the same conditions.
 595
 596 Figure \ref{comparimg} shows denoised images produced by hybrid PI-PD model compared with the output of the BM3D and the average 5x5 filters. The figure illustrates the merits and drawbacks of each model: edges are well preserved by hybrid PI-PD, but a \textit{staircase} effect is visible, a well-known artefact inherent to this type of neighborhood filters. Our recent GPU-implementation of the regression method proposed in \citet{BuadesCM06} brings a mean improvement of 1dB at the cost of 0.4~ms.
 597
 598
 599
 600 \begin{figure}
 601 %\begin{table}[h]
 602 \footnotesize
 603  \begin{center}
 604 \begin{tabular}{|c|r|r|r|r|}\hline
 605 \bf Image&\bf Noisy  &\bf average &\bf hybrid&\bf BM3D \\
 606          &           &\bf 5x5     &\bf PI-PD &         \\\hline\hline
 607 airplane &   19.49dB &   26.39dB  &  28.46dB& 30.88dB  \\
 608          &    0.58   &    0.84    &  0.88  &   0.93    \\\hline
 609 barbara  &   20.04dB &   22.76dB  & 24.26dB&  30.60dB  \\
 610          &    0.70   &    0.76    &  0.83  &   0.94    \\\hline
 611 boat     &   20.33dB &   25.58dB  & 27.54dB&  30.02dB  \\
 612          &    0.66   &    0.81    &  0.87  &   0.91    \\\hline
 613 couple   &   20.28dB &   25.25dB  & 27.33dB&  29.77dB  \\
 614          &    0.69   &    0.79    &  0.87  &   0.91    \\\hline
 615 elaine   &   19.85dB &   28.71dB  & 28.94dB&  30.60dB  \\
 616          &    0.59   &    0.86    &  0.87  &   0.91    \\\hline
 617 fingerprint &20.34dB &   23.33dB  & 26.07dB&  27.93dB  \\
 618          &     0.93  &    0.87    &  0.95  &   0.96    \\\hline
 619 goldhill &    19.59dB&   26.47dB  & 27.43dB&  29.22dB  \\
 620          &     0.67  &    0.82    &  0.87  &   0.88    \\\hline
 621 lena     &    19.92dB&   27.99dB  & 29.14dB&  31.80dB  \\
 622          &     0.60  &    0.84    &  0.88  &   0.93    \\\hline
 623 man      &    20.38dB&   24.74dB  & 26.74dB&  28.14dB  \\
 624          &     0.71  &    0.80    &  0.86  &   0.87    \\\hline
 625 mandrill &    19.34dB&   20.34dB  & 22.38dB&  24.75dB  \\
 626          &     0.77  &    0.69    &  0.83  &   0.88    \\\hline
 627 peppers  &    19.53dB&   27.30dB  & 28.68dB&  30.87dB  \\
 628          &     0.61  &    0.86    &  0.87  &   0.92    \\\hline
 629 stream   &    20.35dB&   23.23dB  & 25.35dB&  26.34dB  \\
 630          &     0.80  &    0.78    &  0.87  &   0.88    \\\hline
 631 zelda    &    17.71dB&   23.13dB  & 27.71dB&  30.49dB  \\
 632          &     0.58  &    0.87    &  0.88  &   0.93    \\\hline
 633  \end{tabular}
 634  \end{center}
 635 \caption{Comparison between hybrid PI-PD, average and BM3D filters. PI-PD parameter values: $n=25$, $l=5$, $T_{max}=1$ and $T2_{max}=2$. The \emph{noisy} column correspond to the noisy input images, before denoising. \newline Timings: average filter in around 0.35~ms hybrid PI-PD in around 11.0~ms and BM3D in around 4.3~s}
 636 \label{tablePI}
 637 %\end{table}
 638 \end{figure}
 639
 640
 641 \begin{figure}[h]
 642 \centering
 643 \subfloat[Noisy image $\sigma=25$]{\label{fig:noisy} \includegraphics{./imgfs/airplane_25_noisy_zoom.ps}}\qquad
 644 \subfloat[Average 5x5 filter, in $0.35~ms$]{\label{fig:pipd} \includegraphics{./imgfs/airplane_25_mean5_zoom.ps}}\\
 645 \subfloat[PI-PD hybrid filter, $n=25$, $l=5$, $T_{max}=1$, $T2_{max}=2$, in $11~ms$ ]{\label{fig:hpipd} \includegraphics{./imgfs/airplane_zoom_hybrid_6_r50_T10_P2.ps}}\qquad
 646 \subfloat[BM3D filter, in $4.3s$]{\label{fig:bm3d} \includegraphics{./imgfs/airplane_bm3d_zoom.ps}}
 647 \caption{Comparison of 512x512 images denoised from noisy airplane image (\ref{fig:noisy}) with a PI-PD filter (\ref{fig:pipd}), PI-PD hybrid filter (\ref{fig:hpipd}) and BM3D filter (\ref{fig:bm3d}). Only zoomed parts of images are shown in order to ensure better viewing.}
 648 \label{comparimg}
 649 \end{figure}
 650
 651
 652 \section{\label{conclusion}Conclusion, future work}
 653 From the start, our approach, unlike quite a few others, has been to base this study on the conception and characteristics of the targeted hardware (Nvidia Graphic cards).
 654
 655 So as to get high execution speeds, we chose, for example, to find a method that remains local (concentrating on the immediate neighborhood of the center pixel), but still provides very significant benefits, using our technique of progressive lengthening.
 656
 657 Nevertheless, our method has proved slightly sub-optimal and lacking robustness in \textit{flat} regions (see above, Low Slope Regions), even if the actual visual effect may be considered quite satisfactory.
 658
 659 As a first step to address the above drawbacks, we have devised a hybrid method that detects and applies distinct processing to LSR regions (see above). Processing speeds remain fast, and much higher than the BM3D implementation taken as quality reference. This is very promising, and opens the perspective of real-time high definition  image sequence processing at 25 fps, provided we improve the edge detector, which currently limits the HD frame rate at 16fps (High Definition: 1920x1080 pixels).
 660
 661 To further improve the quality of output images, we also implemented a efficient parallel implementation of the staircase effect reduction technique presented in \citet{BuadesCM06}. With this method, searching for best improvement factors leads to different parameters values for each image processed, which prompts to studying some way of overriding such parameters.
 662
 663 Our study so far has been based on additive noise; we are currently working on transposing criteria to various multiplicative noise types. We also extended the process to color images with very interesting visual results to be confirmed by the experimental measurement currently in progress.
 664
 665 % references section
 666 \bibliographystyle{spbasic}
 667
 668 \bibliography{bibliosv}
 669
 670 % that's all folks
 671 \end{document}
 672
 673 %doi = {10.5201/ipol.2011.bcm_nlm},