\documentclass[twocolumn, final, natbib]{svjour2} \usepackage{cite} \usepackage{transparent} %\usepackage[pdftex]{graphicx,color} % \graphicspath{{imgfs/}} % \DeclareGraphicsExtensions{.pdf,.jpeg,.png} %\usepackage[dvips]{graphicx} \usepackage{graphicx} %\graphicspath{{imgfs/}} %\DeclareGraphicsExtensions{.png,.jpg,.ps} % *** MATH PACKAGES *** % \usepackage[cmex10]{amsmath} % *** SPECIALIZED LIST PACKAGES *** % \usepackage[ruled,lined,linesnumbered]{algorithm2e} % *** ALIGNMENT PACKAGES *** % \usepackage{array} \usepackage{mdwmath} \usepackage{mdwtab} % *** SUBFIGURE PACKAGES *** \usepackage[caption=false,font=footnotesize]{subfig} % *** FLOAT PACKAGES *** % \usepackage{fixltx2e} \journalname{Real Time Image Processing} \begin{document} % % paper title % can use linebreaks \\ within to get better formatting as desired \title{Fast GPU-based denoising filter using isoline levels} % author names and affiliations % use a multiple column layout for up to two different % affiliations \author{ Gilles Perrot$^1$ \and St\'{e}phane Domas$^1$ \and Rapha\"{e}l Couturier$^1$ \and Nicolas Bertaux$^2$} \institute{ $^1$ \at FEMTO-ST institute\\ Rue Engel Gros, 90000 Belfort, France.\\ forename.name@univ-fcomte.fr \and $^2$ \at Institut Fresnel, CNRS, Aix-Marseille Universit\'e, Ecole Centrale Marseille,\\ Campus de Saint-J\'er\^ome, 13013 Marseille, France.\\ nicolas.bertaux@ec-marseille.fr } \date{Received: date / Revised: date} % make the title area \maketitle \begin{abstract} In this study, we propose to address the issue of image denoising by means of a GPU-based filter, able to achieve high-speed processing by taking advantage of the parallel computation capabilities of modern GPUs. Our approach is based on the level sets theory first introduced by \citet{matheron75} in 1975 but little implemented because of its high computation costs. What we actually do is try to guess the best isoline shapes inside the noisy image. At first, our method involved the polyline modelling of isolines; then we found an optimization heuristics which very closely fits the capabilities of GPUs. So far, though our proposed hybrid PI-PD filter has not achieved the best denoising levels, it is nonetheless able to process a 512x512 image in about 11~ms. \end{abstract} \section{Introduction} Denoising has been a much studied research issue since electronic transmission was first used. The wide range of applications that involve denoising makes it uneasy to propose a universal filtering method. Among them, digital image processing is a major field of interest as the number of digital devices able to take pictures or make movies is growing fast and shooting is rarely done in optimal conditions. Moreover, the increase in pixel density of the CCD or CMOS sensors used to measure light intensity leads to higher noise effects and imposes high output flow rates to the various processing algorithms. In addition, it is difficult to quantify the quality of an image processing algorithm, as visual perception is subject to high variation from one human to another. So far, the advent of GPUs has brought high speedups to a lot of algorithms, and many researchers and developpers have successfully adressed the issue of implementing existing algorithms on such devices. For example in \citet{mcguire2008median},\citet{chen09} and \citet{sanchezICASSP12}, authors managed to design quite fast median filters. Bilateral filtering has also been successfully proposed in \citet{YangTA09}. Still, most high quality algorithms, like NL-means \citet{ipol.2011.bcm_nlm} or BM3D \citet{Dabov09bm3dimage} make use of non-local similarities and/or frequency domain transforms. However, speedups achieved by their current GPU implementations, though quite sigificant (as shown for example with NL-means in \citet{PALHANOXAVIERDEFONTES}), do not come near those achieved by local methods such as gaussian, median or neighborhood filters, as they have not originally been designed against GPU architecture. In order to fully benefit from the capabilities of GPUs, it is important that the approach to designing algorithms be more hardware-oriented, keeping in mind, from the very beginning, the intrinsic constraints of the device which is actually going to run those algorithms. Consequently, this often results in unusual options and even apparently sub-optimal solutions, but the considerable speed benefits obtained would possibly make it at least a good compromise or even the only current way to real-time high-definition image processing. \section{\label{contrib}Contribution} As early as 1975 \citet{matheron75}, it was found that, under the conditions mentioned in section \ref{isolines}, an image can be decomposed into a set of level lines. Accordingly, real-life images fulfill the above conditions and since then, with the increase of computing capabilities, researchers have succeded in implementing such level-lines based algorithms as in \citet{caselles97} and \citet{springerlink:10.1007/3-540-48236-9_16}. A few years ago, in \citet{bertaux:04}, authors proposed an original method which significantly reduces speckle noise inside coherent images, using the level lines in the image to constrain the minimization process. Those level lines are actually \textit{iso-gray-level} lines, which are called \textit{isolines}. In \citet{bertaux:04}, isolines consist in neighborhoods of polyline shapes determined by maximum likelihood optimization. This method proved not only to bring good enhancement but also to preserve edges between regions. Nevertheless, the costs in computation time, though not prohibitive, did not allow real-time image processing; as an example, the authors of \citet{bertaux:04} managed to process an almost 2Mpixel image within a minute on an old PIII-1GHz. Our work started by designing a set of GPU implementations with various optimization heuristics, in order to find out which tracks could be followed towards minimizing loss in quality and preserve admissible execution times. Those algorithms have been tested with reference images taken from \citet{denoiselab} for which various processing results have been published. Some of the more interesting ones are listed and compared in \citet{denoisereview}. Statistical observations (to be detailed below) made on the output images produced by the method proposed in \citet{bertaux:04}, led us to propose a very fast and simple parallel denoising method which gives good results in terms of average gray-level error, but also avoids the blurring of edges. On the basis of the BM3D timings listed in \citet{Dabov09bm3dimage} and with our own measurements, our proposed GPU-based filter runs around 350 times faster and thus is able to process high definition images at over 16fps. It also achieves good denoising quality. \section{Plan} In the following, section \ref{GPUgeneralites} briefly focuses on recent Nvidia GPU characteristics. Section \ref{isolines} will introduce the theory and notations used to define isolines. Then, in section \ref{lniv0}, we will describe the two isoline based models that led to the final hybrid model, while section \ref{lniv} details the parallel implementation of the proposed algorithm. Finally, we present our results in section \ref{results} before drawing our conclusions and outlining our future works in section \ref{conclusion}. \section{\label{GPUgeneralites}NVidia's GPU architecture} GPUs are multi-core, multi-threaded processors, optimized for highly parallel computation. Their design focuses on a SIMT model (Single Instruction Multiple Threads) that devotes more transistors to data processing rather than data-caching and flow control (see \citet{CUDAPG} for more details). For example, a C2070 card features 6~GBytes global memory and a total of 448 cores bundled in several Streaming Multiprocessors (SM). An amount of shared memory, much faster than global memory, is avalaible on each SM (up to 48~KB for a C20xx card) Writing efficient code for such architectures is not obvious, as re-serialization must be avoided as much as possible. Thus, code design requires one pays attention to a number of points, among which: \begin{itemize} \item the CUDA model organizes threads by a) thread blocks in which synchronization is possible, b) a grid of blocks with no possible synchronization between them. \item there is no way to know how blocks are scheduled during one single kernel execution. \item data must be kept in GPU memory, to reduce the overhead generated by copying between CPU and GPU. \item the total amount of threads running the same computation must be as large as possible. \item the number of execution branches inside one block should be as small as possible. \item global memory accesses should be coalescent, \emph{i.e.} memory accesses done by physically parallel threads (2 x 16 at a time) must be consecutive and contained in a 128 Bytes range. \item shared memory is organized in 32x32 bit-wide banks. To avoid bank conflicts, each parallel thread (2 x 16 at a time) must access a different bank. \end{itemize} All the above characteristics always make designing efficient GPU code all the more constraining as non-suited code would probably run even slower on GPU than on CPU. \section{\label{isolines}Isolines} In the following, let $I$ be the reference noiseless image (assuming we have one), $I'$ the noisy acquired image corrupted by Independent and Identically Distributed (IID) additive white gaussian noise of zero mean value and standard deviation $\sigma$. Let $\widehat{I}$ be the denoised image. Each pixel of $I'$ of coordinates $(i,j)$ has its own gray level $z(i,j)$. As introduced above and since most common images are continuous and contain few edges, they can be decomposed into a set of constant gray level lines called \textit{isolines}. Then our goal is to find, for each single pixel of a noisy image, the isoline it belongs to. The generalized likelihood criterion (GL) is used to select the best isoline among all the considered ones, all of which must have the same number of pixels in order to be compared. \subsection{Fixed-length isolines} For each pixel $(i,j)$ of the corrupted image, we look for the gray level of the isoline it belongs to, inside a rectangular window $\omega$ centered on $(i,j)$. Inside $\omega$, let $S^n$ be the isoline part which the center pixel belongs to. $S^n$ is a set of $n$ pixel positions $(i_q,j_q)$ ($q \in [0;n[$).\\ The gray levels $z$ along $S^n$ follow a gaussian probability density function whose parameters $\mu_{S^n}$ (mean value of isoline part) and $\sigma$ (standard deviation brought by gaussian noise ) are unknown.\\ Let $\overline{S^n}$ be defined by $\omega = S^n \cup \overline{S^n}$.\\ For each pixel, the mean values $\mu_{ij}$ of gray levels $z$ over $\overline{S^n}$ are unknown and supposed independant .\\ Let $Z$ be the gray levels of pixels in $\omega$ and $\left\{\mu_{ij}\right\}_{\overline{S^n}}$ the mean values of pixels in $\overline{S^n}$. The likelihood is given by: $$ \displaystyle P \left[Z | S^n, \mu_{S^n}, \left\{\mu_{ij}\right\}_{\overline{S^n}}, \sigma \right] $$ When separating contributions from regions $S^n$ and $\overline{S^n}$, it becomes: \begin{eqnarray} \displaystyle \prod_{(i,j)\in S^n}{P\left[z(i,j) | \mu_{S^n}, \sigma \right]} . \displaystyle\prod_{(i,j)\in \overline{S^n}}{P\left[z(i,j) | \left\{\mu_{ij}\right\}_{\overline{S^n}}, \sigma \right]} \label{LL2} \end{eqnarray} The goal is then to estimate the value of the above expression, in order to find the boundaries of $S^n$ that maximize expression \eqref{LL2}.\\ Let us consider that, on $\overline{S^n}$, the values $z(i,j)$ are the likelihood estimations $\widehat{\mu_{ij}}$ for $\mu_{ij}$. The second term of expression \eqref{LL2} becomes: \begin{eqnarray} \displaystyle\prod_{(i,j)\in \overline{S^n}}{P\left[z(i,j) | \left\{\widehat{\mu_{ij}}\right\}_{\overline{S^n}}, \sigma \right]=1} \end{eqnarray} which leads to the generalized likelihood expression: \begin{eqnarray} \displaystyle \prod_{(i,j)\in S^n}{P\left[z(i,j) | \mu_{S^n}, \sigma \right]} \label{GL} \end{eqnarray} As we know the probability density function on $S^n$, \eqref{GL} can then be developped as \begin{eqnarray} \displaystyle \prod_{(i,j)\in S^n}\frac{1}{\sqrt{2\pi\sigma^2}}e^{-\frac{\left(z(i,j)-\mu_{S^n}\right)^2}{2\sigma^2}} \label{GL2} \end{eqnarray} The log-likelihood is then given by: \begin{eqnarray} \displaystyle -\frac{n}{2}log\left(2\pi\right) - \frac{n}{2}log\left(\sigma^2\right) - \frac{n}{2} \label{LL1} \end{eqnarray} inside which the vector of parameters $(\mu_{S^n}, \sigma)$ is determined by maximum likelihood estimation $$ \left( \begin{array}{l} \widehat{\mu_{S^n}} = \displaystyle\frac{1}{n} \sum_{(i,j)\in S^n} z(i,j) \\ \widehat{\sigma^2} = \displaystyle\frac{1}{n} \sum_{(i,j)\in S^n} \left(z(i,j) - \widehat{\mu_{S^n}}\right)^2 \\ \end{array} \right. $$ The selection of the best isoline is done by searching which one maximizes the expression of equation \eqref{LL1}. \begin{figure}[h] \includegraphics[width=\linewidth]{imgfs/isolines_1.jpg} \caption{Determination and lengthening of an isoline: The gray level $z$ of each pixel is seen as an elevation value. $S^n$ is the $n$ pixel length isoline for pixel of coordinates $(i, j)$. The elongation of $S^n$ by $S^p$ ($p$ pixel length) is submitted to the GLRT condition (see eq. \eqref{GLRT}).} \label{si3} \end{figure} \subsection{Lengthenable isolines} Searching for larger isolines should lead to better filtering as a larger number of pixels would be involved. However, processing all possible isolines starting from each pixel would be too costly in computing time, even in the case of a small GPU-processed 512x512 pixel image. Therefore, we chose to build large isolines inside an iterative process including a mandatory validation stage between each lengthening iteration, so as to reduce the number of pixel combinations to be examined and keep the estimation of deviation $\sigma$ within a satisfactory range of values. Let $S^n$ be a previously selected isoline part and $S^p$ connected to $S^n$ in such a way that $S^p$ could be seen as an addition to $S^n$ so as to define a possible valid isoline $S^{n+p}$. Figure \ref{si3} illustrates this situation with a very simple example image. In this figure, the gray level of each pixel is used as its corresponding height ($z$) in order to visualize isolines easily. Some of the orthogonal isoline projections have been drawn in dotted line in the $(\vec{i},\vec{j})$ plane. Both labeled parts $S^p$ and $S^n$ are represented in the $(\vec{i},\vec{j})$ plane and in the 3D associated plot. In order to decide whether $S^{n+p}$ can be considered as an actual isoline, we compare the log-likelihood of both hypothesis below by using GLRT (Generalized Likelihood Ratio Test): First, assuming that $S^{n+p}$ is an isoline, the gray levels of its pixels share the same mean value $\mu_{n+p}$. According to \eqref{LL1}, its log-likelihood is \begin{eqnarray} \displaystyle -\frac{(n+p)}{2}\left(log\left(2\pi\right)+1\right) - \frac{(n+p)}{2}log\left(\widehat{\sigma_1}^2\right) \label{LLNP} \end{eqnarray} where $\widehat{\sigma_1}$ is the estimation of the standard deviation along $S^n$. Second, considering $S^n$ and $S^p$ as two separate isoline parts connected together, the gray levels of their pixels have two different mean values $\mu_n$ and $\mu_p$. The log-likelihood is the sum of both log-likelihoods, given by \begin{eqnarray} \displaystyle -\frac{(n+p)}{2}\left(log\left(2\pi\right)+1\right) - \frac{n}{2}log\left(\widehat{\sigma_2}^2\right) - \frac{p}{2}log\left(\widehat{\sigma_2}^2\right) \label{LLNP2} \end{eqnarray} where $\widehat{\sigma_2}$ is the estimation of the standard deviation along $S^n$ and $S^p$. The difference between \eqref{LLNP} and \eqref{LLNP2} leads to the expression of $GLRT(S^{n+p},S^n, S^p, T_{max})$: \begin{eqnarray} T_{max}- (n+p).\left[log\left(\widehat{\sigma_1}^2\right) - log\left(\widehat{\sigma_2}^2\right) \right] \label{GLRT} \end{eqnarray} The decision to validate lengthening from $S^n$ to $S^{n+p}$ depends whether $GLRT(S^{n+p},S^n, S^p, T_{max})$ is higher or lower than $0$. Value $T_{max}$ is the GLRT threshold. \section{\label{lniv0}Isoline models} The most obvious model considers isolines as polylines. Each isoline can then be curved by allowing a direction change at the end of each segment; we shall call such isolines \textit{poly-isolines}. In order to keep the number of candidate isolines within reasonable range, we chose to build them by combinating segments described by simple pre-computed patterns. Each pattern $p_{l,d}$ describes a segment of length $l$ and direction $d$. For one given $l$ value, all $p_{l,d}$ patterns are grouped into a matrix denoted $P_l$. Figure \ref{p5q1} shows an example of such a pattern matrix for $l=5$. To fit the GPU-specific architecture, we define regularly distributed $D$ primary directions ($D=32$ in our examples). \subsection{\label{lniv2}Poly-isolines with limited deviation angle (PI-LD)} At one stage we implemented an algorithm parsing the tree of all possible polyline configurations, but the process proved far too slow regarding our goal, even on GPU, because of the amount of memory involved (and consequent memory accesses) and because of the necessary reduction stage for which GPU efficiency is not maximum. So we focused on a variant inspired by \citet{bertaux:04} in which the selected direction of the next segment depends on the whole of the previously built and validated poly-isoline. Let us consider a poly-isoline $S^n$ under construction, starting from pixel $(i,j)$ and made of $K$ validated segments $s_k~(k \in [1;K])$ of length $l$, each of them having its own direction $d_k$. The coordinates of the ending pixel of each segment $s_k$ are denoted $(i_k, j_k)$. Both of the following sums \begin{eqnarray} C_x\left(Z(S^n)\right) &=& \displaystyle\sum_{(i,j)\in S^n}z(i,j)\label{cx}\\ and~~C_{x^2}\left(Z(S^n)\right)&=& \displaystyle\sum_{(i,j)\in S^n}z(i,j)^2\label{cx2} \end{eqnarray} have been obtained during the previous lengthening steps. Let us examine now how to decide wether to add a new segment to $S^n$ or to stop the lenghtening process. The main idea is to apply each pattern $p_{l,d}$ to the ending pixel $(i_k,j_k)$, on the condition that its direction is contained within the limits of maximum deviation $\Delta d_{max}$. Maximum deviation $\Delta d_{max}$ prevents poly-isolines from beeing of circular shape (or backward-oriented) which would possibly generate supplementary artefacts in the output image. Another of its benefits is to reduce the number of combinations to be evaluated. For each allowed pattern, GLRT is performed in order to decide if the corresponding segment could likely be added to the end of the poly-isoline $S^n$. If none is validated by GLRT, the poly-isoline $S^n$ is stopped. If at least one segment has been accepted by GLRT, the one that leads to the maximum likelihood (ML) value of the lengthened poly-isoline $S^{n+l}$ is selected and integrated to $S^{n+l}$ as $s_{K+1}$. In order to avoid critical situations where the first selected segment would not share the primary direction of the actual poly-isoline, no selection is performed on the level of the first segment; $D$ poly-isolines are kept and submitted to the lengthening process. To ensure isotropy, each of them shares the direction of one pattern $p_{l,d}~ (d \in[0;D])$. Eventually, the poly-isoline with the maximum likelihood value is selected among the longest ones. Figure \ref{pild} illustrates one stage of the lengthening process with the example of a two-segment poly-isoline at the beginning of stage ($l=5$ and $\Delta d_{max}=2$). \begin{figure}[h] \centering \subfloat[Isoline with two validated segments $s_1$ and $s_2$.]{\label{pild:debut} \includegraphics{imgfs/PI-LD_detail_3.jpg}}\qquad \subfloat[First evaluated segment, corresponding to pattern $p_{5,0}$.]{\label{pild:sub1} \includegraphics{imgfs/PI-LD_detail_sub1.jpg}}\\ \subfloat[Second evaluated segment, corresponding to pattern $p_{5,1}$.]{\label{pild:sub2} \includegraphics{imgfs/PI-LD_detail_sub2.jpg}}\qquad \subfloat[Third evaluated segment, corresponding to pattern $p_{5,2}$.]{\label{pild:sub3} \includegraphics{imgfs/PI-LD_detail_sub3.jpg}}\\ \subfloat[Fourth evaluated segment, corresponding to pattern $p_{5,3}$.]{\label{pild:sub4} \includegraphics{imgfs/PI-LD_detail_sub4.jpg}}\qquad \subfloat[Fifth evaluated segment, corresponding to pattern $p_{5,4}$.]{\label{pild:sub5} \includegraphics{imgfs/PI-LD_detail_sub5.jpg}}\\ \caption{Example of lengthening process starting with a two-segment poly-isoline ($l=5$, $\Delta d_{max}=2$). The initial situation is shown in \ref{pild:debut}, while \ref{pild:sub1} to \ref{pild:sub5} represent the successive candidate segments. The direction index of the last validated segment is $d_2=2$ (\ref{pild:debut}). It implies that direction indexes allowed for the third segment range from $d_2-\Delta d{max}=0$ to $d_2+\Delta d{max}=4$ (\ref{pild:sub1} to \ref{pild:sub5}). The lengthening of the poly-isoline is accepted if at least one segment has a positive GLRT. If there are several, the one which minimizes the standard deviation of the whole poly-isoline is selected. } \label{pild} \end{figure} \renewcommand{\labelenumi}{\alph{enumi})} \renewcommand{\theenumi}{\alph{enumi})} \subsection{\label{lniv3}Poly-isolines with precomputed directions (PI-PD)} Though much faster, the PI-LD-based filter may be considered a bit weak compared to \textit{state-of-the-art} filters like BM3D family algorithms \citet{Dabov09bm3dimage}. Furthermore, we saw that this way of building poly-isolines requires the alternate use of two different types of validation at each lengthening stage: GLRT and maximum likelihood minimization. In order to be performed, each of them generates numerous branches during kernel execution, which does not fit GPU architecture well and leads to execution times that we hoped would be more impressive. Within the PI-LD model, at each pixel $(i,j)$, as no selection is done at the first stage, $D$ poly-isolines are computed and kept as candidate though, obviously, only one follows the actual isoline at $(i,j)$. So, if we assume we can achieve a robust determination of the direction at any given pixel of this isoline, it becomes unnecessary to perform the selection at each lenghtening step. Thus, at each pixel $(i,j)$, only the first segment has to be determined in order to obtain the local direction of the isoline. This leads to an important reduction of the work complexity: the above PI-LD model needs to evaluate $D.\left(2.\Delta_{dmax}+1\right)^{K-1}$ segments at each pixel position, while only $D.K$ evaluations are needed in the second case. For example, with a maximum of $K=5$ segments and a maximum deviation of $\Delta_{dmax}=2$, the PI-LD needs to evaluate up to 20000 segments per pixel where only 160 should be enough. On the basis of these observations, we propose a new model that we shall call PI-PD, that completely separates the validation stages performed in the PI-LD model implementation mentioned above: A first computation stage selects the best first segment $s_1$ starting at each pixel $(i,j)$ of the input image. Its direction index $d_1(i,j)$ is then stored in a reference matrix denoted $I_\Theta$; sums $C_x$ and $C_{x2}$ along $s_1(i,j)$ are also computed and stored in a dedicated matrix $I_\Sigma$. It can be noticed that this selection method of $s_1$ segments is a degraded version of PI-LD constrained by $K=1$. A second stage manages the now independant lengthening process. For one given state of a poly-isoline where the last added segment has been $s_K$, the pattern whose direction index is given by $d=I_\Theta(i_K,j_K)$ defines the only segment to be evaluated. Both corresponding sums $C_x$ and $C_{x2}$ are read from matrix $I_\Sigma$ and used in GLRT evaluation. The last point is to prevent poly-isolines from turning back. Figure \ref{pipd} details this process, starting from the same initial state as in figure \ref{pild} with the noticeable difference that no deviation limit is needed. Thus, as introduced above, work complexity is considerably reduced, as each pattern is only applied once at one given pixel $(i,j)$, and associated values are computed only once; they are re-used every time one poly-isoline's segment ends at pixel $(i,j)$. Also, this fits GPU constraints better, as it avoids multiple branches during kernel execution. It remains that, the building of poly-isolines is done without global likelihood optimization. Eventually, the model has been improved by adding to it the ability to thicken poly-isolines from one pixel up to three which allows to achieve higher PSNR values by increasing the number of pixels of poly-isolines in addition to the lengthening process. This may apply to large images which do not contain small relevant details, as it may blur small significant details or objects present in the noisy image. Still, this feature makes PI-PD more versatile than our reference BM3D, which has prohibitive computation times when processing large images (over 5 minutes for a 4096x4096 pixel image) and thus should require a slicing stage prior to processing them, causing some overhead. \begin{figure}[h] \centering \subfloat[Poly-isoline with two validated segments.]{\label{pipd:pipd1} \includegraphics[width=2.3cm]{imgfs/PI-PD_detail_sub1.jpg}}\qquad \subfloat[Next direction is read from element $(i_2,j_2)$ of $I_{\Theta}$.]{\label{pipd:pipd2} \includegraphics[width=5.0cm]{imgfs/PI-PD_detail_sub2.jpg}}\\ \subfloat[Pattern $p_{l,d_3}$ is then applied at $(i_2,j_2)$ and GLRT is performed. Both sums needed to perform GLRT are read from element $(i_2,j_2)$ of $I_{\Sigma}$.]{\label{pipd:pipd3} \includegraphics[width=4cm]{imgfs/PI-PD_detail_sub3.jpg}}\qquad \subfloat[If accepted by GLRT, segment $s_3$ is added to poly-isoline.]{\label{pipd:pipd4} \includegraphics[width=2.7cm]{imgfs/PI-PD_detail_sub4.jpg}}\\ \caption{Example of PI-PD lengthening process starting with a two-segment poly-isoline ($l=5$). The initial situation is represented in \ref{pipd:pipd1}, while \ref{pipd:pipd1} to \ref{pipd:pipd4} represent the successive processing steps. The end pixel of the last validated segment is $(i_2,j_2)$ (\ref{pipd:pipd1}). Reference matrices \(I_{\Theta}\) and \(I_{\Sigma}\) provide the values needed to select the pattern to be applied on \((i_2,j_2)\) (\ref{pipd:pipd2} and \ref{pipd:pipd3}). GLRT is performed to validated lengthening or not. This process goes on until one submitted segment does not comply with GLRT. } \label{pipd} \end{figure} \subsection{\label{pipd_plan}Hybrid PI-PD} As the determination of each segment's direction only involves a few pixels, the PI-PD model may not be robust enough in regions where the surface associated with $Z$ has a low local slope value regarding power of noise $\sigma^2$. We shall call those regions Low Slope Regions (LSR). Figure \ref{img_plans} shows this lack of robustness with an example of two drawings of additive white gaussian noise applied on the same reference image (Figure \ref{img_window}). Within this image, we focused on a small 11x11 pixel window containing two LSR with one sharp edge between them. Figures \ref{fig:dir1} and \ref{fig:dir2} show that the directions computed by PI-PD are identical from one drawing to the other near the edge (lines 5-7), while they vary in LSR (lines 1-4, 8-11). \begin{figure}[h] \centering \subfloat[Reference image]{\label{fig:ref} \includegraphics{./imgfs/zoom_edge_ref2.ps}}\\ \subfloat[Image corrupted by random drawing $n^{\circ}1$]{\label{fig:noisy1} \includegraphics{./imgfs/zoom_edge_bruit.ps}}\qquad \subfloat[Image corrupted by random drawing $n^{\circ}2$]{\label{fig:noisy2} \includegraphics{./imgfs/zoom_edge2_bruit.ps}}\\ \subfloat[Isoline directions for random drawing $n^{\circ}1$]{\label{fig:dir1} \includegraphics{./imgfs/zoom_edge1_2D_superpose3.ps}}\qquad \subfloat[Isoline directions for random drawing $n^{\circ}2$]{\label{fig:dir2} \includegraphics{./imgfs/zoom_edge2_2D_superpose2.ps}}\\ \caption{Zoom on a small square window of the airplane image. \ref{fig:ref} reproduce the zoom on the window, taken from the reference image of Figure \ref{img_window}. \ref{fig:noisy1}, \ref{fig:noisy2} and \ref{fig:ref} and are 3D views where each bar represents a pixel whose gray-level corresponds to the height of the bar. Figures \ref{fig:dir1} and \ref{fig:dir2} are 2D top views of the window. The chosen window shows an edge between two regions of low slope. The images \ref{fig:noisy1} and \ref{fig:noisy2} are corrupted with two different random drawings of the same additive white gaussian noise (AWGN) of power $\sigma^2$ and mean value $0$. \ref{fig:dir1} and \ref{fig:dir2} show, for each pixel of the window, the direction of the isoline found by PI-PD. In regions of low slope (the two regions at the top and the bottom), the determination of the direction is not robust. But near the edge, directions do not vary from one drawing to another.} \label{img_plans} \end{figure} Within such regions, our speed goals forbid us to compute isoline directions with the PI-LD model, more robust but far too slow. Instead we propose a fast solution which implies designing an edge detector whose principle is to re-use the segment patterns defined in section \ref{lniv0} and to combine them by pairs in order to detect any possible LSR around the center pixel. If a LSR is detected, the output gray-level value is the average value computed on the current square window, otherwise, the PI-PD output value is used. In order to further simplify computation, only the patterns that do not share any pixel are used. These patterns have a direction which is a multiple of $45^{\circ}$. Each base direction $(\Theta_i)$ and its opposite $(\Theta_i + \pi) \left[2\pi\right]$ define a line that separates the square window in two regions (top and and bottom regions, denoted T and B). We assume that segments on the limit belong to the T region which includes pixels of orientation from $\Theta_i$ to $\Theta_i+\pi$. This region comprises three more segments of directions $(\Theta_i+\frac{\pi}{4})$, $(\Theta_i+\frac{2\pi}{4})$ and $(\Theta_i+\frac{3\pi}{4})$. The other region (B) only includes three segments of directions $(\Theta_i+\frac{5\pi}{4})$, $(\Theta_i+\frac{6\pi}{4})$ and $(\Theta_i+\frac{7\pi}{4})$. Figure \ref{detect_plans} illustrates this organization for $\Theta_i=\Theta_4=45^{\circ}$. Each bar represents a pixel in the detector's window. Pixels with null height are not involved in the GLRT. Pixels represented by higher bars define the T region and those represented by shorter bars define the B region. \begin{figure}[h] \begin{center} \includegraphics{./imgfs/pattern_detecteur.ps} \caption{\label{detect_plans}Edge detector. 3D view representing an example square 11x11 pixel window ($l=5$) used in the edge detector for $\Theta_4=45^{\circ}$ around a center pixel colored in black. Each pixel is represented by a bar. Bars of height value 0 are for pixels that are not involved in the detector. Top region is defined by five pattern segments and includes the center pixel. Bottomp region only includes three pattern segments. The different height values are meant to distinguish between each of the three different sets of pixels and their role. } \end{center} \end{figure} For each $\Theta_i$, one GLRT is performed in order to decide whether the two regions T and B defined above are likely to be seen as a single region or as two different ones, separated by an edge as shown in figure \ref{detect_plans}. The center pixel is located on the edge. Equations \eqref{LLNP}, \eqref{LLNP2} and \eqref{GLRT} lead to a similar GLRT expression: \begin{eqnarray} T2_{max}- (8.l+1).\left[log\left(\widehat{\sigma_3}^2\right) - log\left(\widehat{\sigma_4}^2\right) \right] \label{GLRT2} \end{eqnarray} where $\sigma_3$ is the standard deviation considering that the two regions are likely to define a single one and $\sigma_4$ the standard deviation if an edge is more likely to separate the two regions. $T2_{max}$ is the decision threshold. With equation \eqref{GLRT2}, a negative result leads to an edge detection, oriented towards direction $\Theta_i$. When GLRT is known for each $\Theta_i$, we apply the following hybridation policy: \begin{enumerate} \item more than one negative GLRT: the PI-PD output value is used. \item only one negative GLRT: the center pixel is likely to be on a well-defined edge, and only the region it belongs to is considered. The average value of its pixel gray levels is then used.\label{halfplane} \item no negative GLRT: the window around the center pixel is likely to be a LSR. The average value on the whole square window is used (11x11 pixels in the example of Figure \ref{detect_plans}). \end{enumerate} \begin{figure}[h] \centering \subfloat[Reference noiseless airplane image]{\label{img_window:ref} \includegraphics{./imgfs/airplane.ps}}\qquad \subfloat[Location of the example window in the reference image.]{\label{img_window:win} \includegraphics{./imgfs/zoom_windows_A.ps}}\\ \caption{Location of the example window inside the reference image. Figure \ref{img_window:ref} shows the whole reference image and \ref{img_window:win} zooms on the part where the example 11x11 pixel window is.} \label{img_window} \end{figure} It must be noticed that point \ref{halfplane} has been introduced in order to achieve smoother transitions between regions to which PI-PD is applied and those in which the plain average value is used. Figure \ref{exbords} shows an example of such a classification achieved by the edge detector. The detector has been applied on the top noisy airplane image with a GLRT threshold value $T2_{max}=2$. Black pixels represent pixel classified as \textit{on an edge}, while white ones are those which belong to LSR. \begin{figure}[h] \centering \subfloat[Noisy airplane image]{\label{exbords:noisy} \includegraphics{./imgfs/airplane_noisy_small.ps}}\qquad \subfloat[Pixel classification performed by the edge detector. ]{\label{exbords:bords} \includegraphics{./imgfs/img_bords_T2_small.ps}}\\ \caption{Pixel classification inside the noisy image. Figure \ref{exbords:noisy} shows the noisy input image and \ref{exbords:bords} reproduces the output classification of pixels, as a black and white image, obtained with threshold value $T2_{max}=2$. Black pixels are supposed to be near an edge, while white pixels belong to Low Slope Regions.} \label{exbords} \end{figure} \section{\label{lniv} Hybrid PI-PD filter Implementation: details} All implementation details that will be given here are relative to the proposed PI-PD models and Nvidia$^\copyright~$ GPU devices. \subsection{\label{genPaths}Segment patterns} The first kernel to be run is \texttt{kernel\_genPaths()} which generates matrix $P_l$. Its elements $(\Delta i; \Delta j)$ are the relative coordinates of the pixels which define segment patterns $p_{l,d}$. The dimensions of matrix $P_l$ are $D$ rows $\times$ $l$ columns. To fit GPU architecture as closely as possible, we chose $D=32$ patterns. Each segment $s_k$ of a poly-isoline can then be seen as a pattern $p_{l,d}$ applied on the starting pixel $(i,j)$ of this segment, denoted $p_{l,d}(i,j)$. The example in figure \ref{p5q1} shows the first quarter of $P_5$ and the corresponding eight discrete segment patterns in the first quadrant. The three remaining quarters of the matrix are easily deduced by applying successive rotations of angle $\frac{\pi}{2}$ to the above elements. \begin{figure}[h] \begin{center} \includegraphics[width=0.65\linewidth]{./imgfs/P5Q1.ps} \end{center} \tiny{ $$ P_5 = \begin{bmatrix} (0,1)&(0,2)&(0,3)&(0,4)&(0,5)\\ (0,1)&(0,2)&(-1,3)&(-1,4)&(-1,5)\\ (0,1)&(-1,2)&(-1,3)&(-2,4)&(-2,5)\\ (-1,1)&(-1,2)&(-2,3)&(-3,4)&(-3,5)\\ (-1,1)&(-2,2)&(-3,3)&(-4,4)&(-5,5)\\ (-1,1)&(-2,1)&(-3,2)&(-4,3)&(-5,3)\\ (-1,0)&(-2,1)&(-3,1)&(-4,2)&(-5,2)\\ (-1,0)&(-2,0)&(-3,1)&(-4,1)&(-5,1)\\ \ldots&\ldots&\ldots&\ldots&\ldots\\ \end{bmatrix} $$ } \caption{\label{p5q1}Top: example segment patterns $p_{5,d}$ for $d\in[0;7]$; the black pixel represents the center pixel $(i,j)$, which does not belong to the pattern. The gray ones define the actual pattern segments. Bottom: the first 8 lines of corresponding matrix $P_5$ whose elements are the positions of segment pixels with respect to the center pixel.} \end{figure} \subsection{\label{sipd}Generation of reference matrices $I_{\Sigma}$ and $I_{\Theta}$} In order to generate both matrices, a GPU kernel \texttt{kernel\_precomp()} computes, in parallel for each pixel $(i,j)$: \begin{itemize} \item the direction $\delta$ of the most likely segment $s_1 = p_{l,\delta}(i,j)$ among the $D$ possible ones. This value is stored in matrix $I_{\Theta}$ at position $(i,j)$. \item values $C_x(s_1)$ and $C_{x^2}(s_1)$ defined in equations \eqref{cx} and \eqref{cx2}. This vector of values is stored in matrix $I_{\Sigma}$ at position $(i,j)$. \end{itemize} In order to reduce processing time, the input image is first copied into texture memory (see algorithm \ref{algoinit} for initializations and memory transfer details), thus taking advantage of the 2D optimized caching mechanism. This kernel follows the \textit{one thread per pixel} rule. Consequently, each value of $P_l$ has to be accessed by every thread of a block. That led us to load it from texture memory first, then copy it into all shared memory blocks. This has proved to be the fastest scheme. Algorithm \ref{algoprecomp} summarizes the computations achieved by \texttt{kernel\_precomp()}. Vector $(C_x, C_{x2})$ stores the values of $C_x(s_1)$ and $C_{x^2}(s_1)$ associated with the current tested pattern. Vector $(C_{x-best}, C_{x2-best})$ stores the values of $C_x(s_1)$ and $C_{x^2}(s_1)$ associated with the best previously tested pattern. In the same manner, $\sigma$ and $\sigma_{best}$ are deviation values for current and best tested patterns. The selection of the best pattern is driven by the value of the standard deviation of candidate isolines. Lines 2 and 3 compute both sums for the first pattern to be evaluated. Line 4 computes its standard deviation. Then, lines 5 to 14 loop on each pattern and keep values associated with the best pattern found. These values are eventually stored in matrices $I_{\Theta}$ and $I_{\Sigma}$ on lines 16 and 17. \begin{algorithm}[htb] \SetNlSty{textbf}{}{:} \caption{Initializations in GPU memory} \label{algoinit} $l \leftarrow$ step size\; $D \leftarrow$ number of primary directions\; $I_n \leftarrow$ noisy image\; $I_{n tex} \leftarrow I_n $\tcc*[r]{copy to texture mem.} $P_l \leftarrow$ kernel\_genPaths \tcc*[r]{pattern matrix} $P_{l tex} \leftarrow P_l $\tcc*[r]{copy to texture mem.} $T_{max} \leftarrow$ GLRT threshold (lengthening)\; $T2_{max} \leftarrow$ GLRT threshold (edge detection)\; \end{algorithm} \begin{algorithm} \SetNlSty{textbf}{}{:} \SetKwComment{Videcomment}{}{} \caption{generation of reference matrices, kernel \texttt{kernel\_precomp()}} \label{algoprecomp} \ForEach(\tcc*[f]{\textbf{in parallel}}){pixel $(i,j)$}{ $C_{x-best} \leftarrow \displaystyle\sum_{(y,x)\in p_{l,0}(i,j)} I_{n tex}(i+y,j+x)$ \; $C_{x2-best} \leftarrow \displaystyle\sum_{(y,x)\in p_{l,0}(i,j)} I_{n tex}^2(i+y,j+x)$ \; $\sigma_{best} \leftarrow$ standard deviation along $p_{l,0}(i,j)$ \; %\Videcomment{} \tcc{loop on each pattern} \ForEach{$d \in [1;D-1]$}{ $C_x \leftarrow \displaystyle\sum_{(y,x)\in p_{l,d}(i,j)} I_{n tex}(i+y,j+x)$\; $C_{x2} \leftarrow \displaystyle\sum_{(y,x)\in p_{l,d}(i,j)} I_{n tex}^2(i+y,j+x)$\; $\sigma \leftarrow$ standard deviation along $p_{l,d}(i,j)$\; \If(\tcc*[f]{keep the best}){$\sigma_d < \sigma_{best}$}{ $C_{x-best} \leftarrow C_x$ \; $C_{x2-best} \leftarrow C_{x2}$ \; $\Theta_{best} \leftarrow d$ \; } } $I_{\Sigma}(i,j) \leftarrow \left[ C_{x-best}, C_{x2-best}\right]$ \tcc*[r]{stores} $I_{\Theta}(i,j) \leftarrow \Theta_{best}$ \tcc*[r]{in matrices} } \end{algorithm} \begin{algorithm}[ht] \SetNlSty{textbf}{}{:} \caption{PI-PD lengthening process \texttt{kernel\_PIPD()}} \label{algoPIPD} \ForEach(\tcc*[f]{\textbf{in parallel}}){pixel $(i,j)$}{ % \tcc{in parallel} % \tcc{starting pixel $(i,j)$ and first segment without GLRT} $(C_x^1, C_{x2}^1) \leftarrow z(i,j)$ \tcc*[r]{starting pixel} $(i_1, j_1) \leftarrow (i, j)$ \tcc*[r]{first segment} $(C_x^1, C_{x2}^1) \leftarrow I_{\Sigma}(i_1,j_1)$ \tcc*[r]{read matrix} $d_1 \leftarrow I_{\Theta}(i,j)$ \tcc*[r]{read matrix} $l_1 \leftarrow l$ \tcc*[r]{isoline length} $\sigma_1 \leftarrow (C_{x2}^1/l_1 - C_x^1)/l_1$\; $(i_2, j_2) \leftarrow end~of~first~segment$\; $(C_{x}^2, C_{x2}^2) \leftarrow I_{\Sigma}(i_2,j_2) $ \tcc*[r]{2$^{nd}$ segment} $d_2 \leftarrow I_{\Theta}(i_2,j_2)$\; $\sigma_2 \leftarrow (C_{x2}^2/l - C_x^2)/l$ \; % \While{$GLRT(\sigma_1, \sigma_2, l_1, l) < T_{max}$}{ $l_1 \leftarrow l_1 + l$ \tcc*[r]{lengthening} $(C_x^1, C_{x2}^1) \leftarrow (C_x^1, C_{x2}^1)+(C_x^2, C_{x2}^2)$\; $\sigma_1 \leftarrow (C_{x2}^1/l_1 - C_x^1)/l_1$ \tcc*[r]{update} $(i_1,j_1) \leftarrow (i_2, j_2)$ \tcc*[r]{step forward} $d_1 \leftarrow d_2$\; $(i_2, j_2) \leftarrow end~of~next~segment$\; \tcc*[f]{next segment} $(C_{x}^2, C_{x2}^2) \leftarrow I_{\Sigma}(i_2,j_2) $\; $d_2 \leftarrow I_{\Theta}(i_2,j_2)$\; $\sigma_2 \leftarrow (C_{s2}^2/l - C_s^2)/l$ \; } } $\widehat{I}(i, j) \leftarrow C_x^1/l_1$ \tcc*[r]{isoline value} \end{algorithm} \subsection{\label{sipdl}PI-PD lengthening process: \texttt{kernel\_PIPD()} } This parallel kernel is run in order to obtain the image of the \emph{isolines}. It is detailed in algorithm \ref{algoPIPD}, (see section \ref{lniv3} for process description). Lines from 2 to 11 perform allocations for the first lengthening to evaluate. More precisely, $(i_1, j_1)$ represents the starting pixel of the current segment; $(i_2, j_2)$ is both its ending pixel and the starting pixel of the next segment; $d_1$ and $d_2$ are their directions, read from precomputed matrix $I_{\Theta}$. $C_x^1$ and $C_{x2}^1$ are the gray-level sums along the current poly-isoline; $C_x^2$ and $C_{x2}^2$ are the gray-level sums of the candidate segment. The current poly-isoline ends at $(i_1, j_1)$ and is made of $l_1$ pixels (already accepted segments); its standard deviation is $\sigma_1$. The loop extending from lines 12 to 21 performs the allocations needed to proceed one segment forward, as long as GLRT is true. If the lengthening has been accepted, the length of the poly-isoline is updated in line 13, and the same is done with $C_x$ and $C_{x2}$ which are read from precomputed matrix $I_{\Sigma}$ (see equations \eqref{cx} and \eqref{cx2} for definition). Finally, using direction value $d_2$, it translates the coordinates $(i_1, j_1)$ to the end of the newly elongated poly-isoline, and $(i_2, j_2)$ to the end of the next segment to be tested. As soon as the GLRT condition becomes false, line 23 eventually produces the output value of the denoised image at pixel $(i,j)$, that is, the average gray-level value along the poly-isoline. \subsection{\label{sipdd}Hybrid PI-PD : \texttt{kernel\_edge\_detector()} } As introduced in section \ref{pipd_plan}, the aim of kernel \texttt{kernel\_edge\_detector()} is to divide pixels into two classes according to their belonging to a LSR or not. Algorithm \ref{algoDetect} explains the detailled procedure. Lines 2 to 6 initialize values of the direction index ($\Theta$), the number of edges detected ($edgeCount$), the gray-level sum along the pixels that defines the H half-plane ($sumEdge$) and the number of pixels that defines both half-planes H and L ($nH$, $nL$). Then the loop starting at line 7 performs the GLRT for every considered direction index $\Theta$. Values $sumH$ and $sumL$ are vectors of two parameters $x$ and $y$, parameter $x$ being the sum of gray-level values and $y$ the sum of square gray-level values. Value $sumH$ is computed along the pixels of half-plane H and is obtained by loop at lines 10 to 14; Value $sumL$ is computed along the pixels of half-plane L and is obtained by loop at lines 15 to 19. Value $I_{ntex}(i,j)$ refers to the gray-level value at pixel (i,j) previously stored in texture memory. Eventually, the isoline level value is output at line 27, 30 or 33 depending on the situation (see \ref{pipd_plan} for details about the decision process). \begin{algorithm}[ht] \SetNlSty{textbf}{}{:} \caption{edge detector and pixel classifier \texttt{kernel\_edge\_detector()}} \label{algoDetect} \ForEach(\tcc*[f]{\textbf{in parallel}}){pixel $(i,j)$}{ $\Theta \leftarrow 0$\tcc*[r]{direction index} $edgeCount \leftarrow 0$\; $sumEdge \leftarrow 0$\; $nH \leftarrow 5l+1$\; $nL \leftarrow 3l$\; \While{($\Theta < 32$) }{ $sumH \leftarrow (I_{ntex}(i,j), I_{ntex}^2(i,j))$\; $sumL \leftarrow (0, 0)$\; \For{($\alpha=\Theta$ to $\alpha=\Theta+16$ by step $4$)}{ $sPat \leftarrow \displaystyle\sum_{(y,x)\in P_{l,\alpha}(i,j)} I_{n tex}(i+y,j+x)$\; $sPat2 \leftarrow \displaystyle\sum_{(y,x)\in P_{l,\alpha}(i,j)} I^2_{n tex}(i+y,j+x)$\; $sumH \leftarrow sumH + (sPat, sPat2)$\; } \For{($\alpha=\Theta+20$ to $\alpha=\Theta+28$ by step $4$)}{ $sPat \leftarrow \displaystyle\sum_{(y,x)\in P_{l,\alpha}(i,j)} I_{n tex}(i+y,j+x)$\; $sPat2 \leftarrow \displaystyle\sum_{(y,x)\in P_{l,\alpha}(i,j)} I^2_{n tex}(i+y,j+x)$\; $sumL \leftarrow sumL + (sPat, sPat2)$\; } \If{($GLRT(sumH, nH, sumL, nL) > T2_{max}$)}{ $edgeCount \leftarrow edgeCount + 1$\; $sumEdge \leftarrow sumH.x$\; } $\Theta \leftarrow \Theta + 4$\; } \tcc{outputs isoline value} \If{($edgeCount == 0$)}{ $\widehat{I}(i,j) \leftarrow \dfrac{(sumH.x + sumL.x)}{nH+nL}$ \tcc*[r]{LSR} } \If{($edgeCount == 1$)}{ $\widehat{I}(i,j) \leftarrow \dfrac{(sumEdge)}{nH}$ } \If{($edgeCount > 1$)}{ $\widehat{I}(i,j) \leftarrow \widehat{I_{PIPD}}(i,j)$\tcc*[r]{PI-PD} } } \end{algorithm} \section{\label{results}Results} The proposed hybrid PI-PD model has been evaluated with the 512x512 pixel sample images used by \citet{denoiselab} in order to make relevant comparisons with other filtering techniques. As we aim to address image processing in very noisy conditions (as in \citet{6036776}), we focused on the noisiest versions, degraded by AWGN of standard deviation $\sigma=25$. Quality measurements of the denoised images in comparison with reference images have been obtained by the evaluation of: \begin{enumerate} \item Peak Signal to Noise Ratio (PSNR) that quantifies the mean square error between denoised and reference images: $MSE(I,\widehat{I})$. We used the following expression: $$PSNR =10.log_{10}\left(\frac{max(\widehat{I})}{MSE(I,\widehat{I})}\right)$$ PSNR values are given in dB and highest values mean best PSNR. \item The Mean Structure Similarity Index (MSSIM, defined in \citet{Wang04imagequality}), which quantifies local similarities between denoised and reference images inside a sliding window. MSSIM values belong to an interval $[0; 1]$; the closer to 1 the better. \end{enumerate} PSNR is widely used to measure image quality but can be misleading when used by itself: as demonstrated in \citet{Wang04imagequality}, the processing of noisy images can bring a high PSNR value but very bad visual quality. This is avoided by the use of the MSSIM index along with the PSNR value: when both of them show high values, the overall quality can be considered high. Result figure \ref{tablePI} provides the PSNR and MSSIM of every image, denoised with three different filters: average 5x5, hybrid PI-PD and BM3D. The \emph{noisy} column shows the values for each image before denoising. BM3D (\citet{Dabov09bm3dimage}) is taken as a reference in terms of denoising quality, while the average filter is taken as a reference in terms of processing time. The window size of 5x5 pixels has been choosen to achieve PSNR values similar to those obtained by PI-PD. BM3D code is run on a quad-core Xeon E31245 at 3.3GHz and 8GByte RAM under linux kernel 3.2 (64bits), while PI-PD as well as average filter codes is run on a Nvidia C2070 GPU hosted by a PC running linux kernel 2.6.18 (64bits). The average filter used is an efficient parallel GPU implementation that we developped. It is a generic and versatile separable convolution kernel that outputs more than 700MPixels per second in the 5x5 averaging configuration. Hybrid PI-PD measurements were performed with $n=25$, $l=5$, $T_{max}=1$ and $T2_{max}=2$. BM3D measurements have been performed with the freely available BM3D software proposed in \citet{Dabov09bm3dimage}. The hybrid PI-PD model proves much faster than BM3D and better than the average 5x5 filter. Processing the thirteen images of the database reveals that hybrid PI-PD brings an average improvement of 1.5dB (PSNR) and 7.2\% (MSSIM) against the average filter at the cost of 35 times its computational duration. Against hybrid PI-PD, BM3D achieves an average improvement of 2.4dB and 4.6\% at the cost of 350 times as much duration. Actually, the 5x5 average filter takes around \textbf{0.35~ms} to process an image while hybrid PI-PD needs around \textbf{11~ms} and BM3D around \textbf{4.3~s}. It must be noticed that experimental optimization show that the vector of parameter values $T_{max}=1$ and $T2_{max}=2$ is optimal for 11 of the 13 images of the database. Better results are obtained with a slightly different value of $T2_{max}$ for \textit{peppers} or \textit{zelda} whose denoised images can obtain a MSSIM index of 0.90. Most of the computational time of hybrid PI-PD is spent by the edge detector, which clearly does not fit GPU requirements to achieve good performance. For information, the simple PI-PD model runs in less than 4~ms in the same conditions. Figure \ref{comparimg} shows denoised images produced by hybrid PI-PD model compared with the output of the BM3D and the average 5x5 filters. The figure illustrates the merits and drawbacks of each model: edges are well preserved by hybrid PI-PD, but a \textit{staircase} effect is visible, a well-known artefact inherent to this type of neighborhood filters. Our recent GPU-implementation of the regression method proposed in \citet{BuadesCM06} brings a mean improvement of 1dB at the cost of 0.4~ms. \begin{figure} %\begin{table}[h] \footnotesize \begin{center} \begin{tabular}{|c|r|r|r|r|}\hline \bf Image&\bf Noisy &\bf average &\bf hybrid&\bf BM3D \\ & &\bf 5x5 &\bf PI-PD & \\\hline\hline airplane & 19.49dB & 26.39dB & 28.46dB& 30.88dB \\ & 0.58 & 0.84 & 0.88 & 0.93 \\\hline barbara & 20.04dB & 22.76dB & 24.26dB& 30.60dB \\ & 0.70 & 0.76 & 0.83 & 0.94 \\\hline boat & 20.33dB & 25.58dB & 27.54dB& 30.02dB \\ & 0.66 & 0.81 & 0.87 & 0.91 \\\hline couple & 20.28dB & 25.25dB & 27.33dB& 29.77dB \\ & 0.69 & 0.79 & 0.87 & 0.91 \\\hline elaine & 19.85dB & 28.71dB & 28.94dB& 30.60dB \\ & 0.59 & 0.86 & 0.87 & 0.91 \\\hline fingerprint &20.34dB & 23.33dB & 26.07dB& 27.93dB \\ & 0.93 & 0.87 & 0.95 & 0.96 \\\hline goldhill & 19.59dB& 26.47dB & 27.43dB& 29.22dB \\ & 0.67 & 0.82 & 0.87 & 0.88 \\\hline lena & 19.92dB& 27.99dB & 29.14dB& 31.80dB \\ & 0.60 & 0.84 & 0.88 & 0.93 \\\hline man & 20.38dB& 24.74dB & 26.74dB& 28.14dB \\ & 0.71 & 0.80 & 0.86 & 0.87 \\\hline mandrill & 19.34dB& 20.34dB & 22.38dB& 24.75dB \\ & 0.77 & 0.69 & 0.83 & 0.88 \\\hline peppers & 19.53dB& 27.30dB & 28.68dB& 30.87dB \\ & 0.61 & 0.86 & 0.87 & 0.92 \\\hline stream & 20.35dB& 23.23dB & 25.35dB& 26.34dB \\ & 0.80 & 0.78 & 0.87 & 0.88 \\\hline zelda & 17.71dB& 23.13dB & 27.71dB& 30.49dB \\ & 0.58 & 0.87 & 0.88 & 0.93 \\\hline \end{tabular} \end{center} \caption{Comparison between hybrid PI-PD, average and BM3D filters. PI-PD parameter values: $n=25$, $l=5$, $T_{max}=1$ and $T2_{max}=2$. The \emph{noisy} column correspond to the noisy input images, before denoising. \newline Timings: average filter in around 0.35~ms hybrid PI-PD in around 11.0~ms and BM3D in around 4.3~s} \label{tablePI} %\end{table} \end{figure} \begin{figure}[h] \centering \subfloat[Noisy image $\sigma=25$]{\label{fig:noisy} \includegraphics{./imgfs/airplane_25_noisy_zoom.ps}}\qquad \subfloat[Average 5x5 filter, in $0.35~ms$]{\label{fig:pipd} \includegraphics{./imgfs/airplane_25_mean5_zoom.ps}}\\ \subfloat[PI-PD hybrid filter, $n=25$, $l=5$, $T_{max}=1$, $T2_{max}=2$, in $11~ms$ ]{\label{fig:hpipd} \includegraphics{./imgfs/airplane_zoom_hybrid_6_r50_T10_P2.ps}}\qquad \subfloat[BM3D filter, in $4.3s$]{\label{fig:bm3d} \includegraphics{./imgfs/airplane_bm3d_zoom.ps}} \caption{Comparison of 512x512 images denoised from noisy airplane image (\ref{fig:noisy}) with a PI-PD filter (\ref{fig:pipd}), PI-PD hybrid filter (\ref{fig:hpipd}) and BM3D filter (\ref{fig:bm3d}). Only zoomed parts of images are shown in order to ensure better viewing.} \label{comparimg} \end{figure} \section{\label{conclusion}Conclusion, future work} From the start, our approach, unlike quite a few others, has been to base this study on the conception and characteristics of the targeted hardware (Nvidia Graphic cards). So as to get high execution speeds, we chose, for example, to find a method that remains local (concentrating on the immediate neighborhood of the center pixel), but still provides very significant benefits, using our technique of progressive lengthening. Nevertheless, our method has proved slightly sub-optimal and lacking robustness in \textit{flat} regions (see above, Low Slope Regions), even if the actual visual effect may be considered quite satisfactory. As a first step to address the above drawbacks, we have devised a hybrid method that detects and applies distinct processing to LSR regions (see above). Processing speeds remain fast, and much higher than the BM3D implementation taken as quality reference. This is very promising, and opens the perspective of real-time high definition image sequence processing at 25 fps, provided we improve the edge detector, which currently limits the HD frame rate at 16fps (High Definition: 1920x1080 pixels). To further improve the quality of output images, we also implemented a efficient parallel implementation of the staircase effect reduction technique presented in \citet{BuadesCM06}. With this method, searching for best improvement factors leads to different parameters values for each image processed, which prompts to studying some way of overriding such parameters. Our study so far has been based on additive noise; we are currently working on transposing criteria to various multiplicative noise types. We also extended the process to color images with very interesting visual results to be confirmed by the experimental measurement currently in progress. % references section \bibliographystyle{spbasic} \bibliography{bibliosv} % that's all folks \end{document} %doi = {10.5201/ipol.2011.bcm_nlm},