\chapter{Implementing an efficient convolution operation on GPU}
-%\newcommand{\kl}{\includegraphics[scale=0.6]{Chapters/chapter4/img/kernLeft.png}~}
-%\newcommand{\kr}{\includegraphics[scale=0.6]{Chapters/chapter4/img/kernRight.png}}
-
-%% \lstset{
-%% language=C,
-%% columns=fixed,
-%% basicstyle=\footnotesize\ttfamily,
-%% numbers=left,
-%% firstnumber=1,
-%% numberstyle=\tiny,
-%% stepnumber=5,
-%% numbersep=5pt,
-%% tabsize=3,
-%% extendedchars=true,
-%% breaklines=true,
-%% keywordstyle=\textbf,
-%% frame=single,
-%% % keywordstyle=[1]\textbf,
-%% %identifierstyle=\textbf,
-%% commentstyle=\color{white}\textbf,
-%% stringstyle=\color{white}\ttfamily,
-%% % xleftmargin=17pt,
-%% % framexleftmargin=17pt,
-%% % framexrightmargin=5pt,
-%% % framexbottommargin=4pt,
-%% backgroundcolor=\color{lightgray},
-%% }
before adressing the matter of separable convolutions. We shall refer
to $I$ as an H x L pixel gray-level image, and to $I(x,y)$ as the gray-level
value of each pixel of coordinates $(x,y)$.
-%dire qqs mots sur le filtrage IIR/FIR ?
+
\section{Definition}
GPU functions referred to as kernels, we shall use\emph{ convolution
mask} instead of \emph{convolution kernel}) is defined by:
\begin{equation}
-I'(x, y) = \left(I * h\right) = \sum_{i < H} \sum_{j < L}I(x-j, y-j).h(j,i)
+I'(x, y) = \left(I * h\right) = \sum_{(i < H)} \sum_{(j < L)}I(x-j, y-j).h(j,i)
\label{convoDef}
\end{equation}
-While processing an image, function \emph{h} is bounded by a square
-window of size \emph{k = 2r + 1}, i.e an uneven number, to ensure
+While processing an image, function \emph{h} is often bounded by a square
+window of size \emph{k = 2r + 1}, \textit{i.e} an uneven number, to ensure
there is a center. We shall also point out that, as stated earlier,
the square shape is no limiting factor to the process, as any shape
can be inscribed into a square. In the case of a more complex shape,
\caption{generic convolution}
\label{algo_genconv}
\ForEach{pixel at position $(x, y)$}{
- Read all gray-level values $I(x, y)$ in the neighborhood\\\;
- Compute the weighted sum \( I_\Omega = \sum_{(j,i) \in \Omega}I(x-j, y-j).h(j,i) \)\\\;
- Normalize $I'(x, y)$ value\\\;
+ Read all gray-level values $I(x, y)$ in the neighborhood\;
+ Compute the weighted sum \( I_\Omega = \sum_{(j,i) \in \Omega}I(x-j, y-j).h(j,i) \)\;
+ Normalize $I'(x, y)$ value\;
Outputs the new gray-level value
}
\end{algorithm}
previous chapter, as a reminder : texture memory used with incoming
data, pinned memory with output data, optimized use of registers
while processing data and multiple output per thread\index{Multiple output per thread}.
-One signifcant difference lies in the fact
+One significant difference lies in the fact
that the median filter uses only one parameter, the size of the window mask,
which can be hard-coded, while a convolution mask requires referring to several; hard-coding
its elements would lead to severe lack of flexibility (one function
point in our approach.
Let us assume that we are planning to implement the convolution defined by the following $3\times 3$ mask (low-pass filter or averaging filter):
-$$h=\begin{bmatrix}1&1&1\\1&1&1\\1&1&1\end{bmatrix}$$
+$$h=\frac{1}{9}\begin{bmatrix}1&1&1\\1&1&1\\1&1&1\end{bmatrix}$$
The kernel code presented in Listing \ref{lst:convoGene3Reg8} implements the convolution operation and applies all above optimizations except, for clarity reasons, multiple output per thread.
In the particular case of a generic convolution, it is important to note how mask coefficients are applied to image pixels in order to fit the definition of equation \ref{convoDef}: if the coordinates of the center pixel had been set to (0,0), then the pixel of coordinates $(i,j)$ would have been multiplied by the element $(-i,-j)$ of the mask, which, transposed in our kernel code, leads to multiply the $p^{th}$ pixel of the window by the $(n-p)^{th}$ element of the convolution mask.
\end{table}
Table \ref{tab:convoNonSepReg3} shows timings and global throughput values achieved by those convolution masks on an Nvidia GT200 Tesla architecture (GTX480 card) with $16x8$ thread blocks. This measurement has been done in order to make a relevant comparison with a reference given by Nvidia in \cite{convolutionsoup} where they state that their fastest kernel achieves a $5\times5$ convolution of an 8-bit $2048\times 2048$ pixelimage in $1.4~ms$, which lead to a throughput value of 945~Mpix/s.
-Our current value of 802~Mpix/s, though not unsatisfactory, remains lower to the one reached by the manufacturer'own coding.
+Our current value of 802~Mpix/s, though not unsatisfactory, remains lower to the one reached by the manufacturer's own coding.
Tested in the same conditions, the newer Fermi architecture of
Nvidia's GPUs proved slower (3.3 ms, see Table \ref{tab:convoNonSepReg1}) due to the lower maximum
register count allowed (63, against 128 for Tesla GT200).
or more pixels per thread while keeping safely within the 63-per-thread
rule.
-However, when doing so, \textit{eg} processing what we shall call a \textit{packet} of pixels, window mask overlapping has to be taken into account
+However, when doing so, \textit{e.g} processing what we shall call a \textit{packet} of pixels, window mask overlapping has to be taken into account
to avoid multiple texture fetches of each pixel's gray-level value, while benefiting from the 2-D cache.
In that case, both mask size and pixel packet shape determine the number of texture fetches to be performed for each pixel value.
Figure \ref{fig:convoOverlap1} illustrates two different situations: on top, a mask of radius 1 ($3\times 3$) applied to a packet of 8 pixels in row; at bottom, a mask of radius 2 ($5\times 5$).
-1&2&-1
\end{bmatrix}$$
Such a mask allows to replace a generic 2-D convolution operation by two consecutive stages of a 1-D convolution operation: a vertical of mask $h_v$ and a horizontal of mask $h_h$.
-This saves a lot of arithmetic operations, as a generic $n\times n$ convolution applied on a $H\times L$ image basically represents $H.L.n^2$ multiplications and as many additions, while two consecutive $n\times 1$ convolutions only represents $2.H.L.n$ of each, \textit{eg} 60\% operations are saved per pixel of the image for a $5\times 5$ mask.\\
+This saves a lot of arithmetic operations, as a generic $n\times n$ convolution applied on a $H\times L$ image basically represents $H.L.n^2$ multiplications and as many additions, while two consecutive $n\times 1$ convolutions only represents $2.H.L.n$ of each, \textit{e.g} 60\% operations are saved per pixel of the image for a $5\times 5$ mask.\\
However, beside reducing the operation count, performing a separable convolution also means writing an intermediate image into global memory.
CPU implementations of separable convolutions often use a single function to perform both 1-D convolution stages. To do so, this function reads the input image and actually ouputs the transposed filtered image.
Applying that principle to GPUs is not efficient, as outputting the transposed image means non-coalescent writes into global memory, generating severe performance loss. Hence the idea of developing two different kernels, one for each of both vertical and horizontal convolutions.