BookGPU/Chapters/chapter4/ch4.tex

   1 \chapterauthor{Gilles Perrot}{Femto-ST Institute, University of Franche-Comte, France}
   2
   3 \chapter{Implementing an efficient convolution operation on GPU}
   4
   5
   6 %\newcommand{\kl}{\includegraphics[scale=0.6]{Chapters/chapter4/img/kernLeft.png}~}
   7 %\newcommand{\kr}{\includegraphics[scale=0.6]{Chapters/chapter4/img/kernRight.png}}
   8
   9 %% \lstset{
  10 %%   language=C,
  11 %%   columns=fixed,
  12 %%   basicstyle=\footnotesize\ttfamily,
  13 %%   numbers=left,
  14 %%   firstnumber=1,
  15 %%   numberstyle=\tiny,
  16 %%   stepnumber=5,
  17 %%   numbersep=5pt,
  18 %%   tabsize=3,
  19 %%   extendedchars=true,
  20 %%   breaklines=true,
  21 %%   keywordstyle=\textbf,
  22 %%   frame=single,
  23 %%   % keywordstyle=[1]\textbf,
  24 %%   %identifierstyle=\textbf,
  25 %%   commentstyle=\color{white}\textbf,
  26 %%   stringstyle=\color{white}\ttfamily,
  27 %%   % xleftmargin=17pt,
  28 %%   % framexleftmargin=17pt,
  29 %%   % framexrightmargin=5pt,
  30 %%   % framexbottommargin=4pt,
  31 %%   backgroundcolor=\color{lightgray},
  32 %%   }
  33
  34
  35
  36 \section{Overview}
  37 In this chapter, after dealing with GPU median filter implementations,
  38 we propose to explore how convolutions\index{Convolution}  can be implemented on modern
  39 GPUs. Widely used in digital image processing filters, the \emph{convolution
  40 operation} basically consists in taking the sum of products of elements
  41 from two 2-D functions, letting one of the two functions move over
  42 every element of the other, producing a third function that is typically
  43 viewed as a modified version of one of the original functions. To
  44 begin with, we shall examine non-separable or generic convolutions,
  45 before adressing the matter of separable convolutions. We shall refer
  46 to $I$ as an H x L pixel gray-level image, and to $I(x,y)$ as the gray-level
  47 value of each pixel of coordinates $(x,y)$.
  48 %dire qqs mots sur le filtrage IIR/FIR ?
  49
  50
  51 \section{Definition}
  52 Within a digital image $I$, the convolution operation is performed between
  53 image $I$ and convolution mask \emph{h} (To avoid confusion with other
  54 GPU functions referred to as kernels, we shall use\emph{ convolution
  55 mask} instead of \emph{convolution kernel}) is defined by:
  56 \begin{equation}
  57 I'(x, y) = \left(I * h\right) = \sum_{i < H} \sum_{j < L}I(x-j, y-j).h(j,i)
  58 \label{convoDef}
  59 \end{equation}
  60 While processing an image, function \emph{h} is bounded by a square
  61 window of size \emph{k = 2r + 1}, i.e an uneven number, to ensure
  62 there is a center. We shall also point out that, as stated earlier,
  63 the square shape is no limiting factor to the process, as any shape
  64 can be inscribed into a square. In the case of a more complex shape,
  65 the remaining space is filled by null values (padding).
  66
  67
  68 \section{Implementation}
  69 The basic principle of computing a convolution between one $I$ picture
  70 and one \emph{h} convolution mask defined on domain $\Omega$ is given
  71 by algorithm \ref{algo_genconv} and illustrated by Figure \ref{fig:convoPrinciple}, which mainly shows how gray-level values of the center pixel's neighborhood are combined with the convolution mask values to compute the output value.
  72 For more readability, only part of the connecting lines are shown.
  73  \begin{figure}
  74 \centering
  75    \includegraphics[width=11cm]{Chapters/chapter4/img/convo1.png}
  76    \caption[Principle of a generic convolution implementation.]{Principle of a generic convolution implementation. The center pixel is represented with a black background and the pixels of its neighborhood are denoted $I_{p,q}$ where $(p,q)$ is the relative position of the neighbor pixel. Elements $h_{t,u}$ are the values of the convolution mask.}
  77    \label{fig:convoPrinciple}
  78 \end{figure}
  79 \begin{algorithm}
  80 \caption{generic convolution}
  81 \label{algo_genconv}
  82   \ForEach{pixel at position $(x, y)$}{
  83     Read all gray-level values $I(x, y)$ in the neighborhood\\\;
  84     Compute the weighted sum \( I_\Omega = \sum_{(j,i) \in \Omega}I(x-j, y-j).h(j,i) \)\\\;
  85     Normalize $I'(x, y)$ value\\\;
  86     Outputs the new gray-level value
  87   }
  88 \end{algorithm}
  89
  90 The gray-level value of each pixel of output image $I'$ is the weighted
  91 sum of pixels included in the neighborhood defined by $\Omega$ around
  92 the corresponding pixel in the input image. It has to be noted that,
  93 in case the sum $S$ of all coefficients in the mask is not 1, the original
  94 brightness of the image will be altered and a normalization stage
  95 has to take place, as, for example, in the case of an 8-bit coded
  96 image:
  97 \begin{enumerate}
  98 \item if $S \ge 0$ then $I' = I_\Omega / S$
  99 \item if $S = 0$ then $I' = I_\Omega + 128$
 100 \item if $S < 0$ then $I' = I_\Omega + 255$
 101 \end{enumerate}
 102 In case one, normalizing means performing a division operation for
 103 each pixel, which will be quite time-costly when performed on a GPU. A simple work-around is to normalize mask values before using them in GPU kernels.
 104
 105
 106 \subsection{First test implementation}
 107 This first implementation consists of a rather naive application to
 108 convolutions of the tuning recipes applied to median filters in the
 109 previous chapter, as a reminder : texture memory used with incoming
 110 data, pinned memory with output data, optimized use of registers
 111 while processing data and multiple output per thread\index{Multiple output per thread}.
 112 One signifcant difference lies in the fact
 113 that the median filter uses only one parameter, the size of the window mask,
 114 which can be hard-coded, while a convolution mask requires referring to several; hard-coding
 115 its elements would lead to severe lack of flexibility (one function
 116 per filter, no external settings) so we will just use it as a starting
 117 point in our approach.
 118
 119 Let us assume that we are planning to implement the convolution defined by the following $3\times 3$ mask (low-pass filter or averaging filter):
 120 $$h=\begin{bmatrix}1&1&1\\1&1&1\\1&1&1\end{bmatrix}$$
 121 The kernel code presented in Listing \ref{lst:convoGene3Reg8} implements the convolution operation and applies all above optimizations except, for clarity reasons, multiple output per thread.
 122 In the particular case of a generic convolution, it is important to note how mask coefficients are applied to image pixels in order to fit the definition of equation \ref{convoDef}: if the coordinates of the center pixel had been set to (0,0), then the pixel of coordinates $(i,j)$ would have been multiplied by the element $(-i,-j)$ of the mask, which, transposed in our kernel code, leads to multiply the $p^{th}$ pixel of the window by the $(n-p)^{th}$ element of the convolution mask.
 123
 124 \lstinputlisting[label={lst:convoGene3Reg8},caption=Generic CUDA kernel achieving a convolution operation with hard-coded mask values]{Chapters/chapter4/code/convoGene3Reg8.cu}
 125
 126 Table \ref{tab:convoNonSepReg1} shows kernel timings and throughput values for such a low-pass filter extended to $5\times 5$ and $7\times 7$ masks applied on 8-bit coded gray-level
 127 images of sizes $512\times 512$, $1024\times 1024$, $2048\times 2048$, $4096\times 4096$ and run on a C2070 card with $32\times 8$ thread blocks.
 128 As a reminder, Table \ref{tab:memcpy1} details the data transfer costs that helped computing throughput values.
 129
 130
 131 \begin{table}[h]
 132 \centering
 133 {\normalsize
 134 \begin{tabular}{|c||r|r|r|r|r|r|}
 135 \hline
 136 \textbf{Mask size$\rightarrow$}&\multicolumn{2}{|c|}{\textbf{3x3}}&\multicolumn{2}{|c|}{\textbf{5x5}}&\multicolumn{2}{|c|}{\textbf{7x7}}\\
 137 \textbf{Image size$\downarrow$}&time (ms)&TP&time (ms)&TP&time (ms)&TP\\\hline\hline
 138 $\mathbf{512\times 512}$  &0.077&1165 &0.209&559  &0.407   &472 \\\hline
 139 $\mathbf{1024\times 1024}$&0.297&1432 &0.820&836  &1.603   &515 \\\hline
 140 $\mathbf{2048\times 2048}$&1.178&1549 &\bf 3.265&\bf 875 &6.398&529 \\\hline
 141 $\mathbf{4096\times 4096}$&4.700&1585 &13.05&533     &25.56&533 \\\hline
 142 \end{tabular}
 143 }
 144 \caption[Timings ($time$) and throughput values ($TP$ in Mpix/s) of one register-only non separable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$ and $7\times 7$ pixels, on a C2070 card.]{Timings ($time$) and throughput values ($TP$ in Mpix/s) of one register-only non separable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$ and $7\times 7$ pixels, on a C2070 card (fermi architecture). Data transfer duration are those of Table \ref{tab:memcpy1}.}
 145 \label{tab:convoNonSepReg1}
 146 \end{table}
 147
 148 \begin{table}[h]
 149 \centering
 150 {\normalsize
 151 \begin{tabular}{|c||r|r|r|r|r|r|}
 152 \hline
 153 \textbf{Mask size$\rightarrow$}&\multicolumn{2}{|c|}{\textbf{3x3}}&\multicolumn{2}{|c|}{\textbf{5x5}}&\multicolumn{2}{|c|}{\textbf{7x7}}\\
 154 \textbf{Image size$\downarrow$}&time (ms)&TP&time (ms)&TP&time(ms)&TP\\\hline\hline
 155 $\mathbf{512\times 512}$  &0.060&1186 &0.148&848 &0.280&594 \\\hline
 156 $\mathbf{1024\times 1024}$&0.209&1407 &0.556&960 &1.080&649 \\\hline
 157 $\mathbf{2048\times 2048}$&0.801&1092 &\bf 2.189&\bf 802 &4.278&573 \\\hline
 158 $\mathbf{4096\times 4096}$&3.171&1075 &8.720&793 &17.076&569 \\\hline
 159 \end{tabular}
 160 }
 161 \caption[Timings ($time$) and throughput values ($TP$ in Mpix/s) of one register-only non separable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$ and $7\times 7$ pixels, on a GTX280.]{Timings ($time$) and throughput values ($TP$ in Mpix/s) of one register-only non separable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$ and $7\times 7$ pixels, on a GTX280 (GT200 architecture). Data transfer duration are those of Table \ref{tab:memcpy1}.}
 162 \label{tab:convoNonSepReg3}
 163 \end{table}
 164
 165 \begin{table}[h]
 166 \centering
 167 {\normalsize
 168 \begin{tabular}{|c||r|r|}
 169 \hline
 170 \shortstack{\textbf{GPU card$\rightarrow$}\\\textbf{Image size$\downarrow$}}&\textbf{C2070}&\textbf{GTX280}\\\hline\hline
 171 $\mathbf{512\times 512}$  &0.148 &0.161 \\\hline
 172 $\mathbf{1024\times 1024}$&0.435 &0.536 \\\hline
 173 $\mathbf{2048\times 2048}$&1.530 &3.039 \\\hline
 174 $\mathbf{4096\times 4096}$&5.882 &12.431 \\\hline
 175 \end{tabular}
 176 }
 177 \caption{Time cost of data transfers between CPU and GPU memories, on C2070 and GTX280 cards (in milliseconds).}
 178 \label{tab:memcpy1}
 179 \end{table}
 180
 181 Table \ref{tab:convoNonSepReg3} shows timings and global throughput values achieved by those convolution masks on an Nvidia GT200 Tesla architecture (GTX480 card) with $16x8$ thread blocks. This measurement has been done in order to make a relevant comparison with a reference given by Nvidia in \cite{convolutionsoup} where they state that their fastest kernel achieves a $5\times5$ convolution of an 8-bit  $2048\times 2048$ pixelimage in $1.4~ms$, which lead to a throughput value of 945~Mpix/s.
 182 Our current value of 802~Mpix/s, though not unsatisfactory, remains lower to the one reached by the manufacturer'own coding.
 183 Tested in the same conditions, the newer Fermi architecture of
 184 Nvidia's GPUs proved slower (3.3 ms, see Table \ref{tab:convoNonSepReg1}) due to the lower maximum
 185 register count allowed (63, against 128 for Tesla GT200).
 186
 187 It is interesting to note that, as long as each thread processes one single pixel, kernel execution time is ruled in proportion
 188 with the number of pixels in the image multiplied by that of the mask.
 189 The slope in this first implementaion is $3.14.10^{-8}~ms/pix$ on C2070.
 190
 191 \subsection{Using parameterizable masks}
 192
 193 To further improve the above implementation, it becomes necessary
 194 to free ourselves from the hard-coding constraint. To achieve this,
 195 as was the case with input image storing, several memory options are
 196 available, but, since the amount of data involved in processing a
 197 mask is quite small and constant, we considered it relevant to copy data
 198 into \emph{symbol memory}. Listing \ref{lst:symbolmem} details the process, involving
 199 the Cuda function \emph{CudaMemCopyToSymbol()}.
 200
 201 \lstinputlisting[label={lst:symbolmem},caption=code snippet showing how to setup a mask in GPU symbol memory]{Chapters/chapter4/code/maskInSymbol.cu}
 202
 203 In parallel, giving up the register-only constraint allows a more
 204 conventional coding practice (loops). Listing \ref{lst:convoGene8r} presents
 205 a generic convolution kernel, whose code immediately
 206 appears both simple and concise. Its global time
 207 performance, however, is comparatively lower than the register-only
 208 process, due to the use of constant memory and of the \emph{r} parameter
 209 (radius of the mask). The average slope amounts to $3.81~ms/pix$ on C2070,
 210 which means a time-cost increase of around $20~\%$.
 211
 212 \lstinputlisting[label={lst:convoGene8r},caption=Generic CUDA kernel achieving a convolution operation with the mask in symbol memory and its radius passed as a parameter]{Chapters/chapter4/code/convoGene8r.cu}
 213
 214 \subsection{Increasing the number of pixels processed by each thread}
 215 Much in the same way as we did with the Median Filter, we shall now
 216 attempt to reduce the average latency due to writes into global memory
 217 by having each thread process more than one output value. As the basic
 218 structure of the above GPU kernel uses only 14 registers per thread, regardless
 219 of the size of the convolution mask, one can envisage processing 2
 220 or more pixels per thread while keeping safely within the 63-per-thread
 221 rule.
 222
 223 However, when doing so, \textit{eg} processing what we shall call a \textit{packet} of pixels, window mask overlapping has to be taken into account
 224 to avoid multiple texture fetches of each pixel's gray-level value, while benefiting from the 2-D cache.
 225 In that case, both mask size and pixel packet shape determine the number of texture fetches to be performed for each pixel value.
 226 Figure \ref{fig:convoOverlap1} illustrates two different situations: on top, a mask of radius 1 ($3\times 3$) applied to a packet of 8 pixels in row; at bottom, a mask of radius 2 ($5\times 5$).
 227 The dark gray pixels are the center pixels (pixels of the packet), while light gray pixels belong to the halo around the packet. The number in each pixel box corresponds to the convolution count in which it is involved.
 228 There would be little interest in using different \textit{packet} shapes, as the final global memory writes would not be coalescent; generating multiple latencies.
 229  \begin{figure}
 230 \centering
 231    \subfigure[$3\times 3$ mask: there are 18 center pixels (out of 30) involved in 3 computations.]{ \includegraphics[width=5.8cm]{Chapters/chapter4/img/convoOverlap1.png}}\\
 232    \subfigure[$5\times 5$ mask: only 20 center pixels (out of 60), involved in 5 computations.]{ \includegraphics[width=7cm]{Chapters/chapter4/img/convoOverlap2.png}}
 233    \caption{Mask window overlapping when processing 8 pixels per thread. Top: $3\times 3$ mask. Bottom: $5\times 5$ mask.}
 234    \label{fig:convoOverlap1}
 235 \end{figure}
 236
 237 Altough we actually wrote GPU kernels able to process 2, 4, 8 and 16 pixels per thread, only the one that processes 8 pixels per thread is presented below, as it proved to be the fastest one. Listing \ref{lst:convoGene8x8pL3} reproduce the source code of the kernel for $3\times 3$ masks.
 238 The bottom line is that each thread is associated with one base pixel of coordinates $(x,y)$ which is the first of the packet to be processed, the last one being $(x+7,y)$.
 239 \lstinputlisting[label={lst:convoGene8x8pL3},caption=CUDA kernel achieving a $3\times 3$ convolution operation with the mask in symbol memory and direct data fetches in texture memory]{Chapters/chapter4/code/convoGene8x8pL3.cu}
 240
 241 In this particular case of a $3\times 3$ mask, each pixel value is used in 3 different convolution sums, except pixels located near both ends of the packet, whose values are used in fewer sums.
 242 The general rule, when performing a $n\times n$ convolution (radius $k$) by 8-pixel packets is that each of the $(8-2k).(2k+1)$ \textit{center} pixels of the halo is used in $k$ sums, while the $4k.(2k+1)$ remaining pixels, located around the ends of the packet are used in fewer sums, from $k-1$ to $1$ ($2.(2k+1)$ pixels each).
 243 \begin{table}[h]
 244 \centering
 245 {\normalsize
 246 \begin{tabular}{|c||r|r|r|r|r|r|}
 247 \hline
 248 \textbf{Mask size$\rightarrow$}&\multicolumn{2}{|c|}{\textbf{3x3}}&\multicolumn{2}{|c|}{\textbf{5x5}}&\multicolumn{2}{|c|}{\textbf{7x7}}\\
 249 \textbf{Image size$\downarrow$}&time (ms)&TP&time (ms)&TP&time (ms)&TP\\\hline\hline
 250 $\mathbf{512\times 512}$  &0.036&1425 &0.069&1208 &0.110&1016 \\\hline
 251 $\mathbf{1024\times 1024}$&0.128&1862 &0.253&1524 &0.413&1237 \\\hline
 252 $\mathbf{2048\times 2048}$&0.495&2071 &\bf 0.987&1666 &1.615&1334 \\\hline
 253 $\mathbf{4096\times 4096}$&1.964&2138 &3.926&1711 &6.416&1364 \\\hline
 254 \end{tabular}
 255 }
 256 \caption[Timings ($time$) and throughput values ($TP$ in Mpix/s) of our generic fixed mask size convolution kernel run on a C2070 card.]{Timings ($time$) and throughput values ($TP$ in Mpix/s) of our generic fixed mask size convolution kernel run on a C2070 card. Data transfer durations are those of Table \ref{tab:memcpy1}.}
 257 \label{tab:convoGene8x8p}
 258 \end{table}
 259
 260 Timing results and throughput values are shown in Table \ref{tab:convoGene8x8p}, and show that this solution now outperforms Nvidia references.
 261 It is important to remember that the above kernels have been optimized for the Fermi architecture, unlike those mentioned earlier, which were more efficient on the GT200 architecture.
 262 However, our technique requires to write one kernel per mask size, which can be seen as a major constraint. To make it easier to use this method, we shall propose a kernel code generator that will be available in the near future.
 263
 264 \subsection{Using shared memory to store prefetched data\index{Prefetching}.}
 265  \index{memory~hierarchy!shared~memory}
 266 A more convenient way of coding a convolution kernel is to use shared memory to perform a prefetching stage of the whole halo before computing the convolution sums.
 267 This proves to be quite efficient and more versatile, but it obviously generates some overhead as:
 268 \begin{itemize}
 269 \item Each pixel value has to be read at least twice, first from texture memory into shared memory and then one or several more times from shared memory to be used in convolution computations.
 270 \item Reducing the number of times a single pixel value is read from shared memory is bound to generate bank conflicts, hence once again performance loss.
 271 \end{itemize}
 272  \begin{figure}
 273 \centering
 274    \includegraphics[width=12cm]{Chapters/chapter4/img/convoShMem.png}
 275    \caption[Organization of the prefetching stage of data, for a $5\times 5$ mask and a thread block size of $8\times 4$.]{Organization of the prefetching stage of data, for a $5\times 5$ mask and a thread block size of $8\times 4$. Threads in both top corners of the top figure are identified either by a circle or by a star symbol. The image tile, loaded into shared memory includes the pixels to be updated by the threads of the block, as well as its 2-pixel wide halo. Here, circle and star symbols in the image tile show which pixels are actually loaded into one shared memory vector by its corresponding thread. }
 276    \label{fig:ShMem1}
 277 \end{figure}
 278 Still, we also implemented this method, in a similar manner as Nvidia did in its SDK sample code.
 279 Some improvement has been obtained by increasing the number of pixels processed by each thread, to an optimum 8 pixels per thread.
 280 The principle is to prefetch all pixel values involved in the computations performed by all threads of a block, including 8 pixels per thread plus the halo of radius $r$ (the radius of the convolution mask). As this obviously represents more values than the thread count in one block, some threads have to load more than one value.
 281 The general organization is reproduced in Figure \ref{fig:ShMem1} for $5\times 5$ mask and a $8\times 4$ thread block, while Listing \ref{lst:convoGeneSh1} gives the details of the implementation with its two distinct code blocks: preload in shared memory (Lines 20 to 42) and convolution computations (Lines 45 to 57).
 282 Table \ref{tab:convoGeneSh1} details timing results of this implementation ($16\times 8$ threads/block), up to $13\times 13$ masks, that will serve as a reference in the next section, devoted to separable convolution.
 283 \begin{table}[h]
 284 \centering
 285 {\normalsize
 286 \begin{tabular}{|c||r|r|r|r|r|r|}
 287 \hline
 288 \shortstack{\textbf{Mask size$\rightarrow$}\\\textbf{Image size$\downarrow$}}&\textbf{3x3}&\textbf{5x5}&\textbf{7x7}&\textbf{9x9}&\textbf{11x11}&\textbf{13x13}\\\hline\hline
 289 $\mathbf{512\times 512}$  &0.040 &0.075 &0.141    &0.243&0.314&0.402\\\hline
 290 $\mathbf{1024\times 1024}$&0.141 &0.307 &0.524    &0.917&1.192&1.535\\\hline
 291 $\mathbf{2048\times 2048}$&0.543 &\bf 1.115&2.048 &3.598&4.678&6.037\\\hline
 292 $\mathbf{4096\times 4096}$&2.146 &4.364 &8.156    &14.341&18.652&24.020\\\hline
 293 \end{tabular}
 294 }
 295 \caption{Performances, in milliseconds, of our generic 8 pixels per thread kernel using shared memory, run  on a C2070 card.}
 296 \label{tab:convoGeneSh1}
 297 \end{table}
 298 \begin{table}[h]
 299 \centering
 300 {\normalsize
 301 \begin{tabular}{|c||r|r|r|r|r|r|}
 302 \hline
 303 \shortstack{\textbf{Mask size$\rightarrow$}\\\textbf{Image size$\downarrow$}}&\textbf{3x3}&\textbf{5x5}&\textbf{7x7}&\textbf{9x9}&\textbf{11x11}&\textbf{13x13}\\\hline\hline
 304 $\mathbf{512\times 512}$  &1394 &1176 &907      &670&567&477\\\hline
 305 $\mathbf{1024\times 1024}$&1820 &1413 &1093     &776&644&532\\\hline
 306 $\mathbf{2048\times 2048}$&2023 &\bf 1586 &1172 &818&676&554\\\hline
 307 $\mathbf{4096\times 4096}$&2090 &1637 &1195     &830&684&561\\\hline
 308 \end{tabular}
 309 }
 310 \caption[Throughput values, in MegaPixel per second, of our generic 8 pixels per thread kernel using shared memory, run on a C2070 card.]{Throughput values, in MegaPixel per second, of our generic 8 pixels per thread kernel using shared memory, run on a C2070 card. Data transfer durations are those of Table \ref{tab:memcpy1}.}
 311 \label{tab:convoGeneSh2}
 312 \end{table}
 313 \lstinputlisting[label={lst:convoGeneSh1},caption=CUDA kernel achieving a generic convolution operation after a preloading of data in shared memory.]{Chapters/chapter4/code/convoGeneSh1.cu}
 314
 315 \section{Separable convolution}
 316 A convolution operation is said separable when its masks $h$ is the product of 2 vectors $h_v$ and $h_h$, as is the case in the following example:
 317 $$h = h_v \times h_h = \begin{bmatrix}1\\2\\1\end{bmatrix} \times \begin{bmatrix}-1&2&-1\end{bmatrix} = \begin{bmatrix}
 318 -1&2&-1\\
 319 -2&4&-2\\
 320 -1&2&-1
 321 \end{bmatrix}$$
 322 Such a mask allows to replace a generic 2-D convolution operation by two consecutive stages of a 1-D convolution operation: a vertical of mask $h_v$ and a horizontal of mask $h_h$.
 323 This saves a lot of arithmetic operations, as a generic $n\times n$ convolution applied on a $H\times L$ image basically represents $H.L.n^2$ multiplications and as many additions, while two consecutive $n\times 1$ convolutions only represents $2.H.L.n$ of each, \textit{eg} 60\% operations are saved per pixel of the image for a $5\times 5$ mask.\\
 324 However, beside reducing the operation count, performing a separable convolution also means writing an intermediate image into global memory.
 325 CPU implementations of separable convolutions often use a single function to perform both 1-D convolution stages. To do so, this function reads the input image and actually ouputs the transposed filtered image.
 326 Applying that principle to GPUs is not efficient, as outputting the transposed image means non-coalescent writes into global memory, generating severe performance loss. Hence the idea of developing two different kernels, one for each of both vertical and horizontal convolutions.
 327
 328 Here, the use of Shared memory is the best choice, as there is no overlapping between neighbor windows and thus no possible optimization.
 329 Moreover, to ensure efficiency, it is important to read the input image from texture memory, which implies an internal GPU data copy between both 1-D convolution stages.
 330 Which, even if it is faster than CPU/GPU data transfer, makes separable convolutions slower than generic convolutions for small mask sizes. On C2070, the lower limit is $7\times 7$ pixels ($9\times 9$ for $512\times 512$ images).
 331
 332 Both vertical and horizontal kernels feature similar runtimes: Table \ref{tab:convoSepSh1} only contains their average execution time, including the internal data copy stage, while Table \ref{tab:convoSepSh2} shows the  achieved global throughput values. Timings of the data copy stage are given in Table \ref{tab:cpyToArray}.
 333 Listings \ref{lst:convoSepShV} and \ref{lst:convoSepShH} detail the implementation of both 1-D kernels, while Listing \ref{lst:convoSepSh} shows how to use them in addition with the data copy function in order to achieve a whole separable convolution. The shared memory size is dynamically passed as a parameter at kernel call time. Its expression is given in the comment line before its declaration.
 334 \begin{table}[h]
 335 \centering
 336 {\normalsize
 337 \begin{tabular}{|c||r|}
 338 \hline
 339 \textbf{Image size}&\textbf{C2070}\\\hline\hline
 340 $\mathbf{512\times 512}$  &0.029 \\\hline
 341 $\mathbf{1024\times 1024}$&0.101 \\\hline
 342 $\mathbf{2048\times 2048}$&0.387 \\\hline
 343 $\mathbf{4096\times 4096}$&1.533 \\\hline
 344 \end{tabular}
 345 }
 346 \caption{Time cost of data copy between the vertical and the horizontal 1-D convolution stages, on a C2070 cards (in milliseconds).}
 347 \label{tab:cpyToArray}
 348 \end{table}
 349 \begin{table}[h]
 350 \centering
 351 {\normalsize
 352 \begin{tabular}{|c||r|r|r|r|r|r|}
 353 \hline
 354 \shortstack{\textbf{Mask size$\rightarrow$}\\\textbf{Image size$\downarrow$}}&\textbf{3x3}&\textbf{5x5}&\textbf{7x7}&\textbf{9x9}&\textbf{11x11}&\textbf{13x13}\\\hline\hline
 355 $\mathbf{512\times 512}$  &0.080 &0.087 &0.095 &\bf 0.108&\bf 0.115&\bf 0.126\\\hline
 356 $\mathbf{1024\times 1024}$&0.306 &0.333 &\bf 0.333 &\bf 0.378&\bf 0.404&\bf 0.468\\\hline
 357 $\mathbf{2048\times 2048}$&1.094 &1.191 &\bf 1.260 &\bf 1.444&\bf 1.545&\bf 1.722\\\hline
 358 $\mathbf{4096\times 4096}$&4.262 &4.631 &\bf 5.000 &\bf 5.676&\bf 6.105&\bf 6.736\\\hline
 359 \end{tabular}}
 360 \caption[Performances, in milliseconds, of our generic 8 pixels per thread 1-D convolution kernels using shared memory, run  on a C2070 card.]{Performances, in milliseconds, of our generic 8 pixels per thread 1-D convolution kernels using shared memory, run  on a C2070 card. Timings include data copy. Bold values correspond to situations where separable-convolution kernels run faster than non separable ones.}
 361 \label{tab:convoSepSh1}
 362 \end{table}
 363 \begin{table}[h]
 364 \centering
 365 {\normalsize
 366 \begin{tabular}{|c||r|r|r|r|r|r|}
 367 \hline
 368 \shortstack{\textbf{Mask size$\rightarrow$}\\\textbf{Image size$\downarrow$}}&\textbf{3x3}&\textbf{5x5}&\textbf{7x7}&\textbf{9x9}&\textbf{11x11}&\textbf{13x13}\\\hline\hline
 369 $\mathbf{512\times 512}$  &1150 &1116 &1079 &\bf 1024&\bf 997 &\bf 957\\\hline
 370 $\mathbf{1024\times 1024}$&1415 &1365 &\bf 1365 &\bf 1290&\bf 1250&\bf 1169\\\hline
 371 $\mathbf{2048\times 2048}$&1598 &1541 &\bf 1503 &\bf 1410&\bf 1364&\bf 1290\\\hline
 372 $\mathbf{4096\times 4096}$&1654 &1596 &\bf 1542 &\bf 1452&\bf 1400&\bf 1330\\\hline
 373 \end{tabular}
 374 }
 375 \caption[Throughput values, in MegaPixel per second, of our generic 8 pixels per thread 1-D convolution kernel using shared memory, run on a C2070 card.]{Throughput values, in MegaPixel per second, of our generic 8 pixels per thread 1-D convolution kernel using shared memory, run on a C2070 card. Data transfer durations are those of Table \ref{tab:memcpy1}.}
 376 \label{tab:convoSepSh2}
 377 \end{table}
 378
 379 \lstinputlisting[label={lst:convoSepSh},caption=data copy between the calls to 1-D convolution kernels achieving a 2-D separable convolution operation.]{Chapters/chapter4/code/convoSepSh.cu}
 380 \lstinputlisting[label={lst:convoSepShV},caption=CUDA kernel achieving a horizontal 1-D convolution operation after a preloading \index{Prefetching} of data in shared memory.]{Chapters/chapter4/code/convoSepShV.cu}
 381 \lstinputlisting[label={lst:convoSepShH},caption=CUDA kernel achieving a vertical 1-D convolution operation after a preloading of data in shared memory.]{Chapters/chapter4/code/convoSepShH.cu}
 382
 383 \section{Conclusion}
 384 Extensively detailing the various techniques that may be applied when designing a median or a convolution operation on GPU has enabled us determine that:
 385 \begin{itemize}
 386 \item the use of registers with direct data fetching from texture often allows kernels to run faster than those which use the more conventionnal way of prefetching data from texture memory and storing them into shared memory.
 387 \item increasing the pixel count processed by each thread brings important speedups. In this case, if neighboring windows overlap, optimized direct data fetching from texture will likely outperform the shared memory prefetching technique. That is the case for generic convolution kernels.
 388 \item coding such optimized data fetching is not straightforward. Consequently, we are planning to provide a kernel code generator that will make our kernels more accessible by GPU users.
 389 \end{itemize}
 390 The presented kernels, optimized for a C2070 card, achieve up to 2138~Mpix/s including data transfers, which comes close to the absolute maximum throughput value allowed by the Fermi architecture. The next GPU generation (Kepler) may allow us not only to benefit from  new dynamic parallelism capability to increase kernel paralelism level, but also to take advantage of an increase of the register count allowed per thread block which would allow us, for example, to extend our register-only median filter technique to larger mask sizes.
 391
 392 \putbib[Chapters/chapter4/biblio4]