\section{Overview}
In this chapter, after dealing with GPU median filter implementations,
-we propose to explore how convolutions\index{Convolution} can be implemented on modern
+we propose to explore how convolutions\index{convolution} can be implemented on modern
GPUs. Widely used in digital image processing filters, the \emph{convolution
operation} basically consists of taking the sum of products of elements
from two 2D functions, letting one of the two functions move over
value of each pixel of coordinates $(x,y)$.
-
+\clearpage
\section{Definition}
Within a digital image $I$, the convolution operation is performed between
image $I$ and convolution mask \emph{h} (To avoid confusion with other
convolutions of the techniques applied to median filters in the
previous chapter, as a reminder: texture memory used with incoming
data, pinned memory with output data, optimized use of registers
-while processing data and multiple output per thread\index{Multiple output per thread}.
+while processing data and multiple output per thread\index{multiple output per thread}.
One significant difference lies in the fact
that the median filter uses only one parameter, the size of the window mask,
which can be hard-coded, while a convolution mask requires referring to several parameters; hard-coding
$\mathbf{4096\times 4096}$&4.700&1585 &13.05&533 &25.56&533 \\\hline
\end{tabular}
}
-\caption[Timings (time) and throughput values (TP in MP/s) of one register-only non-separable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$, and $7\times 7$ pixels, on a C2070 card.]{Timings (time) and throughput values (TP in MPx/s) of one register-only non-separable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$, and $7\times 7$ pixels, on a C2070 card (fermi architecture). Data transfer duration are those of Table \ref{tab:memcpy1}. The bold value points out the result obtained in the reference situation.}
+\caption[Timings (time) and throughput values (TP in MP/s) of one register-only nonseparable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$, and $7\times 7$ pixels, on a C2070 card.]{Timings (time) and throughput values (TP in MPx/s) of one register-only nonseparable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$, and $7\times 7$ pixels, on a C2070 card (fermi architecture). Data transfer duration are those of Table \ref{tab:memcpy1}. The bold value points out the result obtained in the reference situation.}
\label{tab:convoNonSepReg1}
\end{table}
$\mathbf{4096\times 4096}$&3.171&1075 &8.720&793 &17.076&569 \\\hline
\end{tabular}
}
-\caption[Timings (time) and throughput values (TP in MP/s) of one register-only non-separable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$, and $7\times 7$ pixels, on a GTX280.]{Timings (time) and throughput values (TP in MP/s) of one register-only non-separable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$, and $7\times 7$ pixels, on a GTX280 (GT200 architecture). Data transfer duration are those of Table \ref{tab:memcpy1}. The bold value points out the result obtained in the reference situation.}
+\caption[Timings (time) and throughput values (TP in MP/s) of one register-only nonseparable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$, and $7\times 7$ pixels, on a GTX280.]{Timings (time) and throughput values (TP in MP/s) of one register-only nonseparable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$, and $7\times 7$ pixels, on a GTX280 (GT200 architecture). Data transfer duration are those of Table \ref{tab:memcpy1}. The bold value points out the result obtained in the reference situation.}
\label{tab:convoNonSepReg3}
\end{table}
\lstinputlisting[label={lst:convoGene8x8pL3},caption=CUDA kernel achieving a $3\times 3$ convolution operation with the mask in symbol memory and direct data fetches in texture memory]{Chapters/chapter4/code/convoGene8x8pL3.cu}
-\subsection{Using shared memory to store prefetched data\index{Prefetching}.}
- \index{memory~hierarchy!shared~memory}
+\subsection{Using shared memory to store prefetched data\index{prefetching}.}
+ \index{memory hierarchy!shared memory}
A more convenient way of coding a convolution kernel is to use shared memory to perform a prefetching stage of the whole halo before computing the convolution sums.
This proves to be quite efficient and more versatile, but it obviously generates some overhead because
\begin{itemize}
\label{tab:cpyToArray}
\end{table}
\lstinputlisting[label={lst:convoSepSh},caption=data copy between the calls to 1D convolution kernels achieving a 2D separable convolution operation]{Chapters/chapter4/code/convoSepSh.cu}
-\lstinputlisting[label={lst:convoSepShV},caption=CUDA kernel achieving a horizontal 1D convolution operation after a preloading \index{Prefetching} of data into shared memory]{Chapters/chapter4/code/convoSepShV.cu}
+\lstinputlisting[label={lst:convoSepShV},caption=CUDA kernel achieving a horizontal 1D convolution operation after a preloading \index{prefetching} of data into shared memory]{Chapters/chapter4/code/convoSepShV.cu}
\lstinputlisting[label={lst:convoSepShH},caption=CUDA kernel achieving a vertical 1D convolution operation after a preloading of data into shared memory]{Chapters/chapter4/code/convoSepShH.cu}
\section{Conclusion}