X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/17d1891ee5feec4f52ef9c51bfa60b78f0bd14c2..a2aa3f0f91a668ee6e799bad0f4de90b7b2be452:/BookGPU/Chapters/chapter4/ch4.tex diff --git a/BookGPU/Chapters/chapter4/ch4.tex b/BookGPU/Chapters/chapter4/ch4.tex index 0a0d6cb..ed3f531 100644 --- a/BookGPU/Chapters/chapter4/ch4.tex +++ b/BookGPU/Chapters/chapter4/ch4.tex @@ -8,7 +8,7 @@ \section{Overview} In this chapter, after dealing with GPU median filter implementations, -we propose to explore how convolutions\index{Convolution} can be implemented on modern +we propose to explore how convolutions\index{convolution} can be implemented on modern GPUs. Widely used in digital image processing filters, the \emph{convolution operation} basically consists of taking the sum of products of elements from two 2D functions, letting one of the two functions move over @@ -20,7 +20,7 @@ to $I$ as an $H\times L$ pixel gray-level image and to $I(x,y)$ as the gray-leve value of each pixel of coordinates $(x,y)$. - +\clearpage \section{Definition} Within a digital image $I$, the convolution operation is performed between image $I$ and convolution mask \emph{h} (To avoid confusion with other @@ -81,7 +81,7 @@ This first implementation consists of a rather naive application to convolutions of the techniques applied to median filters in the previous chapter, as a reminder: texture memory used with incoming data, pinned memory with output data, optimized use of registers -while processing data and multiple output per thread\index{Multiple output per thread}. +while processing data and multiple output per thread\index{multiple output per thread}. One significant difference lies in the fact that the median filter uses only one parameter, the size of the window mask, which can be hard-coded, while a convolution mask requires referring to several parameters; hard-coding @@ -113,7 +113,7 @@ $\mathbf{2048\times 2048}$&1.178&1549 &\bf 3.265&\bf 875 &6.398&529 \\\hline $\mathbf{4096\times 4096}$&4.700&1585 &13.05&533 &25.56&533 \\\hline \end{tabular} } -\caption[Timings (time) and throughput values (TP in MP/s) of one register-only non-separable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$, and $7\times 7$ pixels, on a C2070 card.]{Timings (time) and throughput values (TP in MPx/s) of one register-only non-separable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$, and $7\times 7$ pixels, on a C2070 card (fermi architecture). Data transfer duration are those of Table \ref{tab:memcpy1}. The bold value points out the result obtained in the reference situation.} +\caption[Timings (time) and throughput values (TP in MP/s) of one register-only nonseparable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$, and $7\times 7$ pixels, on a C2070 card.]{Timings (time) and throughput values (TP in MPx/s) of one register-only nonseparable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$, and $7\times 7$ pixels, on a C2070 card (fermi architecture). Data transfer duration are those of Table \ref{tab:memcpy1}. The bold value points out the result obtained in the reference situation.} \label{tab:convoNonSepReg1} \end{table} @@ -142,7 +142,7 @@ $\mathbf{2048\times 2048}$&0.801&1092 &\bf 2.189&\bf 802 &4.278&573 \\\hline $\mathbf{4096\times 4096}$&3.171&1075 &8.720&793 &17.076&569 \\\hline \end{tabular} } -\caption[Timings (time) and throughput values (TP in MP/s) of one register-only non-separable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$, and $7\times 7$ pixels, on a GTX280.]{Timings (time) and throughput values (TP in MP/s) of one register-only non-separable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$, and $7\times 7$ pixels, on a GTX280 (GT200 architecture). Data transfer duration are those of Table \ref{tab:memcpy1}. The bold value points out the result obtained in the reference situation.} +\caption[Timings (time) and throughput values (TP in MP/s) of one register-only nonseparable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$, and $7\times 7$ pixels, on a GTX280.]{Timings (time) and throughput values (TP in MP/s) of one register-only nonseparable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$, and $7\times 7$ pixels, on a GTX280 (GT200 architecture). Data transfer duration are those of Table \ref{tab:memcpy1}. The bold value points out the result obtained in the reference situation.} \label{tab:convoNonSepReg3} \end{table} @@ -239,8 +239,8 @@ However, our technique requires writing one kernel per mask size, which can be s \lstinputlisting[label={lst:convoGene8x8pL3},caption=CUDA kernel achieving a $3\times 3$ convolution operation with the mask in symbol memory and direct data fetches in texture memory]{Chapters/chapter4/code/convoGene8x8pL3.cu} -\subsection{Using shared memory to store prefetched data\index{Prefetching}.} - \index{memory~hierarchy!shared~memory} +\subsection{Using shared memory to store prefetched data\index{prefetching}.} + \index{memory hierarchy!shared memory} A more convenient way of coding a convolution kernel is to use shared memory to perform a prefetching stage of the whole halo before computing the convolution sums. This proves to be quite efficient and more versatile, but it obviously generates some overhead because \begin{itemize} @@ -356,7 +356,7 @@ $\mathbf{4096\times 4096}$&1.533 \\\hline \label{tab:cpyToArray} \end{table} \lstinputlisting[label={lst:convoSepSh},caption=data copy between the calls to 1D convolution kernels achieving a 2D separable convolution operation]{Chapters/chapter4/code/convoSepSh.cu} -\lstinputlisting[label={lst:convoSepShV},caption=CUDA kernel achieving a horizontal 1D convolution operation after a preloading \index{Prefetching} of data into shared memory]{Chapters/chapter4/code/convoSepShV.cu} +\lstinputlisting[label={lst:convoSepShV},caption=CUDA kernel achieving a horizontal 1D convolution operation after a preloading \index{prefetching} of data into shared memory]{Chapters/chapter4/code/convoSepShV.cu} \lstinputlisting[label={lst:convoSepShH},caption=CUDA kernel achieving a vertical 1D convolution operation after a preloading of data into shared memory]{Chapters/chapter4/code/convoSepShH.cu} \section{Conclusion}