From: couturie Date: Thu, 25 Apr 2013 15:03:38 +0000 (+0200) Subject: Merge branch 'master' of ssh://info.iut-bm.univ-fcomte.fr/book_gpu X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/commitdiff_plain/32bc153a6a82be882b13679314a6f1e8021de074?hp=c56162cd0df6cb604942d765ca5430092d4199dd Merge branch 'master' of ssh://info.iut-bm.univ-fcomte.fr/book_gpu Conflicts: BookGPU/Chapters/chapter12/ch12.aux BookGPU/Chapters/chapter16/ch16.aux BookGPU/Chapters/chapter17/ch17.aux BookGPU/Chapters/chapter18/ch18.aux BookGPU/Chapters/chapter6/ch6.aux --- diff --git a/BookGPU/BookGPU.tex b/BookGPU/BookGPU.tex index eaa9c44..e236ce8 100755 --- a/BookGPU/BookGPU.tex +++ b/BookGPU/BookGPU.tex @@ -182,6 +182,7 @@ \include{Chapters/chapter2/ch2} \part{Image processing} \include{Chapters/chapter3/ch3} +\include{Chapters/chapter4/ch4} \part{Software development} \include{Chapters/chapter5/ch5}% lib HeatSolution0.049307.pdf HeatSolution0.pdf ... \include{Chapters/chapter6/ch6} diff --git a/BookGPU/Chapters/chapter4/biblio4.bib b/BookGPU/Chapters/chapter4/biblio4.bib new file mode 100644 index 0000000..14e5ef7 --- /dev/null +++ b/BookGPU/Chapters/chapter4/biblio4.bib @@ -0,0 +1,8 @@ +@unpublished{convolutionsoup, + title = {Convolution Soup}, + author = {Stam, Joe}, + abstract = {Graphics processors can be easily programmed to provide significant acceleration in many common parallel tasks. However, with additional architecture knowledge and understanding of optimization strategies, a savvy programmer can unleash the full potential of the GPU's massive memory bandwidth and ensure the processing resources are utilized to their fullest extent. In this talk, we'll explore several different approaches to a very simple but ubiquitous image processing algorithm, the convolution. A naive approach shows the detrimental impact of poorly written code, a simple approach achieves decent results with little effort or code complexity, and a few highly optimized techniques realize the GPUs full power for the most demanding tasks. The techniques explored in this simple but illustrative example will serve as a base for understanding the optimization strategies to apply towards more complex algorithms.}, + year = {2010}, + month ={8}, + pdf = {http://fr.slideshare.net/NVIDIA/1412-gtc09}, +} \ No newline at end of file diff --git a/BookGPU/Chapters/chapter4/ch4.tex b/BookGPU/Chapters/chapter4/ch4.tex new file mode 100644 index 0000000..ea473a1 --- /dev/null +++ b/BookGPU/Chapters/chapter4/ch4.tex @@ -0,0 +1,389 @@ +\chapterauthor{Gilles Perrot}{FEMTO-ST Institute} + +%\newcommand{\kl}{\includegraphics[scale=0.6]{Chapters/chapter4/img/kernLeft.png}~} +%\newcommand{\kr}{\includegraphics[scale=0.6]{Chapters/chapter4/img/kernRight.png}} + +%% \lstset{ +%% language=C, +%% columns=fixed, +%% basicstyle=\footnotesize\ttfamily, +%% numbers=left, +%% firstnumber=1, +%% numberstyle=\tiny, +%% stepnumber=5, +%% numbersep=5pt, +%% tabsize=3, +%% extendedchars=true, +%% breaklines=true, +%% keywordstyle=\textbf, +%% frame=single, +%% % keywordstyle=[1]\textbf, +%% %identifierstyle=\textbf, +%% commentstyle=\color{white}\textbf, +%% stringstyle=\color{white}\ttfamily, +%% % xleftmargin=17pt, +%% % framexleftmargin=17pt, +%% % framexrightmargin=5pt, +%% % framexbottommargin=4pt, +%% backgroundcolor=\color{lightgray}, +%% } + + +\chapter{Implementing an efficient convolution \index{Convolution} operation on GPU} +\section{Overview} +In this chapter, after dealing with GPU median filter implementations, +we propose to explore how convolutions can be implemented on modern +GPUs. Widely used in digital image processing filters, the \emph{convolution +operation} basically consists in taking the sum of products of elements +from two 2-D functions, letting one of the two functions move over +every element of the other, producing a third function that is typically +viewed as a modified version of one of the original functions. To +begin with, we shall examine non-separable or generic convolutions, +before adressing the matter of separable convolutions. We shall refer +to $I$ as an H x L pixel gray-level image, and to $I(x,y)$ as the gray-level +value of each pixel of coordinates $(x,y)$. +%dire qqs mots sur le filtrage IIR/FIR ? + + +\section{Definition} +Within a digital image $I$, the convolution operation is performed between +image $I$ and convolution mask \emph{h} (To avoid confusion with other +GPU functions referred to as kernels, we shall use\emph{ convolution +mask} instead of \emph{convolution kernel}) is defined by: +\begin{equation} +I'(x, y) = \left(I * h\right) = \sum_{i < H} \sum_{j < L}I(x-j, y-j).h(j,i) +\label{convoDef} +\end{equation} +While processing an image, function \emph{h} is bounded by a square +window of size \emph{k = 2r + 1}, i.e an uneven number, to ensure +there is a center. We shall also point out that, as stated earlier, +the square shape is no limiting factor to the process, as any shape +can be inscribed into a square. In the case of a more complex shape, +the remaining space is filled by null values (padding). + + +\section{Implementation} +The basic principle of computing a convolution between one $I$ picture +and one \emph{h} convolution mask defined on domain $\Omega$ is given +by algorithm \ref{algo_genconv} and illustrated by Figure \ref{fig:convoPrinciple}, which mainly shows how gray-level values of the center pixel's neighborhood are combined with the convolution mask values to compute the output value. +For more readability, only part of the connecting lines are shown. + \begin{figure} +\centering + \includegraphics[width=11cm]{Chapters/chapter4/img/convo1.png} + \caption{Principle of a generic convolution implementation. The center pixel is represented with a black background and the pixels of its neighborhood are denoted $I_{p,q}$ where $(p,q)$ is the relative position of the neighbor pixel. Elements $h_{t,u}$ are the values of the convolution mask.} + \label{fig:convoPrinciple} +\end{figure} +\begin{algorithm} +\caption{generic convolution} +\label{algo_genconv} + \ForEach{pixel at position $(x, y)$}{ + Read all gray-level values $I(x, y)$ in the neighborhood\\\; + Compute the weighted sum \( I_\Omega = \sum_{(j,i) \in \Omega}I(x-j, y-j).h(j,i) \)\\\; + Normalize $I'(x, y)$ value\\\; + Outputs the new gray-level value + } +\end{algorithm} + +The gray-level value of each pixel of output image $I'$ is the weighted +sum of pixels included in the neighborhood defined by $\Omega$ around +the corresponding pixel in the input image. It has to be noted that, +in case the sum $S$ of all coefficients in the mask is not 1, the original +brightness of the image will be altered and a normalization stage +has to take place, as, for example, in the case of an 8-bit coded +image: +\begin{enumerate} +\item if $S \ge 0$ then $I' = I_\Omega / S$ +\item if $S = 0$ then $I' = I_\Omega + 128$ +\item if $S < 0$ then $I' = I_\Omega + 255$ +\end{enumerate} +In case one, normalizing means performing a division operation for +each pixel, which will be quite time-costly when performed on a GPU. A simple work-around is to normalize mask values before using them in GPU kernels. + + +\subsection{First test implementation} +This first implementation consists of a rather naive application to +convolutions of the tuning recipes applied to median filters in the +previous chapter, as a reminder : texture memory used with incoming +data, pinned memory with output data, optimized use of registers +while processing data and multiple output per thread\index{Multiple output per thread}. +One signifcant difference lies in the fact +that the median filter uses only one parameter, the size of the window mask, +which can be hard-coded, while a convolution mask requires referring to several; hard-coding +its elements would lead to severe lack of flexibility (one function +per filter, no external settings) so we will just use it as a starting +point in our approach. + +Let us assume that we are planning to implement the convolution defined by the following $3\times 3$ mask (low-pass filter or averaging filter): +$$h=\begin{bmatrix}1&1&1\\1&1&1\\1&1&1\end{bmatrix}$$ +The kernel code presented in Listing \ref{lst:convoGene3Reg8} implements the convolution operation and applies all above optimizations except, for clarity reasons, multiple output per thread. +In the particular case of a generic convolution, it is important to note how mask coefficients are applied to image pixels in order to fit the definition of equation \ref{convoDef}: if the coordinates of the center pixel had been set to (0,0), then the pixel of coordinates $(i,j)$ would have been multiplied by the element $(-i,-j)$ of the mask, which, transposed in our kernel code, leads to multiply the $p^{th}$ pixel of the window by the $(n-p)^{th}$ element of the convolution mask. + +\lstinputlisting[label={lst:convoGene3Reg8},caption=Generic CUDA kernel achieving a convolution operation with hard-coded mask values]{Chapters/chapter4/code/convoGene3Reg8.cu} + +Table \ref{tab:convoNonSepReg1} shows kernel timings and throughput values for such a low-pass filter extended to $5\times 5$ and $7\times 7$ masks applied on 8-bit coded gray-level +images of sizes $512\times 512$, $1024\times 1024$, $2048\times 2048$, $4096\times 4096$ and run on a C2070 card with $32\times 8$ thread blocks. +As a reminder, Table \ref{tab:memcpy1} details the data transfer costs that helped computing throughput values. + + +\begin{table}[h] +\centering +{\normalsize +\begin{tabular}{|c||r|r|r|r|r|r|} +\hline +\textbf{Mask size$\rightarrow$}&\multicolumn{2}{|c|}{\textbf{3x3}}&\multicolumn{2}{|c|}{\textbf{5x5}}&\multicolumn{2}{|c|}{\textbf{7x7}}\\ +\textbf{Image size$\downarrow$}&time (ms)&TP&time (ms)&TP&time (ms)&TP\\\hline\hline +$\mathbf{512\times 512}$ &0.077&1165 &0.209&559 &0.407 &472 \\\hline +$\mathbf{1024\times 1024}$&0.297&1432 &0.820&836 &1.603 &515 \\\hline +$\mathbf{2048\times 2048}$&1.178&1549 &\bf 3.265&\bf 875 &6.398&529 \\\hline +$\mathbf{4096\times 4096}$&4.700&1585 &13.05&533 &25.56&533 \\\hline +\end{tabular} +} +\caption{Timings ($time$) and throughput values ($TP$ in Mpix/s) of one register-only non separable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$ and $7\times 7$ pixels, on a C2070 card (fermi architecture). Data transfer duration are those of Table \ref{tab:memcpy1}.} +\label{tab:convoNonSepReg1} +\end{table} + +\begin{table}[h] +\centering +{\normalsize +\begin{tabular}{|c||r|r|r|r|r|r|} +\hline +\textbf{Mask size$\rightarrow$}&\multicolumn{2}{|c|}{\textbf{3x3}}&\multicolumn{2}{|c|}{\textbf{5x5}}&\multicolumn{2}{|c|}{\textbf{7x7}}\\ +\textbf{Image size$\downarrow$}&time (ms)&TP&time (ms)&TP&time(ms)&TP\\\hline\hline +$\mathbf{512\times 512}$ &0.060&1186 &0.148&848 &0.280&594 \\\hline +$\mathbf{1024\times 1024}$&0.209&1407 &0.556&960 &1.080&649 \\\hline +$\mathbf{2048\times 2048}$&0.801&1092 &\bf 2.189&\bf 802 &4.278&573 \\\hline +$\mathbf{4096\times 4096}$&3.171&1075 &8.720&793 &17.076&569 \\\hline +\end{tabular} +} +\caption{Timings ($time$) and throughput values ($TP$ in Mpix/s) of one register-only non separable convolution kernel, for small mask sizes of $3\times 3$, $5\times 5$ and $7\times 7$ pixels, on a GTX280 (GT200 architecture). Data transfer duration are those of Table \ref{tab:memcpy1}.} +\label{tab:convoNonSepReg3} +\end{table} + +\begin{table}[h] +\centering +{\normalsize +\begin{tabular}{|c||r|r|} +\hline +\shortstack{\textbf{GPU card$\rightarrow$}\\\textbf{Image size$\downarrow$}}&\textbf{C2070}&\textbf{GTX280}\\\hline\hline +$\mathbf{512\times 512}$ &0.148 &0.161 \\\hline +$\mathbf{1024\times 1024}$&0.435 &0.536 \\\hline +$\mathbf{2048\times 2048}$&1.530 &3.039 \\\hline +$\mathbf{4096\times 4096}$&5.882 &12.431 \\\hline +\end{tabular} +} +\caption{Time cost of data transfers between CPU and GPU memories, on C2070 and GTX280 cards (in milliseconds).} +\label{tab:memcpy1} +\end{table} + +Table \ref{tab:convoNonSepReg3} shows timings and global throughput values achieved by those convolution masks on an Nvidia GT200 Tesla architecture (GTX480 card) with $16x8$ thread blocks. This measurement has been done in order to make a relevant comparison with a reference given by Nvidia in \cite{convolutionsoup} where they state that their fastest kernel achieves a $5\times5$ convolution of an 8-bit $2048\times 2048$ pixelimage in $1.4~ms$, which lead to a throughput value of 945~Mpix/s. +Our current value of 802~Mpix/s, though not unsatisfactory, remains lower to the one reached by the manufacturer'own coding. +Tested in the same conditions, the newer Fermi architecture of +Nvidia's GPUs proved slower (3.3 ms, see Table \ref{tab:convoNonSepReg1}) due to the lower maximum +register count allowed (63, against 128 for Tesla GT200). + +It is interesting to note that, as long as each thread processes one single pixel, kernel execution time is ruled in proportion +with the number of pixels in the image multiplied by that of the mask. +The slope in this first implementaion is $3.14.10^{-8}~ms/pix$ on C2070. + +\subsection{Using parameterizable masks} + +To further improve the above implementation, it becomes necessary +to free ourselves from the hard-coding constraint. To achieve this, +as was the case with input image storing, several memory options are +available, but, since the amount of data involved in processing a +mask is quite small and constant, we considered it relevant to copy data +into \emph{symbol memory}. Listing \ref{lst:symbolmem} details the process, involving +the Cuda function \emph{CudaMemCopyToSymbol()}. + +\lstinputlisting[label={lst:symbolmem},caption=code snippet showing how to setup a mask in GPU symbol memory]{Chapters/chapter4/code/maskInSymbol.cu} + +In parallel, giving up the register-only constraint allows a more +conventional coding practice (loops). Listing \ref{lst:convoGene8r} presents +a generic convolution kernel, whose code immediately +appears both simple and concise. Its global time +performance, however, is comparatively lower than the register-only +process, due to the use of constant memory and of the \emph{r} parameter +(radius of the mask). The average slope amounts to $3.81~ms/pix$ on C2070, +which means a time-cost increase of around $20~\%$. + +\lstinputlisting[label={lst:convoGene8r},caption=Generic CUDA kernel achieving a convolution operation with the mask in symbol memory and its radius passed as a parameter]{Chapters/chapter4/code/convoGene8r.cu} + +\subsection{Increasing the number of pixels processed by each thread} +Much in the same way as we did with the Median Filter, we shall now +attempt to reduce the average latency due to writes into global memory +by having each thread process more than one output value. As the basic +structure of the above GPU kernel uses only 14 registers per thread, regardless +of the size of the convolution mask, one can envisage processing 2 +or more pixels per thread while keeping safely within the 63-per-thread +rule. + +However, when doing so, \textit{eg} processing what we shall call a \textit{packet} of pixels, window mask overlapping has to be taken into account +to avoid multiple texture fetches of each pixel's gray-level value, while benefiting from the 2-D cache. +In that case, both mask size and pixel packet shape determine the number of texture fetches to be performed for each pixel value. +Figure \ref{fig:convoOverlap1} illustrates two different situations: on top, a mask of radius 1 ($3\times 3$) applied to a packet of 8 pixels in row; at bottom, a mask of radius 2 ($5\times 5$). +The dark gray pixels are the center pixels (pixels of the packet), while light gray pixels belong to the halo around the packet. The number in each pixel box corresponds to the convolution count in which it is involved. +There would be little interest in using different \textit{packet} shapes, as the final global memory writes would not be coalescent; generating multiple latencies. + \begin{figure} +\centering + \subfigure[$3\times 3$ mask: there are 18 center pixels (out of 30) involved in 3 computations.]{ \includegraphics[width=5.8cm]{Chapters/chapter4/img/convoOverlap1.png}}\\ + \subfigure[$5\times 5$ mask: only 20 center pixels (out of 60), involved in 5 computations.]{ \includegraphics[width=7cm]{Chapters/chapter4/img/convoOverlap2.png}} + \caption{Mask window overlapping when processing 8 pixels per thread. Top: $3\times 3$ mask. Bottom: $5\times 5$ mask.} + \label{fig:convoOverlap1} +\end{figure} + +Altough we actually wrote GPU kernels able to process 2, 4, 8 and 16 pixels per thread, only the one that processes 8 pixels per thread is presented below, as it proved to be the fastest one. Listing \ref{lst:convoGene8x8pL3} reproduce the source code of the kernel for $3\times 3$ masks. +The bottom line is that each thread is associated with one base pixel of coordinates $(x,y)$ which is the first of the packet to be processed, the last one being $(x+7,y)$. +\lstinputlisting[label={lst:convoGene8x8pL3},caption=CUDA kernel achieving a $3\times 3$ convolution operation with the mask in symbol memory and direct data fetches in texture memory]{Chapters/chapter4/code/convoGene8x8pL3.cu} + +In this particular case of a $3\times 3$ mask, each pixel value is used in 3 different convolution sums, except pixels located near both ends of the packet, whose values are used in fewer sums. +The general rule, when performing a $n\times n$ convolution (radius $k$) by 8-pixel packets is that each of the $(8-2k).(2k+1)$ \textit{center} pixels of the halo is used in $k$ sums, while the $4k.(2k+1)$ remaining pixels, located around the ends of the packet are used in fewer sums, from $k-1$ to $1$ ($2.(2k+1)$ pixels each). +\begin{table}[h] +\centering +{\normalsize +\begin{tabular}{|c||r|r|r|r|r|r|} +\hline +\textbf{Mask size$\rightarrow$}&\multicolumn{2}{|c|}{\textbf{3x3}}&\multicolumn{2}{|c|}{\textbf{5x5}}&\multicolumn{2}{|c|}{\textbf{7x7}}\\ +\textbf{Image size$\downarrow$}&time (ms)&TP&time (ms)&TP&time (ms)&TP\\\hline\hline +$\mathbf{512\times 512}$ &0.036&1425 &0.069&1208 &0.110&1016 \\\hline +$\mathbf{1024\times 1024}$&0.128&1862 &0.253&1524 &0.413&1237 \\\hline +$\mathbf{2048\times 2048}$&0.495&2071 &\bf 0.987&1666 &1.615&1334 \\\hline +$\mathbf{4096\times 4096}$&1.964&2138 &3.926&1711 &6.416&1364 \\\hline +\end{tabular} +} +\caption{Timings ($time$) and throughput values ($TP$ in Mpix/s) of our generic fixed mask size convolution kernel run on a C2070 card. Data transfer durations are those of Table \ref{tab:memcpy1}.} +\label{tab:convoGene8x8p} +\end{table} + +Timing results and throughput values are shown in Table \ref{tab:convoGene8x8p}, and show that this solution now outperforms Nvidia references. +It is important to remember that the above kernels have been optimized for the Fermi architecture, unlike those mentioned earlier, which were more efficient on the GT200 architecture. +However, our technique requires to write one kernel per mask size, which can be seen as a major constraint. To make it easier to use this method, we shall propose a kernel code generator that will be available in the near future. + +\subsection{Using shared memory to store prefetched data\index{Prefetching}.} + \index{memory~hierarchy!shared~memory} +A more convenient way of coding a convolution kernel is to use shared memory to perform a prefetching stage of the whole halo before computing the convolution sums. +This proves to be quite efficient and more versatile, but it obviously generates some overhead as: +\begin{itemize} +\item Each pixel value has to be read at least twice, first from texture memory into shared memory and then one or several more times from shared memory to be used in convolution computations. +\item Reducing the number of times a single pixel value is read from shared memory is bound to generate bank conflicts, hence once again performance loss. +\end{itemize} + \begin{figure} +\centering + \includegraphics[width=12cm]{Chapters/chapter4/img/convoShMem.png} + \caption{Organization of the prefetching stage of data, for a $5\times 5$ mask and a thread block size of $8\times 4$. Threads in both top corners of the top figure are identified either by a circle or by a star symbol. The image tile, loaded into shared memory includes the pixels to be updated by the threads of the block, as well as its 2-pixel wide halo. Here, circle and star symbols in the image tile show which pixels are actually loaded into one shared memory vector by its corresponding thread. } + \label{fig:ShMem1} +\end{figure} +Still, we also implemented this method, in a similar manner as Nvidia did in its SDK sample code. +Some improvement has been obtained by increasing the number of pixels processed by each thread, to an optimum 8 pixels per thread. +The principle is to prefetch all pixel values involved in the computations performed by all threads of a block, including 8 pixels per thread plus the halo of radius $r$ (the radius of the convolution mask). As this obviously represents more values than the thread count in one block, some threads have to load more than one value. +The general organization is reproduced in Figure \ref{fig:ShMem1} for $5\times 5$ mask and a $8\times 4$ thread block, while Listing \ref{lst:convoGeneSh1} gives the details of the implementation with its two distinct code blocks: preload in shared memory (Lines 20 to 42) and convolution computations (Lines 45 to 57). +Table \ref{tab:convoGeneSh1} details timing results of this implementation ($16\times 8$ threads/block), up to $13\times 13$ masks, that will serve as a reference in the next section, devoted to separable convolution. +\begin{table}[h] +\centering +{\normalsize +\begin{tabular}{|c||r|r|r|r|r|r|} +\hline +\shortstack{\textbf{Mask size$\rightarrow$}\\\textbf{Image size$\downarrow$}}&\textbf{3x3}&\textbf{5x5}&\textbf{7x7}&\textbf{9x9}&\textbf{11x11}&\textbf{13x13}\\\hline\hline +$\mathbf{512\times 512}$ &0.040 &0.075 &0.141 &0.243&0.314&0.402\\\hline +$\mathbf{1024\times 1024}$&0.141 &0.307 &0.524 &0.917&1.192&1.535\\\hline +$\mathbf{2048\times 2048}$&0.543 &\bf 1.115&2.048 &3.598&4.678&6.037\\\hline +$\mathbf{4096\times 4096}$&2.146 &4.364 &8.156 &14.341&18.652&24.020\\\hline +\end{tabular} +} +\caption{Performances, in milliseconds, of our generic 8 pixels per thread kernel using shared memory, run on a C2070 card.} +\label{tab:convoGeneSh1} +\end{table} +\begin{table}[h] +\centering +{\normalsize +\begin{tabular}{|c||r|r|r|r|r|r|} +\hline +\shortstack{\textbf{Mask size$\rightarrow$}\\\textbf{Image size$\downarrow$}}&\textbf{3x3}&\textbf{5x5}&\textbf{7x7}&\textbf{9x9}&\textbf{11x11}&\textbf{13x13}\\\hline\hline +$\mathbf{512\times 512}$ &1394 &1176 &907 &670&567&477\\\hline +$\mathbf{1024\times 1024}$&1820 &1413 &1093 &776&644&532\\\hline +$\mathbf{2048\times 2048}$&2023 &\bf 1586 &1172 &818&676&554\\\hline +$\mathbf{4096\times 4096}$&2090 &1637 &1195 &830&684&561\\\hline +\end{tabular} +} +\caption{Throughput values, in MegaPixel per second, of our generic 8 pixels per thread kernel using shared memory, run on a C2070 card. Data transfer durations are those of Table \ref{tab:memcpy1}.} +\label{tab:convoGeneSh2} +\end{table} +\lstinputlisting[label={lst:convoGeneSh1},caption=CUDA kernel achieving a generic convolution operation after a preloading of data in shared memory.]{Chapters/chapter4/code/convoGeneSh1.cu} + +\section{Separable convolution} +A convolution operation is said separable when its masks $h$ is the product of 2 vectors $h_v$ and $h_h$, as is the case in the following example: +$$h = h_v \times h_h = \begin{bmatrix}1\\2\\1\end{bmatrix} \times \begin{bmatrix}-1&2&-1\end{bmatrix} = \begin{bmatrix} +-1&2&-1\\ +-2&4&-2\\ +-1&2&-1 +\end{bmatrix}$$ +Such a mask allows to replace a generic 2-D convolution operation by two consecutive stages of a 1-D convolution operation: a vertical of mask $h_v$ and a horizontal of mask $h_h$. +This saves a lot of arithmetic operations, as a generic $n\times n$ convolution applied on a $H\times L$ image basically represents $H.L.n^2$ multiplications and as many additions, while two consecutive $n\times 1$ convolutions only represents $2.H.L.n$ of each, \textit{eg} 60\% operations are saved per pixel of the image for a $5\times 5$ mask.\\ +However, beside reducing the operation count, performing a separable convolution also means writing an intermediate image into global memory. +CPU implementations of separable convolutions often use a single function to perform both 1-D convolution stages. To do so, this function reads the input image and actually ouputs the transposed filtered image. +Applying that principle to GPUs is not efficient, as outputting the transposed image means non-coalescent writes into global memory, generating severe performance loss. Hence the idea of developing two different kernels, one for each of both vertical and horizontal convolutions. + +Here, the use of Shared memory is the best choice, as there is no overlapping between neighbor windows and thus no possible optimization. +Moreover, to ensure efficiency, it is important to read the input image from texture memory, which implies an internal GPU data copy between both 1-D convolution stages. +Which, even if it is faster than CPU/GPU data transfer, makes separable convolutions slower than generic convolutions for small mask sizes. On C2070, the lower limit is $7\times 7$ pixels ($9\times 9$ for $512\times 512$ images). + +Both vertical and horizontal kernels feature similar runtimes: Table \ref{tab:convoSepSh1} only contains their average execution time, including the internal data copy stage, while Table \ref{tab:convoSepSh2} shows the achieved global throughput values. Timings of the data copy stage are given in Table \ref{tab:cpyToArray}. +Listings \ref{lst:convoSepShV} and \ref{lst:convoSepShH} detail the implementation of both 1-D kernels, while Listing \ref{lst:convoSepSh} shows how to use them in addition with the data copy function in order to achieve a whole separable convolution. The shared memory size is dynamically passed as a parameter at kernel call time. Its expression is given in the comment line before its declaration. +\begin{table}[h] +\centering +{\normalsize +\begin{tabular}{|c||r|} +\hline +\textbf{Image size}&\textbf{C2070}\\\hline\hline +$\mathbf{512\times 512}$ &0.029 \\\hline +$\mathbf{1024\times 1024}$&0.101 \\\hline +$\mathbf{2048\times 2048}$&0.387 \\\hline +$\mathbf{4096\times 4096}$&1.533 \\\hline +\end{tabular} +} +\caption{Time cost of data copy between the vertical and the horizontal 1-D convolution stages, on a C2070 cards (in milliseconds).} +\label{tab:cpyToArray} +\end{table} +\begin{table}[h] +\centering +{\normalsize +\begin{tabular}{|c||r|r|r|r|r|r|} +\hline +\shortstack{\textbf{Mask size$\rightarrow$}\\\textbf{Image size$\downarrow$}}&\textbf{3x3}&\textbf{5x5}&\textbf{7x7}&\textbf{9x9}&\textbf{11x11}&\textbf{13x13}\\\hline\hline +$\mathbf{512\times 512}$ &0.080 &0.087 &0.095 &\bf 0.108&\bf 0.115&\bf 0.126\\\hline +$\mathbf{1024\times 1024}$&0.306 &0.333 &\bf 0.333 &\bf 0.378&\bf 0.404&\bf 0.468\\\hline +$\mathbf{2048\times 2048}$&1.094 &1.191 &\bf 1.260 &\bf 1.444&\bf 1.545&\bf 1.722\\\hline +$\mathbf{4096\times 4096}$&4.262 &4.631 &\bf 5.000 &\bf 5.676&\bf 6.105&\bf 6.736\\\hline +\end{tabular}} +\caption{Performances, in milliseconds, of our generic 8 pixels per thread 1-D convolution kernels using shared memory, run on a C2070 card. Timings include data copy. Bold values correspond to situations where separable-convolution kernels run faster than non separable ones.} +\label{tab:convoSepSh1} +\end{table} +\begin{table}[h] +\centering +{\normalsize +\begin{tabular}{|c||r|r|r|r|r|r|} +\hline +\shortstack{\textbf{Mask size$\rightarrow$}\\\textbf{Image size$\downarrow$}}&\textbf{3x3}&\textbf{5x5}&\textbf{7x7}&\textbf{9x9}&\textbf{11x11}&\textbf{13x13}\\\hline\hline +$\mathbf{512\times 512}$ &1150 &1116 &1079 &\bf 1024&\bf 997 &\bf 957\\\hline +$\mathbf{1024\times 1024}$&1415 &1365 &\bf 1365 &\bf 1290&\bf 1250&\bf 1169\\\hline +$\mathbf{2048\times 2048}$&1598 &1541 &\bf 1503 &\bf 1410&\bf 1364&\bf 1290\\\hline +$\mathbf{4096\times 4096}$&1654 &1596 &\bf 1542 &\bf 1452&\bf 1400&\bf 1330\\\hline +\end{tabular} +} +\caption{Throughput values, in MegaPixel per second, of our generic 8 pixels per thread 1-D convolution kernel using shared memory, run on a C2070 card. Data transfer durations are those of Table \ref{tab:memcpy1}.} +\label{tab:convoSepSh2} +\end{table} + +\lstinputlisting[label={lst:convoSepSh},caption=data copy between the calls to 1-D convolution kernels achieving a 2-D separable convolution operation.]{Chapters/chapter4/code/convoSepSh.cu} +\lstinputlisting[label={lst:convoSepShV},caption=CUDA kernel achieving a horizontal 1-D convolution operation after a preloading \index{Prefetching} of data in shared memory.]{Chapters/chapter4/code/convoSepShV.cu} +\lstinputlisting[label={lst:convoSepShH},caption=CUDA kernel achieving a vertical 1-D convolution operation after a preloading of data in shared memory.]{Chapters/chapter4/code/convoSepShH.cu} + +\section{Conclusion} +Extensively detailing the various techniques that may be applied when designing a median or a convolution operation on GPU has enabled us determine that: +\begin{itemize} +\item the use of registers with direct data fetching from texture often allows kernels to run faster than those which use the more conventionnal way of prefetching data from texture memory and storing them into shared memory. +\item increasing the pixel count processed by each thread brings important speedups. In this case, if neighboring windows overlap, optimized direct data fetching from texture will likely outperform the shared memory prefetching technique. That is the case for generic convolution kernels. +\item coding such optimized data fetching is not straightforward. Consequently, we are planning to provide a kernel code generator that will make our kernels more accessible by GPU users. +\end{itemize} +The presented kernels, optimized for a C2070 card, achieve up to 2138~Mpix/s including data transfers, which comes close to the absolute maximum throughput value allowed by the Fermi architecture. The next GPU generation (Kepler) may allow us not only to benefit from new dynamic parallelism capability to increase kernel paralelism level, but also to take advantage of an increase of the register count allowed per thread block which would allow us, for example, to extend our register-only median filter technique to larger mask sizes. + +\putbib[Chapters/chapter4/biblio4] diff --git a/BookGPU/Chapters/chapter4/code/convoGene3Reg8.cu b/BookGPU/Chapters/chapter4/code/convoGene3Reg8.cu new file mode 100755 index 0000000..ba5ea2e --- /dev/null +++ b/BookGPU/Chapters/chapter4/code/convoGene3Reg8.cu @@ -0,0 +1,31 @@ +__global__ void kernel_convoGene3Reg8( unsigned char *output, int j_dim) +{ + float outval0=0.0 ; + float n0,n1,n2,n3,n4,n5,n6,n7,n8 ; + // convolution mask values + n0 = (1.0/9) ; + n1 = (1.0/9) ; + n2 = (1.0/9) ; + n3 = (1.0/9) ; + n4 = (1.0/9) ; + n5 = (1.0/9) ; + n6 = (1.0/9) ; + n7 = (1.0/9) ; + n8 = (1.0/9) ; + + // absolute base point coordinates + int j = __mul24(blockIdx.x, blockDim.x) + threadIdx.x ; + int i = __mul24(blockIdx.y, blockDim.y) + threadIdx.y ; + // weighted sum + outval0 = n8*tex2D(tex_img_inc, j-1, i-1 ) + + n7*tex2D(tex_img_inc, j , i-1 ) + + n6*tex2D(tex_img_inc, j+1, i-1 ) + + n5*tex2D(tex_img_inc, j-1, i ) + + n4*tex2D(tex_img_inc, j , i ) + + n3*tex2D(tex_img_inc, j+1, i ) + + n2*tex2D(tex_img_inc, j-1, i+1 ) + + n1*tex2D(tex_img_inc, j , i+1 ) + + n0*tex2D(tex_img_inc, j+1, i+1 ) ; + + output[ __mul24(i, j_dim) + j ] = (unsigned char) outval0 ; +} diff --git a/BookGPU/Chapters/chapter4/code/convoGene3Reg8.cu~ b/BookGPU/Chapters/chapter4/code/convoGene3Reg8.cu~ new file mode 100755 index 0000000..255feb5 --- /dev/null +++ b/BookGPU/Chapters/chapter4/code/convoGene3Reg8.cu~ @@ -0,0 +1,31 @@ +__global__ void kernel_convoGene3Reg8( unsigned char *output, int i_dim, int j_dim) +{ + float outval0=0.0 ; + float n0,n1,n2,n3,n4,n5,n6,n7,n8 ; + // convolution mask values + n0 = (1.0/9) ; + n1 = (1.0/9) ; + n2 = (1.0/9) ; + n3 = (1.0/9) ; + n4 = (1.0/9) ; + n5 = (1.0/9) ; + n6 = (1.0/9) ; + n7 = (1.0/9) ; + n8 = (1.0/9) ; + + // absolute base point coordinates + int j = __mul24(blockIdx.x, blockDim.x) + threadIdx.x ; + int i = __mul24(blockIdx.y, blockDim.y) + threadIdx.y ; + // weighted sum + outval0 = n8*tex2D(tex_img_inc, j-1, i-1 ) + + n7*tex2D(tex_img_inc, j , i-1 ) + + n6*tex2D(tex_img_inc, j+1, i-1 ) + + n5*tex2D(tex_img_inc, j-1, i ) + + n4*tex2D(tex_img_inc, j , i ) + + n3*tex2D(tex_img_inc, j+1, i ) + + n2*tex2D(tex_img_inc, j-1, i+1 ) + + n1*tex2D(tex_img_inc, j , i+1 ) + + n0*tex2D(tex_img_inc, j+1, i+1 ) ; + + output[ __mul24(i, j_dim) + j ] = (unsigned char) outval0 ; +} diff --git a/BookGPU/Chapters/chapter4/code/convoGene8r.cu b/BookGPU/Chapters/chapter4/code/convoGene8r.cu new file mode 100644 index 0000000..349a9eb --- /dev/null +++ b/BookGPU/Chapters/chapter4/code/convoGene8r.cu @@ -0,0 +1,18 @@ +__global__ void kernel_convoGene8r( unsigned char *output, int j_dim, int r) +{ + int ic, jc ; + int k=2*r+1 ; + float outval0=0.0 ; + + // absolute coordinates of base point + int j = __umul24( blockIdx.x, blockDim.x ) + threadIdx.x ; + int i = __umul24( blockIdx.y, blockDim.y) + threadIdx.y ; + + // convolution computation + for (ic=-r ; ic<=r ; ic++) + for(jc=-r ; jc<=r ; jc++) + outval0 += mask[ __umul24(ic,k)+jc+r ] + *tex2D(tex_img_inc, j+jc, i+ic) ; + + output[ __umul24(i, j_dim) + j ] = outval0 ; +} diff --git a/BookGPU/Chapters/chapter4/code/convoGene8r.cu~ b/BookGPU/Chapters/chapter4/code/convoGene8r.cu~ new file mode 100644 index 0000000..612e09f --- /dev/null +++ b/BookGPU/Chapters/chapter4/code/convoGene8r.cu~ @@ -0,0 +1,18 @@ +__global__ void kernel_convoGene8r( unsigned char *output, int j_dim, int r) +{ + int ic, jc ; + int L=2*r+1 ; + float outval0=0.0 ; + + // absolute coordinates of base point + int j = __umul24( blockIdx.x, blockDim.x ) + threadIdx.x ; + int i = __umul24( blockIdx.y, blockDim.y) + threadIdx.y ; + + // convolution computation + for (ic=-r ; ic<=r ; ic++) + for(jc=-r ; jc<=r ; jc++) + outval0 += masque[ __umul24(ic,L)+jc+r ] + *tex2D(tex_img_inc, j+jc, i+ic) ; + + output[ __umul24(i, j_dim) + j ] = outval0 ; +} diff --git a/BookGPU/Chapters/chapter4/code/convoGene8x8pL3.cu b/BookGPU/Chapters/chapter4/code/convoGene8x8pL3.cu new file mode 100644 index 0000000..ce619d7 --- /dev/null +++ b/BookGPU/Chapters/chapter4/code/convoGene8x8pL3.cu @@ -0,0 +1,62 @@ +__global__ void kernel_convoGene8x8pL3( unsigned char *output, int j_dim ) +{ + int ic, jc ; + const int k=3 ; + unsigned char pix ; + float outval0=0.0, outval1=0.0, outval2=0.0, outval3=0.0 ; + float outval4=0.0, outval5=0.0, outval6=0.0, outval7=0.0 ; + + // coordonnees absolues du point de base en haut a gauche + int j = ( __umul24( blockIdx.x, blockDim.x) + threadIdx.x)<< 3 ; + int i = ( __umul24( blockIdx.y, blockDim.y) + threadIdx.y) ; + + // center pixels + for (ic=0 ; ic ((2*r+1)*(2*r+1))>>1 ) break ; + } + output[ __mul24(i, j_dim) +j ] = ic ; +} diff --git a/BookGPU/Chapters/chapter4/code/convoGeneSh1.cu b/BookGPU/Chapters/chapter4/code/convoGeneSh1.cu new file mode 100644 index 0000000..53d9ba4 --- /dev/null +++ b/BookGPU/Chapters/chapter4/code/convoGeneSh1.cu @@ -0,0 +1,69 @@ +__global__ void kernel_convoNonSepSh_8p(unsigned char *output, int j_dim, int r) +{ + int ic, jc; + int k = 2*r+1 ; + float outval0=0.0, outval1=0.0, outval2=0.0, outval3=0.0, outval4=0.0, outval5=0.0, outval6=0.0, outval7=0.0 ; + int bdimX = blockDim.x<<3 ; + int tidX = threadIdx.x<<3 ; + + // absolute coordinates of packet's base point + int j = (__umul24(blockIdx.x,blockDim.x) + threadIdx.x) << 3 ; + int i = __umul24( blockIdx.y, blockDim.y) + threadIdx.y ; + int j0= __umul24(blockIdx.x,blockDim.x)<<3 ; // block's base point + int idx = __umul24(i,j_dim) + j ; // absolute index + int idrow = threadIdx.y*(bdimX+k-1) ; // line's offset in sh mem + + // shared memory declaration + extern __shared__ unsigned char roi8p[]; + + // top left + for (int p=0; p<8; p++) + roi8p[ idrow + tidX +p ] = tex2D(tex_img_inc, j-r+p , i-r) ; + + // top right + if ( threadIdx.x < r ) + { + roi8p[ idrow + bdimX + threadIdx.x ] = tex2D( tex_img_inc, j0-r +bdimX+threadIdx.x , i-r ) ; + roi8p[ idrow + bdimX + threadIdx.x +r ] = tex2D( tex_img_inc, j0 +bdimX+threadIdx.x , i-r ) ; + } + // bottom left + if ( threadIdx.y < k-1 ) + { + idrow = (threadIdx.y+blockDim.y)*(bdimX+k-1) ; + for (int p=0; p<8; p++) + roi8p[ idrow + tidX +p ] = tex2D( tex_img_inc, j-r+p , i+blockDim.y-r ) ; + // bottom right + if ( threadIdx.x < r ) + { + roi8p[ idrow + bdimX +threadIdx.x ] = tex2D( tex_img_inc, j0-r +bdimX +threadIdx.x, i+blockDim.y-r ) ; + roi8p[ idrow + bdimX +threadIdx.x +r] = tex2D( tex_img_inc, j0 +bdimX +threadIdx.x, i+blockDim.y-r ) ; + } + } + __syncthreads(); + + // computations + for (ic=0 ; ic global mem + output[ idx++ ] = outval0 ; + output[ idx++ ] = outval1 ; + output[ idx++ ] = outval2 ; + output[ idx++ ] = outval3 ; + output[ idx++ ] = outval4 ; + output[ idx++ ] = outval5 ; + output[ idx++ ] = outval6 ; + output[ idx++ ] = outval7 ; +} diff --git a/BookGPU/Chapters/chapter4/code/convoGeneSh1.cu~ b/BookGPU/Chapters/chapter4/code/convoGeneSh1.cu~ new file mode 100644 index 0000000..46e2120 --- /dev/null +++ b/BookGPU/Chapters/chapter4/code/convoGeneSh1.cu~ @@ -0,0 +1,69 @@ +__global__ void kernel_convoNonSepSh_8p(unsigned char *output, int j_dim, int r) +{ + int ic, jc; + int k = 2*r+1 ; + float outval0=0.0, outval1=0.0, outval2=0.0, outval3=0.0, outval4=0.0, outval5=0.0, outval6=0.0, outval7=0.0 ; + int bdimX = blockDim.x<<3 ; + int tidX = threadIdx.x<<3 ; + + // absolute coordinates of packet's base point + int j = (__umul24(blockIdx.x,blockDim.x) + threadIdx.x) << 3 ; + int i = __umul24( blockIdx.y, blockDim.y) + threadIdx.y ; + int j0= __umul24(blockIdx.x,blockDim.x)<<3 ; // block's base point + int idx = __umul24(i,j_dim) + j ; // absolute index + int idrow = threadIdx.y*(bdimX+k-1) ; // line's offset in sh mem + + // shared memory declaration + extern __shared__ unsigned char roi8p[]; + + // top left + for (int p=0; p<8; p++) + roi8p[ idrow + tidX +p ] = tex2D(tex_img_inc, j-r+p , i-r) ; + + // top right + if ( threadIdx.x < r ) + { + roi8p[ idrow + bdimX + threadIdx.x ] = tex2D( tex_img_inc, j0-r +bdimX+threadIdx.x , i-r ) ; + roi8p[ idrow + bdimX + threadIdx.x +r ] = tex2D( tex_img_inc, j0 +bdimX+threadIdx.x , i-r ) ; + } + // bottom left + if ( threadIdx.y < k-1 ) + { + idrow = (threadIdx.y+blockDim.y)*(bdimX+k-1) ; + for (int p=0; p<8; p++) + roi8p[ idrow + tidX +p ] = tex2D( tex_img_inc, j-r+p , i+blockDim.y-r ) ; + // bottom right + if ( threadIdx.x < r ) + { + roi8p[ idrow + bdimX +threadIdx.x ] = tex2D( tex_img_inc, j0-r +bdimX +threadIdx.x, i+blockDim.y-r ) ; + roi8p[ idrow + bdimX +threadIdx.x +r] = tex2D( tex_img_inc, j0 +bdimX +threadIdx.x, i+blockDim.y-r ) ; + } + } + __syncthreads(); + + // computations + for (ic=0 ; ic global mem + output[ idx ] = outval0 ; + output[ idx+1 ] = outval1 ; + output[ idx+2 ] = outval2 ; + output[ idx+3 ] = outval3 ; + output[ idx+4 ] = outval4 ; + output[ idx+5 ] = outval5 ; + output[ idx+6 ] = outval6 ; + output[ idx+7 ] = outval7 ; +} diff --git a/BookGPU/Chapters/chapter4/code/convoGeneSh1~ b/BookGPU/Chapters/chapter4/code/convoGeneSh1~ new file mode 100644 index 0000000..386b586 --- /dev/null +++ b/BookGPU/Chapters/chapter4/code/convoGeneSh1~ @@ -0,0 +1,73 @@ +__global__ void kernel_convoNonSepSh_8p(unsigned char *output, int i_dim, int j_dim, int r) +{ + int idb, ic, jc; + int L = 2*r+1 ; + float outval0=0.0, outval1=0.0, outval2=0.0, outval3=0.0, outval4=0.0, outval5=0.0, outval6=0.0, outval7=0.0 ; + int bdimX = blockDim.x<<3 ; + int tidX = threadIdx.x<<3 ; + + // coordonnees absolues du point de base + int j = (__umul24(blockIdx.x,blockDim.x) + threadIdx.x)<<3 ; + int i = __umul24( blockIdx.y, blockDim.y) + threadIdx.y ; + int j0= __umul24(blockIdx.x,blockDim.x)<<3 ; + + int idx = __umul24(i,j_dim) + j ; // indice dans l'image + + + // chargement en smem + int idrow = threadIdx.y*(bdimX+L-1) ; + + extern __shared__ unsigned char roi8p[]; + + // bloc 0 (en haut à gauche) + for (int p=0; p<8; p++) + roi8p[ idrow + tidX +p ] = tex2D(tex_img_inc, j-r+p , i-r) ; + + // bloc 1 (en haut à droite)... + if ( threadIdx.x < r ) //...ou plutot ce qu'il en manque + { + roi8p[ idrow + bdimX + threadIdx.x ] = tex2D( tex_img_inc, j0-r +bdimX+threadIdx.x , i-r ) ; + roi8p[ idrow + bdimX + threadIdx.x +r ] = tex2D( tex_img_inc, j0 +bdimX+threadIdx.x , i-r ) ; + } + // bloc 2 ( en bas à gauche) + if ( threadIdx.y < L-1 ) + { + idrow = (threadIdx.y+blockDim.y)*(bdimX+L-1) ; + for (int p=0; p<8; p++) + roi8p[ idrow + tidX +p ] = tex2D( tex_img_inc, j-r+p , i+blockDim.y-r ) ; + + //bloc 4 ( en bas à doite )... + if ( threadIdx.x < r ) //...ou ce qu'il en manque + { + roi8p[ idrow + bdimX +threadIdx.x ] = tex2D( tex_img_inc, j0-r +bdimX +threadIdx.x, i+blockDim.y-r ) ; + roi8p[ idrow + bdimX +threadIdx.x +r] = tex2D( tex_img_inc, j0 +bdimX +threadIdx.x, i+blockDim.y-r ) ; + } + } + __syncthreads(); + + // calculs de convolution + for (ic=0 ; ic global mem + output[ idx ] = outval0 ; + output[ idx+1 ] = outval1 ; + output[ idx+2 ] = outval2 ; + output[ idx+3 ] = outval3 ; + output[ idx+4 ] = outval4 ; + output[ idx+5 ] = outval5 ; + output[ idx+6 ] = outval6 ; + output[ idx+7 ] = outval7 ; +} diff --git a/BookGPU/Chapters/chapter4/code/convoSepSh.cu b/BookGPU/Chapters/chapter4/code/convoSepSh.cu new file mode 100644 index 0000000..2688ecd --- /dev/null +++ b/BookGPU/Chapters/chapter4/code/convoSepSh.cu @@ -0,0 +1,6 @@ +dimGrid = dim3( (L/dimBlock.x)/8, H/dimBlock.y, 1 ) ; + +kernel_convoSepShx8pV<<< dimGrid, dimBlock, 8*dimBlock.x*(dimBlock.y+2*r)*sizeof(char) >>>(d_outc, L, r ); +cudaMemcpyToArray( array_img_inc, 0, 0, d_outc, H*L*sizeof(char) , cudaMemcpyDeviceToDevice) ; +kernel_convoSepShx8pH<<< dimGrid, dimBlock, 8*(dimBlock.x+2*r)*dimBLock.y*sizeof(char) >>>(d_outc, L, r ); + \ No newline at end of file diff --git a/BookGPU/Chapters/chapter4/code/convoSepSh.cu~ b/BookGPU/Chapters/chapter4/code/convoSepSh.cu~ new file mode 100644 index 0000000..1d87e39 --- /dev/null +++ b/BookGPU/Chapters/chapter4/code/convoSepSh.cu~ @@ -0,0 +1,6 @@ +dimGrid = dim3( (L/dimBlock.x)/8, H/dimBlock.y, 1 ) ; + +kernel_convoSepShx8pV<<< dimGrid, dimBlock, 8*dimBlock.x*(dimBlock.y+2*r)*sizeof(char) >>>(d_outc, H, L, r ); +cudaMemcpyToArray( array_img_inc, 0, 0, d_outc, H*L*sizeof(char) , cudaMemcpyDeviceToDevice) ; +kernel_convoSepShx8pH<<< dimGrid, dimBlock, 8*(dimBlock.x+2*r)*dimBLock.y*sizeof(char) >>>(d_outc, L, H, r ); + \ No newline at end of file diff --git a/BookGPU/Chapters/chapter4/code/convoSepShH.cu b/BookGPU/Chapters/chapter4/code/convoSepShH.cu new file mode 100644 index 0000000..19f533e --- /dev/null +++ b/BookGPU/Chapters/chapter4/code/convoSepShH.cu @@ -0,0 +1,58 @@ +__global__ void kernel_convoSepShx8pH(unsigned char *output, int j_dim, int r) +{ + int ic, jc, p; + int k = 2*r+1 ; + float outval0=0.0, outval1=0.0, outval2=0.0, outval3=0.0 ; + float outval4=0.0, outval5=0.0, outval6=0.0, outval7=0.0 ; + int bdimX = blockDim.x<<3 ; // all packets width + int tidX = threadIdx.x<<3 ; // one packet offset + + // absolute coordinates of one packet base point + int j = (__umul24(blockIdx.x,blockDim.x) + threadIdx.x)<<3 ; + int i = __umul24( blockIdx.y, blockDim.y) + threadIdx.y ; + int j0= __umul24(blockIdx.x,blockDim.x)<<3 ; + // absolute index in the image + int idx = __umul24(i,j_dim) + j ; + + // offset of one ROI row in shared memory + int idrow = threadIdx.y*(bdimX+k-1) ; + + extern __shared__ unsigned char roi8p[]; + + // top left block + for (p=0; p<8; p++) + roi8p[ idrow + tidX +p ] = tex2D(tex_img_inc, j-r+p , i) ; + // top right block + if ( threadIdx.x < r ) + { + roi8p[ idrow + bdimX + threadIdx.x ] = tex2D( tex_img_inc, j0-r +bdimX+threadIdx.x , i ) ; + roi8p[ idrow + bdimX + threadIdx.x +r ] = tex2D( tex_img_inc, j0 +bdimX+threadIdx.x , i ) ; + } + + __syncthreads(); + + // horizontal convolution + for (jc=0 ; jc global mem + output[ idx++ ] = outval0 ; + output[ idx++ ] = outval1 ; + output[ idx++ ] = outval2 ; + output[ idx++ ] = outval3 ; + output[ idx++ ] = outval4 ; + output[ idx++ ] = outval5 ; + output[ idx++ ] = outval6 ; + output[ idx++ ] = outval7 ; +} diff --git a/BookGPU/Chapters/chapter4/code/convoSepShH.cu~ b/BookGPU/Chapters/chapter4/code/convoSepShH.cu~ new file mode 100644 index 0000000..8f3e73d --- /dev/null +++ b/BookGPU/Chapters/chapter4/code/convoSepShH.cu~ @@ -0,0 +1,59 @@ +__global__ void kernel_convoSepShx8pH(unsigned char *output, int j_dim, int r) +{ + int ic, jc, p; + int k = 2*r+1 ; + float outval0=0.0, outval1=0.0, outval2=0.0, outval3=0.0, outval4=0.0, outval5=0.0, outval6=0.0, outval7=0.0 ; + int bdimX = blockDim.x<<3 ; + int tidX = threadIdx.x<<3 ; + + // coordonnees absolues du point de base + int j = (__umul24(blockIdx.x,blockDim.x) + threadIdx.x)<<3 ; + int i = __umul24( blockIdx.y, blockDim.y) + threadIdx.y ; + int j0= __umul24(blockIdx.x,blockDim.x)<<3 ; + int idx = __umul24(i,j_dim) + j ; // indice dans l'image + + + // chargement en smem + int idrow = threadIdx.y*(bdimX+k-1) ; + + extern __shared__ unsigned char roi8p[]; + + // bloc 0 (a gauche) + for (p=0; p<8; p++) + roi8p[ idrow + tidX +p ] = tex2D(tex_img_inc, j-r+p , i) ; + + // a droite + if ( threadIdx.x < r ) //...ou plutot ce qu'il en manque + { + roi8p[ idrow + bdimX + threadIdx.x ] = tex2D( tex_img_inc, j0-r +bdimX+threadIdx.x , i ) ; + roi8p[ idrow + bdimX + threadIdx.x +r ] = tex2D( tex_img_inc, j0 +bdimX+threadIdx.x , i ) ; + } + + __syncthreads(); + + // calculs de convolution + // passe horizontale + for (jc=0 ; jc global mem + output[ idx ] = outval0 ; + output[ idx+1 ] = outval1 ; + output[ idx+2 ] = outval2 ; + output[ idx+3 ] = outval3 ; + output[ idx+4 ] = outval4 ; + output[ idx+5 ] = outval5 ; + output[ idx+6 ] = outval6 ; + output[ idx+7 ] = outval7 ; +} diff --git a/BookGPU/Chapters/chapter4/code/convoSepShV.cu b/BookGPU/Chapters/chapter4/code/convoSepShV.cu new file mode 100644 index 0000000..a8bf71e --- /dev/null +++ b/BookGPU/Chapters/chapter4/code/convoSepShV.cu @@ -0,0 +1,57 @@ +__global__ void kernel_convoSepShx8pV(unsigned char *output, int j_dim, int r) +{ + int ic, jc, p; + int k = 2*r+1 ; + float outval0=0.0, outval1=0.0, outval2=0.0, outval3=0.0 ; + float outval4=0.0, outval5=0.0, outval6=0.0, outval7=0.0 ; + int bdimX = blockDim.x<<3 ; // all packets width + int tidX = threadIdx.x<<3 ; // one packet offset + + // absolute coordinates of the base point + int j = (__umul24(blockIdx.x,blockDim.x) + threadIdx.x)<<3 ; + int i = __umul24( blockIdx.y, blockDim.y) + threadIdx.y ; + // absolute index in the image + int idx = __umul24(i,j_dim) + j ; + // offset of one ROI row in shared memory + int idrow = threadIdx.y*bdimX ; + + extern __shared__ unsigned char roi8p[]; + + // top block + for (p=0; p<8; p++) + roi8p[ idrow + tidX +p ] = tex2D(tex_img_inc, j+p , i-r) ; + + // bottom block + if ( threadIdx.y < k-1 ) + { + idrow = (threadIdx.y+blockDim.y)*bdimX ; + for (int p=0; p<8; p++) + roi8p[ idrow + tidX +p ] = tex2D( tex_img_inc, j+p , i+blockDim.y-r ) ; + } + __syncthreads(); + + // vertical convolution + for (ic=0 ; ic global mem + output[ idx++ ] = outval0 ; + output[ idx++ ] = outval1 ; + output[ idx++ ] = outval2 ; + output[ idx++ ] = outval3 ; + output[ idx++ ] = outval4 ; + output[ idx++ ] = outval5 ; + output[ idx++ ] = outval6 ; + output[ idx++ ] = outval7 ; +} diff --git a/BookGPU/Chapters/chapter4/code/convoSepShV.cu~ b/BookGPU/Chapters/chapter4/code/convoSepShV.cu~ new file mode 100644 index 0000000..4437a56 --- /dev/null +++ b/BookGPU/Chapters/chapter4/code/convoSepShV.cu~ @@ -0,0 +1,56 @@ +__global__ void kernel_convoSepShx8pV(unsigned char *output, int j_dim, int r) +{ + int ic, jc, p; + int k = 2*r+1 ; + float outval0=0.0, outval1=0.0, outval2=0.0, outval3=0.0, outval4=0.0, outval5=0.0, outval6=0.0, outval7=0.0 ; + int bdimX = blockDim.x<<3 ; + int tidX = threadIdx.x<<3 ; + + // absolute coordinates of the base point + int j = (__umul24(blockIdx.x,blockDim.x) + threadIdx.x)<<3 ; + int i = __umul24( blockIdx.y, blockDim.y) + threadIdx.y ; + // absolute index in the image + int idx = __umul24(i,j_dim) + j ; + // offset of one ROI row in shared memory + int idrow = threadIdx.y*bdimX ; + + extern __shared__ unsigned char roi8p[]; + + // top block + for (p=0; p<8; p++) + roi8p[ idrow + tidX +p ] = tex2D(tex_img_inc, j+p , i-r) ; + + // bottom block + if ( threadIdx.y < k-1 ) + { + idrow = (threadIdx.y+blockDim.y)*bdimX ; + for (int p=0; p<8; p++) + roi8p[ idrow + tidX +p ] = tex2D( tex_img_inc, j+p , i+blockDim.y-r ) ; + } + __syncthreads(); + + // vertical convolution + for (ic=0 ; ic global mem + output[ idx++ ] = outval0 ; + output[ idx++ ] = outval1 ; + output[ idx++ ] = outval2 ; + output[ idx++ ] = outval3 ; + output[ idx++ ] = outval4 ; + output[ idx++ ] = outval5 ; + output[ idx++ ] = outval6 ; + output[ idx++ ] = outval7 ; +} diff --git a/BookGPU/Chapters/chapter4/code/levelines_kernels.cu b/BookGPU/Chapters/chapter4/code/levelines_kernels.cu new file mode 100644 index 0000000..2c80950 --- /dev/null +++ b/BookGPU/Chapters/chapter4/code/levelines_kernels.cu @@ -0,0 +1,3793 @@ +#define PI 3.1415926 + +__constant__ float masque[256] ; +/* noyau gaussien 3x3 *//* +__constant__ float noyau[] = {1.0/16,2.0/16,1.0/16, + 2.0/16,4.0/16,2.0/16, + 1.0/16,2.0/16,1.0/16} ; + __constant__ int rnoyau = 1 ;*/ + +/* noyau gaussien 5x5 *//* +__constant__ int noyau[] = {1,4,6,4,1, 4,16,24,16,4, 6,24,36,24,6, 4,16,24,16,4, 1,4,6,4,1} ; +__constant__ int sumNoyau = 256 ; +__constant__ int rnoyau = 2 ; +__constant__ int NNoyau = 25 ; + */ +/* noyau 5x5*/ +__constant__ float noyau[] = {1.0/25,1.0/25,1.0/25,1.0/25,1.0/25, + 1.0/25,1.0/25,1.0/25,1.0/25,1.0/25, + 1.0/25,1.0/25,1.0/25,1.0/25,1.0/25, + 1.0/25,1.0/25,1.0/25,1.0/25,1.0/25, + 1.0/25,1.0/25,1.0/25,1.0/25,1.0/25} ; +__constant__ int rnoyau = 2 ; + +/* noyau 7x7 *//* +__device__ __constant__ float noyau[] = {1.0/49,1.0/49,1.0/49,1.0/49,1.0/49,1.0/49,1.0/49, + 1.0/49,1.0/49,1.0/49,1.0/49,1.0/49,1.0/49,1.0/49, + 1.0/49,1.0/49,1.0/49,1.0/49,1.0/49,1.0/49,1.0/49, + 1.0/49,1.0/49,1.0/49,1.0/49,1.0/49,1.0/49,1.0/49, + 1.0/49,1.0/49,1.0/49,1.0/49,1.0/49,1.0/49,1.0/49, + 1.0/49,1.0/49,1.0/49,1.0/49,1.0/49,1.0/49,1.0/49, + 1.0/49,1.0/49,1.0/49,1.0/49,1.0/49,1.0/49,1.0/49} ; +__device__ __constant__ int rnoyau = 3 ; + */ +//__device__ __constant__ int NNoyau = 49 ; + +__device__ __constant__ int noyauH[] = {1,1,1,1,1,1,1} ; +__device__ __constant__ int sumNoyauH = 7 ; +__device__ __constant__ int rnoyauH = 3 ; +// average 7x7 +/* +__device__ __constant__ int noyauV[] = {1,1,1,1,1,1,1} ; +__device__ __constant__ int sumNoyauV = 7 ; +__device__ __constant__ int rnoyauV = 3 ; +*/ +// gaussian 5x5 +__device__ __constant__ int noyauV[] = {1,4,6,4,1} ; +__device__ __constant__ int rnoyauV = 2 ; +__constant__ int sumKV = 16 ; +//__device__ __constant__ int noyauV[] = {1,1,1} ; +//__constant__ int sumNoyauV = 16 ; + +// declarations des textures +texture tex_img_in ; +texture tex_img_ins ; +texture tex_img_inc ; +texture tex_img_estim ; +texture tex_noyau ; +texture tex_noyauConvof ; +texture tex_noyauConvos ; +texture tex_kern ; + + +/*************************************************************** + * kernel identite pour mesures de debit max effectif possible + ***************************************************************/ +__global__ void kernel_ident( unsigned short*output, int i_dim, int j_dim) +{ + // coordonnees absolues du point de base + int j = __mul24(blockIdx.x,blockDim.x) + threadIdx.x ; + int i = __mul24( blockIdx.y, blockDim.y) + threadIdx.y ; + int idx = __mul24(i, j_dim) + j ; // indice dans l'image + + output[ idx ] = tex2D(tex_img_ins, j, i) ; + +} + +__global__ void kernel_ident2p( unsigned short *output, int i_dim, int j_dim) +{ + // coordonnees absolues du point de base + int j = __umul24(__umul24(blockIdx.x,blockDim.x) + threadIdx.x,2) ; + int i = __umul24( blockIdx.y, blockDim.y) + threadIdx.y ; + + output[ __umul24(i, j_dim) + j ] = tex2D(tex_img_inc, j, i) ; + output[ __umul24(i, j_dim) + j +1 ] = tex2D(tex_img_inc, j+1, i) ; + +} + + +__global__ void kernel_ident( unsigned char *output, int i_dim, int j_dim) +{ + // coordonnees absolues du point de base + int j = __mul24(blockIdx.x,blockDim.x) + threadIdx.x ; + int i = __mul24( blockIdx.y, blockDim.y) + threadIdx.y ; + + output[ __umul24(i, j_dim) + j ] = tex2D(tex_img_inc, j, i) ; + +} + +__global__ void kernel_ident2p( unsigned char *output, int i_dim, int j_dim) +{ + // coordonnees absolues du point de base + int j = __umul24(__umul24(blockIdx.x,blockDim.x) + threadIdx.x,2) ; + int i = __umul24( blockIdx.y, blockDim.y) + threadIdx.y ; + + output[ __umul24(i, j_dim) + j ] = tex2D(tex_img_inc, j, i) ; + output[ __umul24(i, j_dim) + j +1 ] = tex2D(tex_img_inc, j+1, i) ; + +} + +/** + * + * \brief calcule l'estimation initiale + * \author zulu - AND + * + * \param[in] L Largeur de l'image + * \param[in] i_dim Hauteur de l'image + * \param[in] j_dim coté de la fenêtre de moyenneage + * + * \param[out] output Image estimee initiale + * + * Version texture : l'img originale est supposée en texture. + * L'estimation réalisée correspond a un moyenneur de 'rayon' r + * Execution sur des blocs de threads 2D et une grille 2D + * selon les dimensions de l'image. + * + */ + +__global__ void kernel_moyenne_shtex( unsigned int *output, unsigned int i_dim, unsigned int j_dim, unsigned int r) +{ + int idl, idc ; + //coordonnées ds le bloc + int ib = threadIdx.y ; + int jb = threadIdx.x ; + int idb_row = __mul24(ib,blockDim.x) + jb ; // indice dans le bloc + + // coordonnees absolues du point + int j = __mul24(blockIdx.x,blockDim.x) + jb ; + int i = __mul24(blockIdx.y,blockDim.y) + ib ; + int idx = __mul24(i,j_dim) + j ; // indice dans l'image + int off = __mul24(r,blockDim.x) ; // offset du bloc dans le buffer en sh mem + + + extern __shared__ unsigned int shmem[] ; + volatile unsigned int * buffer = shmem ; + volatile unsigned int * bufferV = &buffer[ blockDim.x*blockDim.y + 2*off ] ; + + + /*********************************************************************************** + * PASSE HORIZONTALE + ***********************************************************************************/ + // charge les pixels de la bande au dessus du bloc + if(ib=(blockDim.y-r)) + { + buffer[ idb_row + 2*off ] = 0 ; + for (idc=j-r ; idc<=j+r ; idc++ ) + buffer[ idb_row + 2*off ] += tex2D(tex_img_in, idc, i+r) ; + } + + // charge les pixels du bloc + buffer[ idb_row + off ] = 0 ; + for (idc=j-r ; idc<=j+r ; idc++ ) + buffer[ idb_row + off ] += tex2D(tex_img_in, idc, i) ; + + __syncthreads() ; + /***************************************************************************** + * PASSE VERTICALE + *****************************************************************************/ + bufferV[ idb_row ] = 0 ; + for (idl=0 ; idl<=2*r ; idl++ ) + bufferV[ idb_row ] += buffer[ (ib+idl)*blockDim.x + jb ] ; + + __syncthreads() ; + + output[idx] = bufferV[ idb_row ]/((2*r+1)*(2*r+1)) ; //pas juste sur les bords, mais c'est pô grave ! + +} + +/** + * + * \brief calcule la convolution avec un noyau carre + * \author zulu - AND + * + * \param[in] L Largeur de l'image + * \param[in] i_dim Hauteur de l'image + * \param[in] j_dim coté de la fenêtre de moyenneage + * + * \param[out] output Image estimee initiale + * + * Version texture : l'img originale est supposée en texture. + * Execution sur des blocs de threads 2D et une grille 2D + * correspondant aux dimensions de l'image. + * + */ + +// convolution non separable +// image en texture et noyau en mem constante +__global__ void kernel_convoGene8( unsigned char *output, int i_dim, int j_dim) +{ + int ic, jc ; + int r = 2 ; + int L=5 ; + float outval0=0.0 ; + + // coordonnees absolues du point de base + int j = __mul24(blockIdx.x,blockDim.x) + threadIdx.x ; + int i = __mul24( blockIdx.y, blockDim.y) + threadIdx.y ; + + for (ic=0 ; ic masque[ __mul24(ic,L)+jc ]*tex2D(tex_img_inc, j-r+jc, i-r+ic) + for (ic=1 ; ic masque[ __mul24(ic,L)+jc ]*tex2D(tex_img_inc, j-r+jc, i-r+ic) + for (ic=1 ; ic masque[ __mul24(ic,L)+jc ]*tex2D(tex_img_inc, j-r+jc, i-r+ic) + for (ic=0 ; ic masque[ __mul24(ic,L)+jc ]*tex2D(tex_img_inc, j-r+jc, i-r+ic) + for (ic=0 ; ic masque[ __mul24(ic,L)+jc ]*tex2D(tex_img_inc, j-r+jc, i-r+ic) + for (ic=1 ; ic global mem + output[ idx ] = outval0 ; + +} + +__global__ void kernel_convoNonSepSh_2p(unsigned short *output, int i_dim, int j_dim) +{ + int idb, ic, jc ; + int L = 2*rnoyau+1 ; + float outval0=0.0, outval1=0.0 ; + int bdimX = blockDim.x<<1 ; + int tidX = threadIdx.x<<1 ; + + // coordonnees absolues du point de base + int j = __mul24(__mul24(blockIdx.x,blockDim.x) + threadIdx.x,2) ; + int i = __mul24( blockIdx.y, blockDim.y) + threadIdx.y ; + int idx = __mul24(i,j_dim) + j ; // indice dans l'image + + // chargement en smem + int idrow = threadIdx.y*(bdimX+L-1) ; + + extern __shared__ int roi[]; + + // bloc 0 (en haut à gauche) + roi[ idrow + tidX ] = tex2D(tex_img_ins, j-rnoyau, i-rnoyau) ; + roi[ idrow + tidX +1] = tex2D(tex_img_ins, j-rnoyau+1, i-rnoyau) ; + + // bloc 1 (en haut à droite)... + if ( threadIdx.x < rnoyau ) //...ou plutot ce qu'il en manque + { + roi[ idrow + tidX+bdimX ] = tex2D( tex_img_ins, j+bdimX-rnoyau, i-rnoyau ) ; + roi[ idrow + tidX+bdimX +1] = tex2D( tex_img_ins, j+bdimX-rnoyau+1, i-rnoyau ) ; + } + + // bloc 2 ( en bas à gauche) + if ( threadIdx.y < L-1 ) + { + idrow = (threadIdx.y+blockDim.y)*(bdimX+L-1) ; + roi[ idrow + tidX ] = tex2D( tex_img_ins, j-rnoyau, i+blockDim.y-rnoyau ) ; + roi[ idrow + tidX +1] = tex2D( tex_img_ins, j-rnoyau+1, i+blockDim.y-rnoyau ) ; + //bloc 4 ( en bas à doite )... + + if ( 2*threadIdx.x < L-1 ) //...ou ce qu'il en manque + { + roi[ idrow + tidX+bdimX ] = tex2D( tex_img_ins, j+bdimX-rnoyau, i+blockDim.y-rnoyau ) ; + roi[ idrow + tidX+bdimX +1] = tex2D( tex_img_ins, j+bdimX-rnoyau+1, i+blockDim.y-rnoyau ) ; + } + } + __syncthreads(); + + // calculs de convolution + for (ic=0 ; ic global mem + output[ idx ] = outval0 ; + output[ idx+1 ] = outval1 ; +} + +__global__ void kernel_convoNonSepSh_2p(unsigned char *output, int i_dim, int j_dim, int r) +{ + int idb, ic, jc ; + int L = 2*r+1 ; + float outval0=0.0, outval1=0.0 ; + int bdimX = blockDim.x<<1 ; + int tidX = threadIdx.x<<1 ; + + // coordonnees absolues du point de base + int j = __umul24(__mul24(blockIdx.x,blockDim.x) + threadIdx.x,2) ; + int i = __umul24( blockIdx.y, blockDim.y) + threadIdx.y ; + int idx = __umul24(i,j_dim) + j ; // indice dans l'image + + // chargement en smem + int idrow = threadIdx.y*(bdimX+L-1) ; + + extern __shared__ int roi[]; + + // bloc 0 (en haut à gauche) + roi[ idrow + tidX ] = tex2D( tex_img_inc, j-r , i-r) ; + roi[ idrow + tidX +1 ] = tex2D( tex_img_inc, j-r+1, i-r) ; + + // bloc 1 (en haut à droite)... + if ( threadIdx.x < r ) //...ou plutot ce qu'il en manque + { + roi[ idrow + bdimX + tidX ] = tex2D( tex_img_inc, j+bdimX-r , i-r ) ; + roi[ idrow + bdimX + tidX +1 ] = tex2D( tex_img_inc, j+bdimX-r+1, i-r ) ; + } + + // bloc 2 ( en bas à gauche) + if ( threadIdx.y < L-1 ) + { + idrow = (threadIdx.y+blockDim.y)*(bdimX+L-1) ; + roi[ idrow + tidX ] = tex2D( tex_img_inc, j-r, i+blockDim.y-r ) ; + roi[ idrow + tidX +1] = tex2D( tex_img_inc, j-r+1, i+blockDim.y-r ) ; + //bloc 4 ( en bas à doite )... + + if ( threadIdx.x < r ) //...ou ce qu'il en manque + { + roi[ idrow + tidX+bdimX ] = tex2D( tex_img_inc, j+bdimX-r , i+blockDim.y-r ) ; + roi[ idrow + tidX+bdimX +1] = tex2D( tex_img_inc, j+bdimX-r+1, i+blockDim.y-r ) ; + } + } + __syncthreads(); + + // calculs de convolution + for (ic=0 ; ic global mem + output[ idx ] = outval0 ; + output[ idx+1 ] = outval1 ; +} + +//4 pixels en ligne par thread +__global__ void kernel_convoNonSepSh_4p(unsigned char *output, int i_dim, int j_dim, int r) +{ + int idb, ic, jc ; + int L = 2*r+1 ; + float outval0=0.0, outval1=0.0, outval2=0.0, outval3=0.0 ; + int bdimX = blockDim.x<<2 ; + int tidX = threadIdx.x<<2 ; + + // coordonnees absolues du point de base + int j = (__umul24(blockIdx.x,blockDim.x) + threadIdx.x)<<2 ; + int i = __umul24( blockIdx.y, blockDim.y) + threadIdx.y ; + int j0= __umul24(blockIdx.x,blockDim.x)<<2 ; + + int idx = __umul24(i,j_dim) + j ; // indice dans l'image + + + // chargement en smem + int idrow = threadIdx.y*(bdimX+L-1) ; + + extern __shared__ unsigned char roi4p[]; + + // bloc 0 (en haut à gauche) + roi4p[ idrow + tidX ] = tex2D(tex_img_inc, j-r , i-r) ; + roi4p[ idrow + tidX +1] = tex2D(tex_img_inc, j-r+1, i-r) ; + roi4p[ idrow + tidX +2] = tex2D(tex_img_inc, j-r+2, i-r) ; + roi4p[ idrow + tidX +3] = tex2D(tex_img_inc, j-r+3, i-r) ; + + // bloc 1 (en haut à droite)... + if ( threadIdx.x < r ) //...ou plutot ce qu'il en manque + { + roi4p[ idrow + bdimX + threadIdx.x ] = tex2D( tex_img_inc, j0-r +bdimX+threadIdx.x , i-r ) ; + roi4p[ idrow + bdimX + threadIdx.x +r ] = tex2D( tex_img_inc, j0 +bdimX+threadIdx.x , i-r ) ; + } + // bloc 2 ( en bas à gauche) + if ( threadIdx.y < L-1 ) + { + idrow = (threadIdx.y+blockDim.y)*(bdimX+L-1) ; + roi4p[ idrow + tidX ] = tex2D( tex_img_inc, j-r , i+blockDim.y-r ) ; + roi4p[ idrow + tidX +1] = tex2D( tex_img_inc, j-r+1, i+blockDim.y-r ) ; + roi4p[ idrow + tidX +2] = tex2D( tex_img_inc, j-r+2, i+blockDim.y-r ) ; + roi4p[ idrow + tidX +3] = tex2D( tex_img_inc, j-r+3, i+blockDim.y-r ) ; + + //bloc 4 ( en bas à doite )... + if ( threadIdx.x < r ) //...ou ce qu'il en manque + { + roi4p[ idrow + bdimX +threadIdx.x ] = tex2D( tex_img_inc, j0-r +bdimX +threadIdx.x, i+blockDim.y-r ) ; + roi4p[ idrow + bdimX +threadIdx.x +r] = tex2D( tex_img_inc, j0 +bdimX +threadIdx.x, i+blockDim.y-r ) ; + } + } + __syncthreads(); + + // calculs de convolution + for (ic=0 ; ic global mem + output[ idx ] = outval0 ; + output[ idx+1 ] = outval1 ; + output[ idx+2 ] = outval2 ; + output[ idx+3 ] = outval3 ; +} + + +//4 pixels en carre par thread +__global__ void kernel_convoNonSepSh_4pcarre(unsigned char *output, int i_dim, int j_dim, int r) +{ + int idb, ic, jc ; + int L = 2*r+1 ; + float outval0=0.0, outval1=0.0, outval2=0.0, outval3=0.0 ; + int bdimX = blockDim.x<<1 ; + int tidX = threadIdx.x<<1 ; + int bdimY = blockDim.y<<1 ; + int tidY = threadIdx.y<<1 ; + + // coordonnees absolues du point de base + int j = (__umul24(blockIdx.x,blockDim.x) + threadIdx.x)<<1 ; + int i = (__umul24( blockIdx.y, blockDim.y) + threadIdx.y)<<1 ; + int j0= __umul24(blockIdx.x,blockDim.x)<<1 ; + int i0= __umul24(blockIdx.y,blockDim.y)<<1 ; + + int idx = __umul24(i,j_dim) + j ; // indice dans l'image + + + // chargement en smem + int idrowBase = tidY*(bdimX+L-1) ; + int idrowSecond = (tidY+1)*(bdimX+L-1) ; + + + extern __shared__ unsigned char roi4p2[]; + + // bloc 0 (en haut à gauche) + roi4p2[ idrowBase + tidX ] = tex2D(tex_img_inc, j-r , i-r ) ; + roi4p2[ idrowBase + tidX +1] = tex2D(tex_img_inc, j-r+1, i-r ) ; + roi4p2[ idrowSecond+ tidX ] = tex2D(tex_img_inc, j-r , i-r+1) ; + roi4p2[ idrowSecond+ tidX +1] = tex2D(tex_img_inc, j-r+1, i-r+1) ; + + // bloc 1 (en haut à droite)... + if ( threadIdx.x < r ) //...ou plutot ce qu'il en manque + { + roi4p2[ idrowBase + bdimX + threadIdx.x ] = tex2D( tex_img_inc, j0-r +bdimX+threadIdx.x , i-r ) ; + roi4p2[ idrowBase + bdimX + threadIdx.x +r ] = tex2D( tex_img_inc, j0 +bdimX+threadIdx.x , i-r ) ; + roi4p2[ idrowSecond + bdimX + threadIdx.x ] = tex2D( tex_img_inc, j0-r +bdimX+threadIdx.x , i-r+1 ) ; + roi4p2[ idrowSecond + bdimX + threadIdx.x +r ] = tex2D( tex_img_inc, j0 +bdimX+threadIdx.x , i-r+1 ) ; + } + // bloc 2 ( en bas à gauche) + if ( threadIdx.y < L-1 ) + { + idrowBase = (bdimY + threadIdx.y)*(bdimX+L-1) ; + roi4p2[ idrowBase + tidX ] = tex2D( tex_img_inc, j-r , i0-r +bdimY +threadIdx.y ) ; + roi4p2[ idrowBase + tidX +1] = tex2D( tex_img_inc, j-r+1, i0-r +bdimY +threadIdx.y ) ; + + //bloc 4 ( en bas à doite )... + if ( threadIdx.x < r ) //...ou ce qu'il en manque + { + roi4p2[ idrowBase + bdimX + threadIdx.x ] = tex2D( tex_img_inc, j0-r +bdimX +threadIdx.x , i0-r +bdimY +threadIdx.y ) ; + roi4p2[ idrowBase + bdimX + threadIdx.x +r ] = tex2D( tex_img_inc, j0 +bdimX +threadIdx.x , i0-r +bdimY +threadIdx.y ) ; + } + } + __syncthreads(); + + // calculs de convolution + for (ic=0 ; ic global mem + output[ idx ] = outval0 ; + output[ idx+1 ] = outval1 ; + output[ idx+j_dim ] = outval2 ; + output[ idx+j_dim+1 ] = outval3 ; +} + + +//8 pixels en ligne par thread +__global__ void kernel_convoNonSepSh_8p(unsigned char *output, int i_dim, int j_dim, int r) +{ + int idb, ic, jc; + int L = 2*r+1 ; + float outval0=0.0, outval1=0.0, outval2=0.0, outval3=0.0, outval4=0.0, outval5=0.0, outval6=0.0, outval7=0.0 ; + int bdimX = blockDim.x<<3 ; + int tidX = threadIdx.x<<3 ; + + // coordonnees absolues du point de base + int j = (__umul24(blockIdx.x,blockDim.x) + threadIdx.x)<<3 ; + int i = __umul24( blockIdx.y, blockDim.y) + threadIdx.y ; + int j0= __umul24(blockIdx.x,blockDim.x)<<3 ; + + int idx = __umul24(i,j_dim) + j ; // indice dans l'image + + + // chargement en smem + int idrow = threadIdx.y*(bdimX+L-1) ; + + extern __shared__ unsigned char roi8p[]; + + // bloc 0 (en haut à gauche) + for (int p=0; p<8; p++) + roi8p[ idrow + tidX +p ] = tex2D(tex_img_inc, j-r+p , i-r) ; + + // bloc 1 (en haut à droite)... + if ( threadIdx.x < r ) //...ou plutot ce qu'il en manque + { + roi8p[ idrow + bdimX + threadIdx.x ] = tex2D( tex_img_inc, j0-r +bdimX+threadIdx.x , i-r ) ; + roi8p[ idrow + bdimX + threadIdx.x +r ] = tex2D( tex_img_inc, j0 +bdimX+threadIdx.x , i-r ) ; + } + // bloc 2 ( en bas à gauche) + if ( threadIdx.y < L-1 ) + { + idrow = (threadIdx.y+blockDim.y)*(bdimX+L-1) ; + for (int p=0; p<8; p++) + roi8p[ idrow + tidX +p ] = tex2D( tex_img_inc, j-r+p , i+blockDim.y-r ) ; + + //bloc 4 ( en bas à doite )... + if ( threadIdx.x < r ) //...ou ce qu'il en manque + { + roi8p[ idrow + bdimX +threadIdx.x ] = tex2D( tex_img_inc, j0-r +bdimX +threadIdx.x, i+blockDim.y-r ) ; + roi8p[ idrow + bdimX +threadIdx.x +r] = tex2D( tex_img_inc, j0 +bdimX +threadIdx.x, i+blockDim.y-r ) ; + } + } + __syncthreads(); + + // calculs de convolution + for (ic=0 ; ic global mem + output[ idx ] = outval0 ; + output[ idx+1 ] = outval1 ; + output[ idx+2 ] = outval2 ; + output[ idx+3 ] = outval3 ; + output[ idx+4 ] = outval4 ; + output[ idx+5 ] = outval5 ; + output[ idx+6 ] = outval6 ; + output[ idx+7 ] = outval7 ; +} + + +//16 pixels en ligne par thread +__global__ void kernel_convoNonSepSh_16p(unsigned char *output, int i_dim, int j_dim, int r) +{ + int idb, ic, jc; + int L = 2*r+1 ; + float outval0=0.0, outval1=0.0, outval2=0.0, outval3=0.0, outval4=0.0, outval5=0.0, outval6=0.0, outval7=0.0 ; + float outval8=0.0, outval9=0.0, outval10=0.0, outval11=0.0, outval12=0.0, outval13=0.0, outval14=0.0, outval15=0.0 ; + int bdimX = blockDim.x<<4 ; + int tidX = threadIdx.x<<4 ; + + // coordonnees absolues du point de base + int j = (__umul24(blockIdx.x,blockDim.x) + threadIdx.x)<<4 ; + int i = __umul24( blockIdx.y, blockDim.y) + threadIdx.y ; + int j0= __umul24(blockIdx.x,blockDim.x)<<4 ; + + int idx = __umul24(i,j_dim) + j ; // indice dans l'image + + + // chargement en smem + int idrow = threadIdx.y*(bdimX+L-1) ; + + extern __shared__ unsigned char roi16p[]; + + // bloc 0 (en haut à gauche) + for (int p=0; p<16; p++) + roi16p[ idrow + tidX +p ] = tex2D(tex_img_inc, j-r+p , i-r) ; + + // bloc 1 (en haut à droite)... + if ( threadIdx.x < r ) //...ou plutot ce qu'il en manque + { + roi16p[ idrow + bdimX + threadIdx.x ] = tex2D( tex_img_inc, j0-r +bdimX+threadIdx.x , i-r ) ; + roi16p[ idrow + bdimX + threadIdx.x +r ] = tex2D( tex_img_inc, j0 +bdimX+threadIdx.x , i-r ) ; + } + // bloc 2 ( en bas à gauche) + if ( threadIdx.y < L-1 ) + { + idrow = (threadIdx.y+blockDim.y)*(bdimX+L-1) ; + for (int p=0; p<16; p++) + roi16p[ idrow + tidX +p ] = tex2D( tex_img_inc, j-r+p , i+blockDim.y-r ) ; + + //bloc 4 ( en bas à doite )... + if ( threadIdx.x < r ) //...ou ce qu'il en manque + { + roi16p[ idrow + bdimX +threadIdx.x ] = tex2D( tex_img_inc, j0-r +bdimX +threadIdx.x, i+blockDim.y-r ) ; + roi16p[ idrow + bdimX +threadIdx.x +r] = tex2D( tex_img_inc, j0 +bdimX +threadIdx.x, i+blockDim.y-r ) ; + } + } + __syncthreads(); + + // calculs de convolution + for (ic=0 ; ic global mem + output[ idx ] = outval0 ; + output[ idx+1 ] = outval1 ; + output[ idx+2 ] = outval2 ; + output[ idx+3 ] = outval3 ; + output[ idx+4 ] = outval4 ; + output[ idx+5 ] = outval5 ; + output[ idx+6 ] = outval6 ; + output[ idx+7 ] = outval7 ; + output[ idx+8 ] = outval8 ; + output[ idx+9 ] = outval9 ; + output[ idx+10 ] = outval10 ; + output[ idx+11 ] = outval11 ; + output[ idx+12 ] = outval12 ; + output[ idx+13 ] = outval13 ; + output[ idx+14 ] = outval14 ; + output[ idx+15 ] = outval15 ; +} + + + +// convolution non separable +// image en texture et noyau en mem constante +// fetch direct des pixels +// 2 pixels traités par thread => meilleur débit +__global__ void kernel_convoNonSep_2p( int *output, int i_dim, int j_dim) +{ + int idb, ic, jc ; + int r = rnoyau ; + int L = 2*r+1 ; + int N = L*L ; + float outval0=0, outval1=0 ; + + // coordonnees absolues du point de base + int j = __mul24(__mul24(blockIdx.x,blockDim.x) + threadIdx.x, 2) ; + int i = __mul24( blockIdx.y, blockDim.y) + threadIdx.y ; + int idx = __mul24(i, j_dim) + j ; // indice dans l'image + + #pragma unroll + for (idb=0 ; idb< N ; idb++) + { + ic = i-r + idb/L ; + jc = j-r + idb - (idb/L)*L ; + outval0 += ( noyau[ idb ]*tex2D(tex_img_in, jc, ic)) ; + outval1 += ( noyau[ idb ]*tex2D(tex_img_in, jc+1, ic)) ; + } + + output[ idx ] = outval0 ; + output[ idx+1 ] = outval1 ; + + +} + + +// convolution non separable +// image en texture et noyau en mem constante +// fetch direct des pixels +// 2 pixels traités par thread => meilleur débit +__global__ void kernel_convoNonSep_2p_s( unsigned short*output, int i_dim, int j_dim) +{ + int idb, ic, jc ; + int r = rnoyau ; + int L = 2*r+1 ; + int N = L*L ; + float outval0=0, outval1=0 ; + + // coordonnees absolues du point de base + int j = __mul24(__mul24(blockIdx.x,blockDim.x) + threadIdx.x, 2) ; + int i = __mul24( blockIdx.y, blockDim.y) + threadIdx.y ; + int idx = __mul24(i, j_dim) + j ; // indice dans l'image + + #pragma unroll + for (idb=0 ; idb< N ; idb++) + { + ic = i-r + idb/L ; + jc = j-r + idb - (idb/L)*L ; + outval0 += ( noyau[ idb ]*tex2D(tex_img_ins, jc, ic)) ; + outval1 += ( noyau[ idb ]*tex2D(tex_img_ins, jc+1, ic)) ; + } + + output[ idx ] = outval0 ; + output[ idx+1 ] = outval1 ; + +} + + + +// convolution separable +// image en texture et noyau en mem constante +__global__ void kernel_convoSep8V( unsigned char *output, int i_dim, int j_dim, int r) +{ + int ic ; + int L=2*r+1 ; + float outval0=0.0 ; + + // coordonnees absolues du point de base + int j = __mul24( blockIdx.x, blockDim.x ) + threadIdx.x ; + int i = __mul24( blockIdx.y, blockDim.y ) + threadIdx.y ; + + //vertical + for (ic=0 ; ic global mem + output[ idx ] = outval0 ; + output[ idx+1 ] = outval1 ; + output[ idx+2 ] = outval2 ; + output[ idx+3 ] = outval3 ; + output[ idx+4 ] = outval4 ; + output[ idx+5 ] = outval5 ; + output[ idx+6 ] = outval6 ; + output[ idx+7 ] = outval7 ; +} + +__global__ void kernel_convoSepShx8pH(unsigned char *output, int i_dim, int j_dim, int r) +{ + int idb, ic, jc, p; + int L = 2*r+1 ; + float outval0=0.0, outval1=0.0, outval2=0.0, outval3=0.0, outval4=0.0, outval5=0.0, outval6=0.0, outval7=0.0 ; + int bdimX = blockDim.x<<3 ; + int tidX = threadIdx.x<<3 ; + + // coordonnees absolues du point de base + int j = (__umul24(blockIdx.x,blockDim.x) + threadIdx.x)<<3 ; + int i = __umul24( blockIdx.y, blockDim.y) + threadIdx.y ; + int j0= __umul24(blockIdx.x,blockDim.x)<<3 ; + int idx = __umul24(i,j_dim) + j ; // indice dans l'image + + + // chargement en smem + int idrow = threadIdx.y*(bdimX+L-1) ; + + extern __shared__ unsigned char roi8p[]; + + // bloc 0 (a gauche) + for (p=0; p<8; p++) + roi8p[ idrow + tidX +p ] = tex2D(tex_img_inc, j-r+p , i) ; + + // a droite + if ( threadIdx.x < r ) //...ou plutot ce qu'il en manque + { + roi8p[ idrow + bdimX + threadIdx.x ] = tex2D( tex_img_inc, j0-r +bdimX+threadIdx.x , i ) ; + roi8p[ idrow + bdimX + threadIdx.x +r ] = tex2D( tex_img_inc, j0 +bdimX+threadIdx.x , i ) ; + } + + __syncthreads(); + + // calculs de convolution + // passe horizontale + for (jc=0 ; jc global mem + output[ idx ] = outval0 ; + output[ idx+1 ] = outval1 ; + output[ idx+2 ] = outval2 ; + output[ idx+3 ] = outval3 ; + output[ idx+4 ] = outval4 ; + output[ idx+5 ] = outval5 ; + output[ idx+6 ] = outval6 ; + output[ idx+7 ] = outval7 ; +} + + +/************************************************************************************************* + *********************************************************************************************** + + FIN DES kERNELS de CONVOLUTION + + *********************************************************************************************** + *************************************************************************************************/ +// kernel de la libjacket +// Exchange trick: Morgan McGuire, ShaderX 2008 +#define s2(a,b) { float tmp = a; a = min(a,b); b = max(tmp,b); } +#define mn3(a,b,c) s2(a,b); s2(a,c); +#define mx3(a,b,c) s2(b,c); s2(a,c); + +#define mnmx3(a,b,c) mx3(a,b,c); s2(a,b); // 3 exchanges +#define mnmx4(a,b,c,d) s2(a,b); s2(c,d); s2(a,c); s2(b,d); // 4 exchanges +#define mnmx5(a,b,c,d,e) s2(a,b); s2(c,d); mn3(a,c,e); mx3(b,d,e); // 6 exchanges +#define mnmx6(a,b,c,d,e,f) s2(a,d); s2(b,e); s2(c,f); mn3(a,b,c); mx3(d,e,f); // 7 exchanges + +#define IN(X,Y) ((0 <= (X) && (X) < nx && 0 <= (Y) && (Y) < ny) ? d_in[(Y)*nx+(X)] : 0) + +__global__ static void kernel_medjacket(int nx, int ny, unsigned short*d_out, unsigned short*d_in) +{ + int x = __mul24(blockIdx.x , blockDim.x) + threadIdx.x; + int y = __mul24(blockIdx.y , blockDim.y) + threadIdx.y; + + // pull top six from image + float v[6] = { IN(x-1, y-1), IN(x , y-1), IN(x+1, y-1), + IN(x-1, y ), IN(x , y ), IN(x+1, y ) }; + + // with each pass, remove min and max values and add new value + mnmx6(v[0], v[1], v[2], v[3], v[4], v[5]); + v[5] = IN(x-1, y+1); // add new contestant + mnmx5(v[1], v[2], v[3], v[4], v[5]); + v[5] = IN(x , y+1); // add new contestant + mnmx4(v[2], v[3], v[4], v[5]); + v[5] = IN(x+1, y+1); // add last contenstant + mnmx3(v[3], v[4], v[5]); + + // pick the middle one + d_out[y*nx + x] = v[4]; +} + + +/*************************************************************** + * fonction de tri de 2 valeurs entieres (min en premier) + ***************************************************************/ +__device__ inline void s(int* a, int* b) +{ + + int tmp ; + if (*a > *b) + { + tmp = *b ; + *b = *a ; + *a = tmp ; + } +} + +__device__ inline void s(unsigned short * a, unsigned short* b) +{ + + unsigned short tmp ; + if (*a > *b) + { + tmp = *b ; + *b = *a ; + *a = tmp ; + } +} + +__device__ inline void s(unsigned char * a, unsigned char * b) +{ + + unsigned short tmp ; + if (*a > *b) + { + tmp = *b ; + *b = *a ; + *a = tmp ; + } +} + +/*************************************************************** + * fonction de min-max d'un tableau de n valeurs entieres + ***************************************************************/ +__device__ void minmaxN(unsigned short* v, int n) +{ + for (int i=1; i< n; i++) + s( v, v+i) ; + for (int i=n-2; i>0; i--) + s( v+i, v+n-1) ; +} + +/*************************************************************** + * fonction de tri naif d'un tableau de n valeurs entieres + ***************************************************************/ +__device__ void bubTriN(int * v, int n) +{ + for (int i=0; i< n-1; i++) + for (int j=i+1; j ((2*r+1)*(2*r+1))>>1 ) break ; + } + + output[ __mul24(i, j_dim) +j ] = ic ; +} + +__global__ void kernel_medianRSH( unsigned short * output, int i_dim, int j_dim, int r) +{ + // coordonnees absolues du point + int j = __mul24(blockIdx.x,blockDim.x) + threadIdx.x ; + int i = __mul24(blockIdx.y,blockDim.y) + threadIdx.y ; + unsigned short cpt ; + unsigned short ic, jc ; + + // chargement en smem + int idrow = threadIdx.y*(blockDim.x+2*r) ; + + extern __shared__ int medRoi[]; + + // bloc 0 (en haut à gauche) + medRoi[ idrow + threadIdx.x ] = tex2D(tex_img_ins, j-r, i-r) ; + // bloc 1 (en haut à droite)... + if ( threadIdx.x < 2*r ) //...ou plutot ce qu'il en manque + medRoi[ idrow + threadIdx.x + blockDim.x ] = tex2D( tex_img_ins, j+blockDim.x-r, i-r ) ; + // bloc 2 ( en bas à gauche) + if ( threadIdx.y < 2*r ) + { + idrow = (threadIdx.y+blockDim.y)*(blockDim.x+2*r) ; + medRoi[ idrow + threadIdx.x ] = tex2D( tex_img_ins, j-r, i+blockDim.y-r ) ; + //bloc 4 ( en bas à doite )... + if ( threadIdx.x < 2*r ) //...ou ce qu'il en manque + medRoi[ idrow + threadIdx.x + blockDim.x ] = tex2D( tex_img_ins, j+blockDim.x-r, i+blockDim.y-r ) ; + } + __syncthreads(); + + int hist[256] ; + for (ic =0; ic<256; ic++) hist[ic]=0; // init histogramme + + //generation histogramme + for(ic=0; ic<=2*r; ic++ ) + for(jc=0; jc<=2*r; jc++) + hist[medRoi[(threadIdx.y+ic)*(blockDim.x+2*r)+ (threadIdx.x+jc)]]++ ; + + //parcours histogramme + cpt = 0 ; + for(ic=0; ic<256; ic++) + { + cpt += hist[ic] ; + //selection de la valeur du percentile (ici 50%=SUM/2) + if ( cpt > (((2*r+1)*(2*r+1))>>1) ) break ; + } + + output[ __mul24(i, j_dim) +j ] = ic ; +} + + +__global__ void kernel_median5_2pix( unsigned short*output, int i_dim, int j_dim) +{ + + // coordonnees absolues du point + int j = __mul24(__mul24(blockIdx.x,blockDim.x) + threadIdx.x,2) ; + int i = __mul24(blockIdx.y,blockDim.y) + threadIdx.y ; + + /************************************************************************** + * tri(s) + **************************************************************************/ + int a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13 ; + int b7, b8, b9, b10, b11, b12, b13 ; + + /******************************************************************************** + * les 14 premieres valeurs (suffisant pour median 5x5 par forgetfull selection) + ********************************************************************************/ + //premiere ligne + a0 = tex2D(tex_img_ins, j-1, i-2) ; + a1 = tex2D(tex_img_ins, j , i-2) ; + a2 = tex2D(tex_img_ins, j+1, i-2) ; + a3 = tex2D(tex_img_ins, j+2, i-2) ; + //deuxieme ligne + a4 = tex2D(tex_img_ins, j-1, i-1) ; + a5 = tex2D(tex_img_ins, j , i-1) ; + a6 = tex2D(tex_img_ins, j+1, i-1) ; + a7 = tex2D(tex_img_ins, j+2, i-1) ; + //troisieme ligne + a8 = tex2D(tex_img_ins, j-1, i) ; + a9 = tex2D(tex_img_ins, j , i) ; + a10 = tex2D(tex_img_ins, j+1, i) ; + a11 = tex2D(tex_img_ins, j+2, i) ; + //les 2 premiers de la 4eme ligne + a12 = tex2D(tex_img_ins, j-1, i+1) ; + a13 = tex2D(tex_img_ins, j , i+1) ; + + //min max aux extremites + minmax14(&a0,&a1,&a2,&a3,&a4,&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13); + + //chargement valeurs suivante (15) + a13 = tex2D(tex_img_ins, j+1, i+1); + //minmax aux extremites + minmax13(&a1,&a2,&a3,&a4,&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13); + + //chargement valeur suivante (16) + a13 = tex2D(tex_img_ins, j+2, i+1); + //minmax aux extremites + minmax12(&a2,&a3,&a4,&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13); + + //chargement valeur suivante (17) + a13 = tex2D(tex_img_ins, j-1, i+2); + //minmax aux extremites + minmax11(&a3,&a4,&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13); + + //chargement valeur suivante (18) + a13 = tex2D(tex_img_ins, j , i+2); + //minmax aux extremites + minmax10(&a4,&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13); + + //chargement valeur suivante (19) + a13 = tex2D(tex_img_ins, j+1, i+2); + //minmax aux extremites + minmax9(&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13); + + //chargement valeur suivante (20) + a13 = tex2D(tex_img_ins, j+2, i+2); + //minmax aux extmites + minmax8(&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13); + + // fin des pixels voisins communs deux pixels centraux + b7=a7; b8=a8; b9=a9; b10=a10; b11=a11; b12=a12; b13=a13; + + //chargement valeur suivante (21) + a13 = tex2D(tex_img_ins, j-2, i-2); + b13 = tex2D(tex_img_ins, j+3, i-2); + //minmax aux extremites + minmax7(&a7,&a8,&a9,&a10,&a11,&a12,&a13); + minmax7(&b7,&b8,&b9,&b10,&b11,&b12,&b13); + + //chargement valeur suivante (22) + a13 = tex2D(tex_img_ins, j-2, i-1); + b13 = tex2D(tex_img_ins, j+3, i-1); + //minmax aux extremites + minmax6(&a8,&a9,&a10,&a11,&a12,&a13); + minmax6(&b8,&b9,&b10,&b11,&b12,&b13); + + //chargement valeur suivante (23) + a13 = tex2D(tex_img_ins, j-2, i ); + b13 = tex2D(tex_img_ins, j+3, i ); + //minmax aux extremites + minmax5(&a9,&a10,&a11,&a12,&a13); + minmax5(&b9,&b10,&b11,&b12,&b13); + + //chargement valeur suivante (24) + a13 = tex2D(tex_img_ins, j-2, i+1); + b13 = tex2D(tex_img_ins, j+3, i+1); + //minmax aux extremites + minmax4(&a10,&a11,&a12,&a13); + minmax4(&b10,&b11,&b12,&b13); + + //chargement valeur suivante (25) + a13 = tex2D(tex_img_ins, j-2, i+2); + b13 = tex2D(tex_img_ins, j+3, i+2); + //minmax aux extremites + minmax3(&a11,&a12,&a13); + minmax3(&b11,&b12,&b13); + + //median au milieu ! + output[ __mul24(i, j_dim) +j ] = a12 ; + output[ __mul24(i, j_dim) +j+1 ] = b12 ; + +} + + +__global__ void kernel_median5_2pix( unsigned char *output, int i_dim, int j_dim) +{ + + // coordonnees absolues du point + int j = __mul24(__mul24(blockIdx.x,blockDim.x) + threadIdx.x,2) ; + int i = __mul24(blockIdx.y,blockDim.y) + threadIdx.y ; + + /************************************************************************** + * tri(s) + **************************************************************************/ + int a0, a1, a2, a3, a4, a5, a6, a7, a8, a9, a10, a11, a12, a13 ; + int b7, b8, b9, b10, b11, b12, b13 ; + + /******************************************************************************** + * les 14 premieres valeurs (suffisant pour median 5x5 par forgetfull selection) + ********************************************************************************/ + //premiere ligne + a0 = tex2D(tex_img_inc, j-1, i-2) ; + a1 = tex2D(tex_img_inc, j , i-2) ; + a2 = tex2D(tex_img_inc, j+1, i-2) ; + a3 = tex2D(tex_img_inc, j+2, i-2) ; + //deuxieme ligne + a4 = tex2D(tex_img_inc, j-1, i-1) ; + a5 = tex2D(tex_img_inc, j , i-1) ; + a6 = tex2D(tex_img_inc, j+1, i-1) ; + a7 = tex2D(tex_img_inc, j+2, i-1) ; + //troisieme ligne + a8 = tex2D(tex_img_inc, j-1, i) ; + a9 = tex2D(tex_img_inc, j , i) ; + a10 = tex2D(tex_img_inc, j+1, i) ; + a11 = tex2D(tex_img_inc, j+2, i) ; + //les 2 premiers de la 4eme ligne + a12 = tex2D(tex_img_inc, j-1, i+1) ; + a13 = tex2D(tex_img_inc, j , i+1) ; + + //min max aux extremites + minmax14(&a0,&a1,&a2,&a3,&a4,&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13); + + //chargement valeurs suivante (15) + a13 = tex2D(tex_img_inc, j+1, i+1); + //minmax aux extremites + minmax13(&a1,&a2,&a3,&a4,&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13); + + //chargement valeur suivante (16) + a13 = tex2D(tex_img_inc, j+2, i+1); + //minmax aux extremites + minmax12(&a2,&a3,&a4,&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13); + + //chargement valeur suivante (17) + a13 = tex2D(tex_img_inc, j-1, i+2); + //minmax aux extremites + minmax11(&a3,&a4,&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13); + + //chargement valeur suivante (18) + a13 = tex2D(tex_img_inc, j , i+2); + //minmax aux extremites + minmax10(&a4,&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13); + + //chargement valeur suivante (19) + a13 = tex2D(tex_img_inc, j+1, i+2); + //minmax aux extremites + minmax9(&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13); + + //chargement valeur suivante (20) + a13 = tex2D(tex_img_inc, j+2, i+2); + //minmax aux extmites + minmax8(&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13); + + // fin des pixels voisins communs deux pixels centraux + b7=a7; b8=a8; b9=a9; b10=a10; b11=a11; b12=a12; b13=a13; + + //chargement valeur suivante (21) + a13 = tex2D(tex_img_inc, j-2, i-2); + b13 = tex2D(tex_img_inc, j+3, i-2); + //minmax aux extremites + minmax7(&a7,&a8,&a9,&a10,&a11,&a12,&a13); + minmax7(&b7,&b8,&b9,&b10,&b11,&b12,&b13); + + //chargement valeur suivante (22) + a13 = tex2D(tex_img_inc, j-2, i-1); + b13 = tex2D(tex_img_inc, j+3, i-1); + //minmax aux extremites + minmax6(&a8,&a9,&a10,&a11,&a12,&a13); + minmax6(&b8,&b9,&b10,&b11,&b12,&b13); + + //chargement valeur suivante (23) + a13 = tex2D(tex_img_inc, j-2, i ); + b13 = tex2D(tex_img_inc, j+3, i ); + //minmax aux extremites + minmax5(&a9,&a10,&a11,&a12,&a13); + minmax5(&b9,&b10,&b11,&b12,&b13); + + //chargement valeur suivante (24) + a13 = tex2D(tex_img_inc, j-2, i+1); + b13 = tex2D(tex_img_inc, j+3, i+1); + //minmax aux extremites + minmax4(&a10,&a11,&a12,&a13); + minmax4(&b10,&b11,&b12,&b13); + + //chargement valeur suivante (25) + a13 = tex2D(tex_img_inc, j-2, i+2); + b13 = tex2D(tex_img_inc, j+3, i+2); + //minmax aux extremites + minmax3(&a11,&a12,&a13); + minmax3(&b11,&b12,&b13); + + //median au milieu ! + output[ __mul24(i, j_dim) +j ] = a12 ; + output[ __mul24(i, j_dim) +j+1 ] = b12 ; + +} + + +__global__ void kernel_median7( unsigned short*output, int i_dim, int j_dim) +{ + + // coordonnees absolues du point + int j = __mul24(blockIdx.x,blockDim.x) + threadIdx.x ; + int i = __mul24(blockIdx.y,blockDim.y) + threadIdx.y ; + + /************************************************************************** + * tri(s) + **************************************************************************/ + int a0,a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16,a17,a18,a19,a20,a21,a22,a23,a24,a25 ; + + /**************************************** + * les 26 premieres valeurs + ****************************************/ + //premiere ligne + a0 = tex2D(tex_img_ins, j-2, i-3) ; + a1 = tex2D(tex_img_ins, j-1, i-3) ; + a2 = tex2D(tex_img_ins, j , i-3) ; + a3 = tex2D(tex_img_ins, j+1, i-3) ; + a4 = tex2D(tex_img_ins, j+2, i-3) ; + a5 = tex2D(tex_img_ins, j+3, i-3) ; + //deuxieme ligne + a6 = tex2D(tex_img_ins, j-2, i-2) ; + a7 = tex2D(tex_img_ins, j-1, i-2) ; + a8 = tex2D(tex_img_ins, j , i-2) ; + a9 = tex2D(tex_img_ins, j+1, i-2) ; + a10 = tex2D(tex_img_ins, j+2, i-2) ; + a11 = tex2D(tex_img_ins, j+3, i-2) ; + //troisieme ligne + a12 = tex2D(tex_img_ins, j-2, i-1) ; + a13 = tex2D(tex_img_ins, j-1, i-1) ; + a14 = tex2D(tex_img_ins, j , i-1) ; + a15 = tex2D(tex_img_ins, j+1, i-1) ; + a16 = tex2D(tex_img_ins, j+2, i-1) ; + a17 = tex2D(tex_img_ins, j+3, i-1) ; + //quatrieme ligne + a18 = tex2D(tex_img_ins, j-2, i ) ; + a19 = tex2D(tex_img_ins, j-1, i ) ; + a20 = tex2D(tex_img_ins, j , i ) ; + a21 = tex2D(tex_img_ins, j+1, i ) ; + a22 = tex2D(tex_img_ins, j+2, i ) ; + a23 = tex2D(tex_img_ins, j+3, i ) ; + //cinquieme ligne + a24 = tex2D(tex_img_ins, j-2, i+1) ; + a25 = tex2D(tex_img_ins, j-1, i+1) ; + + //min max aux extremites + minmax26(&a0,&a1,&a2,&a3,&a4,&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeurs suivante (26) + a25 = tex2D(tex_img_ins, j , i+1); + //minmax aux extremites + minmax25(&a1,&a2,&a3,&a4,&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (27) + a25 = tex2D(tex_img_ins, j+1, i+1); + //minmax aux extremites + minmax24(&a2,&a3,&a4,&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (28) + a25 = tex2D(tex_img_ins, j+2, i+1); + //minmax aux extremites + minmax23(&a3,&a4,&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (29) + a25 = tex2D(tex_img_ins, j+3, i+1); + //minmax aux extremites + minmax22(&a4,&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (30) + a25 = tex2D(tex_img_ins, j-2, i+2); + //minmax aux extremites + minmax21(&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (31) + a25 = tex2D(tex_img_ins, j-1, i+2); + //minmax aux extmites + minmax20(&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (32) + a25 = tex2D(tex_img_ins, j , i+2); + //minmax aux extmites + minmax19(&a7,&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (33) + a25 = tex2D(tex_img_ins, j+1, i+2); + //minmax aux extmites + minmax18(&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (34) + a25 = tex2D(tex_img_ins, j+2, i+2); + //minmax aux extmites + minmax17(&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (35) + a25 = tex2D(tex_img_ins, j+3, i+2); + //minmax aux extmites + minmax16(&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (36) + a25 = tex2D(tex_img_ins, j-2, i+3); + //minmax aux extmites + minmax15(&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (37) + a25 = tex2D(tex_img_ins, j-1, i+3); + //minmax aux extmites + + minmax14(&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + //chargement valeur suivante (38) + a25 = tex2D(tex_img_ins, j , i+3); + //minmax aux extmites + minmax13(&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (39) + a25 = tex2D(tex_img_ins, j+1, i+3); + //minmax aux extmites + minmax12(&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (40) + a25 = tex2D(tex_img_ins, j+2, i+3); + //minmax aux extmites + minmax11(&a15, &a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (41) + a25 = tex2D(tex_img_ins, j+3, i+3); + //minmax aux extmites + minmax10(&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeurs suivantes + a25 = tex2D(tex_img_ins, j-3, i+3); + //minmax aux extremites + minmax9(&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeurs suivantes + a25 = tex2D(tex_img_ins, j-3, i+2); + //minmax aux extremites + minmax8(&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeurs suivantes + a25 = tex2D(tex_img_ins, j-3, i+1); + //minmax aux extremites + minmax7(&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeurs suivantes + a25 = tex2D(tex_img_ins, j-3, i); + //minmax aux extremites + minmax6(&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeurs suivantes + a25 = tex2D(tex_img_ins, j-3, i-1); + //minmax aux extremites + minmax5(&a21,&a22,&a23,&a24,&a25); + + //chargement valeurs suivantes + a25 = tex2D(tex_img_ins, j-3, i-2); + //minmax aux extremites + minmax4(&a22,&a23,&a24,&a25); + + //chargement valeurs suivantes + a25 = tex2D(tex_img_ins, j-3, i-3); + //minmax aux extremites + minmax3(&a23,&a24,&a25); + + //medians au milieu ! + output[ __mul24(i, j_dim) +j ] = a24 ; + +} + + + +__global__ void kernel_median7_2pix( unsigned short*output, int i_dim, int j_dim) +{ + + // coordonnees absolues du point + int j = __mul24(__mul24(blockIdx.x,blockDim.x) + threadIdx.x,2) ; + int i = __mul24(blockIdx.y,blockDim.y) + threadIdx.y ; + + /************************************************************************** + * tri(s) + **************************************************************************/ + int a0,a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16,a17,a18,a19,a20,a21,a22,a23,a24,a25 ; + int b17,b18,b19,b20,b21,b22,b23,b24,b25; + + /**************************************** + * les 26 premieres valeurs + ****************************************/ + //premiere ligne + a0 = tex2D(tex_img_ins, j-2, i-3) ; + a1 = tex2D(tex_img_ins, j-1, i-3) ; + a2 = tex2D(tex_img_ins, j , i-3) ; + a3 = tex2D(tex_img_ins, j+1, i-3) ; + a4 = tex2D(tex_img_ins, j+2, i-3) ; + a5 = tex2D(tex_img_ins, j+3, i-3) ; + //deuxieme ligne + a6 = tex2D(tex_img_ins, j-2, i-2) ; + a7 = tex2D(tex_img_ins, j-1, i-2) ; + a8 = tex2D(tex_img_ins, j , i-2) ; + a9 = tex2D(tex_img_ins, j+1, i-2) ; + a10 = tex2D(tex_img_ins, j+2, i-2) ; + a11 = tex2D(tex_img_ins, j+3, i-2) ; + //troisieme ligne + a12 = tex2D(tex_img_ins, j-2, i-1) ; + a13 = tex2D(tex_img_ins, j-1, i-1) ; + a14 = tex2D(tex_img_ins, j , i-1) ; + a15 = tex2D(tex_img_ins, j+1, i-1) ; + a16 = tex2D(tex_img_ins, j+2, i-1) ; + a17 = tex2D(tex_img_ins, j+3, i-1) ; + //quatrieme ligne + a18 = tex2D(tex_img_ins, j-2, i ) ; + a19 = tex2D(tex_img_ins, j-1, i ) ; + a20 = tex2D(tex_img_ins, j , i ) ; + a21 = tex2D(tex_img_ins, j+1, i ) ; + a22 = tex2D(tex_img_ins, j+2, i ) ; + a23 = tex2D(tex_img_ins, j+3, i ) ; + //cinquieme ligne + a24 = tex2D(tex_img_ins, j-2, i+1) ; + a25 = tex2D(tex_img_ins, j-1, i+1) ; + + //min max aux extremites + minmax26(&a0,&a1,&a2,&a3,&a4,&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeurs suivante (26) + a25 = tex2D(tex_img_ins, j , i+1); + //minmax aux extremites + minmax25(&a1,&a2,&a3,&a4,&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (27) + a25 = tex2D(tex_img_ins, j+1, i+1); + //minmax aux extremites + minmax24(&a2,&a3,&a4,&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (28) + a25 = tex2D(tex_img_ins, j+2, i+1); + //minmax aux extremites + minmax23(&a3,&a4,&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (29) + a25 = tex2D(tex_img_ins, j+3, i+1); + //minmax aux extremites + minmax22(&a4,&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (30) + a25 = tex2D(tex_img_ins, j-2, i+2); + //minmax aux extremites + minmax21(&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (31) + a25 = tex2D(tex_img_ins, j-1, i+2); + //minmax aux extmites + minmax20(&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (32) + a25 = tex2D(tex_img_ins, j , i+2); + //minmax aux extmites + minmax19(&a7,&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (33) + a25 = tex2D(tex_img_ins, j+1, i+2); + //minmax aux extmites + minmax18(&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (34) + a25 = tex2D(tex_img_ins, j+2, i+2); + //minmax aux extmites + minmax17(&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (35) + a25 = tex2D(tex_img_ins, j+3, i+2); + //minmax aux extmites + minmax16(&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (36) + a25 = tex2D(tex_img_ins, j-2, i+3); + //minmax aux extmites + minmax15(&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (37) + a25 = tex2D(tex_img_ins, j-1, i+3); + //minmax aux extmites + + minmax14(&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + //chargement valeur suivante (38) + a25 = tex2D(tex_img_ins, j , i+3); + //minmax aux extmites + minmax13(&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (39) + a25 = tex2D(tex_img_ins, j+1, i+3); + //minmax aux extmites + minmax12(&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (40) + a25 = tex2D(tex_img_ins, j+2, i+3); + //minmax aux extmites + minmax11(&a15, &a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (41) + a25 = tex2D(tex_img_ins, j+3, i+3); + //minmax aux extmites + minmax10(&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + // fin des pixels voisins communs deux pixels centraux + b17=a17; b18=a18; b19=a19; b20=a20; b21=a21; b22=a22; b23=a23; b24=a24; b25=a25; + + //chargement valeurs suivantes + a25 = tex2D(tex_img_ins, j-3, i+3); + b25 = tex2D(tex_img_ins, j+4, i+3); + //minmax aux extremites + minmax9(&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + minmax9(&b17,&b18,&b19,&b20,&b21,&b22,&b23,&b24,&b25); + + //chargement valeurs suivantes + a25 = tex2D(tex_img_ins, j-3, i+2); + b25 = tex2D(tex_img_ins, j+4, i+2); + //minmax aux extremites + minmax8(&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + minmax8(&b18,&b19,&b20,&b21,&b22,&b23,&b24,&b25); + + //chargement valeurs suivantes + a25 = tex2D(tex_img_ins, j-3, i+1); + b25 = tex2D(tex_img_ins, j+4, i+1); + //minmax aux extremites + minmax7(&a19,&a20,&a21,&a22,&a23,&a24,&a25); + minmax7(&b19,&b20,&b21,&b22,&b23,&b24,&b25); + + //chargement valeurs suivantes + a25 = tex2D(tex_img_ins, j-3, i); + b25 = tex2D(tex_img_ins, j+4, i); + //minmax aux extremites + minmax6(&a20,&a21,&a22,&a23,&a24,&a25); + minmax6(&b20,&b21,&b22,&b23,&b24,&b25); + + //chargement valeurs suivantes + a25 = tex2D(tex_img_ins, j-3, i-1); + b25 = tex2D(tex_img_ins, j+4, i-1); + //minmax aux extremites + minmax5(&a21,&a22,&a23,&a24,&a25); + minmax5(&b21,&b22,&b23,&b24,&b25); + + //chargement valeurs suivantes + a25 = tex2D(tex_img_ins, j-3, i-2); + b25 = tex2D(tex_img_ins, j+4, i-2); + //minmax aux extremites + minmax4(&a22,&a23,&a24,&a25); + minmax4(&b22,&b23,&b24,&b25); + + //chargement valeurs suivantes + a25 = tex2D(tex_img_ins, j-3, i-3); + b25 = tex2D(tex_img_ins, j+4, i-3); + //minmax aux extremites + minmax3(&a23,&a24,&a25); + minmax3(&b23,&b24,&b25); + + //medians au milieu ! + output[ __mul24(i, j_dim) +j ] = a24 ; + output[ __mul24(i, j_dim) +j+1 ] = b24 ; + +} + + +__global__ void kernel_median7_2pix( unsigned char *output, int i_dim, int j_dim) +{ + + // coordonnees absolues du point + int j = __mul24(__mul24(blockIdx.x,blockDim.x) + threadIdx.x,2) ; + int i = __mul24(blockIdx.y,blockDim.y) + threadIdx.y ; + + /************************************************************************** + * tri(s) + **************************************************************************/ + int a0,a1,a2,a3,a4,a5,a6,a7,a8,a9,a10,a11,a12,a13,a14,a15,a16,a17,a18,a19,a20,a21,a22,a23,a24,a25 ; + int b17,b18,b19,b20,b21,b22,b23,b24,b25; + + /**************************************** + * les 26 premieres valeurs + ****************************************/ + //premiere ligne + a0 = tex2D(tex_img_inc, j-2, i-3) ; + a1 = tex2D(tex_img_inc, j-1, i-3) ; + a2 = tex2D(tex_img_inc, j , i-3) ; + a3 = tex2D(tex_img_inc, j+1, i-3) ; + a4 = tex2D(tex_img_inc, j+2, i-3) ; + a5 = tex2D(tex_img_inc, j+3, i-3) ; + //deuxieme ligne + a6 = tex2D(tex_img_inc, j-2, i-2) ; + a7 = tex2D(tex_img_inc, j-1, i-2) ; + a8 = tex2D(tex_img_inc, j , i-2) ; + a9 = tex2D(tex_img_inc, j+1, i-2) ; + a10 = tex2D(tex_img_inc, j+2, i-2) ; + a11 = tex2D(tex_img_inc, j+3, i-2) ; + //troisieme ligne + a12 = tex2D(tex_img_inc, j-2, i-1) ; + a13 = tex2D(tex_img_inc, j-1, i-1) ; + a14 = tex2D(tex_img_inc, j , i-1) ; + a15 = tex2D(tex_img_inc, j+1, i-1) ; + a16 = tex2D(tex_img_inc, j+2, i-1) ; + a17 = tex2D(tex_img_inc, j+3, i-1) ; + //quatrieme ligne + a18 = tex2D(tex_img_inc, j-2, i ) ; + a19 = tex2D(tex_img_inc, j-1, i ) ; + a20 = tex2D(tex_img_inc, j , i ) ; + a21 = tex2D(tex_img_inc, j+1, i ) ; + a22 = tex2D(tex_img_inc, j+2, i ) ; + a23 = tex2D(tex_img_inc, j+3, i ) ; + //cinquieme ligne + a24 = tex2D(tex_img_inc, j-2, i+1) ; + a25 = tex2D(tex_img_inc, j-1, i+1) ; + + //min max aux extremites + minmax26(&a0,&a1,&a2,&a3,&a4,&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeurs suivante (26) + a25 = tex2D(tex_img_inc, j , i+1); + //minmax aux extremites + minmax25(&a1,&a2,&a3,&a4,&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (27) + a25 = tex2D(tex_img_inc, j+1, i+1); + //minmax aux extremites + minmax24(&a2,&a3,&a4,&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (28) + a25 = tex2D(tex_img_inc, j+2, i+1); + //minmax aux extremites + minmax23(&a3,&a4,&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (29) + a25 = tex2D(tex_img_inc, j+3, i+1); + //minmax aux extremites + minmax22(&a4,&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (30) + a25 = tex2D(tex_img_inc, j-2, i+2); + //minmax aux extremites + minmax21(&a5,&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (31) + a25 = tex2D(tex_img_inc, j-1, i+2); + //minmax aux extmites + minmax20(&a6,&a7,&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (32) + a25 = tex2D(tex_img_inc, j , i+2); + //minmax aux extmites + minmax19(&a7,&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (33) + a25 = tex2D(tex_img_inc, j+1, i+2); + //minmax aux extmites + minmax18(&a8,&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (34) + a25 = tex2D(tex_img_inc, j+2, i+2); + //minmax aux extmites + minmax17(&a9,&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (35) + a25 = tex2D(tex_img_inc, j+3, i+2); + //minmax aux extmites + minmax16(&a10,&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (36) + a25 = tex2D(tex_img_inc, j-2, i+3); + //minmax aux extmites + minmax15(&a11,&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (37) + a25 = tex2D(tex_img_inc, j-1, i+3); + //minmax aux extmites + + minmax14(&a12,&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + //chargement valeur suivante (38) + a25 = tex2D(tex_img_inc, j , i+3); + //minmax aux extmites + minmax13(&a13,&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (39) + a25 = tex2D(tex_img_inc, j+1, i+3); + //minmax aux extmites + minmax12(&a14,&a15,&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (40) + a25 = tex2D(tex_img_inc, j+2, i+3); + //minmax aux extmites + minmax11(&a15, &a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + //chargement valeur suivante (41) + a25 = tex2D(tex_img_inc, j+3, i+3); + //minmax aux extmites + minmax10(&a16,&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + + // fin des pixels voisins communs deux pixels centraux + b17=a17; b18=a18; b19=a19; b20=a20; b21=a21; b22=a22; b23=a23; b24=a24; b25=a25; + + //chargement valeurs suivantes + a25 = tex2D(tex_img_inc, j-3, i+3); + b25 = tex2D(tex_img_inc, j+4, i+3); + //minmax aux extremites + minmax9(&a17,&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + minmax9(&b17,&b18,&b19,&b20,&b21,&b22,&b23,&b24,&b25); + + //chargement valeurs suivantes + a25 = tex2D(tex_img_inc, j-3, i+2); + b25 = tex2D(tex_img_inc, j+4, i+2); + //minmax aux extremites + minmax8(&a18,&a19,&a20,&a21,&a22,&a23,&a24,&a25); + minmax8(&b18,&b19,&b20,&b21,&b22,&b23,&b24,&b25); + + //chargement valeurs suivantes + a25 = tex2D(tex_img_inc, j-3, i+1); + b25 = tex2D(tex_img_inc, j+4, i+1); + //minmax aux extremites + minmax7(&a19,&a20,&a21,&a22,&a23,&a24,&a25); + minmax7(&b19,&b20,&b21,&b22,&b23,&b24,&b25); + + //chargement valeurs suivantes + a25 = tex2D(tex_img_inc, j-3, i); + b25 = tex2D(tex_img_inc, j+4, i); + //minmax aux extremites + minmax6(&a20,&a21,&a22,&a23,&a24,&a25); + minmax6(&b20,&b21,&b22,&b23,&b24,&b25); + + //chargement valeurs suivantes + a25 = tex2D(tex_img_inc, j-3, i-1); + b25 = tex2D(tex_img_inc, j+4, i-1); + //minmax aux extremites + minmax5(&a21,&a22,&a23,&a24,&a25); + minmax5(&b21,&b22,&b23,&b24,&b25); + + //chargement valeurs suivantes + a25 = tex2D(tex_img_inc, j-3, i-2); + b25 = tex2D(tex_img_inc, j+4, i-2); + //minmax aux extremites + minmax4(&a22,&a23,&a24,&a25); + minmax4(&b22,&b23,&b24,&b25); + + //chargement valeurs suivantes + a25 = tex2D(tex_img_inc, j-3, i-3); + b25 = tex2D(tex_img_inc, j+4, i-3); + //minmax aux extremites + minmax3(&a23,&a24,&a25); + minmax3(&b23,&b24,&b25); + + //medians au milieu ! + output[ __mul24(i, j_dim) +j ] = a24 ; + output[ __mul24(i, j_dim) +j+1 ] = b24 ; + +} + + + +/***************************************************************************** + * median generic shared mem + forgetfull + *****************************************************************************/ +__global__ void kernel_medianForgetRSH( unsigned short * output, int i_dim, int j_dim, int r) +{ + // coordonnees absolues du point + int j = __mul24(blockIdx.x,blockDim.x) + threadIdx.x ; + int i = __mul24(blockIdx.y,blockDim.y) + threadIdx.y ; + unsigned short ic, jc ; + unsigned short Nreg = ((2*r+1)*(2*r+1))/2 + 2 ; + + int bdimX = blockDim.x; //<<1 ; // pour le cas à 2 pixels par thread + int tidX = threadIdx.x; //<<1 ; + + // chargement en smem + int idrow = threadIdx.y*(blockDim.x+2*r) ; + + extern __shared__ int medRoi[]; + + // bloc 0 (en haut à gauche) + medRoi[ idrow + threadIdx.x ] = tex2D(tex_img_ins, j-r, i-r) ; + // bloc 1 (en haut à droite)... + if ( threadIdx.x < 2*r ) //...ou plutot ce qu'il en manque + medRoi[ idrow + threadIdx.x + blockDim.x ] = tex2D( tex_img_ins, j+blockDim.x-r, i-r ) ; + // bloc 2 ( en bas à gauche) + if ( threadIdx.y < 2*r ) + { + idrow = (threadIdx.y+blockDim.y)*(blockDim.x+2*r) ; + medRoi[ idrow + threadIdx.x ] = tex2D( tex_img_ins, j-r, i+blockDim.y-r ) ; + //bloc 4 ( en bas à doite )... + if ( threadIdx.x < 2*r ) //...ou ce qu'il en manque + medRoi[ idrow + threadIdx.x + blockDim.x ] = tex2D( tex_img_ins, j+blockDim.x-r, i+blockDim.y-r ) ; + } + __syncthreads() ; + + // remplissage du vecteur de tri minmax + unsigned short vect[8066] ; + int Freg=Nreg ; + for (ic=0; ic<2*r+1; ic++) + { + for (jc=0; jc<2*r+1; jc++) + { + if ( ic*(2*r+1)+jc < Nreg ) + { + vect[ ic*(2*r+1)+jc ] = medRoi[ (threadIdx.y+ic)*(bdimX+2*r)+ (tidX+jc) ] ; + } else + { + minmaxN(vect, Freg--) ; + vect[ Nreg-1 ] = medRoi[ (threadIdx.y+ic)*(bdimX+2*r)+ (tidX+jc) ] ; + } + } + } + minmax3(&vect[Nreg-3], &vect[Nreg-2], &vect[Nreg-1]) + + //medRoi[ (threadIdx.y+ic)*(bdimX+L-1)+ (tidX+jc) ] + + output[ __mul24(i, j_dim) +j ] = vect[ Nreg-2 ]; +} + + +/***************************************************************************** + * median generic shared mem + forgetfull + *****************************************************************************/ +__global__ void kernel_medianForgetR( unsigned short * output, int i_dim, int j_dim, int r) +{ + // coordonnees absolues du point + int j = __mul24(blockIdx.x,blockDim.x) + threadIdx.x ; + int i = __mul24(blockIdx.y,blockDim.y) + threadIdx.y ; + unsigned short ic, jc ; + unsigned short Nreg = ((2*r+1)*(2*r+1))/2 + 2 ; + + // remplissage du vecteur de tri minmax + unsigned short vect[8066] ; + int Freg=Nreg ; + for (ic=0; ic<2*r+1; ic++) + { + for (jc=0; jc<2*r+1; jc++) + { + if ( ic*(2*r+1)+jc < Nreg ) + { + vect[ ic*(2*r+1)+jc ] = tex2D(tex_img_ins, j-r+jc, i-r+ic) ; + } else + { + minmaxN(vect, Freg--) ; + vect[ Nreg-1 ] = tex2D(tex_img_ins, j-r+jc, i-r+ic) ; + } + } + } + minmax3(&vect[Nreg-3], &vect[Nreg-2], &vect[Nreg-1]) + + //medRoi[ (threadIdx.y+ic)*(bdimX+L-1)+ (tidX+jc) ] + + output[ __mul24(i, j_dim) +j ] = vect[ Nreg-2 ]; +} + + + +/********************************************************************************** + * MEDIAN PSEUDO-SEPARABLE POUR GRANDS NOYAUX + **********************************************************************************/ +__global__ void kernel_medianSepR( unsigned short *output, int i_dim, int j_dim, int r) +{ + + int idc, val, min, max, inf, egal, sup, mxinf, minsup, estim ; + + //coordonnées ds le bloc + int ib = threadIdx.y ; + int jb = threadIdx.x ; + //int idx_h = __mul24(ib+r,blockDim.x) + jb ; // index pixel deans shmem (bloc+halo) + //int offset = __mul24(blockDim.x,r) ; + + // coordonnees absolues du point + int j = __mul24(blockIdx.x,blockDim.x) + jb ; + int i = __mul24(blockIdx.y,blockDim.y) + ib ; + + extern __shared__ int buff[] ; + /*********************************************************************************** + * CHARGEMENT DATA EN SHARED MEM + ***********************************************************************************/ + for (idc = 0 ; idc < (2*(blockDim.y+r)-1)/blockDim.y; idc++) + { + if (idc*blockDim.y +ib < i_dim) + buff[ (idc*blockDim.y +ib)*blockDim.x + jb ] = tex2D(tex_img_ins, j, i-r+ idc*blockDim.y) ; + } + + __syncthreads() ; + /********************************************************************************************** + * TRI VERTICAL par algo TORBEN MOGENSEN + * (a little bit slow but saves memory => faster !) + **********************************************************************************************/ + min = max = buff[ ib*blockDim.x +jb] ; + + for (idc= 0 ; idc< 2*r+1 ; idc++ ) + { + val = buff[ __mul24(ib+idc, blockDim.x) +jb ] ; + if ( val < min ) min = val ; + if ( val > max ) max = val ; + } + + while (1) + { + estim = (min+max)/2 ; + inf = sup = egal = 0 ; + mxinf = min ; + minsup= max ; + for (idc =0; idc< 2*r+1 ; idc++) + { + val = buff[ __mul24(ib+idc, blockDim.x) +jb ] ; + if( val < estim ) + { + inf++; + if( val > mxinf) mxinf = val ; + } else if (val > estim) + { + sup++; + if( val < minsup) minsup = val ; + } else egal++ ; + } + if ( (inf <= (r+1))&&(sup <=(r+1)) ) break ; + else if (inf>sup) max = mxinf ; + else min = minsup ; + } + + if ( inf >= r+1 ) val = mxinf ; + else if (inf+egal >= r+1) val = estim ; + else val = minsup ; + + output[ __mul24(j, i_dim) +i ] = val ; +} + + +/** + * + * correction de staircase + */ +__global__ void kernel_staircase_reduc3(unsigned int * img_out, unsigned int L, unsigned int H) +{ + // coordonnees du point dans le bloc + //unsigned int iib = threadIdx.x ; + //unsigned int jib = threadIdx.y ; + // coordonnees du point dans l'image + unsigned int y = blockIdx.y*blockDim.y + threadIdx.y; + unsigned int x = blockIdx.x*blockDim.x + threadIdx.x; + + int a, b, c, d, e, f, g, h, i ; // gl des voisins + float wa, wb, wc, wd, we, wf, wg, wh, wi ; // poids + float S1, S2, S11, S22, S12, S0, Sx, Sx1, Sx2 ; + float c1,c2 ; + + // chargement des valeurs GL des pixels du voisinage + a = tex2D(tex_img_estim, x-1, y-1) ; + b = tex2D(tex_img_estim, x , y-1) ; + c = tex2D(tex_img_estim, x+1, y-1) ; + d = tex2D(tex_img_estim, x-1, y ) ; + e = tex2D(tex_img_estim, x , y ) ; + f = tex2D(tex_img_estim, x+1, y ) ; + g = tex2D(tex_img_estim, x-1, y+1) ; + h = tex2D(tex_img_estim, x , y+1) ; + i = tex2D(tex_img_estim, x+1, y+1) ; + + wa = tex1D(tex_noyau, abs(a-e)) ; + wb = tex1D(tex_noyau, abs(b-e)) ; + wc = tex1D(tex_noyau, abs(c-e)) ; + wd = tex1D(tex_noyau, abs(d-e)) ; + we = 1 ; + wf = tex1D(tex_noyau, abs(f-e)) ; + wg = tex1D(tex_noyau, abs(g-e)) ; + wh = tex1D(tex_noyau, abs(h-e)) ; + wi = tex1D(tex_noyau, abs(i-e)) ; + + + //construction des elements du systeme lineaire + S0 = wa+wb+wc+wd+we+wf+wg+wh+wi ; + S1 = wc+wf+wi-wa-wd-wg ; + S2 = wg+wh+wi-wa-wb-wc ; + Sx = wa*a + wb*b + wc*c + wd*d + we*e + wf*f + wg*g + wh*h + wi*i ; + Sx1 = wc*c + wf*f + wi*i - wa*a - wd*d - wg*g ; + Sx2 = wg*g + wh*h + wi*i - wa*a - wb*b - wc*c ; + S11 = wc+wf+wi+wa+wd+wg ; + S22 = wg+wh+wi+wa+wb+wc ; + S12 = wa + wi -wc -wg ; + + c1 = S22*(S11*Sx-S1*Sx1) + S12*(S1*Sx2-S12*Sx) + S2*(S12*Sx1-S11*Sx2) ; + c2 = S22*(S11*S0-S1*S1) + S12*(S2*S1-S12*S0) + S2*(S1*S12-S2*S11) ; + img_out[y*L + x] = c1/c2 ; +} + + diff --git a/BookGPU/Chapters/chapter4/code/maskInSymbol.cu b/BookGPU/Chapters/chapter4/code/maskInSymbol.cu new file mode 100644 index 0000000..02d06ee --- /dev/null +++ b/BookGPU/Chapters/chapter4/code/maskInSymbol.cu @@ -0,0 +1,9 @@ +// on GPU side +__device__ __constant__ float d_mask[256] ; +// on CPU side +float * h_mask = new float[(2*r+1)*(2*r+1)] ; +for (int i=0; i<(2*r+1)*(2*r+1); i++) + h_mask[i]= 1.0/((2*r+1)*(2*r+1)) ; + +cudaMemcpyToSymbol( d_mask, h_mask, + (2*r+1)*(2*r+1)*sizeof(float), 0) ; diff --git a/BookGPU/Chapters/chapter4/code/maskInSymbol.cu~ b/BookGPU/Chapters/chapter4/code/maskInSymbol.cu~ new file mode 100644 index 0000000..19606b5 --- /dev/null +++ b/BookGPU/Chapters/chapter4/code/maskInSymbol.cu~ @@ -0,0 +1,9 @@ +// on GPU side + + +// on CPU side +float * h_noyauConvof = new float[(2*r+1)*(2*r+1)] ; +for (int i=0; i<(2*r+1)*(2*r+1); i++) + h_noyauConvof[i]= 1.0/((2*r+1)*(2*r+1)) ; + +cudaMemcpyToSymbol( masque, h_noyauConvof, (2*r+1)*(2*r+1)*sizeof(float), 0) ; diff --git a/BookGPU/Chapters/chapter4/img/convo1.png b/BookGPU/Chapters/chapter4/img/convo1.png new file mode 100644 index 0000000..7edf4d7 Binary files /dev/null and b/BookGPU/Chapters/chapter4/img/convo1.png differ diff --git a/BookGPU/Chapters/chapter4/img/convoOverlap1.png b/BookGPU/Chapters/chapter4/img/convoOverlap1.png new file mode 100644 index 0000000..ac0e673 Binary files /dev/null and b/BookGPU/Chapters/chapter4/img/convoOverlap1.png differ diff --git a/BookGPU/Chapters/chapter4/img/convoOverlap2.png b/BookGPU/Chapters/chapter4/img/convoOverlap2.png new file mode 100644 index 0000000..606a4bc Binary files /dev/null and b/BookGPU/Chapters/chapter4/img/convoOverlap2.png differ diff --git a/BookGPU/Chapters/chapter4/img/convoShMem.png b/BookGPU/Chapters/chapter4/img/convoShMem.png new file mode 100644 index 0000000..1c48504 Binary files /dev/null and b/BookGPU/Chapters/chapter4/img/convoShMem.png differ diff --git a/BookGPU/Chapters/chapter4/img/convos.svg b/BookGPU/Chapters/chapter4/img/convos.svg new file mode 100644 index 0000000..dbc4b53 --- /dev/null +++ b/BookGPU/Chapters/chapter4/img/convos.svg @@ -0,0 +1,20238 @@ + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + image/svg+xml + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + I0,0 + + + + + + + + + + i + j + + + + + i + j + + + I-2,-2 + I-2,-1 + I-2,0 + I-2,1 + I-2,2 + I-1,-2 + I-1,-1 + I-1,0 + I-1,1 + I-1,2 + I0,-2 + I0,-1 + I0,1 + I0,2 + I1,-2 + I1,-1 + I1,0 + I1,1 + I1,2 + I2,-2 + I2,-1 + I2,0 + I2,1 + I2,2 + + + + + + + + + + + + + + + + + + + + + + + + + + h-2,-2 + h-2,-1 + h-2,0 + h-2,1 + h-2,2 + h-1,-2 + h-1,-1 + h-1,0 + h-1,1 + h-1,2 + h0,-2 + h0,-1 + h0,0 + h0,1 + h0,2 + h1,-2 + h1,-1 + h1,0 + h1,1 + h1,2 + h2,-2 + h2,-1 + h2,0 + h2,1 + h2,2 + + + + + + neighborhood window + convolution mask + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + blockDim.x = 8 + + + + + blockDim.y = 4 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Region of interest (ROI): width = 8x8+2x2 = 68 pixels + + + + ROI height = 8 + + + idrow = 0 + + idrow = 68 + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + + Thread block + thread (0,0) + thread (7,0) + + Image tile + + + bdimX = 64 + + + + + r = 2 + + r = 2 + 3 + 3 + 3 + 3 + 3 + 3 + 3 + 3 + 3 + 3 + 3 + 3 + 3 + 3 + 3 + 3 + 3 + 3 + 2 + 1 + 2 + 1 + 2 + 1 + 2 + 2 + 2 + 1 + 1 + 1 + (x,y) + + (x+7,y) + + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 5 + 4 + 4 + 4 + 4 + 4 + 4 + 4 + 4 + 4 + 4 + 3 + 3 + 3 + 3 + 3 + 3 + 3 + 3 + 3 + 3 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 2 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + 1 + (x,y) + + (x+7,y) + + + + + diff --git a/BookGPU/Chapters/chapter4/img/kernLeft.png b/BookGPU/Chapters/chapter4/img/kernLeft.png new file mode 100755 index 0000000..ae5d0ca Binary files /dev/null and b/BookGPU/Chapters/chapter4/img/kernLeft.png differ diff --git a/BookGPU/Chapters/chapter4/img/kernRight.png b/BookGPU/Chapters/chapter4/img/kernRight.png new file mode 100755 index 0000000..34ae66a Binary files /dev/null and b/BookGPU/Chapters/chapter4/img/kernRight.png differ diff --git a/BookGPU/Makefile b/BookGPU/Makefile index 847d098..e6e4e03 100644 --- a/BookGPU/Makefile +++ b/BookGPU/Makefile @@ -23,6 +23,7 @@ all: bibtex bu16 bibtex bu17 bibtex bu18 + bibtex bu19 makeindex ${BOOK}.idx pdflatex ${BOOK}