new

[book_gpu.git] / BookGPU / Chapters / chapter3 / ch3.tex
diff --git a/BookGPU/Chapters/chapter3/ch3.tex b/BookGPU/Chapters/chapter3/ch3.tex

index 78396a9d77639a501ea825164021995d4109a222..8cd1767cb12b39dd3edebf49319d4b7bdfe0af26 100755 (executable)
--- a/BookGPU/Chapters/chapter3/ch3.tex
+++ b/BookGPU/Chapters/chapter3/ch3.tex
@@ -1,4 +1,4 @@
-\chapterauthor{Gilles Perrot}{FEMTO-ST Institute}
+\chapterauthor{Gilles Perrot}{Femto-ST Institute, University of Franche-Comte, France}
  %\graphicspath{{img/}}
  
  
@@ -181,8 +181,9 @@ Last, like many authors, we chose to use the pixel throughput value of each proc
  In order to estimate the potential for improvement of each kernel, a reference throughput measurement, involving identity kernel of Listing \ref{lst:fkern1}, was performed. As this kernel only fetches input values from texture memory and outputs them to global memory without doing any computation, it represents the smallest, thus fastest, possible process and is taken as the reference throughput value (100\%). The same measurement was performed on CPU, with a maximum effective pixel throughput of 130~Mpixel per second. On GPU, depending on grid parameters it amounts to 800~MPixels/s on GTX280 and 1300~Mpixels/s on C2070.
  
  
+\chapterauthor{Gilles Perrot}{Femto-ST Institute, University of Franche-Comte, France}
+
  \chapter{Implementing a fast median filter}
-\chapterauthor{Gilles Perrot}{FEMTO-ST Institute}
  \section{Introduction}
  Median filtering is a well-known method used in a wide range of application frameworks as well as a standalone filter especially for \textit{salt and pepper} denoising. It is able to highly reduce power of noise without blurring edges too much.
  
@@ -205,7 +206,7 @@ Figure \ref{fig:sap_examples} shows an example of a $512\times 512$ pixel image,
     \subfigure[Image denoised by a $3\times 3$ median filter]{\label{img:sap_example_med3} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_med3.png}}\\
     \subfigure[Image denoised by a $5\times 5$ median filter]{\label{img:sap_example_med5} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_med5.png}}\qquad
     \subfigure[Image denoised by 2 iterations of a $3\times 3$ median filter]{\label{img:sap_example_med3_it2} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_med3_it2.png}}\\
-   \caption{Exemple of median filtering, applied to salt \& pepper noise reduction.}
+   \caption{Example of median filtering, applied to salt \& pepper noise reduction.}
     \label{fig:sap_examples}
  \end{figure}
  
@@ -221,16 +222,18 @@ The first observation to make when analysing results of Table \ref{tab:medianHis
  Since inner loops that fill the histogram vector contain very few fetching instructions (from 9 to 49, depending on the window size), it is not surprising to note their neglectable impact compared to outer loops that fetch image pixels (from 256k to 16M instructions). 
  One could be tempted to claim that CPU has no chance to win, which is not so obvious as it highly depends on what kind of algorithm is run and above all, how it is implemented. To illustrate this, we can notice that, despite a maximum effective throughput potential that is almost five times higher, measured GTX280 throughput values sometimes prove slower than CPU values, as shown in Table \ref{tab:medianHisto1}.
  
+
+\lstinputlisting[label={lst:medianGeneric},caption=Generic CUDA kernel achieving median filtering]{Chapters/chapter3/code/medianGeneric.cu}
+
+
  On the GPU's side, we note high dependence on window size due to the redundancy induced by the multiple fetches of each pixel inside each block, becoming higher with the window size as illustrated by Figure \ref{fig:median_overlap}. On C2070 card, thanks to a more efficient caching mechanism, this effect is lesser. On GPUs, dependency over image size is low, and due to slightly more efficient data transfers when copying larger data amounts, pixel throughputs increases with image size. As an example, transferring a 4096$\times$4096 pixel image (32~MBytes) is a bit faster than transferring 64 times a 512$\times$512 pixel image (0.5~MBytes).
  
  %% mettre l'eau à la bouche
  
-\lstinputlisting[label={lst:medianGeneric},caption=Generic CUDA kernel achieving median filtering]{Chapters/chapter3/code/medianGeneric.cu}
-
  \begin{figure}
     \centering
     \includegraphics[width=8cm]{Chapters/chapter3/img/median_1.png}
-   \caption{Exemple of 5x5 median filtering}
+   \caption{Example of 5x5 median filtering}
     \label{fig:median_1}
  \end{figure}
  
@@ -401,14 +404,14 @@ The minimum register count required to apply the forgetful selection method to a
  \begin{figure}
     \centering
     \includegraphics[width=6cm]{Chapters/chapter3/img/median5_overlap4.png}
-   \caption{Reducing register count in a 5$\times$5 register-only median kernel outputting 2 pixels simultaneously. The first 7 forgetful selection stages are common to both processed center pixels. Only the last 5 selections have to be done separately.}
+   \caption[Reducing register count in a 5$\times$5 register-only median kernel outputting 2 pixels simultaneously.]{Reducing register count in a 5$\times$5 register-only median kernel outputting 2 pixels simultaneously. The first 7 forgetful selection stages are common to both processed center pixels. Only the last 5 selections have to be done separately.}
     \label{fig:median5overlap}
  \end{figure}
  
  \begin{figure}
     \centering
     \includegraphics[width=6cm]{Chapters/chapter3/img/forgetful_selection4.png}
-   \caption{First iteration of the $5\times 5$ selection process, with $k_{25}=14$, which shows how Instruction Level Parallelism is maximized by the use of an incomplete sorting network. Arrows represent the result of the swapping function, with the lowest value at the starting point and the highest value at the end point.}
+   \caption[First iteration of the $5\times 5$ selection process, with $k_{25}=14$, which shows how Instruction Level Parallelism is maximized by the use of an incomplete sorting network.]{First iteration of the $5\times 5$ selection process, with $k_{25}=14$, which shows how Instruction Level Parallelism is maximized by the use of an incomplete sorting network. Arrows represent the result of the swapping function, with the lowest value at the starting point and the highest value at the end point.}
     \label{fig:median5overlap}
  \end{figure}
  
@@ -451,7 +454,7 @@ Figure \ref{fig:sap_examples2} shows an example of a $512\times 512$ pixel image
     \subfigure[Image denoised by a $3\times 3$ separable smoother]{\label{img:sap_example_sep_med3} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_sep_med3.png}}\\
     \subfigure[Image denoised by a $5\times 5$ separable smoother]{\label{img:sap_example_sep_med5} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_sep_med5.png}}\qquad
     \subfigure[Image background estimation by a $55\times 55$ separable smoother]{\label{img:sap_example_sep_med3_it2} \includegraphics[width=5cm]{Chapters/chapter3/img/airplane_sap25_sep_med111.png}}\\
-   \caption{Exemple of separable median filtering (smoother), applied to salt \& pepper noise reduction.}
+   \caption{Example of separable median filtering (smoother), applied to salt \& pepper noise reduction.}
     \label{fig:sap_examples2}
  \end{figure}