X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/11f93a2e8880680f6b192298e5ce0697d2596a31..1874c46934f4ba7e8c2013d3829f65309456d292:/BookGPU/Chapters/chapter3/ch3.tex diff --git a/BookGPU/Chapters/chapter3/ch3.tex b/BookGPU/Chapters/chapter3/ch3.tex index 7078aaf..95bce29 100755 --- a/BookGPU/Chapters/chapter3/ch3.tex +++ b/BookGPU/Chapters/chapter3/ch3.tex @@ -61,7 +61,7 @@ The Makefile given in Listing \ref{lst:mkfile} shows how to adapt examples given \section{Performance measurements} As our goal is to design very fast implementations of basic image processing algorithms, we need to make quite accurate time-measurements, within the order of magnitude of $0.01$~ms. Again, the easiest way of doing so is to use the helper functions of the \textbf{cutil} library. As usual, because the durations we are measuring are short and possibly subject to non negligible variations, a good practice is to measure multiple executions and report the mean runtime. All time results given in this chapter have been obtained through 1000 calls to each kernel. -Listing \ref{lst:chronos} shows how to use the dedicated \textbf{cutil} functions \index{Cutil library!Timer usage}. Timer declaration and creation need to be performed only once while reset, start and stop functions can be used as often as necessary. Synchronization is mandatory before stopping the timer (Line 7), to avoid runtime measurement being biased. +Listing \ref{lst:chronos} shows how to use the dedicated \textbf{cutil} functions \index{Cutil library!timer usage}. Timer declaration and creation need to be performed only once while reset, start and stop functions can be used as often as necessary. Synchronization is mandatory before stopping the timer (Line 7), to avoid runtime measurement being biased. \lstinputlisting[label={lst:chronos},caption=Time measurement technique using cutil functions]{Chapters/chapter3/code/exChronos.cu} In an attempt to provide relevant speedup values, we either implemented CPU versions of the algorithms studied or used the values found in existing literature. Still, the large number and diversity of hardware platforms and GPU cards makes it impossible to benchmark every possible combination and significant differences may occur between the speedups we report and those obtained with different devices. As a reference, our developing platform details as follows: @@ -204,7 +204,7 @@ To overcome this, the most frequent choice made in efficient implementations fou As for registers, designing a generic median filter that would use only that type of memory seems difficult, due to the above mentioned 63 register-per-thread limitation. \index{register count} Yet, nothing forbids us to design fixed-size filters, each of them specific to one of the most popular window sizes. It might be worth the effort as dramatic increase in performance could be expected. -Another track to follow in order to improve performance of GPU implementations consists of hiding latencies generated by arithmetic instruction calls and memory accesses. Both can be partially hidden by introducing Instruction-Level Parallelism \index{Instruction-Level Parallelism}(ILP) and by increasing the data count outputted by each thread. Though such techniques may seem to break the NVIDIA occupancy paradigm, they can lead to dramatically higher data throughput values. +Another track to follow in order to improve performance of GPU implementations consists of hiding latencies generated by arithmetic instruction calls and memory accesses. Both can be partially hidden by introducing Instruction-Level Parallelism \index{instruction-level parallelism}(ILP) and by increasing the data count outputted by each thread. Though such techniques may seem to break the NVIDIA occupancy paradigm, they can lead to dramatically higher data throughput values. The following sections illustrate these ideas and detail the design of the fastest CUDA median filter known to date. \section{A 3$\times$3 median filter: using registers} @@ -252,7 +252,7 @@ In our $3\times 3$ pixel window example, the minimum register count becomes $k_9 This iterative process is illustrated in Figure \ref{fig:forgetful3}, where it achieves one entire $3\times 3$ median selection, beginning with $k_9=6$ elements. The \textit{forgetful selection} method, used in \cite{mcguire2008median}, does not imply full sorting of values, but only selecting minimum and maximum values, which, at the price of a few iteration steps ($n^2-k$), reduces arithmetic complexity. -Listing \ref{lst:medianForget1pix3} details this process where forgetful selection is achieved by use of simple 2-value swapping function ($s()$, lines 1 to 5) that swaps input values if necessary, so as to achieve the first steps of an incomplete sorting network \cite{Batcher:1968:SNA:1468075.1468121}. Moreover, whenever possible, in order to increase the ILP, \index{Instruction-Level Parallelism} successive calls to $s()$ are done with independant elements as arguments. This is illustrated by the macro definitions of lines 7 to 12 and by Figure \ref{fig:bitonic} which details the first iteration of the $5\times 5$ selection, starting with $k_{25}=14$ elements. +Listing \ref{lst:medianForget1pix3} details this process where forgetful selection is achieved by use of simple 2-value swapping function ($s()$, lines 1 to 5) that swaps input values if necessary, so as to achieve the first steps of an incomplete sorting network \cite{Batcher:1968:SNA:1468075.1468121}. Moreover, whenever possible, in order to increase the ILP, \index{instruction-level parallelism} successive calls to $s()$ are done with independant elements as arguments. This is illustrated by the macro definitions of lines 7 to 12 and by Figure \ref{fig:bitonic} which details the first iteration of the $5\times 5$ selection, starting with $k_{25}=14$ elements. \begin{figure}[b] \centering \includegraphics[width=6cm]{Chapters/chapter3/img/forgetful_selection.png}