-In the case of a 3$\times$3 median filter, the simplest solution consists in associating one register to each gray-level value, then sorting those 9 values and selecting the fifth one, i.e. the median value. For such a small amount of data to sort, a simple selection method is well indicated. As shown in Listing \ref{lst:kernelMedian3RegTri9} (\texttt{kernelMedian3RegTri9()}), the constraint of only using registers leads to adopt an unusual manner of coding. However, results are persuasive: runtimes are divided by around 120 on GTX280 and 80 on C2070, while only reduced by a 3.5 factor on CPU.
-The diagram of Figure \ref{fig:compMedians1} summarizes these first results. Only C2070 throughputs are shown and compared to CPU results. We included the maximum effective pixel throughput in order to see the improvement potential of the different implementations. We also introduced throughputd achieved by \textit{libJacket}, a commercial implementation, as it was the fastest known implementation of a 3$\times$3 median filter to date, as illustrated in \cite{chen09}. One of the authors of libJacket kindly posted the CUDA code of its 3$\times$3 median filter, that we inserted into our own coding structure. The algorithm itself is quite similar to ours, but running it in our own environement produced higher throughput values than those published in \cite{chen09}, not due to different hardware capabilities between our GTX280 and the GTX260 used in the paper, but to the way we perform memory transfers and to our register-only method of storing temporary data.
+In the case of a 3$\times$3 median filter, the simplest solution consists in associating one register to each gray-level value, then sorting those 9 values and selecting the fifth one, i.e. the median value. For such a small amount of data to sort, a simple selection method is well indicated. As shown in Listing \ref{lst:kernelMedian3RegTri9} (\texttt{kernelMedian3RegSort9()}), the constraint of only using registers leads to adopt an unusual manner of coding. However, results are persuasive: runtimes are divided by around 120 on GTX280 and 80 on C2070, while only reduced by a 3.5 factor on CPU.
+The diagram of Figure \ref{fig:compMedians1} summarizes these first results, obtained with a block size of 128 threads on GTX280 and 256 on C2070. Only C2070 throughputs are shown and compared to CPU results. We included the maximum effective pixel throughput in order to see the improvement potential of the different implementations. We also introduced throughputd achieved by \textit{libJacket}, a commercial implementation, as it was the fastest known implementation of a 3$\times$3 median filter to date, as illustrated in \cite{chen09}. One of the authors of libJacket kindly posted the CUDA code of its 3$\times$3 median filter, that we inserted into our own coding structure. The algorithm itself is quite similar to ours, but running it in our own environement produced higher throughput values than those published in \cite{chen09}, not due to different hardware capabilities between our GTX280 and the GTX260 used in the paper, but to the way we perform memory transfers and to our register-only method of storing temporary data.