-8 or 16~bit-coded gray-level input images whose dimensions $H\times W$ are multiples of 512 pixels.
-
-\section{Data transfers, memory management.}
-This section deals with the following issues:
-\begin{enumerate}
-\item data transfer from CPU memory to GPU global memory: several GPU memory areas are available as destination memory but the 2-D caching mechanism of texture memory, specifically designed for fetching neighboring pixels, is currently the fastest way to fetch gray-level pixel values inside a kernel computation. This has lead us to choose \textbf{texture memory} as primary GPU memory area for images.
-\item data fetching from GPU global memory to kernel local memory: as said above, we use texture memory. Depending on which process is run, texture data is used either by direct fetching in kernel local memory or through a prefetching in thread block shared memory.
-\item data outputting from kernels to GPU memory: there is actually no alternative to global memory, as kernels can not directly write into texture memory and as copying from texture to CPU memory would not be faster than from simple global memory.
-\item data transfer from GPU global memory to CPU memory: it can be drastically accelerated by use of \textbf{pinned memory}, keeping in mind it has to be used sparingly.
-\end{enumerate}
-Algorithm \ref{algo:memcopy} summarizes all the above considerations and describe how data are handled in our examples. For more information on how to handle the different types of GPU memory, we suggest to refer to CUDA programmer's guide.
-
-At debug stage, for simplicity's sake, we use the \textbf{cutil} library supplied by the NVidia developpement kit (SDK). Thus, in order to easily implement our examples, we suggest readers download download and install the latest NVidia-SDK (ours is SDK4.0), create a new directory \textit{SDK-root-dir/C/src/fast\_kernels} and adapt the generic \textit{Makefile} present in each sub-directory of \textit{SDK-root-dir/C/src/}. Then, only two more files will be enough to have a fully operational environnement: \textit{main.cu} and \textit{fast\_kernels.cu}.
-Listings \ref{lst:main1}, \ref{lst:fkern1} and \ref{lst:mkfile} implement all the above considerations minimally, while remaining functional.
-
-The main file of Listing \ref{lst:main1} is a simplified version of our actual main file.
-It has to be noticed that cutil functions \texttt{cutLoadPGMi} and \texttt{cutSavePGMi} only operate on unsigned integer data. As data is coded in short integer format for performance reasons, the use of these functions involves casting data after loading and before saving. This may be overcome by use of a different library. Actually, our choice was to modify the above mentioned cutil functions.
-
-Listing \ref{lst:fkern1} gives a minimal kernel skeleton that will serve as the basis for all other kernels. Lines 5 and 6 determine the coordinates $(i, j)$ of the pixel to be processed. Each pixel is associated with one thread.
-The instruction in line 8 combines writing the output gray-level value into global memory and fetching the input gray-level value from 2-D texture memory.
-The Makefile given in Listing \ref{lst:mkfile} shows how to adapt examples given in SDK.