\chapterauthor{Zulu pero}{Zulumachine Institute}
\graphicspath{{img/}


% \begin{VF}
% ``A ''

% \VA{Thomas Davenport}{Senior Adjutant to the Junior Marketing VP}
% \end{VF}


% \begin{shadebox}
% A component part for an electronic item is
% manufactured at one of three different factories, and then delivered to
% the main assembly line.Of the total number supplied, factory A supplies
% 50\%, factory B 30\%, and factory C 20\%. Of the components
% manufactured at factory A, 1\% are faulty and the corresponding
% proportions for factories B and C are 4\% and 2\% respectively. A
% component is picked at random from the assembly line. What is the
% probability that it is faulty? 
% \end{shadebox}


% \begin{equation}
% \mbox{var}\widehat{\Delta} = \sum_{j = 1}^t \sum_{k = j+1}^t
% \mbox{var}\,(\hat{\alpha}_j - \hat{\alpha}_k)  = \sum_{j = 1}^t
% \sum_{k = j+1}^t \sigma^2(1/n_j + 1/n_k). \label{2delvart2}
% \end{equation}


% \begin{shortbox}
% \Boxhead{Box Title Here}
% \end{shortbox}

% \begin{theorem}\label{1th:Z_m}
% Let $m$ be a prime number. With the addition and multiplication as 
% defined above, $Z_m$ is a field.
% \end{theorem}

% \begin{proof}
% \end{proof}

% \begin{notelist}{000000}
%  \notes{Note:}{The process of integrating reengineering is best accomplished with an engineer, a dog, and a cat.}
% \end{notelist}


% \begin{VT1}
% \VH{Think About It...}
% Com
% \VT
% \VTA{The Information Revolution}{Business Week}
% \end{VT1}


%\begin{definition}\label{1def:linearcomb}{}\end{definition}


% \begin{extract}
% text 
% \end{extract}

%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%      Listings
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\lstset{
  language=C,
  columns=fixed,
  basicstyle=\footnotesize\ttfamily,
  numbers=left,
  firstnumber=1,
  numberstyle=\tiny,
  stepnumber=5,             
  numbersep=5pt,              
  tabsize=3,                  
  extendedchars=true,         
  breaklines=true,       
  keywordstyle=\textbf,
  frame=single,         
  % keywordstyle=[1]\textbf,   
  %identifierstyle=\textbf,
  commentstyle=\color{white}\textbf,
  stringstyle=\color{white}\ttfamily,
  % xleftmargin=17pt,
  % framexleftmargin=17pt,
  % framexrightmargin=5pt,
  % framexbottommargin=4pt,
  backgroundcolor=\color{lightgray},
  }

%\DeclareCaptionFont{blue}{\color{blue}} 
%\captionsetup[lstlisting]{singlelinecheck=false, labelfont={blue}, textfont={blue}}

%\DeclareCaptionFont{white}{\color{white}}
%\DeclareCaptionFormat{listing}{\colorbox{gray}{\parbox{\textwidth}{\hspace{15pt}#1#2#3}}}
%\captionsetup[lstlisting]{format=listing,labelfont=white,textfont=white, singleline}
%%%%%%%%%%%%%%%%%%%%%%%% Fin Listings %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\newcommand{\kl}{\includegraphics[scale=0.6]{kernLeft.png}~}
\newcommand{\kr}{\includegraphics[scale=0.6]{kernRight.png}}

\chapter{Setting up the environnement.}
Image processing using a GPU often means using it as a general purpose computing processor, which soon brings up the issue of data transfers, especially when kernel runtime is fast and/or when large data sets are processed.
The truth is that, in certain cases, data transfers between GPU and CPU are slower than the actual computation on GPU. 
It remains that global runtime can still be faster than similar processes run on CPU.
Therefore, to fully optimize global runtimes, it is important to pay attention to how memory transfers are done.
This leads us to propose, in the following section, an overall code structure to be used with all our kernel examples. 

Obviously, our code originally accepts various image dimensions and can process color images. 
However, so as to propose concise and more readable code, we will assume the following limitations:
8 or 16~bit-coded gray-level input images whose dimensions $H\times W$ are multiples of 512 pixels. 

\section{Data transfers, memory management.}
This section deals with the following issues: 
\begin{enumerate}
\item data transfer from CPU memory to GPU global memory: several GPU memory areas are available as destination memory but the 2-D caching mechanism of texture memory, specifically designed for fetching neighboring pixels, is currently the fastest way to fetch gray-level pixel values inside a kernel computation. This has lead us to choose \textbf{texture memory} as primary GPU memory area for images.
\item data fetching from GPU global memory to kernel local memory: as said above, we use texture memory. Depending on which process is run, texture data is used either by direct fetching in kernel local memory or through a prefetching in thread block shared memory.
\item data outputting from kernels to GPU memory: there is actually no alternative to global memory, as kernels can not directly write into texture memory and as copying from texture to CPU memory would not be faster than from simple global memory.
\item data transfer from GPU global memory to CPU memory: it can be drastically accelerated by use of \textbf{pinned memory}, keeping in mind it has to be used sparingly.
\end{enumerate}
Algorithm \ref{algo:memcopy} summarizes all the above considerations and describe how data are handled in our examples. For more information on how to handle the different types of GPU memory, we suggest to refer to CUDA programmer's guide. 

At debug stage, for simplicity's sake, we use the \textbf{cutil} library supplied by the NVidia developpement kit (SDK). Thus, in order to easily implement our examples, we suggest readers download download and install the latest NVidia-SDK (ours is SDK4.0), create a new directory \textit{SDK-root-dir/C/src/fast\_kernels} and adapt the generic \textit{Makefile} present in each sub-directory of \textit{SDK-root-dir/C/src/}. Then, only two more files will be enough to have a fully operational environnement: \textit{main.cu} and \textit{fast\_kernels.cu}. 
Listings \ref{lst:main1}, \ref{lst:fkern1} and \ref{lst:mkfile} implement all the above considerations minimally, while remaining functional. 

The main file of Listing \ref{lst:main1} is a simplified version of our actual main file. 
It has to be noticed that cutil functions \texttt{cutLoadPGMi} and \texttt{cutSavePGMi} only operate on unsigned integer data. As data is coded in short integer format for performance reasons, the use of these functions involves casting data after loading and before saving. This may be overcome by use of a different library. Actually, our choice was to modify the above mentioned cutil functions.

Listing \ref{lst:fkern1} gives a minimal kernel skeleton that will serve as the basis for all other kernels. Lines 5 and 6 determine the coordinates $(i, j)$ of the pixel to be processed. Each pixel is associated with one thread.
The instruction in line 8 combines writing the output gray-level value into global memory and fetching the input gray-level value from 2-D texture memory.
The Makefile given in Listing \ref{lst:mkfile} shows how to adapt examples given in SDK.

\begin{algorithm}
 \SetNlSty{textbf}{}{:}
 allocate and populate CPU memory \textbf{h\_in}\;
 allocate CPU pinned-memory \textbf{h\_out}\;
 allocate GPU global memory \textbf{d\_out}\;
 declare GPU texture reference \textbf{tex\_img\_in}\;
 allocate GPU array in global memory \textbf{array\_img\_in}\;
 bind GPU array \textbf{array\_img\_in} to texture \textbf{tex\_img\_in}\;
 copy data from \textbf{h\_in} to \textbf{array\_img\_in}\label{algo:memcopy:H2D}\; 
 kernel\kl gridDim,blockDim\kr()\tcc*[f]{outputs to d\_out}\label{algo:memcopy:kernel}\;
 copy data from \textbf{d\_out} to \textbf{h\_out} \label{algo:memcopy:D2H}\;
\caption{Global memory management on CPU and GPU sides.}
\label{algo:memcopy}
\end{algorithm}

\lstinputlisting[label={lst:main1},caption=Generic main.cu file used to launch CUDA kernels]{code/mainSkel.cu}

\lstinputlisting[label={lst:fkern1},caption=fast\_kernels.cu file featuring one kernel skeleton]{code/kernSkel.cu}

\lstinputlisting[label={lst:mkfile},caption=Generic Makefile based on those provided by NV SDK]{code/Makefile}


\section{Performance measurements}
As our goal is to design very fast implementations of basic image processing algorithms, we need to make quite accurate time-measurements, within the order of magnitude of $0.01~ms$. Again, the easiest way of doing so is to use the helper functions of the cutil library. As usual, as the durations we are measuring are short and possibly suject to non neglectable variations, a good practice is to measure multiple executions and issue the mean runtime. All time results given in this chapter have been obtained through 1000 calls to each kernel.

Listing \ref{lst:chronos} shows how to use the dedicated cutil functions. Timer declaration and creation only need to be performed once while reset, start and stop can be used as often as necessary. Synchronization is mandatory before stopping the timer (Line 7), to avoid runtime measure being biased.
\lstinputlisting[label={lst:chronos},caption=Time measurement technique using cutil functions]{code/exChronos.cu}

In an attempt to provide relevant speedup values, we either implemented CPU versions of the algorithms studied, or used the values found in existing literature. Still, the large number and diversity of hardware platforms and GPU cards make it impossible to benchmark every possible combination and significant differences may occur between the speedups we announce and those obtained with different devices. As a reference, our developing platform details as follows:

\begin{itemize}
\item CPU codes run on: 
  \begin{itemize}
  \item Quad Core Xeon E31245 at 3.3GHz-8GByte RAM running Linux kernel 3.2 
    \item Quad Core Xeon E5620 at 2.40GHz-12GByte RAM running Linux kernel 2.6.18 
  \end{itemize}
\item GPU codes run on:
\begin{itemize}
  \item Nvidia Tesla C2070 hosted by a PC QuadCore Xeon E5620 at 2.4GHz-12GByte RAM, running Linux kernel 2.6.18 
    \item NVidia GeForce GTX 280 hosted by a PC QuadCore Xeon X5482 at 3.20GHz-4GByte RAM, running Linux kernel 2.6.32
  \end{itemize}
\end{itemize}

All kernels have also been tested with various image sizes from 512$\times$512 to 4096$\times$4096 pixels. This allows to guess runtime dependancy over image size.

Last, like many authors, we chose to use the pixel throughput value of each process in Mega Pixels per second (MP/s) as a performance indicator, including data transfers and kernel runtimes. 
In order to estimate the potential for improvement of each kernel, a reference throughput measurement, involving identity kernel of Listing \ref{lst:fkern1}, was performed. As this kernel only fetches input values from texture memory and outputs them to global memory without doing any computation, it represents the smallest, thus fastest, possible process and is taken as the reference throughput value (100\%). The same measurement was performed on CPU, with a maximum effective pixel throughput of 130~Mpixel per second. On GPU, depending on grid parameters it amounts to 800~MPixels/s on GTX280 and 1300~Mpixels/s on C2070.

\section{Glossary}
\begin{Glossary}
\item[CUDA] Compute Unified Device Architecture.
\end{Glossary}

\putbib[biblio]