1 \chapterauthor{Zulu pero}{Zulumachine Institute}
8 % \VA{Thomas Davenport}{Senior Adjutant to the Junior Marketing VP}
14 % A component part for an electronic item is
15 % manufactured at one of three different factories, and then delivered to
16 % the main assembly line.Of the total number supplied, factory A supplies
17 % 50\%, factory B 30\%, and factory C 20\%. Of the components
18 % manufactured at factory A, 1\% are faulty and the corresponding
19 % proportions for factories B and C are 4\% and 2\% respectively. A
20 % component is picked at random from the assembly line. What is the
21 % probability that it is faulty?
26 % \mbox{var}\widehat{\Delta} = \sum_{j = 1}^t \sum_{k = j+1}^t
27 % \mbox{var}\,(\hat{\alpha}_j - \hat{\alpha}_k) = \sum_{j = 1}^t
28 % \sum_{k = j+1}^t \sigma^2(1/n_j + 1/n_k). \label{2delvart2}
33 % \Boxhead{Box Title Here}
36 % \begin{theorem}\label{1th:Z_m}
37 % Let $m$ be a prime number. With the addition and multiplication as
38 % defined above, $Z_m$ is a field.
44 % \begin{notelist}{000000}
45 % \notes{Note:}{The process of integrating reengineering is best accomplished with an engineer, a dog, and a cat.}
50 % \VH{Think About It...}
53 % \VTA{The Information Revolution}{Business Week}
57 %\begin{definition}\label{1def:linearcomb}{}\end{definition}
65 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
67 %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
71 basicstyle=\footnotesize\ttfamily,
82 % keywordstyle=[1]\textbf,
83 %identifierstyle=\textbf,
84 commentstyle=\color{white}\textbf,
85 stringstyle=\color{white}\ttfamily,
87 % framexleftmargin=17pt,
88 % framexrightmargin=5pt,
89 % framexbottommargin=4pt,
90 backgroundcolor=\color{lightgray},
93 %\DeclareCaptionFont{blue}{\color{blue}}
94 %\captionsetup[lstlisting]{singlelinecheck=false, labelfont={blue}, textfont={blue}}
96 %\DeclareCaptionFont{white}{\color{white}}
97 %\DeclareCaptionFormat{listing}{\colorbox{gray}{\parbox{\textwidth}{\hspace{15pt}#1#2#3}}}
98 %\captionsetup[lstlisting]{format=listing,labelfont=white,textfont=white, singleline}
99 %%%%%%%%%%%%%%%%%%%%%%%% Fin Listings %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
101 \newcommand{\kl}{\includegraphics[scale=0.6]{kernLeft.png}~}
102 \newcommand{\kr}{\includegraphics[scale=0.6]{kernRight.png}}
104 \chapter{Setting up the environnement.}
105 Image processing using a GPU often means using it as a general purpose computing processor, which soon brings up the issue of data transfers, especially when kernel runtime is fast and/or when large data sets are processed.
106 The truth is that, in certain cases, data transfers between GPU and CPU are slower than the actual computation on GPU.
107 It remains that global runtime can still be faster than similar processes run on CPU.
108 Therefore, to fully optimize global runtimes, it is important to pay attention to how memory transfers are done.
109 This leads us to propose, in the following section, an overall code structure to be used with all our kernel examples.
111 Obviously, our code originally accepts various image dimensions and can process color images.
112 However, so as to propose concise and more readable code, we will assume the following limitations:
113 8 or 16~bit-coded gray-level input images whose dimensions $H\times W$ are multiples of 512 pixels.
115 \section{Data transfers, memory management.}
116 This section deals with the following issues:
118 \item data transfer from CPU memory to GPU global memory: several GPU memory areas are available as destination memory but the 2-D caching mechanism of texture memory, specifically designed for fetching neighboring pixels, is currently the fastest way to fetch gray-level pixel values inside a kernel computation. This has lead us to choose \textbf{texture memory} as primary GPU memory area for images.
119 \item data fetching from GPU global memory to kernel local memory: as said above, we use texture memory. Depending on which process is run, texture data is used either by direct fetching in kernel local memory or through a prefetching in thread block shared memory.
120 \item data outputting from kernels to GPU memory: there is actually no alternative to global memory, as kernels can not directly write into texture memory and as copying from texture to CPU memory would not be faster than from simple global memory.
121 \item data transfer from GPU global memory to CPU memory: it can be drastically accelerated by use of \textbf{pinned memory}, keeping in mind it has to be used sparingly.
123 Algorithm \ref{algo:memcopy} summarizes all the above considerations and describe how data are handled in our examples. For more information on how to handle the different types of GPU memory, we suggest to refer to CUDA programmer's guide.
125 At debug stage, for simplicity's sake, we use the \textbf{cutil} library supplied by the NVidia developpement kit (SDK). Thus, in order to easily implement our examples, we suggest readers download download and install the latest NVidia-SDK (ours is SDK4.0), create a new directory \textit{SDK-root-dir/C/src/fast\_kernels} and adapt the generic \textit{Makefile} present in each sub-directory of \textit{SDK-root-dir/C/src/}. Then, only two more files will be enough to have a fully operational environnement: \textit{main.cu} and \textit{fast\_kernels.cu}.
126 Listings \ref{lst:main1}, \ref{lst:fkern1} and \ref{lst:mkfile} implement all the above considerations minimally, while remaining functional.
128 The main file of Listing \ref{lst:main1} is a simplified version of our actual main file.
129 It has to be noticed that cutil functions \texttt{cutLoadPGMi} and \texttt{cutSavePGMi} only operate on unsigned integer data. As data is coded in short integer format for performance reasons, the use of these functions involves casting data after loading and before saving. This may be overcome by use of a different library. Actually, our choice was to modify the above mentioned cutil functions.
131 Listing \ref{lst:fkern1} gives a minimal kernel skeleton that will serve as the basis for all other kernels. Lines 5 and 6 determine the coordinates $(i, j)$ of the pixel to be processed. Each pixel is associated with one thread.
132 The instruction in line 8 combines writing the output gray-level value into global memory and fetching the input gray-level value from 2-D texture memory.
133 The Makefile given in Listing \ref{lst:mkfile} shows how to adapt examples given in SDK.
136 \SetNlSty{textbf}{}{:}
137 allocate and populate CPU memory \textbf{h\_in}\;
138 allocate CPU pinned-memory \textbf{h\_out}\;
139 allocate GPU global memory \textbf{d\_out}\;
140 declare GPU texture reference \textbf{tex\_img\_in}\;
141 allocate GPU array in global memory \textbf{array\_img\_in}\;
142 bind GPU array \textbf{array\_img\_in} to texture \textbf{tex\_img\_in}\;
143 copy data from \textbf{h\_in} to \textbf{array\_img\_in}\label{algo:memcopy:H2D}\;
144 kernel\kl gridDim,blockDim\kr()\tcc*[f]{outputs to d\_out}\label{algo:memcopy:kernel}\;
145 copy data from \textbf{d\_out} to \textbf{h\_out} \label{algo:memcopy:D2H}\;
146 \caption{Global memory management on CPU and GPU sides.}
150 \lstinputlisting[label={lst:main1},caption=Generic main.cu file used to launch CUDA kernels]{code/mainSkel.cu}
152 \lstinputlisting[label={lst:fkern1},caption=fast\_kernels.cu file featuring one kernel skeleton]{code/kernSkel.cu}
154 \lstinputlisting[label={lst:mkfile},caption=Generic Makefile based on those provided by NV SDK]{code/Makefile}
157 \section{Performance measurements}
158 As our goal is to design very fast implementations of basic image processing algorithms, we need to make quite accurate time-measurements, within the order of magnitude of $0.01~ms$. Again, the easiest way of doing so is to use the helper functions of the cutil library. As usual, as the durations we are measuring are short and possibly suject to non neglectable variations, a good practice is to measure multiple executions and issue the mean runtime. All time results given in this chapter have been obtained through 1000 calls to each kernel.
160 Listing \ref{lst:chronos} shows how to use the dedicated cutil functions. Timer declaration and creation only need to be performed once while reset, start and stop can be used as often as necessary. Synchronization is mandatory before stopping the timer (Line 7), to avoid runtime measure being biased.
161 \lstinputlisting[label={lst:chronos},caption=Time measurement technique using cutil functions]{code/exChronos.cu}
163 In an attempt to provide relevant speedup values, we either implemented CPU versions of the algorithms studied, or used the values found in existing literature. Still, the large number and diversity of hardware platforms and GPU cards make it impossible to benchmark every possible combination and significant differences may occur between the speedups we announce and those obtained with different devices. As a reference, our developing platform details as follows:
166 \item CPU codes run on:
168 \item Quad Core Xeon E31245 at 3.3GHz-8GByte RAM running Linux kernel 3.2
169 \item Quad Core Xeon E5620 at 2.40GHz-12GByte RAM running Linux kernel 2.6.18
171 \item GPU codes run on:
173 \item Nvidia Tesla C2070 hosted by a PC QuadCore Xeon E5620 at 2.4GHz-12GByte RAM, running Linux kernel 2.6.18
174 \item NVidia GeForce GTX 280 hosted by a PC QuadCore Xeon X5482 at 3.20GHz-4GByte RAM, running Linux kernel 2.6.32
178 All kernels have also been tested with various image sizes from 512$\times$512 to 4096$\times$4096 pixels. This allows to guess runtime dependancy over image size.
180 Last, like many authors, we chose to use the pixel throughput value of each process in Mega Pixels per second (MP/s) as a performance indicator, including data transfers and kernel runtimes.
181 In order to estimate the potential for improvement of each kernel, a reference throughput measurement, involving identity kernel of Listing \ref{lst:fkern1}, was performed. As this kernel only fetches input values from texture memory and outputs them to global memory without doing any computation, it represents the smallest, thus fastest, possible process and is taken as the reference throughput value (100\%). The same measurement was performed on CPU, with a maximum effective pixel throughput of 130~Mpixel per second. On GPU, depending on grid parameters it amounts to 800~MPixels/s on GTX280 and 1300~Mpixels/s on C2070.
185 \item[CUDA] Compute Unified Device Architecture.