new

[book_gpu.git] / BookGPU / Chapters / chapter4 / ch4.tex
diff --git a/BookGPU/Chapters/chapter4/ch4.tex b/BookGPU/Chapters/chapter4/ch4.tex

index 0a0d6cb28edc6733c3f9349d9beabf491262ea69..90612c9d798fdaee746e263d63901772d95ed11c 100644 (file)
--- a/BookGPU/Chapters/chapter4/ch4.tex
+++ b/BookGPU/Chapters/chapter4/ch4.tex
@@ -8,7 +8,7 @@
  
  \section{Overview}
  In this chapter, after dealing with GPU median filter implementations,
  
  \section{Overview}
  In this chapter, after dealing with GPU median filter implementations,
-we propose to explore how convolutions\index{Convolution}  can be implemented on modern
+we propose to explore how convolutions\index{convolution}  can be implemented on modern
  GPUs. Widely used in digital image processing filters, the \emph{convolution
  operation} basically consists of taking the sum of products of elements
  from two 2D functions, letting one of the two functions move over
  GPUs. Widely used in digital image processing filters, the \emph{convolution
  operation} basically consists of taking the sum of products of elements
  from two 2D functions, letting one of the two functions move over
@@ -20,7 +20,7 @@ to $I$ as an $H\times L$ pixel gray-level image and to $I(x,y)$ as the gray-leve
  value of each pixel of coordinates $(x,y)$.
  
  
  value of each pixel of coordinates $(x,y)$.
  
  
-
+\clearpage
  \section{Definition}
  Within a digital image $I$, the convolution operation is performed between
  image $I$ and convolution mask \emph{h} (To avoid confusion with other
  \section{Definition}
  Within a digital image $I$, the convolution operation is performed between
  image $I$ and convolution mask \emph{h} (To avoid confusion with other
@@ -81,7 +81,7 @@ This first implementation consists of a rather naive application to
  convolutions of the techniques applied to median filters in the
  previous chapter, as a reminder: texture memory used with incoming
  data, pinned memory with output data, optimized use of registers
  convolutions of the techniques applied to median filters in the
  previous chapter, as a reminder: texture memory used with incoming
  data, pinned memory with output data, optimized use of registers
-while processing data and multiple output per thread\index{Multiple output per thread}. 
+while processing data and multiple output per thread\index{multiple output per thread}. 
  One significant difference lies in the fact
  that the median filter uses only one parameter, the size of the window mask,
  which can be hard-coded, while a convolution mask requires referring to several parameters; hard-coding
  One significant difference lies in the fact
  that the median filter uses only one parameter, the size of the window mask,
  which can be hard-coded, while a convolution mask requires referring to several parameters; hard-coding
@@ -239,8 +239,8 @@ However, our technique requires writing one kernel per mask size, which can be s
  
  \lstinputlisting[label={lst:convoGene8x8pL3},caption=CUDA kernel achieving a $3\times 3$ convolution operation with the mask in symbol memory and direct data fetches in texture memory]{Chapters/chapter4/code/convoGene8x8pL3.cu}
  
  
  \lstinputlisting[label={lst:convoGene8x8pL3},caption=CUDA kernel achieving a $3\times 3$ convolution operation with the mask in symbol memory and direct data fetches in texture memory]{Chapters/chapter4/code/convoGene8x8pL3.cu}
  
-\subsection{Using shared memory to store prefetched data\index{Prefetching}.}
- \index{memory~hierarchy!shared~memory}
+\subsection{Using shared memory to store prefetched data\index{prefetching}.}
+ \index{memory hierarchy!shared memory}
  A more convenient way of coding a convolution kernel is to use shared memory to perform a prefetching stage of the whole halo before computing the convolution sums.
  This proves to be quite efficient and more versatile, but it obviously generates some overhead because 
  \begin{itemize}
  A more convenient way of coding a convolution kernel is to use shared memory to perform a prefetching stage of the whole halo before computing the convolution sums.
  This proves to be quite efficient and more versatile, but it obviously generates some overhead because 
  \begin{itemize}
@@ -356,7 +356,7 @@ $\mathbf{4096\times 4096}$&1.533 \\\hline
  \label{tab:cpyToArray}
  \end{table}
  \lstinputlisting[label={lst:convoSepSh},caption=data copy between the calls to 1D convolution kernels achieving a 2D separable convolution operation]{Chapters/chapter4/code/convoSepSh.cu}
  \label{tab:cpyToArray}
  \end{table}
  \lstinputlisting[label={lst:convoSepSh},caption=data copy between the calls to 1D convolution kernels achieving a 2D separable convolution operation]{Chapters/chapter4/code/convoSepSh.cu}
-\lstinputlisting[label={lst:convoSepShV},caption=CUDA kernel achieving a horizontal 1D convolution operation after a preloading \index{Prefetching} of data into shared memory]{Chapters/chapter4/code/convoSepShV.cu}
+\lstinputlisting[label={lst:convoSepShV},caption=CUDA kernel achieving a horizontal 1D convolution operation after a preloading \index{prefetching} of data into shared memory]{Chapters/chapter4/code/convoSepShV.cu}
  \lstinputlisting[label={lst:convoSepShH},caption=CUDA kernel achieving a vertical 1D convolution operation after a preloading of data into shared memory]{Chapters/chapter4/code/convoSepShH.cu}
   
  \section{Conclusion}
  \lstinputlisting[label={lst:convoSepShH},caption=CUDA kernel achieving a vertical 1D convolution operation after a preloading of data into shared memory]{Chapters/chapter4/code/convoSepShH.cu}
   
  \section{Conclusion}