X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/1874c46934f4ba7e8c2013d3829f65309456d292..c08513a856650905e4751c8b52c7cbb661368a5f:/BookGPU/Chapters/chapter3/ch3.tex diff --git a/BookGPU/Chapters/chapter3/ch3.tex b/BookGPU/Chapters/chapter3/ch3.tex index 95bce29..1b2e263 100755 --- a/BookGPU/Chapters/chapter3/ch3.tex +++ b/BookGPU/Chapters/chapter3/ch3.tex @@ -18,10 +18,10 @@ However, so as to propose concise and more readable code, we will assume the fol \section{Data transfers, memory management.} This section deals with the following issues: \begin{enumerate} -\item Data transfer from CPU memory to GPU global memory: several GPU memory areas are available as destination memory but the 2D caching mechanism of texture memory, \index{memory~hierarchy!texture~memory} specifically designed for fetching neighboring pixels, is currently the fastest way to fetch gray-level pixel values inside a kernel computation. This has led us to choose \textbf{texture memory} as primary GPU memory area for input images. -\item Data fetching from GPU global memory to kernel local memory: as said above, we use texture memory. \index{memory~hierarchy!texture~memory} Depending on which process is run, texture data is used either by direct fetching in kernel local memory or through a prefetching \index{prefetching} in shared memory. \index{memory~hierarchy!shared~memory} +\item Data transfer from CPU memory to GPU global memory: several GPU memory areas are available as destination memory but the 2D caching mechanism of texture memory, \index{memory hierarchy!texture memory} specifically designed for fetching neighboring pixels, is currently the fastest way to fetch gray-level pixel values inside a kernel computation. This has led us to choose \textbf{texture memory} as primary GPU memory area for input images. +\item Data fetching from GPU global memory to kernel local memory: as said above, we use texture memory. \index{memory hierarchy!texture memory} Depending on which process is run, texture data is used either by direct fetching in kernel local memory or through a prefetching \index{prefetching} in shared memory. \index{memory hierarchy!shared memory} \item Data outputting from kernels to GPU memory: there is actually no alternative to global memory, as kernels cannot directly write into texture memory and as copying from texture to CPU memory would not be faster than from simple global memory. -\item Data transfer from GPU global memory to CPU memory: it can be drastically accelerated by use of \textbf{pinned memory}, \index{memory~hierarchy!pinned~memory} keeping in mind it has to be used sparingly. +\item Data transfer from GPU global memory to CPU memory: it can be drastically accelerated by use of \textbf{pinned memory}, \index{memory hierarchy!pinned memory} keeping in mind it has to be used sparingly. \end{enumerate} Algorithm \ref{algo:memcopy} summarizes all the above considerations and describes how data are handled in our examples. For more information on how to handle the different types of GPU memory, we suggest referring to the CUDA programmer's guide. @@ -127,7 +127,7 @@ copy data from GPU global memory to CPU memory\label{algoMedianGeneric:memcpyD2H As mentioned earlier, the selection of the median value can be performed by more than one technique, using either histogram-based or sorting methods, each having its own benefits and drawbacks as will be discussed further down. \subsection{A naive implementation} -As a reference, Listing \ref{lst:medianGeneric} gives a simple, not to say simplistic, implementation of a CUDA kernel (\texttt{kernel\_medianR}) achieving generic $n\times n$ histogram-based median filtering. Its runtime has a very low data dependency, but this implementation does not suit GPU architecture very well. Each pixel loads the whole of its $n\times n$ neighborhood, meaning that one pixel is loaded multiple times inside one single thread block, and even more time-consuming, the use of a local vector (histogram[]) considerably downgrades performance, as the compiler automatically stores such vectors in local memory (slow) \index{memory~hierarchy!local~memory}. +As a reference, Listing \ref{lst:medianGeneric} gives a simple, not to say simplistic, implementation of a CUDA kernel (\texttt{kernel\_medianR}) achieving generic $n\times n$ histogram-based median filtering. Its runtime has a very low data dependency, but this implementation does not suit GPU architecture very well. Each pixel loads the whole of its $n\times n$ neighborhood, meaning that one pixel is loaded multiple times inside one single thread block, and even more time-consuming, the use of a local vector (histogram[]) considerably downgrades performance, as the compiler automatically stores such vectors in local memory (slow) \index{memory hierarchy!local memory}. Table \ref{tab:medianHisto1} displays measured runtimes of \texttt{kernel\_medianR} and pixel throughputs for each GPU version (C2070 and GTX480 targets) and for both CPU and GPU implementations. Usual window sizes of $3\times 3$, $5\times 5$, and $7\times 7$ are shown. Though some specific applications require larger window sizes and dedicated algorithms, such small square window sizes are most widely used in general purpose image processing. GPU runtimes have been obtained with a grid of 64-thread blocks. @@ -192,9 +192,9 @@ On the GPU's side, we note high dependence on window size due to the redundancy \section{NVIDIA GPU tuning recipes} When designing GPU code, besides thinking of the actual data computing process, one must choose the memory type in which to store temporary data. Three types of GPU memory are available: \begin{enumerate} -\item \textbf{Global memory, the most versatile:} \index{memory~hierarchy!global~memory}\\Offers the largest storing space and global scope but is the slowest (400 to 800 clock cycles latency). \textbf{Texture memory} is physically included into it, but allows access through an efficient 2D caching mechanism. -\item \textbf{Registers, the fastest:} \index{memory~hierarchy!registers}\\Allow access without latency, but only 63 registers are available per thread (thread scope), with a maximum of 32K per Streaming Multiprocessor (SM). \index{register count} -\item \textbf{Shared memory, a complex compromise:} \index{memory~hierarchy!shared~memory}\\All threads in one block can access $48~KBytes$ of shared memory, which is faster than global memory (20 clock cycles latency) but slower than registers. +\item \textbf{Global memory, the most versatile:} \index{memory hierarchy!global memory}\\Offers the largest storing space and global scope but is the slowest (400 to 800 clock cycles latency). \textbf{Texture memory} is physically included into it, but allows access through an efficient 2D caching mechanism. +\item \textbf{Registers, the fastest:} \index{memory hierarchy!registers}\\Allow access without latency, but only 63 registers are available per thread (thread scope), with a maximum of 32K per Streaming Multiprocessor (SM). \index{register count} +\item \textbf{Shared memory, a complex compromise:} \index{memory hierarchy!shared memory}\\All threads in one block can access $48~KBytes$ of shared memory, which is faster than global memory (20 clock cycles latency) but slower than registers. However, bank conflicts can occur if two threads of a warp try to access data stored in one single memory bank. In such cases, the parallel process is serialized which may cause significant performance decrease. One easy way to avoid this is to ensure that two consecutive threads in one block always access 32-bit data at two consecutive addresses. \end{enumerate}