X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/17bff40b83bcdcc39769f9e59c70ffae1c525b72..e9bec782df9908a1d94ab5629dbe7710293019bd:/BookGPU/Chapters/chapter10/ch10.tex?ds=inline diff --git a/BookGPU/Chapters/chapter10/ch10.tex b/BookGPU/Chapters/chapter10/ch10.tex index 7a39dca..17b3b4f 100644 --- a/BookGPU/Chapters/chapter10/ch10.tex +++ b/BookGPU/Chapters/chapter10/ch10.tex @@ -1,5 +1,5 @@ \chapterauthor{Xavier Meyer and Bastien Chopard}{Department of Computer Science, University of Geneva, Switzerland} -\chapterauthor{Paul Albuquerque}{Institute for Informatics and Telecommunications, hepia, \\ University of Applied Sciences of Western Switzerland -- Geneva, Switzerland} +\chapterauthor{Paul Albuquerque}{Institute for Informatics and Telecommunications, HEPIA, \\ University of Applied Sciences of Western Switzerland -- Geneva, Switzerland} %\chapterauthor{Bastien Chopard}{Department of Computer Science, University of Geneva} %\chapter{Linear programming on a GPU: a study case based on the simplex method and the branch-cut-and bound algorithm} @@ -297,7 +297,7 @@ and $\bs\beta = \mbf{N^T} \mbf{(B^{-1})^T} \mbf{d}$ \subsubsection*{Choice of the leaving variable} The stability and robustness of the algorithm depend considerably on the choice of the leaving variable. With respect to this, the \textit{expand} method~\cite{GILL1989} proves to be very useful in the sense that it helps to avoid cycles and reduces the risks of encountering numerical instabilities. This method consists of two steps of complexity $\mathcal{O}(m)$. In the first step, a small perturbation is applied to the bounds of the variables to prevent stalling of the objective value, thus avoiding cycles. These perturbed bounds are then used to determine the greatest gain on the entering variable imposed by the most constraining basic variable. The second phase uses the original bounds to define the basic variable offering the gain closest to the one of the first phase. This variable will then be selected for leaving the basis. - +\clearpage \section{Branch-and-bound\index{branch-and-bound} algorithm} \label{chXXX:sec:bnb} \subsection{Integer linear programming\index{integer linear programming}} @@ -428,7 +428,7 @@ Various types or families of cutting-planes exist. Those that are most applied a The cutting-planes generated must be carefully selected in order to avoid a huge increase in the problem size. They are selected according to three criteria: their efficiency, their orthogonality with respect to other cutting-planes, and their parallelism with respect to the objective function. Cutting-planes having the most impact on the problem are then selected, while the others are dropped. - +\clearpage \section{CUDA considerations} \label{chXXX:sec:cuda} The most expensive operations in the simplex algorithm are linear algebra functions. @@ -442,7 +442,7 @@ In the following section, we first quickly describe the CUDA reduction operation % Reduce operation \subsection{Parallel reduction} \label{chXXX:subsec:reduction} -A parallel reduction\index{parallel reduction} operation is performed in an efficient manner inside a GPU block as shown in Figure~\ref{chXXX:fig:reduc}. Shared memory is used for a fast and reliable way to communicate between threads. However, at the grid level, reduction cannot be easily implemented due to the lack of direct communication between blocks. The usual way of dealing with this type of limitation is to apply the reduction in two separate steps. The first one involves a GPU kernel reducing the data over multiple blocks, the local result of each block being stored on completion. The second step finishes the reduction on a single block or on the CPU. +A parallel reduction\index{parallel!reduction} operation is performed in an efficient manner inside a GPU block as shown in Figure~\ref{chXXX:fig:reduc}. Shared memory is used for a fast and reliable way to communicate between threads. However, at the grid level, reduction cannot be easily implemented due to the lack of direct communication between blocks. The usual way of dealing with this type of limitation is to apply the reduction in two separate steps. The first one involves a GPU kernel reducing the data over multiple blocks, the local result of each block being stored on completion. The second step finishes the reduction on a single block or on the CPU. An optimized way of doing the reduction can be found in the examples\footnote{Available at http://docs.nvidia.com/cuda/cuda-samples/index.html\#advanced} provided by NVIDIA. %In order to keep code listings compact hereafter, the reduction of values among a block will be referred to as \textit{reduceOperation(value)} (per extension \textit{reduceArgMax(maxVal)}). @@ -549,7 +549,7 @@ The bulk of the data representing the problem is stored on the GPU. Only variables required for decision-making operations are updated on the CPU. The communications arising from the aforementioned scheme are illustrated in Figure~\ref{chXXX:fig:diagSTD}. The amount of data exchanged at each iteration is independent of the problem size ensuring that this implementation scales well as the problem size increases. - +\clearpage \begin{figure}[!h] \centering \includegraphics[width=90mm]{Chapters/chapter10/figures/DiagSTD_cap.pdf} @@ -937,7 +937,7 @@ nesm.mps & 663 & 2923 & 13988 & 4.4\\ \hline maros.mps & 847 & 1443 & 10006 & 1.7\\ \hline perold.mps & 626 & 1376 & 6026 & 1.0\\ \hline \end{tabular} -\caption{NETLIB problems solved in the range of 1 to 4 seconds} +\caption{NETLIB problems solved in the range of 1 to 4 seconds.} \label{chXXX:tab:medium} \end{center} \end{table} @@ -950,7 +950,7 @@ perold.mps & 626 & 1376 & 6026 & 1.0\\ \hline \end{figure} Finally, the biggest problems, and slowest to solve, are given in Table~\ref{chXXX:tab:big}. A new tendency can be observed in Figure~\ref{chXXX:fig:SlowSolve}. The \textit{Revised-sparse} method is the fastest on most of the problems. The performances are still close between the best two methods on problems having a C-to-R ratio close to 2 such as bnl2.mps, pilot.mps, or greenbeb.mps. However, when this ratio is greater, the \textit{Revised-sparse} can be nearly twice as fast as the standard method. This is noticeable on 80bau3b.mps (4.5) and fit2p.mps (4.3). Although the C-to-R ratio of d6cube.mps (14.9) exceeds the ones previously mentioned, the \textit{Revised-sparse} method doesn't show an impressive performance, probably due to the small amount of rows and the density of this problem, which doesn't fully benefit from the lower complexity of sparse operations. - +\clearpage \begin{table}[!h] \begin{center} \begin{tabular}{|l|r|r|r|r|}