\chapterauthor{Xavier Meyer and Bastien Chopard}{Department of Computer Science, University of Geneva, Switzerland}
-\chapterauthor{Paul Albuquerque}{Institute for Informatics and Telecommunications, hepia, \\ University of Applied Sciences of Western Switzerland -- Geneva, Switzerland}
+\chapterauthor{Paul Albuquerque}{Institute for Informatics and Telecommunications, HEPIA, \\ University of Applied Sciences of Western Switzerland -- Geneva, Switzerland}
%\chapterauthor{Bastien Chopard}{Department of Computer Science, University of Geneva}
%\chapter{Linear programming on a GPU: a study case based on the simplex method and the branch-cut-and bound algorithm}
-\chapter{Linear Programming on a GPU: A~Case~Study}
+\chapter{Linear programming on a GPU: a~case~study}
\section{Introduction}
\label{chXXX:sec:intro}
The simplex method~\cite{VCLP} is a well-known optimization algorithm for solving linear programming (LP) models in the field of operations research. It is part of software often employed by businesses for finding solutions to problems such as airline scheduling problems. The original standard simplex method was proposed by Dantzig in 1947. A more efficient method, named the revised simplex, was later developed. Nowadays its sequential implementation can be found in almost all commercial LP solvers. But the always increasing complexity and size of LP problems from the industry, drives the demand for more computational power.
\section{Simplex algorithm}
\label{chXXX:sec:simplex-algo}
-\subsection{Linear programming model\index{Linear programming}}
+\subsection{Linear programming model\index{linear programming}}
\label{chXXX:subsec:lp-model}
%% Mathematics of LP
An LP model in its canonical form can be expressed as the following optimization problem:
\subsection{Standard simplex algorithm}
\label{chXXX:subsec:simplex-std}
% Simple explanation of the simple algorithm : 3 operations
-The simplex method\index{Simplex!Standard method}~\cite{VCLP} is an algorithm for solving LP models.
+The simplex method\index{simplex!standard method}~\cite{VCLP} is an algorithm for solving LP models.
It proceeds by iteratively visiting vertices on the boundary of the feasible region.
This amounts to performing algebraic manipulations on the system of linear equations.
\subsection{Revised simplex method}
\label{chXXX:subsec:simplex-rev}
% Global revised simplex idea
-The operation that takes the most time in the standard method is the pivoting operation, and more specifically, the update of the constraints matrix~$\mbf{A}$. The revised method\index{Simplex!Revised method} tries to avoid this costly operation by updating only a smaller part of this matrix.
+The operation that takes the most time in the standard method is the pivoting operation, and more specifically, the update of the constraints matrix~$\mbf{A}$. The revised method\index{simplex!revised method} tries to avoid this costly operation by updating only a smaller part of this matrix.
%% Revised simplex method operations, useful for the algorithm
The revised simplex method uses the same operations as in the standard method to choose the entering and leaving variables.
\subsubsection*{Choice of the leaving variable}
The stability and robustness of the algorithm depend considerably on the choice of the leaving variable. With respect to this, the \textit{expand} method~\cite{GILL1989} proves to be very useful in the sense that it helps to avoid cycles and reduces the risks of encountering numerical instabilities. This method consists of two steps of complexity $\mathcal{O}(m)$. In the first step, a small perturbation is applied to the bounds of the variables to prevent stalling of the objective value, thus avoiding cycles. These perturbed bounds are then used to determine the greatest gain on the entering variable imposed by the most constraining basic variable. The second phase uses the original bounds to define the basic variable offering the gain closest to the one of the first phase. This variable will then be selected for leaving the basis.
-
-\section{Branch-and-bound\index{Branch-and-bound} algorithm}
+\clearpage
+\section{Branch-and-bound\index{branch-and-bound} algorithm}
\label{chXXX:sec:bnb}
-\subsection{Integer linear programming\index{Integer linear programming}}
+\subsection{Integer linear programming\index{integer linear programming}}
\label{chXXX:subsec:ilp}
% Why and what are integer linear programming problem
In some problems, variables are integer-valued. For example, in a vehicle routing problem, one cannot assign one third of a vehicule to a specific route. ILP problems restrict LP problems by imposing an integrality condition on the variables. While this change may seem to have little impact on the model, the aftermaths on the resolution method are quite important.
\subsection{Branching strategy}
\label{chXXX:subsec:bnb-branching}
% What it is and why it is important
-The branching strategy\index{Branch-and-bound!Branching} defines the method used to select the variable on which branching will occur.
+The branching strategy\index{branch-and-bound!branching} defines the method used to select the variable on which branching will occur.
The objective value of the child node depends greatly on the choice of this variable.
Branching on a variable may lead to a drop on the upper bound and thus speed up the exploration,
while branching on other variables could leave this bound unchanged.
\subsection{Node selection strategy}
\label{chXXX:subsec:bnb-node-select}
-The node selection strategy\index{Branch-and-bound!Node selection} defines the methodology used to explore the tree.
+The node selection strategy\index{branch-and-bound!node selection} defines the methodology used to explore the tree.
While the usual depth-first search and breadth-first search are considered and used,
some remarks about the tree exploration must be made.
First let us mention a few facts:
\subsection{Cutting-plane methods}
\label{chXXX:subsec:cuts}
-\textit{Cutting-planes}\index{Branch-and-bound!Cutting-plane}~\cite{WOLTER06} (also simply \textit{cuts}) are new constraints whose role is to cut off parts of the search space.
+\textit{Cutting-planes}\index{branch-and-bound!cutting-plane}~\cite{WOLTER06} (also simply \textit{cuts}) are new constraints whose role is to cut off parts of the search space.
They may be generated from the first LP solution (\textit{cut-and-branch}) or periodically during the B\&B (\textit{branch-and-cut}).
On the one hand cutting-planes may considerably reduce the size of the solution space, but on the other hand they increase the problem size.
Moreover, generating cutting-planes is costly since it requires a thorough analysis of the current state of the problem.
The cutting-planes generated must be carefully selected in order to avoid a huge increase in the problem size. They are selected according to three criteria: their efficiency, their orthogonality with respect to other cutting-planes, and their parallelism with respect to the objective function.
Cutting-planes having the most impact on the problem are then selected, while the others are dropped.
-
+\clearpage
\section{CUDA considerations}
\label{chXXX:sec:cuda}
The most expensive operations in the simplex algorithm are linear algebra functions.
% Reduce operation
\subsection{Parallel reduction}
\label{chXXX:subsec:reduction}
-A parallel reduction\index{Parallel reduction} operation is performed in an efficient manner inside a GPU block as shown in Figure~\ref{chXXX:fig:reduc}. Shared memory is used for a fast and reliable way to communicate between threads. However, at the grid level, reduction cannot be easily implemented due to the lack of direct communication between blocks. The usual way of dealing with this type of limitation is to apply the reduction in two separate steps. The first one involves a GPU kernel reducing the data over multiple blocks, the local result of each block being stored on completion. The second step finishes the reduction on a single block or on the CPU.
+A parallel reduction\index{parallel!reduction} operation is performed in an efficient manner inside a GPU block as shown in Figure~\ref{chXXX:fig:reduc}. Shared memory is used for a fast and reliable way to communicate between threads. However, at the grid level, reduction cannot be easily implemented due to the lack of direct communication between blocks. The usual way of dealing with this type of limitation is to apply the reduction in two separate steps. The first one involves a GPU kernel reducing the data over multiple blocks, the local result of each block being stored on completion. The second step finishes the reduction on a single block or on the CPU.
An optimized way of doing the reduction can be found in the examples\footnote{Available at http://docs.nvidia.com/cuda/cuda-samples/index.html\#advanced} provided by NVIDIA.
%In order to keep code listings compact hereafter, the reduction of values among a block will be referred to as \textit{reduceOperation(value)} (per extension \textit{reduceArgMax(maxVal)}).
\begin{figure}[!h]
\centering
\includegraphics[width=10cm]{Chapters/chapter10/figures/Reduc3.pdf}
-\caption{Example of a parallel reduction at block level (courtesy NVIDIA).}
+\caption{Example of a parallel reduction at block level. (Courtesy NVIDIA).}
\label{chXXX:fig:reduc}
\end{figure}
Only variables required for decision-making operations are updated on the CPU.
The communications arising from the aforementioned scheme are illustrated in Figure~\ref{chXXX:fig:diagSTD}.
The amount of data exchanged at each iteration is independent of the problem size ensuring that this implementation scales well as the problem size increases.
-
+\clearpage
\begin{figure}[!h]
\centering
\includegraphics[width=90mm]{Chapters/chapter10/figures/DiagSTD_cap.pdf}
There are two cases where starting from a fresh problem is required or beneficial. The first one is imposed by the numeric inaccuracy appearing after several iterations of the LP solver. The second is upon the request of a new subtree. To avoid the extensive communication costs of a full restart, the GPU keeps in memory an intermediate stable state of the problem, a \textit{warmstart}. This state could, for example, be the one found after solving the root node of the tree.
% Global flow
-\subsubsection*{Multi-GPU\index{Multi-GPU} exploration}
+\subsubsection*{Multi-GPU\index{multi-GPU} exploration}
Having the CPU act as a decision maker and the GPU as an explorer, allows for the possibility of using multiple GPUs to explore the tree. The global knowledge is maintained by the CPU task. The CPU assigns to the GPUs the task of exploring subtrees of promising nodes. Since each plunging is done locally, no communications are required between the GPUs. Moreover, the amount of nodes processed during a plunging can be used to tune the load of the CPU task.
\section{Performance model}
\label{chXXX:sec:MODEL}
-Performance models\index{Performance model} are useful to predict the behaviour of implementations as a function of various parameters. Since the standard simplex algorithm is the easiest to understand, we will focus in this section on its behaviour as a function of the problem size.
+Performance models\index{performance model} are useful to predict the behaviour of implementations as a function of various parameters. Since the standard simplex algorithm is the easiest to understand, we will focus in this section on its behaviour as a function of the problem size.
CUDA kernels require a different kind of modeling than usually encountered in parallelism. The key idea is to capture in the model the decomposition into threads, warps, and blocks. One must also pay a particular attention to global memory accesses and to how the pipelines reduce the associated latency.
maros.mps & 847 & 1443 & 10006 & 1.7\\ \hline
perold.mps & 626 & 1376 & 6026 & 1.0\\ \hline
\end{tabular}
-\caption{NETLIB problems solved in the range of 1 to 4 seconds}
+\caption{NETLIB problems solved in the range of 1 to 4 seconds.}
\label{chXXX:tab:medium}
\end{center}
\end{table}
\end{figure}
Finally, the biggest problems, and slowest to solve, are given in Table~\ref{chXXX:tab:big}. A new tendency can be observed in Figure~\ref{chXXX:fig:SlowSolve}. The \textit{Revised-sparse} method is the fastest on most of the problems. The performances are still close between the best two methods on problems having a C-to-R ratio close to 2 such as bnl2.mps, pilot.mps, or greenbeb.mps. However, when this ratio is greater, the \textit{Revised-sparse} can be nearly twice as fast as the standard method. This is noticeable on 80bau3b.mps (4.5) and fit2p.mps (4.3). Although the C-to-R ratio of d6cube.mps (14.9) exceeds the ones previously mentioned, the \textit{Revised-sparse} method doesn't show an impressive performance, probably due to the small amount of rows and the density of this problem, which doesn't fully benefit from the lower complexity of sparse operations.
-
+\clearpage
\begin{table}[!h]
\begin{center}
\begin{tabular}{|l|r|r|r|r|}