BookGPU/Chapters/chapter10/ch10.tex

   1
   2
   3 \chapterauthor{Xavier Meyer and Bastien Chopard}{Department of Computer Science, University of Geneva, Switzerland}
   4 \chapterauthor{Paul Albuquerque}{Institute for Informatics and Telecommunications, Hepia, \\ University of Applied Sciences of Western Switzerland - Geneva, Switzerland}
   5 %\chapterauthor{Bastien Chopard}{Department of Computer Science, University of Geneva}
   6
   7 %\chapter{Linear programming on a GPU: a study case based on the simplex method and the branch-cut-and bound algorithm}
   8 \chapter{Linear programming on a GPU: a case study}
   9 \section{Introduction}
  10 \label{chXXX:sec:intro}
  11 The simplex method is a well-known optimization algorithm for solving linear programming (LP) models in the field of operations research. It is part of software often employed by businesses for finding solutions to problems such as airline scheduling problems. The original standard simplex method was proposed by Dantzig in 1947~\cite{VCLP}. A more efficient method named the revised simplex, was later developed. Nowadays its sequential implementation can be found in almost all commercial LP solvers. But the always increasing complexity and size of LP problems from the industry, drives the demand for more computational power. Indeed, current implementations of the revised simplex strive to produce the expected results, if any. In this context, parallelization is the natural idea to investigate~\cite{HALL}. Already in 1996, Thomadakis et al.~\cite{THOMAD} implemented the standard method on a massively parallel computer and obtained an increase in performances when solving dense or large problems.
  12
  13 A few years back, in order to meet the demand for processing power, graphics card vendors made their graphical processing units (GPU) available for general-purpose computing. Since then GPUs have gained a lot of popularity as they offer an opportunity to accelerate algorithms having an architecture well-adapted to the GPU model. The simplex method falls into this category. Indeed, GPUs exhibit a massively parallel architecture optimized for matrix processing. To our knowledge, there are only few simplex implementations on GPU. Bieling et al.~\cite{BIELING} presented encouraging results while solving small to mid-size LP problems with the revised simplex. However, the complexity of their algorithm seems to be rather close to the one of the standard simplex with similar heuristics. Following the steps of this first work, an implementation of the revised simplex~\cite{LALAMI11} showed interesting results on non-sparse and square matrices. Finally, an implementation of the interior point method~\cite{JUNG08} outperformed its CPU equivalent on mid-size problems.
  14
  15 The Branch-and-Bound (B\&B) algorithm is extensively used for solving integer linear programming (ILP) problems.
  16 This tree-based exploration method subdivides the feasible region of the relaxed LP model into successively smaller subsets to obtain bounds on the objective value.
  17 Each corresponding submodel can be solved with the simplex method, and the bounds computed determine whether further branching is required.
  18 Hence, a natural parallelization of the B\&B method is to let the CPU manage the B\&B tree and despatch the relaxed submodels to a LP solver on a GPU. We refer the reader to chapter~\ref{ch8:GPU-accelerated-tree-based-exact-optimization-methods}, and the references therein, for a more complete introduction to parallel B\&B algorithms.
  19
  20 In this chapter, we present a GPU implementation of the standard and revised simplex methods, based on the CUDA technology of NVIDIA. We also derive a performance model and establish its accurateness. Let us mention that there are many technicalities involved in CUDA programming, in particular regarding the management of tasks and memories on the GPU. Thus fine tuning is indispensable to avoid a breakdown on performance.
  21 %With respect to the simplex algorithm, special attention must be given to numerical stability.
  22
  23 The chapter is organized as follows. First, we begin with a description of the standard and revised simplex methods
  24 and introduce the heuristics used in our implementations. This is followed by a presentation of the B\&B algorithm.
  25 The next section points out CUDA aspects which are important for our implementations.
  26 In the following one, we focus on the GPU implementations of the simplex method as well as the B\&B algorithm.
  27 In the sixth section, we describe a performance model for the standard simplex implementation.
  28 In the seventh section, a performance comparison between our implementations is made on real-life problems and an analysis is given.
  29 Finally, we summarize the results obtained and consider new perspectives.
  30
  31 \section{Simplex algorithm}
  32 \label{chXXX:sec:simplex-algo}
  33 \subsection{Linear programming model}
  34 \label{chXXX:subsec:lp-model}
  35 %% Mathematics of LP
  36 A linear programming\index{linear programming} (LP) model in its canonical form can be expressed as the following optimization problem:
  37 \begin{equation}
  38 \label{chXXX:equ:canon-form}
  39 %\centering
  40 \begin{array}{lcl}
  41  \text{maximize} &\hspace{10pt}& z = \mbf{c^T x} \hspace{30pt}\\
  42  \text{subject to} &\hspace{10pt} &\mbf{Ax \leq b} \hspace{30pt}\\
  43                             &\hspace{10pt} &\mbf{x \geq 0} \hspace{30pt}
  44 \end{array}
  45 \end{equation}
  46 where $\mbf{x}=(x_j),\mbf{c}=(c_j) \in \mathbb{R}^n,\mbf{b}=(b_i) \in \mathbb{R}^m$ and $\mbf{A}=(a_{ij})$ is the $m \times n$ constraints matrix.
  47 The objective function $z = \mbf{c^T x}$ is the inner product of the cost vector $\mbf{c}$ and the unknown variables $\mbf{x}$.
  48 An element $\mbf{x}$ is called a solution which is said to be feasible
  49 if it satisfies the $m$ linear constraints imposed by $\mbf{A}$ and the bound $\mbf{b}$.
  50 The feasible region $\{\mbf{x}\in \mathbb{R}^n\, |\, \mbf{Ax \leq b},\, \mbf{x \geq 0}\}$ is a convex polytope.
  51 An optimal solution to the LP problem will therefore reside on a vertex of this polytope.
  52
  53 \subsection{Standard simplex algorithm}
  54 \label{chXXX:subsec:simplex-std}
  55 % Simple explanation of the simple algorithm : 3 operations
  56 The simplex method\index{simplex!standard method}~\cite{VCLP} is an algorithm for solving linear programming (LP) models.
  57 It proceeds by iteratively visiting vertices on the boundary of the feasible region.
  58 This amounts to performing algebraic manipulations on the system of linear equations.
  59
  60 We begin by reformulating the model.
  61 So-called \textit{slack variables} $x_{n+i}$ are added to the canonical form in order to replace inequalities by equalities in equation~(\ref{chXXX:equ:canon-form}):
  62 \begin{equation}
  63 \label{chXXX:equ:augm-form}
  64 %\centering
  65 \begin{array}{cccccccc}
  66 x_{n+i} & = &  b_i - & \displaystyle\sum\limits^{n}_{j=1} & a_{ij}x_j & & (i={1,2,...,m})\\
  67 \end{array}
  68 \end{equation}
  69 The resulting problem is called the \textit{augmented form} in which the variables are divided into two disjoint index sets,
  70 $\B$ and $\N$, which correspond to the basic and nonbasic variables.
  71 Basic variables, which form the basis of the problem, are on the left-hand side of equation~(\ref{chXXX:equ:augm-form}),
  72 while nonbasic variables, which form the core of the equations, appear on the right-hand~side.
  73 We can thus consider the following LP form:
  74 \begin{equation}
  75 %\centering
  76 \begin{array}{lcl}
  77  \text{maximize} &\hspace{10pt}& z = \mbf{c^T x} \hspace{30pt}\\
  78  \text{subject to} &\hspace{10pt} &\mbf{Ax = b} \hspace{30pt}\\
  79                             &\hspace{10pt} &\mbf{x \geq 0} \hspace{30pt}
  80 \end{array}
  81 \end{equation}
  82 where $\mbf{x} \in \mathbb{R}^{n+m}$ and $\mbf{A}$ is now the $m\times (n+m)$ matrix obtained by concatenating the constraints matrix with the $m\times m$ identity matrix $\mbf{I_m}$.
  83 The cost vector has been padded with zeros so that $\mbf{c} \in \mathbb{R}^{n+m}$.
  84
  85 The basic and nonbasic variables can be separated given the expression $\mbf{A_\N x_\N + x_\B = b}$.
  86 Similarly, $z = \mbf{c^T x = c_\N^T x_\N + c_\B^T x_\B}$ with $\mbf{c_\B = 0}$.
  87 By definition, a basic solution is obtained by assigning null values to all nonbasic variables ($\mbf{x_\N=0}$).
  88 Hence, $\mbf{x=x_\B = b}$ is a basic solution.
  89
  90 The simplex algorithm searches for the optimal solution through an iterative process.
  91 For the sake of simplicity, we will assume here that $\mbf{b>0}$ (\textit{i.e.} the origin belongs to the feasible region).
  92 Otherwise a preliminary treatment is required to generate a feasible initial solution (see~\S~\ref{chXXX:sec:heuristics}).
  93 A typical iteration then consists of three operations (summarized in algorithm~\ref{chXXX:algo:simplex-std}).
  94 \begin{enumerate}
  95 %%%%%%%%%%%%%%%%%
  96 \item \textbf{Choosing the entering variable.}
  97 The entering variable is a nonbasic variable whose increase will lead to an increase of the value of the objective function $z$.
  98 This variable must be selected with care so as to yield a substantial leap towards the optimal solution.
  99 The standard way of making this choice is to select the variable $x_e$ with the largest positive coefficient
 100 $c_e = \max\{c_j>0\, |\, j\in \N\}$ in the objective function $z$.
 101 However, other strategies such as choosing the positive coefficient with the smallest index, prove to be useful.
 102 %%%%%%%%%%%%%%%%%
 103 \item \textbf{Choosing the leaving variable.}
 104 This variable is the basic variable which first violates its constraint as the entering variable $x_e$ increases.
 105 The choice of the leaving variable must guarantee that the solution remains feasible.
 106 More precisely, setting all nonbasic variables except $x_e$ to zero, $x_e$ is bounded by
 107 $$t=\min\left\{\dfrac{b_i}{a_{ie}}\, \bigg|\, a_{ie}>0,\, i=1,\dots,m \right\}$$
 108 If $t=+\infty$, the LP problem is unbounded. \\
 109 Otherwise, $t=\dfrac{b_\ell}{a_{\ell e}}$ for some $\ell\in \B$, and $x_\ell$ is the leaving variable; \\
 110 the element $a_{\ell e}$ is called the pivot.
 111 %%%%%%%%%%%%%%%%%
 112 \item \textbf{Pivoting.}
 113 Once both variables are defined, the pivoting operation switches these variables from one set to the other:
 114 the entering variable enters the basis $\mathcal{B}$, taking the place of the leaving variable which now belongs to $\mathcal{N}$,
 115 namely $\mathcal{B} \leftarrow (\mathcal{B} \setminus \{\ell\}) \cup \{e\}$ and $\mathcal{N} \leftarrow (\mathcal{N} \setminus \{e\}) \cup \{\ell\}$.
 116 Correspondingly, the columns with index $\ell$ and $e$ are exchanged between $\mbf{I_m}$ and $\mbf{A_\N}$, and similarly for $\mbf{c_\B=0}$ and $\mbf{c_\N}$.
 117 We then update the constraints matrix $\mbf{A}$, the bound $\mbf{b}$ and the cost $\mbf{c}$ using Gaussian elimination.
 118 More precisely, denoting $\mbf{\tilde{I}_m}$, $\mbf{\tilde{A}_\N}$, $\mbf{\tilde{c}_\B}$ and $\mbf{\tilde{c}_\N}$ the resulting elements after the exchange, Gaussian elimination then transforms the tableau
 119
 120 \begin{center}
 121 \begin{tabular}{|c|c||c|}
 122 $\mbf{\tilde{A}_\N}$ & $\mbf{\tilde{I}_m}$ & $\mbf{b}$ \\
 123 \hline
 124 $\mbf{\tilde{c}^T_\N}$ & $\mbf{\tilde{c}^T_\B}$ & $z$
 125 \end{tabular}
 126 \end{center}
 127 \noindent
 128 into a tableau with updated values for $\mbf{A_\N}$, $\mbf{c_\N}$ and $\mbf{b}$
 129 \begin{center}
 130 \begin{tabular}{|c|c||c|}
 131 $\mbf{A_\N}$ & $\mbf{I_m}$ & $\mbf{b}$ \\
 132 \hline
 133 $\mbf{c^T_\N}$ & $\mbf{c^T_\B}$ & $z-tc_e$
 134 \end{tabular}
 135 \end{center}
 136 The latter is obtained by first dividing the $\ell$-th row by $a_{\ell e}$;
 137 the resulting row multiplied by $a_{ie}$ ($i\not=\ell$), resp. by $c_e$, is then added to the $i$-th row, resp. the last row.
 138 Hence, $\mbf{c_\B=0}$.
 139 These operations amount to jumping from the current vertex to an adjacent vertex with objective value $z=tc_e$.
 140 \end{enumerate}
 141 % End condition
 142 The algorithm ends when no more entering variable can be found, that is when~$\mbf{c_\N \le 0}$.
 143
 144 %% \begin{algorithm}
 145 %% \caption{Standard simplex algorithm}
 146 %% \label{chXXX:algo:simplex-std}
 147 %% \begin{algorithmic}
 148 %%      \STATE \texttt{//1. Find entering variable}
 149 %%      \IF{$\mbf{c_\N \le 0}$}
 150 %%      \RETURN Optimal
 151 %%    \ENDIF
 152 %%    \STATE Choose an index $e\in \N$ such that $c_e>0$
 153 %%      \STATE \texttt{//2. Find leaving variable}
 154 %%      \IF{$(\mbf{A_\N})_e \mbf{\le 0}$}
 155 %%      \RETURN Unbounded
 156 %%      \ENDIF
 157 %%      \STATE Let $\ell \in \B$ be the index such that \\
 158 %%      \hspace*{10mm} $t := \dfrac{b_\ell}{a_{\ell e}}=\min\left\{\dfrac{b_i}{a_{ie}}\, \bigg|\, a_{ie}>0,\, i=1,\dots,m \right\}$
 159 %%      \STATE \texttt{//3. Pivoting - update }
 160 %%    \STATE $\mathcal{B} := (\mathcal{B} \setminus \{\ell\}) \cup \{e\}$, $\mathcal{N} := (\mathcal{N} \setminus \{e\}) \cup \{\ell\}$
 161 %%      \STATE Compute $z_{best}:=z_{best}+t c_e$
 162 %%    \STATE Exchange $(\mbf{I_m})_\ell$ and $(\mbf{A_\N})_e$, $c_\ell$ and $c_e$
 163 %%      \STATE Update $\mbf{A_\N}$, $\mbf{c_\N}$ and $\mbf{b}$
 164 %%    \STATE Go to 1.
 165 %% \end{algorithmic}
 166 %% \end{algorithm}
 167
 168 \begin{algorithm}
 169 \SetNlSty{}{}{:}
 170     \textit{//1. Find entering variable}\;
 171      \If{$\mbf{c_\N \le 0}$} {
 172         \Return Optimal\;
 173    }
 174    Choose an index $e\in \N$ such that $c_e>0$ \;
 175    \textit{//2. Find leaving variable}\;
 176         \If{$(\mbf{A_\N})_e \mbf{\le 0}$} {
 177         \Return Unbounded
 178         }
 179         Let $\ell \in \B$ be the index such that \;
 180          $t := \dfrac{b_\ell}{a_{\ell e}}=\min\left\{\dfrac{b_i}{a_{ie}}\, \bigg|\, a_{ie}>0,\, i=1,\dots,m \right\}$\;
 181   \textit{//3. Pivoting - update }\;
 182    $\mathcal{B} := (\mathcal{B} \setminus \{\ell\}) \cup \{e\}$, $\mathcal{N} := (\mathcal{N} \setminus \{e\}) \cup \{\ell\}$\;
 183  Compute $z_{best}:=z_{best}+t c_e$\;
 184     Exchange $(\mbf{I_m})_\ell$ and $(\mbf{A_\N})_e$, $c_\ell$ and $c_e$\;
 185 Update $\mbf{A_\N}$, $\mbf{c_\N}$ and $\mbf{b}$\;
 186     Go to 1.\;
 187 \caption{Standard simplex algorithm}
 188 \label{chXXX:algo:simplex-std}
 189
 190 \end{algorithm}
 191
 192
 193 \subsection{Revised simplex method}
 194 \label{chXXX:subsec:simplex-rev}
 195 % Global revised simplex idea
 196 The operation which takes the most time in the standard method, is the pivoting operation, and more specifically the update of the constraints matrix~$\mbf{A}$. The revised method\index{simplex!revised method} tries to avoid this costly operation by only updating a smaller part of this matrix.
 197
 198 %% Revised simplex method operations, useful for the algorithm
 199 The revised simplex method uses the same operations as in the standard method to choose the entering and leaving variables.
 200 However, since the constraints matrix $\mbf{A}$ need not be fully updated, some additional reformulation is required.
 201
 202 At any stage of the algorithm, basic and nonbasic variables can be separated according to $\mbf{Ax = A_\N x_\N + A_\B x_\B = b}$.
 203 We can then transform the system $\mbf{Ax=b}$ into $\mbf{A_\B x_\B = b-A_\N x_\N}$.
 204 Let us denote $\mbf{B=A_\B}, \mbf{N=A_\N}$.
 205 Since $\mbf{B}$ is invertible, we can write
 206 \begin{align*}
 207 \mbf{x}_\B&=\mbf{B^{-1}b-B^{-1}Nx_\N} \\
 208 %\hline
 209 z&=\mbf{c_\B^T B^{-1}b+(c_\N^T-c_\B^T B^{-1}N)x_\N}
 210 \end{align*}
 211 The vector $\mbf{c_\N^T-c_\B^T B^{-1}N}$ is called the reduced cost vector.
 212
 213 The choice of the leaving variable can be rewritten.
 214 Setting all nonbasic variables save the entering variable $x_e$, to zero, $x_e$ is then bounded by
 215 $$t=\min\left\{\dfrac{(\mbf{B^{-1}b})_i}{(\mbf{B^{-1}N})_{ie}}\, \bigg|\, (\mbf{B^{-1}N})_{ie}>0,\, i=1,\dots,m \right\}$$
 216 If $t=+\infty$, the LP problem is unbounded. Otherwise, $t=\frac{(\mbf{B^{-1}b})_\ell}{(\mbf{B^{-1}N})_{\ell e}}$ for some $\ell\in \B$, and $x_\ell$ is the leaving variable.
 217
 218 Recall that the pivoting operation begins by exchanging columns with index $\ell$ and $e$ between $\mbf{B}$ and $\mbf{N}$, and similarly for $\mbf{c_\B}$ and $\mbf{c_\N}$.
 219
 220 To emphasize the difference between standard and revised simplex, we first express the updating for the standard simplex. This amounts to
 221 \begin{align*}
 222 [\mbf{B} \quad \mbf{N} \quad \mbf{b}] & \leftarrow [\mbf{I_m} \quad \mbf{B^{-1}N} \quad \mbf{B^{-1}b}] \\
 223 [\mbf{c_\B^T} \quad \mbf{c_\N^T}] & \leftarrow [\mbf{0} \quad \mbf{c_\N^T-c_\B^T B^{-1}N}]
 224 \end{align*}
 225 which moves the current vertex to an adjacent vertex $\mbf{x=B^{-1}b}$ with objective value $z=\mbf{c_\B^T B^{-1}b}$ (the end condition remains $\mbf{c_\N \le 0}$). Many computations performed in this update phase can in fact be avoided.
 226
 227 The main point of the revised simplex method is that the only values which really need to be computed at each step are: $\mbf{B^{-1}}$, $\mbf{B^{-1}b}$,$\mbf{B^{-1}N}_{e}$, $\mbf{c_\B^T B^{-1}}$ and $\mbf{c_\B^T B^{-1}b}$. However, matrix inversion is a time consuming operation (of cubic order for Gaussian elimination).
 228 Fortunately, there are efficient ways of computing an update for $\mbf{B^{-1}}$.
 229 One way is to take advantage of the sparsity of the LP problem by using the LU sparse decomposition.
 230 This decomposition may be updated at a small cost at each iteration~\cite{SUHL93}.
 231 Another way is to use the \textit{product form of the inverse}~\cite{dantzig1953product}, which we describe hereafter.
 232
 233 Let $\mbf{b}_1,\dots,\mbf{b}_m$ be the columns of $\mbf{B}$, $\mbf{v}\in \mathbb{R}^m$
 234 and $\mbf{a}=\mbf{Bv}=\sum_{i=1}^m v_i\mbf{b}_i$.
 235 Denote $\mbf{B_a}=(\mbf{b}_1,\dots,\mbf{b}_{p-1},\mbf{a},\mbf{b}_{p+1},\dots,\mbf{b}_m)$ for a given $1\le p\le m$ such that $v_p\not= 0$.
 236 We want to compute $\mbf{B_a}^{-1}$.
 237 We first write $$\mbf{b}_p=\dfrac{1}{v_p}+\mbf{a}\sum_{i\not=p} \dfrac{-v_i}{v_p}\mbf{b}_i$$
 238 Define $$\bs\eta = \left(\dfrac{-v_1}{v_p},\dots,\dfrac{-v_{p-1}}{v_p},\dfrac{1}{v_p},\dfrac{-v_{p+1}}{v_p},\dots,\dfrac{-v_m}{v_p}\right)^T$$
 239 and $$\mbf{E}=(\bs\varepsilon_1,\dots,\bs\varepsilon_{p-1},\bs\eta,\bs\varepsilon_{p+1},\dots,\bs\varepsilon_m)$$
 240 where $\bs\varepsilon_j = (0,\dots,0,1,0,\dots,0)^T$ is the $j$-th element of the canonical basis of~$\mathbb{R}^m$.
 241 Then $\mbf{B_a E=B}$, so $\mbf{B_a}^{-1}=\mbf{E\; B}^{-1}$.
 242
 243 We apply these preliminary considerations to the simplex algorithm
 244 with $\mbf{a=N}_e, \mbf{v=(B^{-1}N)}_e$ (recall that $x_e$ is the entering variable).
 245 If initially $\mbf{B}$ is the identity matrix $\mbf{I_m}$, at the $k$-th iteration of the algorithm,
 246 the inverse matrix is given by $\mbf{B}^{-1}=\mbf{E}_k \mbf{E}_{k-1}\cdots \mbf{E}_2 \mbf{E}_1$,
 247 where $\mbf{E}_i$ is the matrix constructed at the $i$-th iteration.
 248 %Note that matrices like $\mbf{E}$ are sparse and can therefore be stored compactly.
 249
 250 \subsection{Heuristics and improvements}
 251 \label{chXXX:sec:heuristics}
 252 Specific heuristics or methods are needed to improve the performance and stability of the simplex algorithm.
 253 We will explain below how we find an initial feasible solution and how we choose the entering and leaving variables.
 254
 255 \subsubsection*{Finding an initial feasible solution}
 256 Our implementations use the \textit{two-phase simplex}~\cite{VCLP}.
 257 The first phase aims at finding a feasible solution required by the second phase to solve the original problem.
 258 If the origin $\mbf{x=0}$ is not a feasible solution, we proceed to find such a solution by solving a so-called \textit{auxiliary problem} with the simplex algorithm.
 259 This can be achieved by adding a non-negative artificial variable to each constraint equation corresponding to a basic variable which violates its non-negativity condition,
 260 before minimizing the sum of these artificial variables. The algorithm will thus try to drive all artificial variables towards zero.
 261 If it succeeds, then a basic feasible solution is available as an initial solution for the second phase
 262 in which it attempts to find an optimal solution to the original problem. Otherwise the latter is infeasible.
 263
 264 To avoid having to introduce these additional auxiliary variables, we use an alternate version of this procedure.
 265 For basic variables with index in $\mathcal{I}=\{j\in~\B \,\,|\,\, x_j<0 \}$, we temporarily relax the non-negativity condition in order to have a feasible problem,
 266 and then apply the simplex algorithm to minimize $w=-\sum_{j\in \mathcal{I}} x_j$.
 267 However, we update $\mathcal{I}$ at each iteration and modify accordingly the objective function,
 268 whose role is to drive these infeasible basic variables towards their original bound.
 269 If at some stage $\mathcal{I}=\emptyset$, we end up with an initial feasible solution.
 270 Otherwise, the algorithm terminates with $\mathcal{I}\not=\emptyset$ and $w>0$, which indicates that the original problem is infeasible.
 271 The alteration introduced above involves more computations during the first phase,
 272 but it offers the advantage of preserving the problem size and making good use of the GPU processing power.
 273
 274 \subsubsection*{Choice of the entering variable}
 275 The number of iterations required to solve a problem depends on the method used to select the entering variable. The one described in section~\ref{chXXX:subsec:simplex-std} chooses the most promising variable $x_e$ in terms of cost. While being inexpensive to compute, this method can lead to an important number of iterations before the best solution is found.
 276
 277 There exists various heuristics to select this variable $x_e$.
 278 One of the most commonly used is the \textit{steepest-edge} method~\cite{GOLDFRAB77}.
 279 To improve the speed at which the best solution is found, this method takes into account the coefficients of $\mbf{B}^{-1}\mbf{N}$.
 280 This can be explained from the geometrical point of view.
 281 The constraints $\mbf{Ax=b}$ form the hull of a convex polytope.
 282 The simplex algorithm moves from one vertex (\textit{i.e.}~a~solution) to another while trying to improve the objective function.
 283 The steepest-edge method searches for the edge along which the rate of improvement of the objective function is the best.
 284 The entering variable $x_e$ is then determined by
 285 $$ e = \argmax \left\{ \dfrac{c_j}{\sqrt{\gamma_j}} \, \bigg| \, c_j > 0 , j \in \N \right\} $$
 286 with $ \gamma_j = \Vert \mbf{\mbf{B}^{-1}\mbf{N}_j} \Vert^2 $.
 287
 288 This method is quite costly to compute but it reduces significantly the number of iterations required to solve a problem. This heuristic can be directly applied to the standard simplex since the full-tableau is updated at each iteration. However it defeats the purpose of the revised simplex algorithm since the aim is precisely to avoid updating the whole constraints matrix at each iteration.
 289 Taking this into account, the steepest-edge coefficients $\bs\gamma$ are updated based only on their current value. For the sake of clarity, the hat notation is used to differentiate the next iteration value of a variable from its current value: for example if $\bs\gamma$ denotes the steepest-edge coefficients at the current iteration, then $\hat{\bs\gamma}$ are the updated coefficients.
 290
 291 Given the column of the entering variable $\mbf{d} = \mbf{(B^{-1}N)}_e$, we may process afresh the steepest-edge coefficient of the entering variable as $\gamma_e = \Vert \mbf{d} \Vert^2$.
 292 The updated steepest-edge coefficients are then given by
 293 $$\hat{\gamma_j} = \max\left\{\gamma_j - 2\hat{\alpha_j} \beta_j + \gamma_e \hat{\alpha_j}^2, 1+\hat{\alpha_j}^{2}\right\} \quad \textrm{ for } j \neq e $$
 294 $$\hat{\gamma_e} = \gamma_e / d_e^2$$
 295 with $\hat{\bs\alpha} = \mbf{N^T} (\mbf{(\hat{B}^{-1})^T})_\ell $
 296 and $\bs\beta = \mbf{N^T} \mbf{(B^{-1})^T} \mbf{d}$
 297
 298 \subsubsection*{Choice of the leaving variable}
 299 The stability and robustness of the algorithm depend considerably on the choice of the leaving variable. With respect to this, the \textit{expand} method~\cite{GILL1989} proves to be very useful in the sense that it helps to avoid cycles and reduces the risks of encountering numerical instabilities. This method consists of two steps of complexity $\mathcal{O}(m)$. In the first step, a small perturbation is applied to the bounds of the variables to prevent stalling of the objective value, thus avoiding cycles. These perturbed bounds are then used to determine the greatest gain on the entering variable imposed by the most constraining basic variable. The second phase uses the original bounds to define the basic variable offering the gain closest to the one of the first phase. This variable will then be selected for leaving the basis.
 300
 301 \section{Branch-and-Bound\index{Branch-and-Bound} algorithm}
 302 \label{chXXX:sec:bnb}
 303 \subsection{Integer linear programming}
 304 \label{chXXX:subsec:ilp}
 305 % Why and what are integer linear programming problem
 306 In some problems, variables are integer-valued. For example, in a vehicle routing problem, one cannot assign one third of a vehicule to a specific route. Integer linear programming\index{integer linear programming} (ILP) problems restrict LP problems by imposing an integrality condition on the variables. While this change may seem to have little impact on the model, the aftermaths on the resolution method are quite important.
 307
 308 % What is the cost of adding this constraint
 309 From a geometrical perspective, the simplex algorithm is a method for finding an optimum in a convex polytope defined by a LP problem. However, the additional integrality condition results in the loss of this convexity property. The resolution method must then be altered in order to find a solution to the ILP problem. The idea is to explore the solution space by a divide-and-conquer approach in which the simplex algorithm is used to find a local optimum in subspaces.
 310
 311 \subsection{Branch-and-Bound tree}
 312 \label{chXXX:subsec:bnb-tree}
 313 % Intro
 314 The solution space is explored by the \textit{Branch-and-Bound} (B\&B) algorithm~\cite{ACHTERBERG07}.
 315 The strategy used is conceptually close to a tree traversal.
 316
 317 % Branching
 318 %At first, the ILP problem is considered as a LP problem by relaxing the integrality condition before applying a LP solver to it.
 319 %This initial relaxed problem represents the root of the tree about to be built.
 320 %From the obtained solution, a variable $x_j$ (with bounds $l_j \leq x_j \leq u_j$) that violates its integrality condition, is chosen as a branching variable.
 321 %The current problem will then be divided, if possible, into two subproblems: one having $x_j$ constrained to $l_j' = l_j$ and $u_j' = \lfloor x_j \rfloor$ and the other to $l_j' = \lceil x_j \rceil$ and $u_j' = u_j$.
 322 %Each of these LP subproblems represents a child node waiting to be solved.
 323
 324 % Branching
 325 At first, the ILP problem is considered as a LP problem by relaxing the integrality condition before applying a LP solver to it.
 326 This initial relaxed problem represents the root of the tree about to be built.
 327 From the obtained solution $\bs{\xi}$, if $\xi_j$ is not integral for some $j$, then $x_j$ can be chosen as a branching variable.
 328 The current problem will then be divided, if possible, into two subproblems: one having $x_j\le \lfloor \xi_j \rfloor$ and the other $x_j\ge \lceil \xi_j \rceil$.
 329 Each of these LP subproblems represents a child node waiting to be solved.
 330
 331 % When does it ends
 332 This is repeated for each subproblems in such a way that all variables are led towards integral values.
 333 At some point a subproblem will either be infeasible or lead to a feasible ILP solution (leaf nodes).
 334 The algorithm ends when the tree has been fully visited, returning the best feasible ILP solution.
 335
 336 % Upper, lower bounds and pruning
 337 During the exploration, lower and upper bounds on the objective value are computed.
 338 The upper bound is represented by the best objective value encountered in a child node.
 339 This non-admissible solution hints towards what the objective value of the ILP problem could be.
 340 The lower bound is the best ILP solution yet found, in other words the objective value of the most promising leaf node.
 341 The bounds may be used to prune subtrees when further branching cannot improve the best current solution.
 342
 343 % How can we be improve the exploration
 344 In the B\&B algorithm, the main two operations that impact the convergence of the bounds towards the optimal solution are
 345 the branching strategy and the node selection strategy.
 346
 347 \subsubsection*{Example}
 348 Figure~\ref{chXXX:fig:bnb} illustrates the use of the B\&B algorithm for solving an ILP.
 349 The ILP problem is the following:
 350 \begin{align*}
 351 \textrm{Maximize}  \quad z= x_1 + 4x_2 & \\
 352 \textrm{Subject to}\quad \quad 5x_1 + 8x_2 & \leq 40\\
 353  -2x_1 + 3x_2 & \leq 9\\
 354  x_1, x_2 & \geq 0 \quad \textrm{ integer-valued}
 355 \end{align*}
 356 The nodes are solved with the simplex method in the order written on the figure.
 357 At the 3rd step of the B\&B, the first feasible solution is encountered (light grey leaf).
 358 The optimal solution is encountered at the 7th step (dark grey leaf).
 359 However, this is assessed only after solving the last two nodes (8th and 9th)
 360 whose objective value $z$ is inferior to the best one yet found.
 361
 362 \begin{figure}[!h]
 363 \centering
 364 \includegraphics[width=12cm]{Chapters/chapter10/figures/tree2_conv.pdf}
 365 \caption{Solving an ILP problem using a Branch-and-Bound algorithm}
 366 \label{chXXX:fig:bnb}
 367 \end{figure}
 368
 369 \subsection{Branching strategy}
 370 \label{chXXX:subsec:bnb-branching}
 371 % What it is and why it is important
 372 The branching strategy\index{Branch-and-Bound!branching} defines the method used to select the variable on which branching will occur.
 373 The objective value of the child node depends greatly on the choice of this variable.
 374 Branching on a variable may lead to a drop on the upper bound and thus speed up the exploration,
 375 while branching on other variables could leave this bound unchanged.
 376
 377 % Example of strategy
 378 Several branching strategies exist~\cite{Achterberg05}. Let us briefly comment on some of them in terms of improvement of the objective function and processing cost.
 379 \begin{itemize}
 380 \item \textit{Smallest index strategy} is a greedy approach that always selects the variable with the smallest index as branching variable. This method is simple, has a cheap processing cost but does not try to select the best variable.
 381 \item \textit{Strong branching strategy} is an exhaustive strategy. The branching variable selected is the one amongst all the potential variables that leads to the best solution. This means that for each potential branching variable, its potential child nodes must be solved. This method is easy to implement, but its computational cost makes it inefficient.
 382 \item \textit{Reliability branching strategy} is a strategy which maintains a pseudo-cost~\cite{BENICHOU71} for each potential branching variable. The pseudo-cost of a variable is based on the result obtained when branching on it at previous steps. Since at the start, pseudo-costs are unreliable, a limited strong branching approach is used until pseudo-costs are deemed reliable. This method is more complex than the two others and requires fine tuning. It offers however the best trade-off between improvement and computional cost.
 383 \end{itemize}
 384
 385 \subsection{Node selection strategy}
 386 \label{chXXX:subsec:bnb-node-select}
 387 The node selection strategy\index{Branch-and-Bound!node selection} defines the methodology used to explore the tree.
 388 While the usual depth-first search and breadth-first search are considered and used,
 389 some remarks about the tree exploration must be made.
 390 First let us mention a few facts:
 391 \begin{enumerate}
 392  \item Solutions obtained from child nodes cannot be better than the one of their parent node.
 393  \item An LP solver is able to quickly find a solution if the subproblem (child node) is only slightly different than its parent.
 394 \end{enumerate}
 395 Given the above statements, it is of interest to quickly find a feasible solution.
 396 Indeed, this allows to prune all pending nodes which do not improve the solution found.
 397 However, the quality of the latter impacts the amount of nodes pruned.
 398 It takes more time to produce a good solution because one must search for the best nodes in the tree.
 399 Consequently, a trade-off must be made between the quality and the time required to find a solution.
 400
 401 Two types of strategies~\cite{Linderoth97} can then be considered:
 402 \begin{itemize}
 403 \item \textit{Depth-first search} aims at always selecting one of the child nodes until a leaf
 404 (infeasible subproblem or feasible solution) is reached.
 405 This strategy is characterized by fast solving and it quickly finds a feasible solution.
 406 It mostly improves the lower bound.
 407 \item  \textit{Best-first search} aims at always selecting one of the most promising nodes in the tree.
 408 This strategy is characterized by slower solving but guarantees that the first feasible solution found is the optimal.
 409 It mostly improves the upper bound.
 410 \end{itemize}
 411 A problem occurs with the best-first search strategy:
 412 there might be numerous nodes having solutions of the same quality, thus making the choice of a node difficult.
 413 To avoid this problem, the \textit{best-estimate search} strategy~\cite{FORREST74} differentiates the best nodes by estimating their potential cost with the help of a pseudo-cost (see previous section).
 414
 415 The most promising variant is a hybrid strategy: the base strategy being the best-estimate search,
 416 the subtrees of the best-estimated nodes are then subject to a limited depth-first search,
 417 more commonly called \textit{plunging}.
 418 This method launches a fast search for a feasible solution in the most promising subtrees,
 419 thus improving the upper and lower bounds at the same time.
 420
 421 \subsection{Cutting-plane methods}
 422 \label{chXXX:subsec:cuts}
 423 \textit{Cutting-planes}\index{Branch-and-Bound!cutting-plane}~\cite{WOLTER06} (also simply \textit{cuts}) are new constraints whose role is to cut off parts of the search space.
 424 They may be generated from the first LP solution (\textit{Cut-and-Branch}) or periodically during the B\&B (\textit{Branch-and-Cut}).
 425 On the one hand cutting-planes may considerably reduce the size of the solution space, but on the other hand they increase the problem size.
 426 Moreover, generating cutting-planes is costly since it requires a thorough analysis of the current state of the problem.
 427
 428 Various types or families of cutting-planes exist. Those that are most applied are the \textit{Gomory cutting-planes} and the \textit{complemented mixed integer rounding inequalities} (c-MIR). Other kinds of cutting-planes target specific families of problems, for example the \textit{0-1 Knapsack cutting-planes} or \textit{flow cover cuts}.
 429
 430 The cutting-planes generated must be carefully selected in order to avoid a huge increase of the problem size. They are selected according to three criteria: their efficiency, their orthogonality with respect to other cutting-planes, and their parallelism with respect to the objective function.
 431 Cutting-planes having the most impact on the problem are then selected, while the others are dropped.
 432
 433 \section{CUDA considerations}
 434 \label{chXXX:sec:cuda}
 435 Most expensive operations in the simplex algorithm are linear algebra functions. The CUBLAS library is used for dense matrix-vector multiplications (\texttt{cublasDgemv}) and dense vector sums (\texttt{cublasDaxpy}). The CUSPARSE library is used for the sparse matrix-vector multiplication (\texttt{cusparseDcsrmv}).
 436
 437 However, complex operations are required to implement the simplex and must be coded from scratch.
 438 For example, reduction operations (\textit{argmax, argmin}) are fundamental for selecting variables.
 439 In the following section, we first quickly describe the CUDA reduction operation, before making some global remarks on kernel optimization.
 440
 441 % Reduce operation
 442 \subsection{Parallel reduction}
 443 \label{chXXX:subsec:reduction}
 444 A parallel reduction\index{Parallel reduction} operation is performed in an efficient manner inside a GPU block as shown in figure~\ref{chXXX:fig:reduc}. Shared memory is used for a fast and reliable way to communicate between threads. However at the grid level, reduction cannot be easily implemented due to the lack of direct communication between blocks. The usual way of dealing with this type of limitation is to apply the reduction in two separate steps. The first one involves a GPU kernel reducing the data over multiple blocks, the local result of each block being stored on completion. The second step ends the reduction on a single block or on the CPU.
 445 \begin{figure}[!h]
 446 \centering
 447 \includegraphics[width=10cm]{Chapters/chapter10/figures/Reduc2.pdf}
 448 \caption{Example of a parallel reduction at block level (courtesy NVIDIA)}
 449 \label{chXXX:fig:reduc}
 450 \end{figure}
 451
 452 An optimized way of doing the reduction can be found in the example provided by NVIDIA.
 453 In order to keep code listings compact hereafter, the reduction of values amongst a block will be referred to as \textit{reduceOperation(value)} (per extension \textit{reduceArgMax(maxVal)}).
 454
 455 % Volkov - hiding latency, register usage
 456 \subsection{Kernel optimization}
 457 Optimizing a kernel is a difficult task. The most important point is to determine whether the performances are limited by the bandwidth or by the instruction throughput. Depending on the case and the specificities of the problem, various strategies may be applied. This part requires a good understanding of the underlying architecture and its limitations. The CUDA Programming Guide offers some insight into this subject, together with the interesting articles and presentations by Vassily Volkov~\cite{VOLKOV2010}. The CUDA profiler is the best way to monitor the performances of a kernel. This tool proposes multiple performance markers giving indications about potential bottlenecks.
 458
 459 \section{Implementations}
 460 \label{chXXX:sec:impl}
 461 % Foreword - no expand
 462 In this section, the implementations of the simplex algorithm are studied. The emphasis is put on the main algorithms and kernels having a major impact on performance. The \textit{expand} method used for choosing the leaving variable will not be detailed for that reason.
 463 Then the Branch-and-Bound (B\&B) implementation is quickly explained focusing on the interaction between the B\&B algorithm and the simplex solver.
 464
 465 \subsection{Standard simplex}
 466 \label{chXXX:subsec:impl-simplex-std}
 467 % Standard simplex --> cublas but steepest edge
 468 The implementation of the standard simplex algorithm (see~\S~\ref{chXXX:subsec:simplex-std}) is rather straightforward. The main difference is that the basis matrix is not stored in memory since it is equals to the identity matrix most of the time. Instead a proper data structure keeps track of the basic variables and their values.
 469 %% \begin{algorithm}
 470 %% \caption{Standard simplex algorithm}
 471 %% \label{chXXX:algo:impl-simplex-std}
 472 %% \begin{algorithmic}
 473 %%      \STATE \textit{// Find entering variable $e$}
 474 %%      \STATE $\gamma_j \leftarrow \Vert (\mbf{A_\N})_j \Vert^2$
 475 %%      \STATE $e \leftarrow argmax \left( \mbf{c} / \sqrt{\boldsymbol\gamma} \right) $
 476 %%      \IF{$e < 0$}
 477 %%      \RETURN \texttt{optima\_found}
 478 %%      \ENDIF
 479 %%      \STATE \textit{// Find leaving variable $\ell$}
 480 %%      \STATE $\ell, t \leftarrow expand((\mbf{A_\N})_e)$
 481 %%      \IF{$\ell < 0$}
 482 %%      \RETURN \texttt{unbounded}
 483 %%      \ENDIF
 484 %%      \STATE \textit{// Pivoting}
 485 %%      \STATE $\mbf{d} \leftarrow (\mbf{A_\N})_e \, , \, d_e \leftarrow (A_\N)_{\ell e}-1$
 486 %%      \STATE $\mbf{r} \leftarrow \mbf{N^T}_\ell$
 487 %%      \STATE $x_e \leftarrow x_e + t $
 488 %%      \STATE $\mbf{x_\B} \leftarrow \mbf{x_\B} - t (\mbf{A_\N})_e $
 489 %%      \STATE $\mbf{A_\N} \leftarrow  \mbf{A_\N} - \mbf{d^T} \mbf{r} / (A_\N)_{\ell e}$ \hspace{2mm}\textit{// cublasDger}
 490 %%      \STATE $\mbf{c} \leftarrow \mbf{c} -  c_\ell \mbf{r} / (A_\N)_{\ell e}$ \hspace{12mm}\textit{// cublasDaxpy}
 491 %%      \STATE $swap(x_e, x_{\ell})$
 492 %% \end{algorithmic}
 493 %% \end{algorithm}
 494
 495 \begin{algorithm}
 496 \caption{Standard simplex algorithm}
 497 \label{chXXX:algo:impl-simplex-std}
 498         \textit{// Find entering variable $e$}\;
 499          $\gamma_j \leftarrow \Vert (\mbf{A_\N})_j \Vert^2$\;
 500         $e \leftarrow argmax \left( \mbf{c} / \sqrt{\boldsymbol\gamma} \right) $\;
 501         \If{$e < 0$}     {
 502         \Return \texttt{optima\_found}
 503         }
 504         \textit{// Find leaving variable $\ell$     }\;
 505         $\ell, t \leftarrow expand((\mbf{A_\N})_e)$\;
 506         \If{$\ell < 0$} {
 507         \Return \texttt{unbounded}
 508         }
 509         \textit{// Pivoting}\;
 510         $\mbf{d} \leftarrow (\mbf{A_\N})_e \, , \, d_e \leftarrow (A_\N)_{\ell e}-1$\;
 511         $\mbf{r} \leftarrow \mbf{N^T}_\ell$\;
 512         $x_e \leftarrow x_e + t $\;
 513         $\mbf{x_\B} \leftarrow \mbf{x_\B} - t (\mbf{A_\N})_e $\;
 514         $\mbf{A_\N} \leftarrow  \mbf{A_\N} - \mbf{d^T} \mbf{r} / (A_\N)_{\ell e}$ \hspace{2mm}\textit{// cublasDger}\;
 515         $\mbf{c} \leftarrow \mbf{c} -  c_\ell \mbf{r} / (A_\N)_{\ell e}$ \hspace{12mm}\textit{// cublasDaxpy}\;
 516         $swap(x_e, x_{\ell})$\;
 517
 518 \end{algorithm}
 519
 520
 521
 522 The first step of algorithm~\ref{chXXX:algo:impl-simplex-std} is the selection of the entering variable using the steepest-edge heuristic. This step is discussed thoroughly in the next section. Then the leaving variable index $\ell$ and the potential gain on the entering variable $t$ are determined using the \textit{expand} method. Finally, the pivoting is done. The nonbasic matrix $\mbf{A_\N}$ and the objective function coefficients $\mbf{c}$ are updated using the CUBLAS library (with \texttt{cublasDger}, respectively \texttt{cublasDaxpy}). This requires the entering variable column $\mbf{d}$ and the leaving variable row $\mbf{r}$ to be copied and slightly altered. The new value of the variables is computed and, since we use a special structure instead of the basis matrix, the entering and leaving variables are swapped.
 523
 524 % Kernel example ==> steepest edge
 525 \subsubsection*{Choice of the entering variable}
 526 \label{chXXX:sec:impl-simplex-std-inVar}
 527 Let us discuss two different approaches for the selection of the entering variable. The first one relies on the CUBLAS library.
 528 The idea is to split this step in several small operations, starting with the computation of the norms one by one with the \texttt{cublasDnrm2} function.
 529 The score of each variable is then obtained by dividing the cost vector $\mbf{c}$ by the norms previously computed.
 530 The entering variable is finally selected by taking the variable with the biggest steepest-edge coefficient using \texttt{cublasDamax}.
 531
 532 While being easy to implement, such an approach would lead to poor performances. The main problem is a misuse of data parallelism. Each column can be processed independently and so at the same time. Moreover, slicing this step into small operations requires that each temporary results be stored in global memory. This creates a huge amount of slow data transfers between kernels and global memory.
 533
 534 \lstinputlisting[label=chXXX:lst:k1,caption=Kernel for the choice of the entering variable]{Chapters/chapter10/optiSE.cu}
 535
 536 To avoid this, the whole step could be implemented as a unique kernel, as shown in the simplified listing~\ref{chXXX:lst:k1}. Each block evaluates multiple potential entering variables.
 537 The threads of a block process part of the norm of a column.
 538 Then a reduction (see \S~\ref{chXXX:subsec:reduction}) is done inside the block to form the full norm.
 539 The thread having the final value can then compute the score of the current variable and determine if it is the best one encountered in this block.
 540 Once a block has evaluated its variables, the most promising one is stored in global memory.
 541 The final reduction step is finally done on the CPU.
 542
 543 The dimension of the blocks and grid have to be chosen wisely as a function of the problem size.
 544 The block size is directly correlated to the size of a column while the grid size is a trade-off between giving enough work to the scheduler in order to hide the latency and returning as few potential entering variable as possible for the final reduction step.
 545
 546 \subsubsection*{CPU-GPU interactions}
 547 The bulk of the data representing the problem is stored on the GPU.
 548 Only variables required for decision-making operations are updated on the CPU.
 549 \begin{figure}[!h]
 550 \centering
 551 \includegraphics[width=90mm]{Chapters/chapter10/figures/diagSTD2.pdf}
 552 \caption{Communications between CPU and GPU}
 553 \label{chXXX:fig:diagSTD}
 554 \end{figure}
 555 The communications arising from the aforementioned scheme are illustrated in figure~\ref{chXXX:fig:diagSTD}.
 556 The amount of data exchanged at each iteration is independent of the problem size ensuring that this implementation scales well as the problem size increases.
 557
 558 \subsection{Revised simplex}
 559 \label{chXXX:subsec:impl-simplex-rev}
 560 % Revised simplex
 561 The main steps found in the previous implementation are again present in the revised simplex. The difference comes from the fact that a full update of the matrix $\mbf{N}$ is not necessary at every iteration. As explained in section~\ref{chXXX:subsec:simplex-rev}, the basis matrix $\mbf{B}$ is updated so that the required values such as the entering variable column can be processed from the now \textit{constant} matrix $\mbf{N}$. Due to these changes, the steepest-edge method is adapted. The resulting implementation requires more operations as described in algorithm~\ref{chXXX:algo:simplex-rev}.
 562 % Algorithm
 563 %% \begin{algorithm}
 564 %% \caption{Revised simplex algorithm}
 565 %% \label{chXXX:algo:simplex-rev}
 566 %% \begin{algorithmic}
 567 %%      \STATE \textit{// Find entering variable $x_e,\, e\in \B$}
 568 %%      \STATE $\bs\tau \leftarrow  \mbf{(B^{-1})^T}\mbf{c_\B} $ \hspace*{5mm} \textit{// cublasDgemv}
 569 %%     \STATE $\bs\upsilon \leftarrow \mbf{c_\N} - \mbf{N^T} \bs\tau $ \hspace*{4mm} \textit{// cublasDgemv}
 570 %%     \STATE $e \leftarrow argmax \left( \bs\upsilon / \sqrt{\bs\gamma} \right) $
 571 %%     \STATE \textbf{if} $e < 0$ \textbf{return} \texttt{optima\_found}
 572 %%      \STATE \textit{// Find leaving variable $x_\ell,\, \ell\in \N$}
 573 %%     \STATE $\mbf{d} \leftarrow \mbf{B^{-1}} \mbf{N}_e $ \hspace*{9mm} \textit{// cublasDgemv}
 574 %%       \STATE $\ell, t \leftarrow expand(\mbf{d})$
 575 %%     \STATE \textbf{if} $\ell < 0$ \textbf{return} \texttt{unbounded}
 576 %%      \STATE \textit{// Pivoting - basis update}
 577 %%      \STATE $\bs\kappa \leftarrow \mbf{(B^{-1})^T} \mbf{d} $ \hspace*{6mm} \textit{// cublasDgemv}
 578 %%      \STATE $\boldsymbol\beta \leftarrow \mbf{N^T} \bs\kappa$ \hspace*{12mm} \textit{// cublasDgemv}
 579 %%      \STATE $\mbf{B^{-1}} \leftarrow  \mbf{E}\, \mbf{B^{-1}}$
 580 %%      %\STATE $basisUpdate(\mbf{B^{-1}}, \mbf{d})$ \textit{// See section ...}
 581 %%      \STATE $x_e \leftarrow x_e + t $
 582 %%      \STATE $\mbf{x_\B} \leftarrow \mbf{x_\B} - t\,\mbf{d} $
 583 %%      \STATE $swap(x_\ell,x_e)$
 584 %%      \STATE $\hat{\bs\alpha} \leftarrow \mbf{N^T} \left( \mbf{(B^{-1})^T} \right)_\ell$ \hspace*{4mm} \textit{// cublasDgemv}
 585 %%      \STATE $\gamma_e \leftarrow \Vert \mbf{d} \Vert^2$
 586 %%      \STATE $\hat{\gamma_j} \leftarrow  \max\left\{\gamma_j - 2 \hat{\alpha_j} \beta_j + \gamma_e \hat{\alpha_j}^2, 1+\hat{\alpha_j}^{2}\right\},\; \forall j \neq e $,$\quad \hat{\gamma_e} \leftarrow \gamma_{e} / d_e^2$
 587 %% \end{algorithmic}
 588 %% \end{algorithm}
 589
 590
 591 \begin{algorithm}
 592 \caption{Revised simplex algorithm}
 593 \label{chXXX:algo:simplex-rev}
 594         \textit{// Find entering variable $x_e,\, e\in \B$}\;
 595         $\bs\tau \leftarrow  \mbf{(B^{-1})^T}\mbf{c_\B} $ \hspace*{5mm} \textit{// cublasDgemv}\;
 596        $\bs\upsilon \leftarrow \mbf{c_\N} - \mbf{N^T} \bs\tau $ \hspace*{4mm} \textit{// cublasDgemv}\;
 597     $e \leftarrow argmax \left( \bs\upsilon / \sqrt{\bs\gamma} \right) $\;
 598     \If { $e < 0$ } {\Return \texttt{optima\_found} }
 599     \textit{// Find leaving variable $x_\ell,\, \ell\in \N$}     \;
 600     $\mbf{d} \leftarrow \mbf{B^{-1}} \mbf{N}_e $ \hspace*{9mm} \textit{// cublasDgemv}\;
 601     $\ell, t \leftarrow expand(\mbf{d})$\;
 602     \If { $\ell < 0$} {\Return \texttt{unbounded}}
 603      \textit{// Pivoting - basis update}\;
 604     $\bs\kappa \leftarrow \mbf{(B^{-1})^T} \mbf{d} $ \hspace*{6mm} \textit{// cublasDgemv}\;
 605     $\boldsymbol\beta \leftarrow \mbf{N^T} \bs\kappa$ \hspace*{12mm} \textit{// cublasDgemv}\;
 606     $\mbf{B^{-1}} \leftarrow  \mbf{E}\, \mbf{B^{-1}}$\;
 607     $x_e \leftarrow x_e + t $\;
 608     $\mbf{x_\B} \leftarrow \mbf{x_\B} - t\,\mbf{d} $\;
 609     $swap(x_\ell,x_e)$\;
 610     $\hat{\bs\alpha} \leftarrow \mbf{N^T} \left( \mbf{(B^{-1})^T} \right)_\ell$ \hspace*{4mm} \textit{// cublasDgemv}\;
 611     $\gamma_e \leftarrow \Vert \mbf{d} \Vert^2$ \;
 612     $\hat{\gamma_j} \leftarrow  \max\left\{\gamma_j - 2 \hat{\alpha_j} \beta_j + \gamma_e \hat{\alpha_j}^2, 1+\hat{\alpha_j}^{2}\right\},\; \forall j \neq e $,$\quad \hat{\gamma_e} \leftarrow \gamma_{e} / d_e^2$\;
 613 \end{algorithm}
 614
 615
 616 Let us compare the complexity, in terms of level 2 and level 3 BLAS operations, of both implementations. The standard one has mainly two costly steps: the selection of the entering variable and the update of the matrix $\mbf{N}$. These are level 2 BLAS operations or an equivalent, and thus the approximate complexity is $\mathcal{O}(2mn)$.
 617
 618 The new implementation proposed has three operations \texttt{cublasDgemv} where the matrix $\mbf{N}$ is involved,
 619 and three others with the matrix $\mbf{B^{-1}}$.
 620 The complexities of these operations are respectively $\mathcal{O}(mn)$ and $\mathcal{O}(m^2)$.
 621 The update of the matrix $\mbf{B}$ is a level 3 BLAS operation costing $\mathcal{O}(m^3)$.
 622 The approximate complexity is then $\mathcal{O}(3mn + 3m^2 + m^3)$.
 623
 624 It appears clearly that, in the current state, the revised simplex implementation is less efficient than the standard one. However this method can be improved and might well be better than the standard one based on two characteristics of LP problems:
 625 their high sparsity and the fact that usually $m \ll n$. In the next sections, we will give details about these improvements.
 626
 627 \subsubsection*{Choice of the entering variable}
 628 Once again, this part of the algorithm can be substantially improved.
 629 Firstly, algorithmic optimizations must be considered.
 630 Since the row $\bs\alpha$ of the leaving variable is processed to update the steepest-edge coefficients, the cost vector $\bs\upsilon$ can be updated directly without using the basis matrix $\mbf{B}$.
 631 This is done as follow
 632 \begin{equation*}
 633 \upsilon_j  = \bar{\upsilon}_j - \bar{\upsilon}_e \alpha_j, \; \forall j \neq e, \qquad \qquad
 634 \upsilon_e  =  -\dfrac{\bar{\upsilon}_e}{\alpha_e}
 635 \end{equation*}
 636 where $\bar{\upsilon_j}$ denotes the value of $\upsilon_j$ at the previous iteration.
 637 It is important to note that all these updated values ($\bs\upsilon, \bs\gamma$) must be processed afresh once in a while to reduce numerical errors.
 638
 639 Including this change, the operations required for the selection of the entering variable $e$ are detailed in algorithm~\ref{chXXX:algo:inVar}. The values related to the entering variable at the previous iteration $\bar{e}=q$ are used. The reduced costs are updated, followed by the steepest-edge coefficients. With these values the entering variable is determined.
 640
 641 %% \begin{algorithm}[!h]
 642 %% \caption{Choice of the entering variable}
 643 %% \label{chXXX:algo:inVar}
 644 %% \begin{algorithmic}
 645 %%      \STATE $q \leftarrow \bar{e}$
 646 %%      \STATE $\bar{\upsilon}_q \leftarrow c_q - \mbf{c_{\B}^T} \bar{\mbf{d}}$
 647 %%      \STATE $\bar{\gamma}_q \leftarrow \Vert \bar{\mbf{d}} \Vert$
 648 %%      \STATE \textit{// Update of the reduced costs}
 649 %%      \STATE $\upsilon_{j} \leftarrow \bar{\upsilon}_{j} - \alpha_{j} \bar{\upsilon}_q, \; \forall j \neq q$
 650 %%     \STATE $\upsilon_q \leftarrow  \dfrac{\bar{\upsilon}_q}{\bar{d}_q^2}$
 651 %%      \STATE \textit{// Update of the steepest edge coefficients}
 652 %%      \STATE $\gamma_j \leftarrow  \max\left\{\bar{\gamma}_j - 2 \alpha_j \bar{\beta}_j + \bar{\gamma}_q \alpha_j^2, 1+\alpha_{j}^{2} \right\},\; \forall j \neq q $
 653 %%      \STATE $\gamma_q \leftarrow \bar{\gamma}_q / \bar{d}_q^2$
 654 %%      \STATE \textit{// Selection of the entering variable}
 655 %%     \STATE $e \leftarrow argmax \left( \bs\upsilon / \sqrt{\bs\gamma} \right) $
 656 %% \end{algorithmic}
 657 %% \end{algorithm}
 658
 659 \begin{algorithm}[!h]
 660 \caption{Choice of the entering variable}
 661 \label{chXXX:algo:inVar}
 662         $q \leftarrow \bar{e}$\;
 663         $\bar{\upsilon}_q \leftarrow c_q - \mbf{c_{\B}^T} \bar{\mbf{d}}$   \;
 664         $\bar{\gamma}_q \leftarrow \Vert \bar{\mbf{d}} \Vert$\;
 665         \textit{// Update of the reduced costs} \;
 666         $\upsilon_{j} \leftarrow \bar{\upsilon}_{j} - \alpha_{j} \bar{\upsilon}_q, \; \forall j \neq q$\;
 667         $\upsilon_q \leftarrow  \dfrac{\bar{\upsilon}_q}{\bar{d}_q^2}$ \;
 668         \textit{// Update of the steepest edge coefficients}\;
 669         $\gamma_j \leftarrow  \max\left\{\bar{\gamma}_j - 2 \alpha_j \bar{\beta}_j + \bar{\gamma}_q \alpha_j^2, 1+\alpha_{j}^{2} \right\},\; \forall j \neq q $\;
 670         $\gamma_q \leftarrow \bar{\gamma}_q / \bar{d}_q^2$  \;
 671         \textit{// Selection of the entering variable}\;
 672     $e \leftarrow argmax \left( \bs\upsilon / \sqrt{\bs\gamma} \right) $\;
 673 \end{algorithm}
 674
 675 Coupling these operations into a single kernel is quite beneficial. This lead $\bs\upsilon$ and $\bs\gamma$ to only be loaded once from global memory. Their updated values are stored while the entering variable $e$ is selected. With these changes, the global complexity of this step is reduced from $\mathcal{O}(mn + m^2 + n)$ to $\mathcal{O}(n)$. Moreover the remaining processing is done optimally by reusing data and by overlapping global memory access and computations.
 676
 677 \subsubsection*{Basis update}
 678 % Update operation
 679 The basis update $\mbf{B^{-1}} \leftarrow  \mbf{E} \, \mbf{B^{-1}}$ is a matrix-matrix multiplication. However, due to the special form of the matrix $\mbf{E}$ (see \S~\ref{chXXX:subsec:simplex-rev}), the complexity of this operation can be reduced from $\mathcal{O}(m^3)$ to $\mathcal{O}(m^2)$~\cite{BIELING}. The matrix $\mbf{E}$ is merely the identity matrix having the $\ell$-th column replaced by the vector $\bs\eta$. The update of the matrix $\mbf{B^{-1}}$ can be rewritten as
 680 \begin{equation*}
 681 \hat{B}^{-1}_{ij}  = B^{-1}_{ij}\left(1 - \frac{d_i}{d_\ell}\right)\, , \quad \forall i \neq \ell, \qquad
 682 \qquad \hat{B}^{-1}_{\ell j}  = \frac{B^{-1}_{\ell j}}{d_\ell}
 683 \end{equation*}
 684 As shown in the listing~\ref{chXXX:lst:k2}, each block of the kernel processes a single column while each thread may compute multiple elements of a column.
 685 This organisation ensures that global memory accesses are coalescent since the matrix $\mbf{B}$ is stored column-wise.
 686
 687 \lstinputlisting[label=chXXX:lst:k2,caption=Basis update]{Chapters/chapter10/updateBasis.cu}
 688
 689 % Sparse matrix - vector
 690 \subsubsection*{Sparse matrix operations}
 691 The complexity of the implementation is now $\mathcal{O}(2mn + 3m^2)$ which is close to the one of the standard simplex implementation. The operations where the matrix $\mbf{N}$ is involved remain the more expensive (considering $m \ll n$). However, this matrix is \textit{constant} in the revised simplex allowing the use of sparse matrix-vector multiplication from the CUSPARSE library. On sparse problems, this leads to important gains in performance. The sparse storage of the matrix $\mbf{N}$ reduces significantly the memory space used by the algorithm. All these improvements come at a small cost: columns are no longer directly available in their dense format and must thus be decompressed to their dense representation when needed.
 692
 693 The complexity of the revised simplex implementation is thus finally $\mathcal{O}(2\theta + 3m^2)$ where $\theta$ is a function of $m$, $n$ and the sparsity of the problem. We shall see in section~\ref{chXXX:subsec:implPerf}, what kind of performances we may obtain on various problems.
 694
 695 \subsection{Branch-and-Bound}
 696 \label{chXXX:subsec:impl-bnb}
 697 % Branch and bound on GPU
 698 % Limiting communications
 699 The B\&B algorithm is operated from the CPU.
 700 The simplex implementations, also referred to as LP solver, are used to solve the nodes of the B\&B tree.
 701 Algorithm~\ref{chXXX:algo:bnb} contains the main operations discussed in section~\ref{chXXX:sec:bnb}.
 702 It starts by solving the root node.
 703 The branching strategy is then used to select the next variable to branch on.
 704 From thereon, until no node remains to be solved, the node selection strategy is used to elect the next one to be processed.
 705
 706 The critical factor for coupling the LP solver and the B\&B, is the amount of communications exchanged between them. Since CPU-GPU communications are expensive, the informations required to solve a new node must be minimized. Upon solving a new node, the full transfer of the problem must be avoided. This will be the focus of this section.
 707
 708 % Algorithm
 709 %% \begin{algorithm}
 710 %% \caption{Branch-and-Bound}
 711 %% \label{chXXX:algo:bnb}
 712 %% \begin{algorithmic}
 713 %%     \STATE Solution  $\leftarrow$ Solve(InitialProblem)
 714 %%     \STATE BranchVar $\leftarrow$ SelectNextVariable(Solution)
 715 %%     \STATE NodeList $\leftarrow$ AddNewNodes(BranchVar)
 716 %%     \WHILE{NodeList $\neq \emptyset$}
 717 %%     \STATE Node $\leftarrow$ SelectNextNode(NodeList)
 718 %%     \STATE Solution  $\leftarrow$ Solve(Node)
 719 %%     \IF{exists(Solution)}
 720 %%     \STATE BranchVar $\leftarrow$ SelectNextVariable(Solution)
 721 %%     \STATE NodeList $\leftarrow$ AddNewNodes(BranchVar)
 722 %%     \ENDIF
 723 %%     \ENDWHILE
 724 %% \end{algorithmic}
 725 %% \end{algorithm}
 726
 727 \begin{algorithm}
 728 \caption{Branch-and-Bound}
 729 \label{chXXX:algo:bnb}
 730     Solution  $\leftarrow$ Solve(InitialProblem)\;
 731     BranchVar $\leftarrow$ SelectNextVariable(Solution)\;
 732     NodeList $\leftarrow$ AddNewNodes(BranchVar)\;
 733     \While{NodeList $\neq \emptyset$} {
 734     Node $\leftarrow$ SelectNextNode(NodeList)\;
 735     Solution  $\leftarrow$ Solve(Node)\;
 736     \If{exists(Solution)}{
 737       BranchVar $\leftarrow$ SelectNextVariable(Solution)\;
 738       NodeList $\leftarrow$ AddNewNodes(BranchVar)\;
 739     }
 740     }
 741 \end{algorithm}
 742
 743 % Algorithmic choice --> Plunging
 744 \subsubsection*{Algorithmic choices}
 745 The first critical choice is to select an efficient tree traversal strategy that also takes advantage of the locality of the problem. At first glance, the best-first search would be a poor choice, since from one node to the next the problem could change substantially. However, when combined with the plunging heuristic, this same strategy becomes really interesting. Indeed, the plunging phase can be completely decoupled from the B\&B. The CPU acting as a decision maker, chooses a promising node and spawns a task that will independently explore the subtree of this node. Once done, this task will report its findings to the decision maker. Moreover, when plunging, the next node to be solved differs only by the new bound on the branching variable. Communications are then minimized and the LP solver is usually able to quickly find the new solution since the problem has only been slightly altered.
 746
 747 % Technical improvement --> warmstart / hotstart
 748 \subsubsection*{Warmstart}
 749 There are two case where starting from a fresh problem is required or beneficial. The first one is imposed by the numeric inaccuracy appearing after several iterations of the LP solver. The second is upon the request of a new subtree. To avoid the extensive communication costs of a full restart, the GPU keeps in memory an intermediate stable state of the problem, a \textit{warmstart}. This state could for example be the one found after solving the root node of the tree.
 750
 751 % Global flow
 752 \subsubsection*{Multi-GPU\index{multi-GPU} exploration}
 753 The organisation, CPU decision maker vs GPU explorer, offers the possibility to use multiple GPUs to explore the tree. The global knowledge is maintained by the CPU task. The latter assigns to the GPUs the task of exploring subtrees of promising nodes. Since each plunging is done locally, no communications are required between the GPUs. Moreover, the amount of nodes processed during a plunging can be used to tune the load of the CPU task.
 754
 755 \section{Performance model}
 756 \label{chXXX:sec:MODEL}
 757 Performance models\index{Performance model} are useful to predict the behaviour of implementations as a function of various parameters. Since the standard simplex algorithm is the easiest to understand, we will focus in this section on its behaviour as a function of the problem size.
 758
 759 CUDA kernels require a different kind of modelling than usually encountered in parallelism. The key idea is to capture in the model the decomposition into threads, warps and blocks. One must also pay a particular attention to global memory accesses and to how the pipelines reduce the associated latency.
 760
 761 In order to model our implementation, we follow the approach given by K. Kothapalli et al.~\cite{MODEL} (see also~\cite{ELSTER09}). Firstly, we examine the different levels of parallelism on a GPU. Then we determine the amount of work done by a single task. By combining both analysis, we establish the kernel models. The final model then consists of the sum of each kernel.
 762
 763 This model has also been used to model a multi-GPUs version of the standard simplex method~\cite{MEYER2011}.
 764
 765
 766 \subsection{Levels of parallelism}
 767 \label{chXXX:subsec:LVLParallelism}
 768
 769 A kernel can be decomposed into an initialization phase followed by a processing phase.
 770 During the initialization phase the kernel context is setup.
 771 Since this operation does not take a lot of time compared to the processing phase,
 772 it is needless to incorporate it into the model.
 773 The processing phase is more complex and its execution time depends directly on the amount of work it must perform.
 774 We shall now focus on this phase and on the various levels of parallelism on a GPU.
 775
 776 At the first level, the total work is broken down into components, the blocks. They are then dispatched on the available multiprocessors  on the GPU. The execution time of a kernel depends  on the number of blocks $N_B$ per SM (\textit{streaming multiprocessor}) and on the number $N_{SM}$ of SM per GPU.
 777
 778 At a lower level, the work of a block is shared among the various cores of its dedicated SM. This is done by organizing the threads of the block into groups, the warps. A warp is a unit of threads that can be executed in parallel on a SM. The execution time of a block is then linked to the number $N_W$ of warps per block, the number $N_T$ of threads per warp and the number $N_P$ of cores per SM.
 779
 780 The third and lowest level of parallelism is a pipeline. This pipeline enables a pseudo-parallel execution of the tasks forming a warp. The gain produced by this pipeline is expressed by its depth $D$.
 781
 782 \subsection{Amount of work done by a thread}
 783 \label{chXXX:subsec:workTask}
 784
 785 In the previous section, we defined the different levels of parallelism down to the smallest, namely the thread level. We must now estimate how much work is done by a single thread of a kernel in terms of cycles. It depends on the number and the type of instructions. In CUDA, each arithmetic instruction requires a different number of cycles. For example, a single precision add costs 4 cycles while a single precision division costs 36 cycles.
 786
 787 Moreover, since global memory access instructions are non-blocking operations, they must be counted separately from arithmetic instructions. The number of cycles involved in a global memory access, amounts to a 4 cycle instruction (read/write) followed by a non-blocking latency of 400-600 cycles. To correctly estimate the work due to such accesses, one needs to sum only the latency that is not hidden. Two consecutive read instructions executed by a thread, would cost twice the 4 cycles, but only once the latency due to its non-blocking behaviour. Once the amount of cycles involved in these memory accesses has been determined, it is then necessary to take into account the latency hidden by the scheduler (\textit{warp swapping}).
 788
 789 The total work $C_T$ done by a thread can be defined either as the sum or as the maximum of the memory access cycles and the arithmetic instructions cycles. Summing both types of cycles means we consider that latency cannot be used to hide arithmetic instructions. The maximum variant represents the opposite situation where arithmetic instructions and memory accesses are concurrent. Then only the biggest of the two represents the total work of a thread. This could occur for example when the latency is entirely hidden by the pipeline.
 790
 791 It is not trivial to define which scenario is appropriate for each kernel. There are many factors involved: the  dependency on the instructions, the behaviour of the scheduler, the quantity of data to process, and so on. The number of cycles $C_T$ done by a thread can thus be defined by the sum, respectively the maximum, of the global memory access and the arithmetic instructions.
 792 %Moreover, for each kernel the task representing the worst case scenario will be the one modelised.
 793
 794 \subsection{Global performance model}
 795 \label{chXXX:subsec:MODEL}
 796
 797 We now turn to the global performance model for a kernel. We must find out how many threads are run by a core of an SM. Recall that a kernel decomposes into blocks of threads, with each SM running $N_B$ blocks, each block having itself $N_W$ warps consisting of $N_T$ threads. Also recall that an SM has $N_P$ cores which execute batches of threads in a pipeline of depth $D$. Thus the total work $C_{core}$ done by a core is given by
 798 \begin{equation}
 799 \label{chXXX:equ:ccore}
 800 C_{core} = N_B \cdot N_W \cdot N_T \cdot C_T \cdot \dfrac{1}{N_P \cdot D}
 801 \end{equation}
 802 which represents the total work done by a SM divided by the number of cores per SM and by the depth of the pipeline.
 803 Finally, the execution time of a kernel is obtained by dividing $C_{core}$ by the core frequency.
 804
 805 \subsection{A kernel example : \textit{steepest-edge}}
 806 \label{chXXX:subsec:SE}
 807 The selection of the entering variable is done by the \textit{steepest-edge} method as described in section~\ref{chXXX:subsec:impl-simplex-std}.
 808
 809 The processing of a column is done in a single block. Each thread of a block has to compute $N_{el}=\frac{m}{N_W \cdot N_T}$ coefficients of the column. This first computation consists of $N_{el}$ multiplications and additions. The resulting partial sum of squared variables must then be reduced on a single thread of the block. This requires $N_{red} = \log_2(N_W \cdot N_T)$ additions. Since the shared memory is used optimally, there are no added communications. Finally, the coefficient $c_j$ must be divided by the result of the reduction.
 810
 811 Each block of the kernel computes $N_{col}=\frac{n}{N_B \cdot N_{SM}}$ columns, where $N_{SM}$ is the number of SM per GPU. After processing a column, a block keeps only the minimum of its previously computed steepest-edge coefficients. Thus the number of arithmetic instruction cycles for a given thread is given by
 812 \begin{equation}
 813 \label{chXXX:equ:arithmSE}
 814 C_{Arithm} =  N_{col} \cdot (N_{el}  \cdot (C_{add} + C_{mult}) + N_{red} \cdot C_{add} + C_{div} + C_{cmp})
 815 \end{equation}
 816 where $C_{ins}$ denotes the number of cycles to execute instruction $ins$.
 817
 818 Each thread has to load $N_{el}$ variables to compute its partial sum of squared variables. The thread computing the division also loads the coefficient $c_j$. This must be done for the $N_{col}$ columns that a block has to deal with. We must also take into account that the scheduler hides some latency by swapping the warps, so the total latency $C_{latency}$ must be divided by the number of warps $N_W$.
 819 Thus the number of cycles relative to memory accesses is given by:
 820
 821 \begin{equation}
 822 \label{chXXX:equ:latencySE}
 823 C_{Accesses} =  \dfrac{ N_{col} \cdot (N_{el}+1) \cdot C_{latency}} {N_W}
 824 \end{equation}
 825
 826 At the end of the execution of this kernel, each block stores in global memory its local minimum. The CPU will then have to retrieve the $N_B\cdot N_{SM}$ local minimums and reduce them. It is then profitable to minimize the number $N_B$ of blocks per SM. With a maximum of two blocks per SM, the cost of this final operation done by the CPU can be neglected when $m$ and $n$ are big.
 827
 828 It now remains to either maximize or sum $C_{Arithm}$ and $C_{Accesses}$ to obtain~$C_T$. The result of equ.~(\ref{chXXX:equ:ccore}) divided by the core frequency yields the time $T_{KSteepestEdge}$ of the steepest-edge kernel.
 829
 830 \subsection{Standard simplex GPU implementation model}
 831 \label{chXXX:subsec:model-simplex-std}
 832 As seen in section~\ref{chXXX:subsec:impl-simplex-std}, the standard simplex implementation requires only a few communications between the CPU and the GPU. Since each of these communications are constant and small, they will be neglected in the model. For the sake of simplicity, we will consider the second phase of the \textit{two-phase simplex} where we apply iteratively the three main operations: selecting the entering variable, choosing the leaving variable and pivoting. Each of these operations is computed as a kernel. The time of an iteration $T_{Kernels}$ then amounts to the sum of all three kernel times:
 833 \begin{equation}
 834 \label{chXXX:equ:1GPU}
 835 T_{Kernels} = T_{KSteepestEdge} + T_{KExpand} + T_{KPivoting}
 836 \end{equation}
 837 The times $T_{KExpand}$ and $T_{KPivoting}$ for the expand and pivoting kernels are obtained in a similar way as for the steepest-edge kernel described in the previous section.
 838
 839 With the estimated time per iteration $T_{Kernels}$, we can express the total time $T_{prob}$ required for solving a problem as
 840 \begin{equation}
 841 \label{chXXX:equ:tprob}
 842 T_{prob} = T_{init} + r \cdot T_{Kernels}
 843 \end{equation}
 844 where $r$ is the number of iterations.
 845 Note that research by Dantzig~\cite{DANTZIG80} asserts that $r$ is proportional to $m \log_2(n)$.
 846
 847 \section{Measurements and analysis}
 848 \label{chXXX:sec:measurements}
 849 Two sets of measurements are presented here. The first one is a validation of the performance model presented in section~\ref{chXXX:sec:MODEL}. The second is a comparison of the performances of the various implementations of the simplex method detailed in section~\ref{chXXX:sec:impl}.
 850
 851 As a preliminary, we ensured that our implementations were functional. For that matter, we used a set of problems available on the \textit{NETLIB}~\cite{NETBLIB} repository. This dataset usually serves as a benchmark for LP solvers. It consists of a vast variety of real and specific problems for testing the stability and robustness of an implementation. For example, some of these represent real-life models of large refineries, industrial production/allocation or fleet assignments problems. Our implementations are able to solve almost all of these problems.
 852
 853 % Test 1 : Tesla S1070 =>  Performance model + custom problems
 854 \subsection{Performance model validation}
 855 \label{chXXX:subsec:modelVal}
 856
 857 In order to check that our performance model for the standard simplex implementation is correct, a large range of problems of varying size is needed. As none of the existing datasets offered the desired diversity of problems, we used a problem generator. It is then possible to create problems of given size and density. Since usual LP problems have more variables than equations, our generated problems have a ratio of 5 variables for 1 equation ($n=5 \cdot m$).
 858 %All these problems fall into the same category. Hence, generated problems of the same size will be solved within approximately the same number of iterations and the optimum of the objective function will only slightly fluctuate.
 859
 860 The test environment is composed of a CPU server (2 Intel Xeon X5570, 2.93GHz, with 24GB DDR3 RAM)
 861 and a GPU computing system (NVIDIA tesla S1070) with the 3.2 CUDA Toolkit. This system connects 4 GPUs to the server. Each GPU has 4GB GDDR3 graphics memory, 30 streaming multiprocessors, each holding 8 cores (1.4GHz).
 862 %Such a GPU offers at peak performance up to 1TFLOPS for single precision floating point, 80GFLOPS for double precision floating point and 100GB/s memory bandwidth between threads (registers) and global memory.
 863
 864 We validated our performance model by showing that it accurately fits our measurements.
 865 The correlation between the measurements and the model is above $0.999$ (see fig.~\ref{chXXX:fig:fitting}).
 866 \begin{figure}[!h]
 867 \centering
 868 \includegraphics[width=100mm]{Chapters/chapter10/figures/Model.pdf}
 869 \caption{Performance model and measurements comparison}
 870 \label{chXXX:fig:fitting}
 871 \end{figure}
 872
 873 % Test 2 : Tesla M2090 => Performance comparison + NETLIB problems
 874 \subsection{Performance comparison of implementations}
 875 \label{chXXX:subsec:implPerf}
 876 Four different implementations are compared in this section. We will refer to each of them according to the terminology introduced below.
 877 \begin{itemize}
 878  \item Standard simplex method improved ($\mathcal{O}(2mn)$):  \texttt{Standard}
 879  \item Revised simplex method using basis kernel
 880  \begin{itemize}
 881   \item without improvements ($\mathcal{O}(3mn + 4m^2)$): \texttt{Revised-base}
 882   \item optimized ($\mathcal{O}(2mn + 3m^2)$): \texttt{Revised-opti}
 883   \item optimized with sparse matrix-vector multiplication ($\mathcal{O}(2\theta+3m^2)$): \texttt{Revised-sparse}
 884  \end{itemize}
 885 \end{itemize}
 886 These implementations all use the \textit{steepest-edge} and \textit{expand} methods for the choice of, respectively, the entering and the leaving variables.
 887
 888 We used problems from the \textit{NETLIB} repository to illustrate the improvements discussed in section~\ref{chXXX:sec:impl}. The results expected from the first three methods are quite clear. The \texttt{Standard} method should be the best, followed by the \texttt{Revised-opti} and then the \texttt{Revised-base}. However, the performance of the \texttt{Revised-sparse} implementation remains unclear since the value of $\theta$ is unknown and depends on the density and size of the constraints matrix. This is the main question we shall try to answer with our experiments.
 889
 890 The test environnement is composed of a CPU server (2 Intel Xeon X5650, 2.67GHz, with 96GB DDR3 RAM) and a GPU computing system (NVIDIA tesla M2090) with the 4.2 CUDA Toolkit. This system connects 4 GPUs to the server. Each GPU has 6GB GDDR5 graphics memory and 512 cores (1.3GHz).
 891
 892 \begin{table}[!h]
 893 \begin{center}
 894 \begin{tabular}{|l|r|r|r|r|}
 895 \hline
 896 \multicolumn{1}{|l|}{Problem name} & \multicolumn{1}{l|}{Rows} & \multicolumn{1}{l|}{Cols} & \multicolumn{1}{l|}{NNZ} & \multicolumn{1}{l|}{C-to-R} \\ \hline
 897 etamacro.mps & 401 & 688 & 2489 & 1.7 \\ \hline
 898 fffff800.mps & 525 & 854 & 6235 & 1.6\\ \hline
 899 finnis.mps & 498 & 614 & 2714 & 1.2\\ \hline
 900 gfrd-pnc.mps & 617 & 1092 & 3467 & 1.8\\ \hline
 901 grow15.mps & 301 & 645 & 5665 & 2.1\\ \hline
 902 grow22.mps & 441 & 946 & 8318 & 2.1\\ \hline
 903 scagr25.mps & 472 & 500 & 2029 & 1.1\\ \hline
 904 \end{tabular}
 905 \caption{NETLIB problems solved in less than 1 second}
 906 \label{chXXX:tab:small}
 907 \end{center}
 908 \end{table}
 909
 910 Let us begin by discussing the performances on problems solved in less than one second. The name, size, number of non-zero elements (NNZ) and columns to rows ratio (C-to-R) of each problems are reported in table~\ref{chXXX:tab:small}. The performances shown in figure~\ref{chXXX:fig:FastSolve} corroborate our previous observations. On these problems the \texttt{Revised-sparse} method doesn't outperform the \texttt{Standard} one. This can be explained by two factors: the added communications (kernel calls) for the revised method, and the small size and density of the problems. It is likely that sparse operations on a GPU require larger amounts of data to become more efficient than their dense counterpart.
 911 \begin{figure}[!h]
 912 \centering
 913 \includegraphics[width=100mm]{Chapters/chapter10/figures/Small2.pdf}
 914 \caption{Time required to solve problems of table~\ref{chXXX:tab:small}}
 915 \label{chXXX:fig:FastSolve}
 916 \end{figure}
 917
 918 The problems shown in table~\ref{chXXX:tab:medium} are solved in less than 4 seconds. As we can see in figure~\ref{chXXX:fig:NormalSolve}, the expected trend for the \texttt{Revised-base} and the \texttt{Revised-opti} methods is now confirmed. Let us presently focus on the \texttt{Standard} and \texttt{Revised-sparse} methods. Some of the problems, in particular \textit{czprob.mps} and \textit{nesm.mps}, are solved in a comparable amount of time. The performance gain of the \texttt{Revised-sparse} is related to the C-to-R (columns to rows) ratio of these problems, displaying respectively a 3.8 and a 4.4 ratio.
 919
 920 \begin{table}[!h]
 921 \begin{center}
 922 \begin{tabular}{|l|r|r|r|r|}
 923 \hline
 924 \multicolumn{1}{|l|}{Problem name} & \multicolumn{1}{l|}{Rows} & \multicolumn{1}{l|}{Cols} & \multicolumn{1}{l|}{NNZ} & \multicolumn{1}{l|}{C-to-R} \\ \hline
 925 25fv47.mps & 822 & 1571 & 11127 & 1.9\\ \hline
 926 bnl1.mps & 644 & 1175 & 6129 & 1.8\\ \hline
 927 cycle.mps & 1904 & 2857 & 21322 & 1.5\\ \hline
 928 czprob.mps & 930 & 3523 & 14173 & 3.8\\ \hline
 929 ganges.mps & 1310 & 1681 & 7021 & 1.2\\ \hline
 930 nesm.mps & 663 & 2923 & 13988 & 4.4\\ \hline
 931 maros.mps & 847 & 1443 & 10006 & 1.7\\ \hline
 932 perold.mps & 626 & 1376 & 6026 & 1.0\\ \hline
 933 \end{tabular}
 934 \caption{NETLIB problems solved in the range of 1 to 4 seconds}
 935 \label{chXXX:tab:medium}
 936 \end{center}
 937 \end{table}
 938
 939 \begin{figure}[!h]
 940 \centering
 941 \includegraphics[width=100mm]{Chapters/chapter10/figures/Med2.pdf}
 942 \caption{Time required to solve problems of table~\ref{chXXX:tab:medium}}
 943 \label{chXXX:fig:NormalSolve}
 944 \end{figure}
 945
 946 Finally, the biggest problems, and slowest to solve, are given in table~\ref{chXXX:tab:big}. A new tendency can be observed in figure~\ref{chXXX:fig:SlowSolve}. The \texttt{Revised-sparse} method is the fastest on most of the problems. The performances are still close between the best two methods on problems having a C-to-R ratio close to 2 like \textit{bnl2.mps}, \textit{pilot.mps} or \textit{greenbeb.mps}. However, when this ratio is greater, the \texttt{Revised-sparse} can be nearly twice as fast as the standard method. This is noticeable on \textit{80bau3b.mps} (4.5) and \textit{fit2p.mps} (4.3). Although the C-to-R ratio of \textit{d6cube.mps} (14.9) exceeds the ones previously mentioned, the \texttt{Revised-sparse} method doesn't show an impressive performance, probably due to the small amount of rows and the density of this problem, which does thus not fully benefit from the lower complexity of sparse operations.
 947
 948 \begin{table}[!h]
 949 \begin{center}
 950 \begin{tabular}{|l|r|r|r|r|}
 951 \hline
 952 \multicolumn{1}{|l|}{Problem name} & \multicolumn{1}{l|}{Rows} & \multicolumn{1}{l|}{Cols} & \multicolumn{1}{l|}{NNZ} & \multicolumn{1}{l|}{C-to-R} \\ \hline
 953 80bau3b.mps & 2263 & 9799 & 29063 & 4.3\\ \hline
 954 bnl2.mps & 2325 & 3489 & 16124 & 1.5\\ \hline
 955 d2q06c.mps & 2172 & 5167 & 35674 & 2.4\\ \hline
 956 d6cube.mps & 416 & 6184 & 43888 & 14.9\\ \hline
 957 fit2p.mps & 3001 & 13525 & 60784 & 4.5\\ \hline
 958 greenbeb.mps & 2393 & 5405 & 31499 & 2.3 \\ \hline
 959 maros-r7.mps & 3137 & 9408 & 151120 & 3.0\\ \hline
 960 pilot.mps & 1442 & 3652 & 43220 & 2.5 \\ \hline
 961 pilot87.mps & 2031 & 4883 & 73804 & 2.4\\ \hline
 962 \end{tabular}
 963 \caption{NETLIB problems solved in more than 5 seconds}
 964 \label{chXXX:tab:big}
 965 \end{center}
 966 \end{table}
 967
 968 \begin{figure}[!h]
 969 \centering
 970 \includegraphics[width=100mm]{Chapters/chapter10/figures/Big2.pdf}
 971 \caption{Time required to solve problems of table \ref{chXXX:tab:big}}
 972 \label{chXXX:fig:SlowSolve}
 973 \end{figure}
 974
 975 \section{Conclusion and perspectives}
 976 \label{chXXX:sec:persp}
 977 In this chapter, we tried to present the knowledge and understanding necessary in our view to produce an efficient integer linear programming solver on a GPU. We proposed various solutions to implement the standard and revised simplex. We discussed the main concepts behind a Branch-and-Bound algorithm and pointed out some major concerns when it is coupled with a GPU solver. The results obtained with the standard simplex implementation allowed us to validate a detailed performance model we had proposed. Finally, we used problems from the \textit{NETLIB} library to compare the performances of our various implementations. The revised simplex implementation with sparse matrix operations, showed the best performances on time consuming problems, while the standard simplex one was more competitive on easier problems. However, sequential open-source solvers like \textit{CLP} of the \textit{COIN-OR} project still outperform such GPU implementations.
 978
 979 We shall now discuss some remaining potential improvements. A first step towards performance would be to consider the dual revised simplex~\cite{LEMKE54}. While being similar to the methods treated in this chapter, it has the capacity to greatly reduce the time needed to solve problems.
 980 Yet, the main improvement would be to tackle larger problems than the ones considered here. Indeed, problems having hundreds of thousands of variables would technically be solvable on GPU devices with 2GB to 4GB of global memory. Moreover, such amounts of data would fully benefit from the potential of these devices. However, solving problems of this size raises an issue not addressed here: numerical instabilities. This phenomenon is due to the limited precision of mathematical operations on computing devices.
 981 % Numerical innacuracies/ Example + problem
 982 Let us illustrate this problem on the inverse $\mbf{B^{-1}}$ of the basis matrix $\mbf{B}$. At each iteration of the revised simplex, $\mbf{B^{-1}}$ is updated from its previous values. Performing this update adds a small error to the result. Unfortunately, these errors accumulate at each iteration, bringing at some point the $\mbf{B^{-1}}$ matrix into a degenerate state. To avoid this situation, the matrix $\mbf{B^{-1}}$ must be recomputed afresh once in a while by inverting the matrix $\mbf{B}$.
 983
 984 % LU decomposition
 985 Instead of directly processing the inverse of the matrix $\mbf{B}$, it is more common to use some specific arithmetical treatment. Most of the simplex implementations use the LU decomposition~\cite{BRAYTON70}.
 986 Since the matrix $\mbf{B}$ is sparse, the sparse version of this decomposition can be used, and the corresponding update performed for $\mbf{B^{-1}}$ at each iteration~\cite{BARTELS69}. The sparse LU decomposition on CPU has been recently subject to a large amount of research, yielding many improvements. Once its GPU counterpart will be available, this might result in improved and competitive simplex implementations on GPU.
 987
 988 \putbib[Chapters/chapter10/biblio]
 989
 990
 991
 992
 993
 994
 995