BookGPU/Chapters/chapter10/ch10.tex

   1 \chapterauthor{Xavier Meyer and Bastien Chopard}{Department of Computer Science, University of Geneva, Switzerland}
   2 \chapterauthor{Paul Albuquerque}{Institute for Informatics and Telecommunications, HEPIA, \\ University of Applied Sciences of Western Switzerland -- Geneva, Switzerland}
   3 %\chapterauthor{Bastien Chopard}{Department of Computer Science, University of Geneva}
   4
   5 %\chapter{Linear programming on a GPU: a study case based on the simplex method and the branch-cut-and bound algorithm}
   6 \chapter{Linear Programming on a GPU: A~Case~Study}
   7 \section{Introduction}
   8 \label{chXXX:sec:intro}
   9 The simplex method~\cite{VCLP} is a well-known optimization algorithm for solving linear programming (LP) models in the field of operations research. It is part of software often employed by businesses for finding solutions to problems such as airline scheduling problems. The original standard simplex method was proposed by Dantzig in 1947. A more efficient method, named the revised simplex, was later developed. Nowadays its sequential implementation can be found in almost all commercial LP solvers. But the always increasing complexity and size of LP problems from the industry, drives the demand for more computational power.
  10 %Indeed, current implementations of the revised simplex strive to produce the expected results, if any.
  11 In this context, parallelization is the natural idea to investigate~\cite{HALL}. Already in 1996, Thomadakis and Liu~\cite{THOMAD} implemented the standard method on a massive parallel computer and obtained an increase in performances when solving dense or large problems.
  12
  13 A few years back, in order to meet the demand for processing power, graphics card vendors made their graphical processing units (GPU) available for general-purpose computing. Since then GPUs have gained a lot of popularity as they offer an opportunity to accelerate algorithms having an architecture well-adapted to the GPU model. The simplex method falls into this category. Indeed, GPUs exhibit a massive parallel architecture optimized for matrix processing. To our knowledge, there are only a few simplex implementations on GPU. Bieling et al.~\cite{BIELING} presented encouraging results while solving small to mid-sized LP problems with the revised simplex. However, the complexity of their algorithm seems to be rather close to the one of the standard simplex with similar heuristics. Following the steps of this first work, an implementation of the revised simplex~\cite{LALAMI11} showed interesting results on dense and square matrices. Furthermore, an implementation of the interior point method~\cite{JUNG08} outperformed its CPU equivalent on mid-sized problems.
  14
  15 The Branch-and-Bound (B\&B) algorithm is extensively used for solving integer linear programming (ILP) problems.
  16 This tree-based exploration method subdivides the feasible region of the relaxed LP model into successively smaller subsets to obtain bounds on the objective value.
  17 Each corresponding submodel can be solved with the simplex method, and the bounds computed determine whether further branching is required.
  18 Hence, a natural parallelization of the B\&B method is to let the CPU manage the B\&B tree and dispatch the relaxed submodels to an LP solver on a GPU. We refer the reader to Chapter~\ref{ch8:GPU-accelerated-tree-based-exact-optimization-methods}, and the references therein, for a more complete introduction to parallel B\&B algorithms.
  19
  20 In this chapter, we present a GPU implementation of the standard and revised simplex methods, based on the CUDA technology of NVIDIA. We also derive a performance model and establish its accurateness. Let us mention that there are many technicalities involved in CUDA programming, in particular regarding the management of tasks and memories on the GPU. Thus, fine tuning is indispensable to avoid a breakdown on performance.
  21 %With respect to the simplex algorithm, special attention must be given to numerical stability.
  22
  23 The chapter is organized as follows. First, we begin with a description of the standard and revised simplex methods
  24 and introduce the heuristics used in our implementations. This is followed by a presentation of the B\&B algorithm.
  25 The next section points out CUDA aspects which are important for our implementations.
  26 In the following one, we focus on the GPU implementations of the simplex method as well as the B\&B algorithm.
  27 In the sixth section, we describe a performance model for the standard simplex implementation.
  28 In the seventh section, a performance comparison between our implementations is made on real-life problems and an analysis is given.
  29 Finally, we summarize the results obtained and consider new perspectives.
  30
  31 \section{Simplex algorithm}
  32 \label{chXXX:sec:simplex-algo}
  33 \subsection{Linear programming model\index{linear programming}}
  34 \label{chXXX:subsec:lp-model}
  35 %% Mathematics of LP
  36 An LP model in its canonical form can be expressed as the following optimization problem:
  37 \begin{equation}
  38 \label{chXXX:equ:canon-form}
  39 %\centering
  40 \begin{array}{lcl}
  41  \text{maximize} &\hspace{10pt}& z = \mbf{c^T x} \hspace{30pt}\\
  42  \text{subject to} &\hspace{10pt} &\mbf{Ax \leq b} \hspace{30pt}\\
  43                             &\hspace{10pt} &\mbf{x \geq 0} \hspace{30pt}
  44 \end{array}
  45 \end{equation}
  46 where $\mbf{x}=(x_j),\mbf{c}=(c_j) \in \mathbb{R}^n,\mbf{b}=(b_i) \in \mathbb{R}^m$, and $\mbf{A}=(a_{ij})$ is the $m \times n$ constraints matrix.
  47 The objective function $z = \mbf{c^T x}$ is the inner product of the cost vector $\mbf{c}$ and the unknown variables $\mbf{x}$.
  48 An element $\mbf{x}$ is called a solution which is said to be feasible
  49 if it satisfies the $m$ linear constraints imposed by $\mbf{A}$ and the bound $\mbf{b}$.
  50 The feasible region $\{\mbf{x}\in \mathbb{R}^n\, |\, \mbf{Ax \leq b},\, \mbf{x \geq 0}\}$ is a convex polytope.
  51 An optimal solution to the LP problem will therefore reside on a vertex of this polytope.
  52
  53 \subsection{Standard simplex algorithm}
  54 \label{chXXX:subsec:simplex-std}
  55 % Simple explanation of the simple algorithm : 3 operations
  56 The simplex method\index{simplex!standard method}~\cite{VCLP} is an algorithm for solving LP models.
  57 It proceeds by iteratively visiting vertices on the boundary of the feasible region.
  58 This amounts to performing algebraic manipulations on the system of linear equations.
  59
  60 We begin by reformulating the model.
  61 So-called \textit{slack variables} $x_{n+i}$ are added to the canonical form in order to replace inequalities by equalities in equation~(\ref{chXXX:equ:canon-form}):
  62 \begin{equation}
  63 \label{chXXX:equ:augm-form}
  64 %\centering
  65 \begin{array}{cccccccc}
  66 x_{n+i} & = &  b_i - & \displaystyle\sum\limits^{n}_{j=1} & a_{ij}x_j & & (i={1,2,...,m})\\
  67 \end{array}
  68 \end{equation}
  69 The resulting problem is called the \textit{augmented form} in which the variables are divided into two disjoint index sets,
  70 $\B$ and $\N$, which correspond to the basic and nonbasic variables.
  71 Basic variables, which form the basis of the problem, are on the left-hand side of equation~(\ref{chXXX:equ:augm-form}),
  72 while nonbasic variables, which form the core of the equations, appear on the right-hand~side.
  73 We can thus consider the following LP form:
  74 \begin{equation}
  75 %\centering
  76 \begin{array}{lcl}
  77  \text{maximize} &\hspace{10pt}& z = \mbf{c^T x} \hspace{30pt}\\
  78  \text{subject to} &\hspace{10pt} &\mbf{Ax = b} \hspace{30pt}\\
  79                             &\hspace{10pt} &\mbf{x \geq 0} \hspace{30pt}
  80 \end{array}
  81 \end{equation}
  82 where $\mbf{x} \in \mathbb{R}^{n+m}$ and $\mbf{A}$ is now the $m\times (n+m)$ matrix obtained by concatenating the constraints matrix with the $m\times m$ identity matrix $\mbf{I_m}$.
  83 The cost vector has been padded with zeros so that $\mbf{c} \in \mathbb{R}^{n+m}$.
  84
  85 The basic and nonbasic variables can be separated given the expression $\mbf{A_\N x_\N + x_\B = b}$.
  86 Similarly, $z = \mbf{c^T x = c_\N^T x_\N + c_\B^T x_\B}$ with $\mbf{c_\B = 0}$.
  87 By definition, a basic solution is obtained by assigning null values to all nonbasic variables ($\mbf{x_\N=0}$).
  88 Hence, $\mbf{x=x_\B = b}$ is a basic solution.
  89
  90 The simplex algorithm searches for the optimal solution through an iterative process.
  91 For the sake of simplicity, we will assume here that $\mbf{b>0}$, that is, the origin belongs to the feasible region.
  92 Otherwise a preliminary treatment is required to generate a feasible initial solution (see~Section~\ref{chXXX:sec:heuristics}).
  93 A typical iteration then consists of three operations (summarized in Algorithm~\ref{chXXX:algo:simplex-std}).
  94 \begin{enumerate}
  95 %%%%%%%%%%%%%%%%%
  96 \item \textbf{Choosing the entering variable.}
  97 The entering variable is a nonbasic variable whose increase will lead to an increase in the value of the objective function $z$.
  98 This variable must be selected with care so as to yield a substantial leap towards the optimal solution.
  99 The standard way of making this choice is to select the variable $x_e$ with the largest positive coefficient
 100 $c_e = \max\{c_j>0\, |\, j\in \N\}$ in the objective function $z$.
 101 However, other strategies, such as choosing the positive coefficient with the smallest index, also prove to be useful.
 102 %%%%%%%%%%%%%%%%%
 103 \item \textbf{Choosing the leaving variable.}
 104 This variable is the basic variable which first violates its constraint as the entering variable $x_e$ increases.
 105 The choice of the leaving variable must guarantee that the solution remains feasible.
 106 More precisely, setting all nonbasic variables except $x_e$ to zero, $x_e$ is bounded by
 107 $$t=\min\left\{\dfrac{b_i}{a_{ie}}\, \bigg|\, a_{ie}>0,\, i=1,\dots,m \right\}$$
 108 If $t=+\infty$, the LP problem is unbounded. \\
 109 Otherwise, $t=\dfrac{b_\ell}{a_{\ell e}}$ for some $\ell\in \B$, and $x_\ell$ is the leaving variable; \\
 110 the element $a_{\ell e}$ is called the pivot.
 111 %%%%%%%%%%%%%%%%%
 112 \item \textbf{Pivoting.}
 113 Once both variables are defined, the pivoting operation switches these variables from one set to the other:
 114 the entering variable enters the basis $\mathcal{B}$, taking the place of the leaving variable which now belongs to $\mathcal{N}$,
 115 namely, $\mathcal{B} \leftarrow (\mathcal{B} \setminus \{\ell\}) \cup \{e\}$ and $\mathcal{N} \leftarrow (\mathcal{N} \setminus \{e\}) \cup \{\ell\}$.
 116 Correspondingly, the columns with index $\ell$ and $e$ are exchanged between $\mbf{I_m}$ and $\mbf{A_\N}$, and similarly for $\mbf{c_\B=0}$ and $\mbf{c_\N}$.
 117 We then update the constraints matrix $\mbf{A}$, the bound $\mbf{b}$ and the cost $\mbf{c}$ using Gaussian elimination.
 118 More precisely, denoting $\mbf{\tilde{I}_m}$, $\mbf{\tilde{A}_\N}$, $\mbf{\tilde{c}_\B}$, and $\mbf{\tilde{c}_\N}$ as the resulting elements after the exchange, Gaussian elimination then transforms the tableau
 119
 120 \begin{center}
 121 \begin{tabular}{|c|c||c|}
 122 $\mbf{\tilde{A}_\N}$ & $\mbf{\tilde{I}_m}$ & $\mbf{b}$ \\
 123 \hline
 124 $\mbf{\tilde{c}^T_\N}$ & $\mbf{\tilde{c}^T_\B}$ & $z$
 125 \end{tabular}
 126 \end{center}
 127 \noindent
 128 into a tableau with updated values for $\mbf{A_\N}$, $\mbf{c_\N}$, and $\mbf{b}$
 129 \begin{center}
 130 \begin{tabular}{|c|c||c|}
 131 $\mbf{A_\N}$ & $\mbf{I_m}$ & $\mbf{b}$ \\
 132 \hline
 133 $\mbf{c^T_\N}$ & $\mbf{c^T_\B}$ & $z-tc_e$
 134 \end{tabular}
 135 \end{center}
 136 The latter is obtained by first dividing the $\ell$-th row by $a_{\ell e}$;
 137 the resulting row multiplied by $a_{ie}$ ($i\not=\ell$) is then subtracted to the $i^\mathrm{th}$ row; the same operation is performed using $c_e$ and the last row.
 138 Hence, $\mbf{c_\B=0}$.
 139 These operations amount to jumping from the current vertex to an adjacent vertex with objective value $z=tc_e$.
 140 \end{enumerate}
 141 % End condition
 142 The algorithm ends when no more entering variable can be found, that is, when~$\mbf{c_\N \le 0}$.
 143
 144 %% \begin{algorithm}
 145 %% \caption{Standard simplex algorithm}
 146 %% \label{chXXX:algo:simplex-std}
 147 %% \begin{algorithmic}
 148 %%      \STATE \texttt{//1. Find entering variable}
 149 %%      \IF{$\mbf{c_\N \le 0}$}
 150 %%      \RETURN Optimal
 151 %%    \ENDIF
 152 %%    \STATE Choose an index $e\in \N$ such that $c_e>0$
 153 %%      \STATE \texttt{//2. Find leaving variable}
 154 %%      \IF{$(\mbf{A_\N})_e \mbf{\le 0}$}
 155 %%      \RETURN Unbounded
 156 %%      \ENDIF
 157 %%      \STATE Let $\ell \in \B$ be the index such that \\
 158 %%      \hspace*{10mm} $t := \dfrac{b_\ell}{a_{\ell e}}=\min\left\{\dfrac{b_i}{a_{ie}}\, \bigg|\, a_{ie}>0,\, i=1,\dots,m \right\}$
 159 %%      \STATE \texttt{//3. Pivoting - update }
 160 %%    \STATE $\mathcal{B} := (\mathcal{B} \setminus \{\ell\}) \cup \{e\}$, $\mathcal{N} := (\mathcal{N} \setminus \{e\}) \cup \{\ell\}$
 161 %%      \STATE Compute $z_{best}:=z_{best}+t c_e$
 162 %%    \STATE Exchange $(\mbf{I_m})_\ell$ and $(\mbf{A_\N})_e$, $c_\ell$ and $c_e$
 163 %%      \STATE Update $\mbf{A_\N}$, $\mbf{c_\N}$ and $\mbf{b}$
 164 %%    \STATE Go to 1.
 165 %% \end{algorithmic}
 166 %% \end{algorithm}
 167
 168 \begin{algorithm}
 169 %\SetNlSty{}{}{:}
 170     \textit{//1. Find entering variable}\;
 171      \If{$\mbf{c_\N \le 0}$} {
 172         \Return Optimal\;
 173    }
 174    Choose an index $e\in \N$ such that $c_e>0$ \;
 175    \textit{//2. Find leaving variable}\;
 176         \If{$(\mbf{A_\N})_e \mbf{\le 0}$} {
 177         \Return Unbounded
 178         }
 179         Let $\ell \in \B$ be the index such that \;
 180          $t := \dfrac{b_\ell}{a_{\ell e}}=\min\left\{\dfrac{b_i}{a_{ie}}\, \bigg|\, a_{ie}>0,\, i=1,\dots,m \right\}$\;
 181   \textit{//3. Pivoting, update }\;
 182    $\mathcal{B} := (\mathcal{B} \setminus \{\ell\}) \cup \{e\}$, $\mathcal{N} := (\mathcal{N} \setminus \{e\}) \cup \{\ell\}$\;
 183  Compute $z_{best}:=z_{best}+t c_e$\;
 184     Exchange $(\mbf{I_m})_\ell$ and $(\mbf{A_\N})_e$, $(\mbf{c_\B})_\ell$ and $(\mbf{c_\N})_e$\;
 185 Update $\mbf{A_\N}$, $\mbf{c_\N}$ and $\mbf{b}$\;
 186     Go to 1.\;
 187 \caption{standard simplex algorithm}
 188 \label{chXXX:algo:simplex-std}
 189
 190 \end{algorithm}
 191
 192
 193 \subsection{Revised simplex method}
 194 \label{chXXX:subsec:simplex-rev}
 195 % Global revised simplex idea
 196 The operation that takes the most time in the standard method is the pivoting operation, and more specifically, the update of the constraints matrix~$\mbf{A}$. The revised method\index{simplex!revised method} tries to avoid this costly operation by updating only a smaller part of this matrix.
 197
 198 %% Revised simplex method operations, useful for the algorithm
 199 The revised simplex method uses the same operations as in the standard method to choose the entering and leaving variables.
 200 However, since the constraints matrix $\mbf{A}$ need not be fully updated, some additional reformulation is required.
 201
 202 At any stage of the algorithm, basic and nonbasic variables can be separated according to $\mbf{Ax = A_\N x_\N + A_\B x_\B = b}$.
 203 We can then transform the system $\mbf{Ax=b}$ into $\mbf{A_\B x_\B = b-A_\N x_\N}$.
 204 Let us denote $\mbf{B=A_\B}, \mbf{N=A_\N}$.
 205 Since $\mbf{B}$ is invertible, we can write
 206 \begin{align*}
 207 \mbf{x}_\B&=\mbf{B^{-1}b-B^{-1}Nx_\N} \\
 208 %\hline
 209 z&=\mbf{c_\B^T B^{-1}b+(c_\N^T-c_\B^T B^{-1}N)x_\N}
 210 \end{align*}
 211 The vector $\mbf{c_\N^T-c_\B^T B^{-1}N}$ is called the reduced cost vector.
 212
 213 The choice of the leaving variable can be rewritten.
 214 Setting all nonbasic variables except the entering variable $x_e$ to zero, $x_e$ is then bounded by
 215 $$t=\min\left\{\dfrac{(\mbf{B^{-1}b})_i}{(\mbf{B^{-1}N})_{ie}}\, \bigg|\, (\mbf{B^{-1}N})_{ie}>0,\, i=1,\dots,m \right\}$$
 216 If $t=+\infty$, the LP problem is unbounded. Otherwise, $t=\frac{(\mbf{B^{-1}b})_\ell}{(\mbf{B^{-1}N})_{\ell e}}$ for some $\ell\in \B$, and $x_\ell$ is the leaving variable.
 217
 218 Recall that the pivoting operation begins by exchanging columns with index $\ell$ and $e$ between $\mbf{B}$ and $\mbf{N}$ and similarly for $\mbf{c_\B}$ and $\mbf{c_\N}$.
 219
 220 To emphasize the difference between standard and revised simplex, we first express the updating for the standard simplex. This amounts to
 221 \begin{align*}
 222 [\mbf{B} \quad \mbf{N} \quad \mbf{b}] & \leftarrow [\mbf{I_m} \quad \mbf{B^{-1}N} \quad \mbf{B^{-1}b}] \\
 223 [\mbf{c_\B^T} \quad \mbf{c_\N^T}] & \leftarrow [\mbf{0} \quad \mbf{c_\N^T-c_\B^T B^{-1}N}]
 224 \end{align*}
 225 which moves the current vertex to an adjacent vertex $\mbf{x=B^{-1}b}$ with objective value $z=\mbf{c_\B^T B^{-1}b}$ (the end condition remains $\mbf{c_\N \le 0}$). Many computations performed in this update phase can in fact be avoided.
 226
 227 The main point of the revised simplex method is that the only values which really need to be computed at each step are $\mbf{B^{-1}}$, $\mbf{B^{-1}b}$, $\mbf{B^{-1}N}_{e}$, $\mbf{c_\B^T B^{-1}}$, and $\mbf{c_\B^T B^{-1}b}$. However, matrix inversion is a time-consuming operation (of cubic order for Gaussian elimination).
 228 Fortunately, there are efficient ways of computing an update for $\mbf{B^{-1}}$.
 229 One way is to take advantage of the sparsity of the LP problem by using the so-called LU decomposition\footnote{The LU decomposition is a linear algebra decomposition which allows to write a matrix as a product of a lower and an upper triangular matrix.} for sparse matrices~\cite{BARTELS69}.
 230 This decomposition may be updated at a small cost at each iteration~\cite{SUHL93}.
 231 Another way is to use the \textit{product form of the inverse}~\cite{dantzig1953product}, which we describe hereafter.
 232
 233 Let $\mbf{b}_1,\dots,\mbf{b}_m$ be the columns of $\mbf{B}$, $\mbf{v}\in \mathbb{R}^m$,
 234 and $\mbf{a}=\mbf{Bv}=\sum_{i=1}^m v_i\mbf{b}_i$.
 235 Denote $\mbf{B_a}=(\mbf{b}_1,\dots,\mbf{b}_{p-1},\mbf{a},\mbf{b}_{p+1},\dots,\mbf{b}_m)$ for a given $1\le p\le m$ such that $v_p\not= 0$.
 236 We want to compute $\mbf{B_a}^{-1}$.
 237 We first write $$\mbf{b}_p=\dfrac{1}{v_p}+\mbf{a}\sum_{i\not=p} \dfrac{-v_i}{v_p}\mbf{b}_i$$
 238 Define $$\bs\eta = \left(\dfrac{-v_1}{v_p},\dots,\dfrac{-v_{p-1}}{v_p},\dfrac{1}{v_p},\dfrac{-v_{p+1}}{v_p},\dots,\dfrac{-v_m}{v_p}\right)^T$$
 239 and $$\mbf{E}=(\bs\varepsilon_1,\dots,\bs\varepsilon_{p-1},\bs\eta,\bs\varepsilon_{p+1},\dots,\bs\varepsilon_m)$$
 240 where $\bs\varepsilon_j = (0,\dots,0,1,0,\dots,0)^T$ is the $j^\mathrm{th}$ element of the canonical basis of~$\mathbb{R}^m$.
 241 Then $\mbf{B_a E=B}$, so $\mbf{B_a}^{-1}=\mbf{E\; B}^{-1}$.
 242
 243 We apply these preliminary considerations to the simplex algorithm
 244 with $\mbf{a=N}_e, \mbf{v=(B^{-1}N)}_e$ (recall that $x_e$ is the entering variable).
 245 If initially $\mbf{B}$ is the identity matrix $\mbf{I_m}$, at the $k^\mathrm{th}$ iteration of the algorithm,
 246 the inverse matrix is given by $\mbf{B}^{-1}=\mbf{E}_k \mbf{E}_{k-1}\cdots \mbf{E}_2 \mbf{E}_1$,
 247 where $\mbf{E}_i$ is the matrix constructed at the $i^\mathrm{th}$ iteration.
 248 %Note that matrices like $\mbf{E}$ are sparse and can therefore be stored compactly.
 249
 250 \subsection{Heuristics and improvements}
 251 \label{chXXX:sec:heuristics}
 252 Specific heuristics or methods are needed to improve the performance and stability of the simplex algorithm.
 253 We will explain below how we find an initial feasible solution and how we choose the entering and leaving variables.
 254
 255 \subsubsection*{Finding an initial feasible solution}
 256 Our implementations use the \textit{two-phase simplex}~\cite{VCLP}.
 257 The first phase aims at finding a feasible solution required by the second phase to solve the original problem.
 258 If the origin $\mbf{x=0}$ is not a feasible solution, we proceed to find such a solution by solving a so-called \textit{auxiliary problem} with the simplex algorithm.
 259 This can be achieved by adding a nonnegative artificial variable to each constraint equation corresponding to a basic variable which violates its nonnegativity condition,
 260 before minimizing the sum of these artificial variables. The algorithm will thus try to drive all artificial variables towards zero.
 261 If it succeeds, then a basic feasible solution is available as an initial solution for the second phase
 262 in which it attempts to find an optimal solution to the original problem. Otherwise, the problem is infeasible.
 263
 264 To avoid having to introduce these additional auxiliary variables, we use an alternate version of this procedure.
 265 For basic variables with index in $\mathcal{I}=\{j\in~\B \,\,|\,\, x_j<0 \}$, we temporarily relax the nonnegativity condition in order to have a feasible problem
 266 and then apply the simplex algorithm to minimize $w=-\sum_{j\in \mathcal{I}} x_j$.
 267 However, we update $\mathcal{I}$ at each iteration and modify accordingly the objective function,
 268 whose role is to drive these infeasible basic variables towards their original bound.
 269 If at some stage $\mathcal{I}=\emptyset$, we end up with an initial feasible solution.
 270 Otherwise, the algorithm terminates with $\mathcal{I}\not=\emptyset$ and $w>0$, which indicates that the original problem is infeasible.
 271 The alteration introduced above involves more computations during the first phase,
 272 but it offers the advantage of preserving the problem size and making good use of the GPU processing power.
 273
 274 \subsubsection*{Choice of the entering variable}
 275 The number of iterations required to solve a problem depends on the method used to select the entering variable. The one described in Section~\ref{chXXX:subsec:simplex-std} chooses the most promising variable $x_e$ in terms of cost. While being inexpensive to compute, this method can lead to an important number of iterations before the best solution is found.
 276
 277 There exist various heuristics to select this variable $x_e$.
 278 One of the most commonly used is the \textit{steepest-edge} method~\cite{GOLDFRAB77}.
 279 To improve the speed at which the best solution is found, this method takes into account the coefficients of $\mbf{B}^{-1}\mbf{N}$.
 280 This can be explained from the geometrical point of view.
 281 The constraints $\mbf{Ax=b}$ form the hull of a convex polytope.
 282 The simplex algorithm moves from one vertex (i.e.,~a~solution) to another while trying to improve the objective function.
 283 The steepest-edge method searches for the edge along which the rate of improvement of the objective function is the best.
 284 The entering variable $x_e$ is then determined by
 285 $$e = \argmax \left\{ \dfrac{c_j}{\sqrt{\gamma_j}} \, \bigg| \, c_j > 0 , j \in \N \right\}$$
 286 with $\gamma_j = \Vert \mbf{B}^{-1}\mbf{N}_j \Vert^2$.
 287
 288 This method is quite costly to compute but it reduces significantly the number of iterations required to solve a problem. This heuristic can be directly applied to the standard simplex since the full tableau is updated at each iteration. However, it defeats the purpose of the revised simplex algorithm since the aim is precisely to avoid updating the whole constraints matrix at each iteration.
 289 Taking this into account, the steepest-edge coefficients $\bs\gamma$ are updated based only on their current value. For the sake of clarity, the hat notation is used to differentiate the next iteration value of a variable from its current value: for example if $\bs\gamma$ denotes the steepest-edge coefficients at the current iteration, then $\hat{\bs\gamma}$ are the updated coefficients.
 290
 291 Given the column of the entering variable $\mbf{d} = \mbf{(B^{-1}N)}_e$, we may process afresh the steepest-edge coefficient of the entering variable as $\gamma_e = \Vert \mbf{d} \Vert^2$.
 292 The updated steepest-edge coefficients are then given by
 293 $$\hat{\gamma_j} = \max\left\{\gamma_j - 2\hat{\alpha_j} \beta_j + \gamma_e \hat{\alpha_j}^2, 1+\hat{\alpha_j}^{2}\right\} \quad \textrm{ for } j \neq e $$
 294 $$\hat{\gamma_e} = \gamma_e / d_e^2$$
 295 with $\hat{\bs\alpha} = \mbf{N^T} (\mbf{(\hat{B}^{-1})^T})_\ell $
 296 and $\bs\beta = \mbf{N^T} \mbf{(B^{-1})^T} \mbf{d}$
 297
 298 \subsubsection*{Choice of the leaving variable}
 299 The stability and robustness of the algorithm depend considerably on the choice of the leaving variable. With respect to this, the \textit{expand} method~\cite{GILL1989} proves to be very useful in the sense that it helps to avoid cycles and reduces the risks of encountering numerical instabilities. This method consists of two steps of complexity $\mathcal{O}(m)$. In the first step, a small perturbation is applied to the bounds of the variables to prevent stalling of the objective value, thus avoiding cycles. These perturbed bounds are then used to determine the greatest gain on the entering variable imposed by the most constraining basic variable. The second phase uses the original bounds to define the basic variable offering the gain closest to the one of the first phase. This variable will then be selected for leaving the basis.
 300 \clearpage
 301 \section{Branch-and-bound\index{branch-and-bound} algorithm}
 302 \label{chXXX:sec:bnb}
 303 \subsection{Integer linear programming\index{integer linear programming}}
 304 \label{chXXX:subsec:ilp}
 305 % Why and what are integer linear programming problem
 306 In some problems, variables are integer-valued. For example, in a vehicle routing problem, one cannot assign one third of a vehicule to a specific route. ILP problems restrict LP problems by imposing an integrality condition on the variables. While this change may seem to have little impact on the model, the aftermaths on the resolution method are quite important.
 307
 308 % What is the cost of adding this constraint
 309 From a geometrical perspective, the simplex algorithm is a method for finding an optimum in a convex polytope defined by an LP problem. However, the additional integrality condition results in the loss of this convexity property. The resolution method must then be altered in order to find a solution to the ILP problem. The idea is to explore the solution space by a divide-and-conquer approach in which the simplex algorithm is used to find a local optimum in subspaces.
 310
 311 \subsection{Branch-and-bound tree}
 312 \label{chXXX:subsec:bnb-tree}
 313 % Intro
 314 The solution space is explored by the B\&B algorithm~\cite{ACHTERBERG07}.
 315 The strategy used is conceptually close to a tree traversal.
 316
 317 % Branching
 318 %At first, the ILP problem is considered as an LP problem by relaxing the integrality condition before applying an LP solver to it.
 319 %This initial relaxed problem represents the root of the tree about to be built.
 320 %From the obtained solution, a variable $x_j$ (with bounds $l_j \leq x_j \leq u_j$) that violates its integrality condition, is chosen as a branching variable.
 321 %The current problem will then be divided, if possible, into two subproblems: one having $x_j$ constrained to $l_j' = l_j$ and $u_j' = \lfloor x_j \rfloor$ and the other to $l_j' = \lceil x_j \rceil$ and $u_j' = u_j$.
 322 %Each of these LP subproblems represents a child node waiting to be solved.
 323
 324 % Branching
 325 At first, the ILP problem is considered as an LP problem by relaxing the integrality condition before applying an LP solver to it.
 326 This initial relaxed problem represents the root of the tree about to be built.
 327 From the obtained solution $\bs{\xi}$, if $\xi_j$ is not integral for some $j$, then $x_j$ can be chosen as a branching variable.
 328 The current problem will then be divided, if possible, into two subproblems: one having $x_j\le \lfloor \xi_j \rfloor$ and the other $x_j\ge \lceil \xi_j \rceil$.
 329 Each of these LP subproblems represents a child node waiting to be solved.
 330
 331 % When does it ends
 332 This is repeated for each subproblem in such a way that all variables are led towards integral values.
 333 At some point a subproblem will either be infeasible or lead to a feasible ILP solution (leaf nodes).
 334 The algorithm ends when the tree has been fully visited, returning the best feasible ILP solution.
 335
 336 % Upper, lower bounds and pruning
 337 During the exploration, lower and upper bounds on the objective value are computed.
 338 The upper bound is represented by the best objective value encountered in a child node.
 339 This nonadmissible solution hints towards what the objective value of the ILP problem could be.
 340 The lower bound is the best ILP solution yet found, in other words the objective value of the most promising leaf node.
 341 The bounds may be used to prune subtrees when further branching cannot improve the best current solution.
 342
 343 % How can we be improve the exploration
 344 In the B\&B algorithm, the main two operations that impact the convergence of the bounds towards the optimal solution are
 345 the branching strategy and the node selection strategy.
 346
 347 \subsubsection*{Example}
 348 Figure~\ref{chXXX:fig:bnb} illustrates the use of the B\&B algorithm for solving an ILP.
 349 The ILP problem is the following:
 350 \begin{align*}
 351 \textrm{Maximize}  \quad z= x_1 + 4x_2 & \\
 352 \textrm{Subject to}\quad \quad 5x_1 + 8x_2 & \leq 40\\
 353  -2x_1 + 3x_2 & \leq 9\\
 354  x_1, x_2 & \geq 0 \quad \textrm{ integer-valued}
 355 \end{align*}
 356 The nodes are solved with the simplex method in the order written on the figure.
 357 At the 3rd step of the B\&B, the first feasible solution is encountered (light grey leaf).
 358 The optimal solution is encountered at the 7th step (dark grey leaf).
 359 However, this is assessed only after solving the last two nodes (8th and 9th)
 360 whose objective value $z$ is inferior to the best one yet found.
 361
 362 \begin{figure}[!h]
 363 \centering
 364 \includegraphics[width=12cm]{Chapters/chapter10/figures/tree2_conv.pdf}
 365 \caption{Solving an ILP problem using a branch-and-bound algorithm.}
 366 \label{chXXX:fig:bnb}
 367 \end{figure}
 368
 369 \subsection{Branching strategy}
 370 \label{chXXX:subsec:bnb-branching}
 371 % What it is and why it is important
 372 The branching strategy\index{branch-and-bound!branching} defines the method used to select the variable on which branching will occur.
 373 The objective value of the child node depends greatly on the choice of this variable.
 374 Branching on a variable may lead to a drop on the upper bound and thus speed up the exploration,
 375 while branching on other variables could leave this bound unchanged.
 376
 377 % Example of strategy
 378 Several branching strategies exist~\cite{Achterberg05}. Let us briefly comment on some of them in terms of improving the objective function and processing cost.
 379 \begin{itemize}
 380 \item \textit{Smallest-index strategy} is a greedy approach that always selects the variable with the smallest index as branching variable. This method is simple, has a cheap processing cost, but does not try to select the best variable.
 381 \item \textit{Strong branching strategy} is an exhaustive strategy. The branching variable selected is the one among all the potential variables that leads to the best solution. This means that for each potential branching variable, its potential child nodes must be solved. This method is easy to implement, but its computational cost makes it inefficient.
 382 \item \textit{Reliability branching strategy} is a strategy which maintains a pseudocost~\cite{BENICHOU71} for each potential branching variable. The pseudocost of a variable is based on the result obtained when branching on it at previous steps. Since at the start, pseudocosts are unreliable, a limited strong branching approach is used until pseudocosts are deemed reliable. This method is more complex than the two others and requires fine tuning. It offers, however, the best trade-off between improvement and computional cost.
 383 \end{itemize}
 384
 385 \subsection{Node selection strategy}
 386 \label{chXXX:subsec:bnb-node-select}
 387 The node selection strategy\index{branch-and-bound!node selection} defines the methodology used to explore the tree.
 388 While the usual depth-first search and breadth-first search are considered and used,
 389 some remarks about the tree exploration must be made.
 390 First let us mention a few facts:
 391 \begin{enumerate}
 392  \item Solutions obtained from child nodes cannot be better than the one of their parent node.
 393  \item An LP solver is able to quickly find a solution if the subproblem (child node) is only slightly different from its parent.
 394 \end{enumerate}
 395 Given the above statements, it is of interest to quickly find a feasible solution.
 396 Indeed, this allows the pruning of all pending nodes which do not improve the solution found.
 397 However, the quality of the latter solution impacts the amount of nodes pruned.
 398 It takes more time to produce a good solution because one must search for the best nodes in the tree.
 399 Consequently, a trade-off must be made between the quality and the time required to find a solution.
 400
 401 Two types of strategies~\cite{Linderoth97} can then be considered:
 402 \begin{itemize}
 403 \item \textit{Depth-first search} aims at always selecting one of the child nodes until a leaf
 404 (infeasible subproblem or feasible solution) is reached.
 405 This strategy is characterized by fast solving, and it quickly finds a feasible solution.
 406 It mostly improves the lower bound.
 407 \item  \textit{Best-first search} aims at always selecting one of the most promising nodes in the tree.
 408 This strategy is characterized by slower solving but guarantees that the first feasible solution found is the optimal.
 409 It mostly improves the upper bound.
 410 \end{itemize}
 411 A problem occurs with the best-first search strategy:
 412 there might be numerous nodes having solutions of the same quality, thus making the choice of a node difficult.
 413 To avoid this problem, the \textit{best-estimate search} strategy~\cite{FORREST74} differentiates the best nodes by estimating their potential cost with the help of a pseudocost (see previous section).
 414
 415 The most promising variant is a hybrid strategy in which the base strategy is the best-estimate search with the subtrees of the best-estimated nodes being then subject to a limited depth-first search,
 416 more commonly called \textit{plunging}.
 417 This method launches a fast search for a feasible solution in the most promising subtrees,
 418 thus improving the upper and lower bounds at the same time.
 419
 420 \subsection{Cutting-plane methods}
 421 \label{chXXX:subsec:cuts}
 422 \textit{Cutting-planes}\index{branch-and-bound!cutting-plane}~\cite{WOLTER06} (also simply \textit{cuts}) are new constraints whose role is to cut off parts of the search space.
 423 They may be generated from the first LP solution (\textit{cut-and-branch}) or periodically during the B\&B (\textit{branch-and-cut}).
 424 On the one hand cutting-planes may considerably reduce the size of the solution space, but on the other hand they increase the problem size.
 425 Moreover, generating cutting-planes is costly since it requires a thorough analysis of the current state of the problem.
 426
 427 Various types or families of cutting-planes exist. Those that are most applied are the \textit{Gomory cutting-planes} and the \textit{complemented mixed integer rounding inequalities} (c-MIR). Other kinds of cutting-planes target specific families of problems, for example, the \textit{0-1 knapsack cutting-planes} or \textit{flow cover cuts}.
 428
 429 The cutting-planes generated must be carefully selected in order to avoid a huge increase in the problem size. They are selected according to three criteria: their efficiency, their orthogonality with respect to other cutting-planes, and their parallelism with respect to the objective function.
 430 Cutting-planes having the most impact on the problem are then selected, while the others are dropped.
 431 \clearpage
 432 \section{CUDA considerations}
 433 \label{chXXX:sec:cuda}
 434 The most expensive operations in the simplex algorithm are linear algebra functions.
 435 The NVIDIA CUDA Basic Linear Algebra Subroutines\footnote{The libraries CUBLAS and CUSPARSE are available at https://developer.nvidia.com/cublas and https://developer.nvidia.com/cusparse.} (CUBLAS) library is a GPU-accelerated version of the complete standard BLAS library.
 436 It can be used for dense matrix-vector multiplications (\texttt{cublasDgemv}) and dense vector sums (\texttt{cublasDaxpy}). The CUSPARSE library is used for the sparse matrix-vector multiplication (\texttt{cusparseDcsrmv}).
 437
 438 However, complex operations are required to implement the simplex and must be coded from scratch.
 439 For example, reduction operations, such as \textit{argmax} or \textit{argmin}, are fundamental for selecting variables.
 440 In the following section, we first quickly describe the CUDA reduction operation, before making some global remarks on kernel optimization.
 441
 442 % Reduce operation
 443 \subsection{Parallel reduction}
 444 \label{chXXX:subsec:reduction}
 445 A parallel reduction\index{parallel!reduction} operation is performed in an efficient manner inside a GPU block as shown in Figure~\ref{chXXX:fig:reduc}. Shared memory is used for a fast and reliable way to communicate between threads. However, at the grid level, reduction cannot be easily implemented due to the lack of direct communication between blocks. The usual way of dealing with this type of limitation is to apply the reduction in two separate steps. The first one involves a GPU kernel reducing the data over multiple blocks, the local result of each block being stored on completion. The second step finishes the reduction on a single block or on the CPU.
 446 An optimized way of doing the reduction can be found in the examples\footnote{Available at http://docs.nvidia.com/cuda/cuda-samples/index.html\#advanced} provided by NVIDIA.
 447 %In order to keep code listings compact hereafter, the reduction of values among a block will be referred to as \textit{reduceOperation(value)} (per extension \textit{reduceArgMax(maxVal)}).
 448
 449 \begin{figure}[!h]
 450 \centering
 451 \includegraphics[width=10cm]{Chapters/chapter10/figures/Reduc3.pdf}
 452 \caption{Example of a parallel reduction at block level (courtesy NVIDIA).}
 453 \label{chXXX:fig:reduc}
 454 \end{figure}
 455
 456 % Volkov - hiding latency, register usage
 457 \subsection{Kernel optimization}
 458 Optimizing a kernel is a difficult task. The most important point is to determine whether the performances are limited by the bandwidth or by the instruction throughput. Depending on the case and the specificities of the problem, various strategies may be applied. This part requires a good understanding of the underlying architecture and its limitations. The \textit{CUDA Programming Guide} offers some insight into this subject, as do the interesting articles and presentations by Vassily Volkov~\cite{VOLKOV2010}. The CUDA profiler is the best way to monitor the performances of a kernel. This tool proposes multiple performance markers giving indications about potential bottlenecks.
 459
 460 \section{Implementations}
 461 \label{chXXX:sec:impl}
 462 % Foreword - no expand
 463 In this section, the implementations of the simplex algorithm are studied. The emphasis is put on the main algorithms and kernels having a major impact on performance. The \textit{expand} method used for choosing the leaving variable will not be detailed for that reason.
 464 Then the B\&B implementation is quickly explained focusing on the interaction between the B\&B algorithm and the simplex solver.
 465
 466 \subsection{Standard simplex}
 467 \label{chXXX:subsec:impl-simplex-std}
 468 % Standard simplex --> cublas but steepest edge
 469 The implementation of the standard simplex algorithm (see Section~\ref{chXXX:subsec:simplex-std}) is rather straightforward. The main difference with Algorithm~\ref{chXXX:algo:simplex-std} is that the basis matrix is not stored in memory since it is equal to the identity matrix most of the time. Instead a proper data structure keeps track of the basic variables and their values.
 470 %% \begin{algorithm}
 471 %% \caption{Standard simplex algorithm}
 472 %% \label{chXXX:algo:impl-simplex-std}
 473 %% \begin{algorithmic}
 474 %%      \STATE \textit{// Find entering variable $e$}
 475 %%      \STATE $\gamma_j \leftarrow \Vert (\mbf{A_\N})_j \Vert^2$
 476 %%      \STATE $e \leftarrow argmax \left( \mbf{c} / \sqrt{\boldsymbol\gamma} \right) $
 477 %%      \IF{$e < 0$}
 478 %%      \RETURN \texttt{optima\_found}
 479 %%      \ENDIF
 480 %%      \STATE \textit{// Find leaving variable $\ell$}
 481 %%      \STATE $\ell, t \leftarrow expand((\mbf{A_\N})_e)$
 482 %%      \IF{$\ell < 0$}
 483 %%      \RETURN \texttt{unbounded}
 484 %%      \ENDIF
 485 %%      \STATE \textit{// Pivoting}
 486 %%      \STATE $\mbf{d} \leftarrow (\mbf{A_\N})_e \, , \, d_e \leftarrow (A_\N)_{\ell e}-1$
 487 %%      \STATE $\mbf{r} \leftarrow \mbf{N^T}_\ell$
 488 %%      \STATE $x_e \leftarrow x_e + t $
 489 %%      \STATE $\mbf{x_\B} \leftarrow \mbf{x_\B} - t (\mbf{A_\N})_e $
 490 %%      \STATE $\mbf{A_\N} \leftarrow  \mbf{A_\N} - \mbf{d^T} \mbf{r} / (A_\N)_{\ell e}$ \hspace{2mm}\textit{// cublasDger}
 491 %%      \STATE $\mbf{c} \leftarrow \mbf{c} -  c_\ell \mbf{r} / (A_\N)_{\ell e}$ \hspace{12mm}\textit{// cublasDaxpy}
 492 %%      \STATE $swap(x_e, x_{\ell})$
 493 %% \end{algorithmic}
 494 %% \end{algorithm}
 495
 496 \begin{algorithm}
 497 \caption{standard simplex algorithm}
 498 \label{chXXX:algo:impl-simplex-std}
 499         \textit{// Find entering variable $x_e$}\;
 500          $\gamma_j \leftarrow \Vert (\mbf{A_\N})_j \Vert^2$\;
 501         $e \leftarrow argmax \left( \mbf{c} / \sqrt{\boldsymbol\gamma} \right) $\;
 502         \If{$e < 0$}     {
 503         \Return \texttt{optima\_found}
 504         }
 505         \textit{// Find leaving variable $x_\ell$}\;
 506         $\ell, t \leftarrow expand((\mbf{A_\N})_e)$\;
 507         \If{$\ell < 0$} {
 508         \Return \texttt{unbounded}
 509         }
 510         \textit{// Pivoting}\;
 511         $\mbf{d} \leftarrow (\mbf{A_\N})_e \, , \, d_e \leftarrow (A_\N)_{\ell e}-1$\;
 512         $\mbf{r} \leftarrow \mbf{N^T}_\ell$\;
 513         $x_e \leftarrow x_e + t $\;
 514         $\mbf{x_\B} \leftarrow \mbf{x_\B} - t (\mbf{A_\N})_e $\;
 515         $\mbf{A_\N} \leftarrow  \mbf{A_\N} - \mbf{d^T} \mbf{r} / (A_\N)_{\ell e}$ \hspace{2mm}\textit{// cublasDger}\;
 516         $\mbf{c} \leftarrow \mbf{c} -  c_\ell \mbf{r} / (A_\N)_{\ell e}$ \hspace{12mm}\textit{// cublasDaxpy}\;
 517         $swap(x_e, x_{\ell})$\;
 518
 519 \end{algorithm}
 520
 521
 522
 523 The first step of Algorithm~\ref{chXXX:algo:impl-simplex-std} is the selection of the entering variable using the steepest-edge heuristic. This step is discussed thoroughly in the next section. Then the leaving variable index $\ell$ and the potential gain on the entering variable $t$ are determined using the \textit{expand} method. Finally, the pivoting is done. The nonbasic matrix $\mbf{A_\N}$ and the objective function coefficients $\mbf{c}$ are updated using the CUBLAS library (respectively, with \texttt{cublasDger} and \texttt{cublasDaxpy}). This requires the entering variable column $\mbf{d}$ and the leaving variable row $\mbf{r}$ to be copied and slightly altered. The new value of the variables is computed and, since we use a special structure instead of the basis matrix, the entering and leaving variables are swapped.
 524
 525 % Kernel example ==> steepest edge
 526 \subsubsection*{Choice of the entering variable}
 527 \label{chXXX:sec:impl-simplex-std-inVar}
 528 Let us discuss two different approaches for the selection of the entering variable. The first one relies on the CUBLAS library.
 529 The idea is to split this step into several small operations, starting with the computation of the norms one by one with the \texttt{cublasDnrm2} function.
 530 The score of each variable is then obtained by dividing the cost vector $\mbf{c}$ by the norms previously computed.
 531 The entering variable is finally selected by taking the variable with the biggest steepest-edge coefficient using \texttt{cublasDamax}.
 532
 533 While being easy to implement, such an approach would lead to poor performances. The main problem is a misuse of data parallelism. Each column can be processed independently, and thus at the same time. Moreover, slicing this step into small operations requires that each temporary result be stored in global memory. This creates a huge amount of slow data transfers between kernels and global memory.
 534
 535 \lstinputlisting[label=chXXX:lst:k1,caption=a kernel for the choice of the entering variable]{Chapters/chapter10/optiSE.cu}
 536
 537 To avoid this, the whole step could be implemented as a unique kernel, as shown in the simplified Listing~\ref{chXXX:lst:k1}. Each block evaluates multiple potential entering variables.
 538 The threads of a block process part of the norm of a column.
 539 Then a reduction (see Section~\ref{chXXX:subsec:reduction}) is done inside the block to form the full norm.
 540 The thread having the final value can then compute the score of the current variable and determine if it is the best one encountered in this block.
 541 Once a block has evaluated its variables, the most promising one is stored in global memory.
 542 The final reduction step is finally done on the CPU.
 543
 544 The dimension of the blocks and grid have to be chosen wisely as a function of the problem size.
 545 The block size is directly correlated to the size of a column while the grid size is a trade-off between giving enough work to the scheduler in order to hide the latency and returning as few potential entering variables as possible for the final reduction step.
 546
 547 \subsubsection*{CPU-GPU interactions}
 548 The bulk of the data representing the problem is stored on the GPU.
 549 Only variables required for decision-making operations are updated on the CPU.
 550 The communications arising from the aforementioned scheme are illustrated in Figure~\ref{chXXX:fig:diagSTD}.
 551 The amount of data exchanged at each iteration is independent of the problem size ensuring that this implementation scales well as the problem size increases.
 552 \clearpage
 553 \begin{figure}[!h]
 554 \centering
 555 \includegraphics[width=90mm]{Chapters/chapter10/figures/DiagSTD_cap.pdf}
 556 \caption{Communications between CPU and GPU.}
 557 \label{chXXX:fig:diagSTD}
 558 \end{figure}
 559
 560
 561 \subsection{Revised simplex}
 562 \label{chXXX:subsec:impl-simplex-rev}
 563 % Revised simplex
 564 The main steps found in the previous implementation are again present in the revised simplex. The difference comes from the fact that a full update of the matrix $\mbf{N}$ is not necessary at every iteration. As explained in Section~\ref{chXXX:subsec:simplex-rev}, the basis matrix $\mbf{B}$ is updated so that the required values, such as the entering variable column, can be processed from the now \textit{constant} matrix $\mbf{N}$. Due to these changes, the steepest-edge method is adapted. The resulting implementation requires more operations as described in Algorithm~\ref{chXXX:algo:simplex-rev}.
 565 % Algorithm
 566 %% \begin{algorithm}
 567 %% \caption{Revised simplex algorithm}
 568 %% \label{chXXX:algo:simplex-rev}
 569 %% \begin{algorithmic}
 570 %%      \STATE \textit{// Find entering variable $x_e,\, e\in \B$}
 571 %%      \STATE $\bs\tau \leftarrow  \mbf{(B^{-1})^T}\mbf{c_\B} $ \hspace*{5mm} \textit{// cublasDgemv}
 572 %%     \STATE $\bs\upsilon \leftarrow \mbf{c_\N} - \mbf{N^T} \bs\tau $ \hspace*{4mm} \textit{// cublasDgemv}
 573 %%     \STATE $e \leftarrow argmax \left( \bs\upsilon / \sqrt{\bs\gamma} \right) $
 574 %%     \STATE \textbf{if} $e < 0$ \textbf{return} \texttt{optima\_found}
 575 %%      \STATE \textit{// Find leaving variable $x_\ell,\, \ell\in \N$}
 576 %%     \STATE $\mbf{d} \leftarrow \mbf{B^{-1}} \mbf{N}_e $ \hspace*{9mm} \textit{// cublasDgemv}
 577 %%       \STATE $\ell, t \leftarrow expand(\mbf{d})$
 578 %%     \STATE \textbf{if} $\ell < 0$ \textbf{return} \texttt{unbounded}
 579 %%      \STATE \textit{// Pivoting - basis update}
 580 %%      \STATE $\bs\kappa \leftarrow \mbf{(B^{-1})^T} \mbf{d} $ \hspace*{6mm} \textit{// cublasDgemv}
 581 %%      \STATE $\boldsymbol\beta \leftarrow \mbf{N^T} \bs\kappa$ \hspace*{12mm} \textit{// cublasDgemv}
 582 %%      \STATE $\mbf{B^{-1}} \leftarrow  \mbf{E}\, \mbf{B^{-1}}$
 583 %%      %\STATE $basisUpdate(\mbf{B^{-1}}, \mbf{d})$ \textit{// See section ...}
 584 %%      \STATE $x_e \leftarrow x_e + t $
 585 %%      \STATE $\mbf{x_\B} \leftarrow \mbf{x_\B} - t\,\mbf{d} $
 586 %%      \STATE $swap(x_\ell,x_e)$
 587 %%      \STATE $\hat{\bs\alpha} \leftarrow \mbf{N^T} \left( \mbf{(B^{-1})^T} \right)_\ell$ \hspace*{4mm} \textit{// cublasDgemv}
 588 %%      \STATE $\gamma_e \leftarrow \Vert \mbf{d} \Vert^2$
 589 %%      \STATE $\hat{\gamma_j} \leftarrow  \max\left\{\gamma_j - 2 \hat{\alpha_j} \beta_j + \gamma_e \hat{\alpha_j}^2, 1+\hat{\alpha_j}^{2}\right\},\; \forall j \neq e $,$\quad \hat{\gamma_e} \leftarrow \gamma_{e} / d_e^2$
 590 %% \end{algorithmic}
 591 %% \end{algorithm}
 592
 593
 594 \begin{algorithm}
 595 \caption{revised simplex algorithm}
 596 \label{chXXX:algo:simplex-rev}
 597         \textit{// Find entering variable $x_e,\, e\in \B$}\;
 598         $\bs\tau \leftarrow  \mbf{(B^{-1})^T}\mbf{c_\B} $ \hspace*{5mm} \textit{// cublasDgemv}\;
 599        $\bs\upsilon \leftarrow \mbf{c_\N} - \mbf{N^T} \bs\tau $ \hspace*{4mm} \textit{// cublasDgemv}\;
 600     $e \leftarrow argmax \left( \bs\upsilon / \sqrt{\bs\gamma} \right) $\;
 601     \If { $e < 0$ } {\Return \texttt{optima\_found} }
 602     \textit{// Find leaving variable $x_\ell,\, \ell\in \N$}     \;
 603     $\mbf{d} \leftarrow \mbf{B^{-1}} \mbf{N}_e $ \hspace*{9mm} \textit{// cublasDgemv}\;
 604     $\ell, t \leftarrow expand(\mbf{d})$\;
 605     \If { $\ell < 0$} {\Return \texttt{unbounded}}
 606      \textit{// Pivoting, basis update}\;
 607     $\bs\kappa \leftarrow \mbf{(B^{-1})^T} \mbf{d} $ \hspace*{6mm} \textit{// cublasDgemv}\;
 608     $\boldsymbol\beta \leftarrow \mbf{N^T} \bs\kappa$ \hspace*{12mm} \textit{// cublasDgemv}\;
 609     $\mbf{B^{-1}} \leftarrow  \mbf{E}\, \mbf{B^{-1}}$\;
 610     $x_e \leftarrow x_e + t $\;
 611     $\mbf{x_\B} \leftarrow \mbf{x_\B} - t\,\mbf{d} $\;
 612     $swap(x_\ell,x_e)$\;
 613     $\hat{\bs\alpha} \leftarrow \mbf{N^T} \left( \mbf{(B^{-1})^T} \right)_\ell$ \hspace*{4mm} \textit{// cublasDgemv}\;
 614     $\gamma_e \leftarrow \Vert \mbf{d} \Vert^2$ \;
 615     $\hat{\gamma_j} \leftarrow  \max\left\{\gamma_j - 2 \hat{\alpha_j} \beta_j + \gamma_e \hat{\alpha_j}^2, 1+\hat{\alpha_j}^{2}\right\},\; \forall j \neq e $,$\quad \hat{\gamma_e} \leftarrow \gamma_{e} / d_e^2$\;
 616 \end{algorithm}
 617
 618
 619 Let us compare the complexity, in terms of level 2 and level 3 BLAS operations, of both implementations. The standard one has mainly two costly steps: the selection of the entering variable and the update of the matrix $\mbf{N}$. These are level 2 BLAS operations or the equivalent, and thus the approximate complexity is $\mathcal{O}(2mn)$.
 620
 621 The new implementation proposed has three operations \texttt{cublasDgemv} where the matrix $\mbf{N}$ is involved,
 622 and three others with the matrix $\mbf{B^{-1}}$.
 623 The complexities of these operations are respectively $\mathcal{O}(mn)$ and $\mathcal{O}(m^2)$.
 624 The update of the matrix $\mbf{B}$ is a level 3 BLAS operation costing $\mathcal{O}(m^3)$.
 625 The approximate complexity is then $\mathcal{O}(3mn + 3m^2 + m^3)$.
 626
 627 It appears clearly that, in the current state, the revised simplex implementation is less efficient than the standard one. However this method can be improved and might well be better than the standard one based on two characteristics of LP problems:
 628 their high sparsity and the fact that usually $m \ll n$. In the next sections, we will give details about these improvements.
 629
 630 \subsubsection*{Choice of the entering variable}
 631 Once again, this part of the algorithm can be substantially improved.
 632 First, algorithmic optimizations must be considered.
 633 Since row $\bs\alpha$ of the leaving variable is processed to update the steepest-edge coefficients, the cost vector $\bs\upsilon$ can be updated directly without using the basis matrix $\mbf{B}$.
 634 This is done as follow
 635 \begin{equation*}
 636 \upsilon_j  = \bar{\upsilon}_j - \bar{\upsilon}_e \alpha_j, \; \forall j \neq e, \qquad \qquad
 637 \upsilon_e  =  -\dfrac{\bar{\upsilon}_e}{\alpha_e}
 638 \end{equation*}
 639 where $\bar{\upsilon_j}$ denotes the value of $\upsilon_j$ at the previous iteration.
 640 It is important to note that all these updated values ($\bs\upsilon, \bs\gamma$) must be processed afresh once in a while to reduce numerical errors.
 641
 642 Including this change, the operations required for the selection of the entering variable $e$ are detailed in Algorithm~\ref{chXXX:algo:inVar}. The values related to the entering variable at the previous iteration $\bar{e}=q$ are used. The reduced costs are updated, followed by the steepest-edge coefficients. With these values the entering variable is determined.
 643
 644 %% \begin{algorithm}[!h]
 645 %% \caption{Choice of the entering variable}
 646 %% \label{chXXX:algo:inVar}
 647 %% \begin{algorithmic}
 648 %%      \STATE $q \leftarrow \bar{e}$
 649 %%      \STATE $\bar{\upsilon}_q \leftarrow c_q - \mbf{c_{\B}^T} \bar{\mbf{d}}$
 650 %%      \STATE $\bar{\gamma}_q \leftarrow \Vert \bar{\mbf{d}} \Vert$
 651 %%      \STATE \textit{// Update of the reduced costs}
 652 %%      \STATE $\upsilon_{j} \leftarrow \bar{\upsilon}_{j} - \alpha_{j} \bar{\upsilon}_q, \; \forall j \neq q$
 653 %%     \STATE $\upsilon_q \leftarrow  \dfrac{\bar{\upsilon}_q}{\bar{d}_q^2}$
 654 %%      \STATE \textit{// Update of the steepest edge coefficients}
 655 %%      \STATE $\gamma_j \leftarrow  \max\left\{\bar{\gamma}_j - 2 \alpha_j \bar{\beta}_j + \bar{\gamma}_q \alpha_j^2, 1+\alpha_{j}^{2} \right\},\; \forall j \neq q $
 656 %%      \STATE $\gamma_q \leftarrow \bar{\gamma}_q / \bar{d}_q^2$
 657 %%      \STATE \textit{// Selection of the entering variable}
 658 %%     \STATE $e \leftarrow argmax \left( \bs\upsilon / \sqrt{\bs\gamma} \right) $
 659 %% \end{algorithmic}
 660 %% \end{algorithm}
 661
 662 \begin{algorithm}[!h]
 663 \caption{choice of the entering variable}
 664 \label{chXXX:algo:inVar}
 665         $q \leftarrow \bar{e}$\;
 666         $\bar{\upsilon}_q \leftarrow c_q - \mbf{c_{\B}^T} \bar{\mbf{d}}$   \;
 667         $\bar{\gamma}_q \leftarrow \Vert \bar{\mbf{d}} \Vert$\;
 668         \textit{// Update of the reduced costs}\;
 669         $\upsilon_{j} \leftarrow \bar{\upsilon}_{j} - \alpha_{j} \bar{\upsilon}_q, \; \forall j \neq q$\;
 670         $\upsilon_q \leftarrow  \dfrac{\bar{\upsilon}_q}{\bar{d}_q^2}$ \;
 671         \textit{// Update of the steepest edge coefficients}\;
 672         $\gamma_j \leftarrow  \max\left\{\bar{\gamma}_j - 2 \alpha_j \bar{\beta}_j + \bar{\gamma}_q \alpha_j^2, 1+\alpha_{j}^{2} \right\},\; \forall j \neq q $\;
 673         $\gamma_q \leftarrow \bar{\gamma}_q / \bar{d}_q^2$  \;
 674         \textit{// Selection of the entering variable}\;
 675     $e \leftarrow argmax \left( \bs\upsilon / \sqrt{\bs\gamma} \right) $\;
 676 \end{algorithm}
 677
 678 Coupling these operations into a single kernel is quite beneficial. This leads $\bs\upsilon$ and $\bs\gamma$ to be loaded only once from global memory. Their updated values are stored while the entering variable $e$ is selected. With these changes, the global complexity of this step is reduced from $\mathcal{O}(mn + m^2 + n)$ to $\mathcal{O}(n)$. Moreover the remaining processing is done optimally by reusing data and by overlapping global memory access and computations.
 679
 680 \subsubsection*{Basis update}
 681 % Update operation
 682 The basis update $\mbf{B^{-1}} \leftarrow  \mbf{E} \, \mbf{B^{-1}}$ is a matrix-matrix multiplication. However, due to the special form of the matrix $\mbf{E}$ (see Section~\ref{chXXX:subsec:simplex-rev}), the complexity of this operation can be reduced from $\mathcal{O}(m^3)$ to $\mathcal{O}(m^2)$~\cite{BIELING}. The matrix $\mbf{E}$ is merely the identity matrix having the $\ell^\mathrm{th}$ column replaced by the vector $\bs\eta$. The update of the matrix $\mbf{B^{-1}}$ can be rewritten as
 683 \begin{equation*}
 684 \hat{B}^{-1}_{ij}  = B^{-1}_{ij}\left(1 - \frac{d_i}{d_\ell}\right)\, , \quad \forall i \neq \ell, \qquad
 685 \qquad \hat{B}^{-1}_{\ell j}  = \frac{B^{-1}_{\ell j}}{d_\ell}
 686 \end{equation*}
 687 As shown in Listing~\ref{chXXX:lst:k2}, each block of the kernel processes a single column while each thread may compute multiple elements of a column.
 688 This organization ensures that global memory accesses are coalescent since the matrix $\mbf{B}$ is stored column-wise.
 689
 690 \lstinputlisting[label=chXXX:lst:k2,caption=basis update]{Chapters/chapter10/updateBasis.cu}
 691
 692 % Sparse matrix - vector
 693 \subsubsection*{Sparse matrix operations}
 694 The complexity of the implementation is now $\mathcal{O}(2mn + 3m^2)$ which is close to the one of the standard simplex implementation. The operations where the matrix $\mbf{N}$ is involved remain the more expensive (considering $m \ll n$). However, this matrix is \textit{constant} in the revised simplex allowing the use of sparse matrix-vector multiplication from the CUSPARSE library. On sparse problems, this leads to important gains in performance. The sparse storage of the matrix $\mbf{N}$ reduces significantly the memory space used by the algorithm. All these improvements come at a small cost: columns are no longer directly available in their dense format and must be decompressed to their dense representation when needed.
 695
 696 The complexity of the revised simplex implementation is thus finally $\mathcal{O}(2\theta + 3m^2)$ where $\theta$ is a function of $m$, $n$, and the sparsity of the problem. We shall see in Section~\ref{chXXX:subsec:implPerf}, what kind of performances we may obtain on various problems.
 697
 698 \subsection{Branch-and-bound}
 699 \label{chXXX:subsec:impl-bnb}
 700 % Branch and bound on GPU
 701 % Limiting communications
 702 The B\&B algorithm is operated from the CPU.
 703 The simplex implementations, also referred to as LP solver, are used to solve the nodes of the B\&B tree.
 704 Algorithm~\ref{chXXX:algo:bnb} contains the main operations discussed in Section~\ref{chXXX:sec:bnb}.
 705 It starts by solving the root node.
 706 The branching strategy is then used to select the next variable to branch on.
 707 From thereon, until no node remains to be solved, the node selection strategy is used to select the next one to be processed.
 708
 709 The critical factor for coupling the LP solver and the B\&B is the amount of communications exchanged between them. Since CPU-GPU communications are expensive, the informations required to solve a new node must be minimized. Upon solving a new node, the full transfer of the problem must be avoided. This will be the focus of this section.
 710
 711 % Algorithm
 712 %% \begin{algorithm}
 713 %% \caption{Branch-and-Bound}
 714 %% \label{chXXX:algo:bnb}
 715 %% \begin{algorithmic}
 716 %%     \STATE Solution  $\leftarrow$ Solve(InitialProblem)
 717 %%     \STATE BranchVar $\leftarrow$ SelectNextVariable(Solution)
 718 %%     \STATE NodeList $\leftarrow$ AddNewNodes(BranchVar)
 719 %%     \WHILE{NodeList $\neq \emptyset$}
 720 %%     \STATE Node $\leftarrow$ SelectNextNode(NodeList)
 721 %%     \STATE Solution  $\leftarrow$ Solve(Node)
 722 %%     \IF{exists(Solution)}
 723 %%     \STATE BranchVar $\leftarrow$ SelectNextVariable(Solution)
 724 %%     \STATE NodeList $\leftarrow$ AddNewNodes(BranchVar)
 725 %%     \ENDIF
 726 %%     \ENDWHILE
 727 %% \end{algorithmic}
 728 %% \end{algorithm}
 729
 730 \begin{algorithm}
 731 \caption{branch-and-bound}
 732 \label{chXXX:algo:bnb}
 733     Solution  $\leftarrow$ Solve(InitialProblem)\;
 734     BranchVar $\leftarrow$ SelectNextVariable(Solution)\;
 735     NodeList $\leftarrow$ AddNewNodes(BranchVar)\;
 736     \While{NodeList $\neq \emptyset$} {
 737         Node $\leftarrow$ SelectNextNode(NodeList)\;
 738         Solution  $\leftarrow$ Solve(Node)\;
 739         \If{exists(Solution)}{
 740             BranchVar $\leftarrow$ SelectNextVariable(Solution)\;
 741             NodeList $\leftarrow$ AddNewNodes(BranchVar)\;
 742         }
 743     }
 744 \end{algorithm}
 745
 746 % Algorithmic choice --> Plunging
 747 \subsubsection*{Algorithmic choices}
 748 The first critical choice is to select an efficient tree traversal strategy that also takes advantage of the locality of the problem. At first glance, the best-first search would be a poor choice, since from one node to the next the problem could change substantially. However, when combined with the plunging heuristic, this same strategy becomes really interesting. Indeed, the plunging phase can be completely decoupled from the B\&B. The CPU, acting as a decision maker, chooses a promising node and spawns a task that will independently explore the subtree of this node. Once done, this task will report its findings to the decision maker. Moreover, when plunging, the next node to be solved differs by only the new bound on the branching variable. Communications are then minimized and the LP solver is usually able to quickly find the new solution since the problem has been only slightly altered.
 749
 750 % Technical improvement --> warmstart / hotstart
 751 \subsubsection*{Warmstart}
 752 There are two cases where starting from a fresh problem is required or beneficial. The first one is imposed by the numeric inaccuracy appearing after several iterations of the LP solver. The second is upon the request of a new subtree. To avoid the extensive communication costs of a full restart, the GPU keeps in memory an intermediate stable state of the problem, a \textit{warmstart}. This state could, for example, be the one found after solving the root node of the tree.
 753
 754 % Global flow
 755 \subsubsection*{Multi-GPU\index{multi-GPU} exploration}
 756 Having the CPU act as a decision maker and the GPU as an explorer, allows for the possibility of using multiple GPUs to explore the tree. The global knowledge is maintained by the CPU task. The CPU assigns to the GPUs the task of exploring subtrees of promising nodes. Since each plunging is done locally, no communications are required between the GPUs. Moreover, the amount of nodes processed during a plunging can be used to tune the load of the CPU task.
 757
 758 \section{Performance model}
 759 \label{chXXX:sec:MODEL}
 760 Performance models\index{performance model} are useful to predict the behaviour of implementations as a function of various parameters. Since the standard simplex algorithm is the easiest to understand, we will focus in this section on its behaviour as a function of the problem size.
 761
 762 CUDA kernels require a different kind of modeling than usually encountered in parallelism. The key idea is to capture in the model the decomposition into threads, warps, and blocks. One must also pay a particular attention to global memory accesses and to how the pipelines reduce the associated latency.
 763
 764 In order to model our implementation, we follow the approach given by K. Kothapalli et al.~\cite{MODEL} (see also~\cite{ELSTER09}). First, we examine the different levels of parallelism on a GPU. Then, we determine the amount of work done by a single task. By combining both analyses, we establish the kernel models. The final model then consists of the sum of each kernel.
 765
 766 This model has also been used to model a multi-GPUs version of the standard simplex method~\cite{MEYER2011}.
 767
 768
 769 \subsection{Levels of parallelism}
 770 \label{chXXX:subsec:LVLParallelism}
 771
 772 A kernel can be decomposed into an initialization phase followed by a processing phase.
 773 During the initialization phase the kernel context is set up.
 774 Since this operation does not take a lot of time compared to the processing phase,
 775 it is needless to incorporate it into the model.
 776 The processing phase is more complex and its execution time depends directly on the amount of work it must perform.
 777 We shall now focus on this phase and on the various levels of parallelism on a GPU.
 778
 779 At the first level, the total work is broken down into components, the blocks. They are then dispatched on the available multiprocessors  on the GPU. The execution time of a kernel depends  on the number of blocks $N_B$ per SM (streaming multiprocessor) and on the number $N_{SM}$ of SM per GPU.
 780
 781 At a lower level, the work of a block is shared among the various cores of its dedicated SM. This is done by organizing the threads of the block into groups, the warps. A warp is a unit of threads that can be executed in parallel on an SM. The execution time of a block is then linked to the number $N_W$ of warps per block, the number $N_T$ of threads per warp, and the number $N_P$ of cores per SM.
 782
 783 The third and lowest level of parallelism is a pipeline. This pipeline enables a pseudoparallel execution of the tasks forming a warp. The gain produced by this pipeline is expressed by its depth $D$.
 784
 785 \subsection{Amount of work done by a thread}
 786 \label{chXXX:subsec:workTask}
 787
 788 In the previous section, we defined the different levels of parallelism down to the smallest, namely the thread level. We must now estimate how much work is done by a single thread of a kernel in terms of cycles. It depends on the number and the type of instructions. In CUDA, each arithmetic instruction requires a different number of cycles. For example, a single precision add costs 4 cycles while a single precision division costs 36 cycles.
 789
 790 Moreover, since global memory access instructions are nonblocking operations, they must be counted separately from arithmetic instructions. The number of cycles involved in a global memory access amounts to a 4 cycle instruction (read/write) followed by a nonblocking latency of 400--600 cycles. To correctly estimate the work due to such accesses, one needs to sum only the latency that is not hidden. Two consecutive read instructions executed by a thread would cost twice the 4 cycles, but only once the latency due to its nonblocking behaviour. Once the amount of cycles involved in these memory accesses has been determined, it is then necessary to take into account the latency hidden by the scheduler (warp swapping).
 791
 792 The total work $C_T$ done by a thread can be defined either as the sum or as the maximum of the memory access cycles and the arithmetic instructions cycles. Summing both types of cycles means we consider that latency cannot be used to hide arithmetic instructions. The maximum variant represents the opposite situation where arithmetic instructions and memory accesses are concurrent. Then only the biggest of the two represents the total work of a thread. This could occur for example when the latency is entirely hidden by the pipeline.
 793
 794 It is not trivial to define which scenario is appropriate for each kernel. There are many factors involved: the dependency on the instructions, the behaviour of the scheduler, the quantity of data to process, and so on.
 795
 796 If latency cannot be used to hide arithmetic instructions, the number of cycles $C_T$ done by a thread can be defined as the sum of the global memory access and the arithmetic instructions.
 797 Otherwise, one must consider the maximum instead of the sum.
 798 %Moreover, for each kernel the task representing the worst case scenario will be the one modelised.
 799
 800 \subsection{Global performance model}
 801 \label{chXXX:subsec:MODEL}
 802
 803 We now turn to the global performance model for a kernel. We must find out how many threads are run by a core of an SM. Recall that a kernel decomposes into blocks of threads, with each SM running $N_B$ blocks, each block having $N_W$ warps consisting of $N_T$ threads. Also recall that an SM has $N_P$ cores which execute batches of threads in a pipeline of depth $D$. Thus, the total work $C_{core}$ done by a core is given by
 804 \begin{equation}
 805 \label{chXXX:equ:ccore}
 806 C_{core} = N_B \cdot N_W \cdot N_T \cdot C_T \cdot \dfrac{1}{N_P \cdot D}
 807 \end{equation}
 808 which represents the total work done by an SM divided by the number of cores per SM and by the depth of the pipeline.
 809 Finally, the execution time of a kernel is obtained by dividing $C_{core}$ by the core frequency.
 810
 811 \subsection{A kernel example: \textit{steepest-edge}}
 812 \label{chXXX:subsec:SE}
 813 The selection of the entering variable is done by the \textit{steepest-edge} method as described in Section~\ref{chXXX:subsec:impl-simplex-std}.
 814
 815 The processing of a column is done in a single block. Each thread of a block has to compute $N_{el}=\frac{m}{N_W \cdot N_T}$ coefficients of the column. This first computation consists of $N_{el}$ multiplications and additions. The resulting partial sum of squared variables must then be reduced on a single thread of the block. This requires $N_{red} = \log_2(N_W \cdot N_T)$ additions. Since the shared memory is used optimally, there are no added communications. Finally, the coefficient $c_j$ must be divided by the result of the reduction.
 816
 817 Each block of the kernel computes $N_{col}=\frac{n}{N_B \cdot N_{SM}}$ columns, where $N_{SM}$ is the number of SM per GPU. After processing a column, a block keeps only the minimum of its previously computed steepest-edge coefficients. Thus, the number of arithmetic instruction cycles for a given thread is given by
 818 \begin{equation}
 819 \label{chXXX:equ:arithmSE}
 820 C_{Arithm} =  N_{col} \cdot (N_{el}  \cdot (C_{add} + C_{mult}) + N_{red} \cdot C_{add} + C_{div} + C_{cmp})
 821 \end{equation}
 822 where $C_{ins}$ denotes the number of cycles to execute instruction $ins\in \{add, div, mult, cmp\}$.
 823
 824 Each thread has to load $N_{el}$ variables to compute its partial sum of squared variables. The thread computing the division also loads the coefficient $c_j$. This must be done for the $N_{col}$ columns with which a block has to deal. We must also take into account that the scheduler hides some latency by swapping the warps, so the total latency $C_{latency}$ must be divided by the number of warps $N_W$.
 825 Thus, the number of cycles relative to memory accesses is given by
 826
 827 \begin{equation}
 828 \label{chXXX:equ:latencySE}
 829 C_{Accesses} =  \dfrac{ N_{col} \cdot (N_{el}+1) \cdot C_{latency}} {N_W}
 830 \end{equation}
 831
 832 At the end of the execution of this kernel, each block stores in global memory its local minimum. The CPU will then have to retrieve the $N_B\cdot N_{SM}$ local minimums and reduce them. It is then profitable to minimize the number $N_B$ of blocks per SM. With a maximum of two blocks per SM, the cost of this final operation done by the CPU can be neglected when $m$ and $n$ are big.
 833
 834 It now remains to either maximize or sum $C_{Arithm}$ and $C_{Accesses}$ to obtain~$C_T$. The result of Equation~(\ref{chXXX:equ:ccore}) divided by the core frequency yields the time $T_{KSteepestEdge}$ of the steepest-edge kernel.
 835
 836 \subsection{Standard simplex GPU implementation model}
 837 \label{chXXX:subsec:model-simplex-std}
 838 As seen in Section~\ref{chXXX:subsec:impl-simplex-std}, the standard simplex implementation requires only a few communications between the CPU and the GPU. Since all of these communications are constant and small, they will be neglected in the model. For the sake of simplicity, we will consider the second phase of the \textit{two-phase simplex} where we apply iteratively the three main operations: selecting the entering variable, choosing the leaving variable, and pivoting. Each of these operations is computed as a kernel. The time of an iteration $T_{Kernels}$ then amounts to the sum of all three kernel times:
 839 \begin{equation}
 840 \label{chXXX:equ:1GPU}
 841 T_{Kernels} = T_{KSteepestEdge} + T_{KExpand} + T_{KPivoting}
 842 \end{equation}
 843 The times $T_{KExpand}$ and $T_{KPivoting}$ for the expand and pivoting kernels are obtained in a similar way as for the steepest-edge kernel described in the previous section.
 844
 845 With the estimated time per iteration $T_{Kernels}$, we can express the total time $T_{prob}$ required for solving a problem as
 846 \begin{equation}
 847 \label{chXXX:equ:tprob}
 848 T_{prob} = T_{init} + r \cdot T_{Kernels}
 849 \end{equation}
 850 where $r$ is the number of iterations.
 851 Note that research by Dantzig~\cite{DANTZIG80} asserts that $r$ is proportional to $m \log_2(n)$.
 852
 853 \section{Measurements and analysis}
 854 \label{chXXX:sec:measurements}
 855 Two sets of measurements are presented here. The first one is a validation of the performance model presented in Section~\ref{chXXX:sec:MODEL}. The second is a comparison of the performances of the various implementations of the simplex method detailed in Section~\ref{chXXX:sec:impl}.
 856
 857 As a preliminary, we ensured that our implementations were functional. For that matter, we used a set of problems available on the NETLIB Repository~\cite{NETBLIB}. This dataset usually serves as a benchmark for LP solvers. It consists of a vast variety of real and specific problems for testing the stability and robustness of an implementation. For example, some of these represent real-life models of large refineries, industrial production/allocation, or fleet assignments problems. Our implementations are able to solve almost all of these problems.
 858
 859 % Test 1 : Tesla S1070 =>  Performance model + custom problems
 860 \subsection{Performance model validation}
 861 \label{chXXX:subsec:modelVal}
 862
 863 In order to check that our performance model for the standard simplex implementation is correct, a large range of problems of varying size is needed. As none of the existing datasets offered the desired diversity of problems, we used a problem generator. It is then possible to create problems of given size and density. Since usual LP problems have more variables than equations, our generated problems have a ratio of 5 variables for 1 equation ($n=5 \cdot m$).
 864 %All these problems fall into the same category. Hence, generated problems of the same size will be solved within approximately the same number of iterations and the optimum of the objective function will only slightly fluctuate.
 865
 866 The test environment is composed of a CPU server (2 Intel Xeon X5570, 2.93GHz, with 24GB DDR3 RAM)
 867 and a GPU computing system (NVIDIA Tesla S1070) with the 3.2 CUDA Toolkit. This system connects 4 GPUs to the server. Each GPU has 4GB GDDR3 graphics memory and 30 streaming multiprocessors, each holding 8 cores (1.4GHz).
 868 %Such a GPU offers at peak performance up to 1TFLOPS for single precision floating point, 80GFLOPS for double precision floating point and 100GB/s memory bandwidth between threads (registers) and global memory.
 869
 870 We validated our performance model by showing that it accurately fits our measurements.
 871 The correlation between the measurements and the model is above $0.999$ (see Figure~\ref{chXXX:fig:fitting}).
 872 \begin{figure}[!h]
 873 \centering
 874 \includegraphics[width=100mm]{Chapters/chapter10/figures/Model.pdf}
 875 \caption{Performance model and measurements comparison.}
 876 \label{chXXX:fig:fitting}
 877 \end{figure}
 878
 879 % Test 2 : Tesla M2090 => Performance comparison + NETLIB problems
 880 \subsection{Performance comparison of implementations}
 881 \label{chXXX:subsec:implPerf}
 882 Four different implementations are compared in this section. We will refer to each of them according to the terminology introduced below.
 883 \begin{itemize}
 884  \item Standard simplex method improved ($\mathcal{O}(2mn)$):  \textit{Standard}
 885  \item Revised simplex method using basis kernel
 886  \begin{itemize}
 887   \item without improvements ($\mathcal{O}(3mn + 4m^2)$): \textit{Revised-base}
 888   \item optimized ($\mathcal{O}(2mn + 3m^2)$): \textit{Revised-opti}
 889   \item optimized with sparse matrix-vector multiplication ($\mathcal{O}(2\theta+3m^2)$): \textit{Revised-sparse}
 890  \end{itemize}
 891 \end{itemize}
 892 These implementations all use the \textit{steepest-edge} and \textit{expand} methods for the choice of, respectively, the entering and the leaving variables.
 893
 894 We used problems from the NETLIB Repository to illustrate the improvements discussed in Section~\ref{chXXX:sec:impl}. The results expected from the first three methods are quite clear. The \textit{Standard} method should be the best, followed by the \textit{Revised-opti}, and then the \textit{Revised-base}. However, the performance of the \textit{Revised-sparse} implementation remains unclear since the value of $\theta$ is unknown and depends on the density and size of the constraints matrix. This is the main question we shall try to answer with our experiments.
 895
 896 The test environnement is composed of a CPU server (2 Intel Xeon X5650, 2.67GHz, with 96GB DDR3 RAM) and a GPU computing system (NVIDIA Tesla M2090) with the 4.2 CUDA Toolkit. This system connects 4 GPUs to the server. Each GPU has 6GB GDDR5 graphics memory and 512 cores (1.3GHz).
 897
 898 \begin{table}[!h]
 899 \begin{center}
 900 \begin{tabular}{|l|r|r|r|r|}
 901 \hline
 902 \multicolumn{1}{|l|}{Problem Name} & \multicolumn{1}{l|}{Rows} & \multicolumn{1}{l|}{Cols} & \multicolumn{1}{l|}{NNZ} & \multicolumn{1}{l|}{C-to-R} \\ \hline
 903 etamacro.mps & 401 & 688 & 2489 & 1.7 \\ \hline
 904 fffff800.mps & 525 & 854 & 6235 & 1.6\\ \hline
 905 finnis.mps & 498 & 614 & 2714 & 1.2\\ \hline
 906 gfrd-pnc.mps & 617 & 1092 & 3467 & 1.8\\ \hline
 907 grow15.mps & 301 & 645 & 5665 & 2.1\\ \hline
 908 grow22.mps & 441 & 946 & 8318 & 2.1\\ \hline
 909 scagr25.mps & 472 & 500 & 2029 & 1.1\\ \hline
 910 \end{tabular}
 911 \caption{NETLIB problems solved in less than 1 second.}
 912 \label{chXXX:tab:small}
 913 \end{center}
 914 \end{table}
 915
 916 Let us begin by discussing the performances on problems solved in less than one second. The name, size, number of nonzero elements (NNZ), and columns to rows ratio (C-to-R) of each problem are reported in Table~\ref{chXXX:tab:small}. The performances shown in Figure~\ref{chXXX:fig:FastSolve} corroborate our previous observations. On these problems the \textit{Revised-sparse} method doesn't outperform the \textit{Standard} one. This can be explained by two factors: the added communications (kernel calls) for the revised method, and the small size and density of the problems. It is likely that sparse operations on a GPU require larger amounts of data to become more efficient than their dense counterparts.
 917 \begin{figure}[!h]
 918 \centering
 919 \includegraphics[width=100mm]{Chapters/chapter10/figures/Small_N.pdf}
 920 \caption{Time required to solve problems of Table~\ref{chXXX:tab:small}.}
 921 \label{chXXX:fig:FastSolve}
 922 \end{figure}
 923
 924 The problems shown in Table~\ref{chXXX:tab:medium} are solved in less than 4 seconds. As we can see in Figure~\ref{chXXX:fig:NormalSolve}, the expected trend for the \textit{Revised-base} and the \textit{Revised-opti} methods is now confirmed. Let us presently focus on the \textit{Standard} and \textit{Revised-sparse} methods. Some of the problems, in particular czprob.mps and nesm.mps, are solved in a comparable amount of time. The performance gain of the \textit{Revised-sparse} is related to the C-to-R ratio of these problems, displaying, respectively, a 3.8 and a 4.4 ratio.
 925
 926 \begin{table}[!h]
 927 \begin{center}
 928 \begin{tabular}{|l|r|r|r|r|}
 929 \hline
 930 \multicolumn{1}{|l|}{Problem Name} & \multicolumn{1}{l|}{Rows} & \multicolumn{1}{l|}{Cols} & \multicolumn{1}{l|}{NNZ} & \multicolumn{1}{l|}{C-to-R} \\ \hline
 931 25fv47.mps & 822 & 1571 & 11127 & 1.9\\ \hline
 932 bnl1.mps & 644 & 1175 & 6129 & 1.8\\ \hline
 933 cycle.mps & 1904 & 2857 & 21322 & 1.5\\ \hline
 934 czprob.mps & 930 & 3523 & 14173 & 3.8\\ \hline
 935 ganges.mps & 1310 & 1681 & 7021 & 1.2\\ \hline
 936 nesm.mps & 663 & 2923 & 13988 & 4.4\\ \hline
 937 maros.mps & 847 & 1443 & 10006 & 1.7\\ \hline
 938 perold.mps & 626 & 1376 & 6026 & 1.0\\ \hline
 939 \end{tabular}
 940 \caption{NETLIB problems solved in the range of 1 to 4 seconds.}
 941 \label{chXXX:tab:medium}
 942 \end{center}
 943 \end{table}
 944
 945 \begin{figure}[!h]
 946 \centering
 947 \includegraphics[width=100mm]{Chapters/chapter10/figures/Med_N.pdf}
 948 \caption{Time required to solve problems of Table~\ref{chXXX:tab:medium}.}
 949 \label{chXXX:fig:NormalSolve}
 950 \end{figure}
 951
 952 Finally, the biggest problems, and slowest to solve, are given in Table~\ref{chXXX:tab:big}. A new tendency can be observed in Figure~\ref{chXXX:fig:SlowSolve}. The \textit{Revised-sparse} method is the fastest on most of the problems. The performances are still close between the best two methods on problems having a C-to-R ratio close to 2 such as bnl2.mps, pilot.mps, or greenbeb.mps. However, when this ratio is greater, the \textit{Revised-sparse} can be nearly twice as fast as the standard method. This is noticeable on 80bau3b.mps (4.5) and fit2p.mps (4.3). Although the C-to-R ratio of d6cube.mps (14.9) exceeds the ones previously mentioned, the \textit{Revised-sparse} method doesn't show an impressive performance, probably due to the small amount of rows and the density of this problem, which doesn't fully benefit from the lower complexity of sparse operations.
 953 \clearpage
 954 \begin{table}[!h]
 955 \begin{center}
 956 \begin{tabular}{|l|r|r|r|r|}
 957 \hline
 958 \multicolumn{1}{|l|}{Problem Name} & \multicolumn{1}{l|}{Rows} & \multicolumn{1}{l|}{Cols} & \multicolumn{1}{l|}{NNZ} & \multicolumn{1}{l|}{C-to-R} \\ \hline
 959 80bau3b.mps & 2263 & 9799 & 29063 & 4.3\\ \hline
 960 bnl2.mps & 2325 & 3489 & 16124 & 1.5\\ \hline
 961 d2q06c.mps & 2172 & 5167 & 35674 & 2.4\\ \hline
 962 d6cube.mps & 416 & 6184 & 43888 & 14.9\\ \hline
 963 fit2p.mps & 3001 & 13525 & 60784 & 4.5\\ \hline
 964 greenbeb.mps & 2393 & 5405 & 31499 & 2.3 \\ \hline
 965 maros-r7.mps & 3137 & 9408 & 151120 & 3.0\\ \hline
 966 pilot.mps & 1442 & 3652 & 43220 & 2.5 \\ \hline
 967 pilot87.mps & 2031 & 4883 & 73804 & 2.4\\ \hline
 968 \end{tabular}
 969 \caption{NETLIB problems solved in more than 5 seconds.}
 970 \label{chXXX:tab:big}
 971 \end{center}
 972 \end{table}
 973
 974 \begin{figure}[!h]
 975 \centering
 976 \includegraphics[width=100mm]{Chapters/chapter10/figures/Big_N.pdf}
 977 \caption{Time required to solve problems of Table \ref{chXXX:tab:big}.}
 978 \label{chXXX:fig:SlowSolve}
 979 \end{figure}
 980
 981 \section{Conclusion and perspectives}
 982 \label{chXXX:sec:persp}
 983 In this chapter, we have tried to present the knowledge and understanding necessary in our view to produce an efficient integer linear programming solver on a GPU. We proposed various solutions to implement the standard and revised simplex. We have discussed the main concepts behind a branch-and-bound algorithm and pointed out some major concerns when it is coupled with a GPU solver. The results obtained with the standard simplex implementation allowed us to validate a detailed performance model we had proposed. Finally, we used problems from the NETLIB library to compare the performances of our various implementations. The revised simplex implementation with sparse matrix operations showed the best performances on time-consuming problems, while the standard simplex one was more competitive on easier problems. However, sequential open-source solvers such as CLP of the COIN-OR project still outperform such GPU implementations.
 984
 985 We shall now discuss some remaining potential improvements. A first step towards performance would be to consider the dual revised simplex~\cite{LEMKE54}. While being similar to the methods treated in this chapter, it has the capacity to greatly reduce the time needed to solve problems.
 986 Yet, the main improvement would be seen when tackling larger problems than the ones considered here. Indeed, problems having hundreds of thousands of variables would technically be solvable on GPU devices with 2GB to 4GB of global memory. Moreover, such amounts of data would fully benefit from the potential of these devices. However, solving problems of this size raises an issue not addressed here: numerical instabilities. This phenomenon is due to the limited precision of mathematical operations on computing devices.
 987 % Numerical innacuracies/ Example + problem
 988 Let us illustrate this problem on the inverse $\mbf{B^{-1}}$ of the basis matrix $\mbf{B}$. At each iteration of the revised simplex, $\mbf{B^{-1}}$ is updated from its previous values. Performing this update adds a small error to the result. Unfortunately, these errors accumulate at each iteration, bringing at some point the $\mbf{B^{-1}}$ matrix into a degenerate state. To avoid this situation, the matrix $\mbf{B^{-1}}$ must be recomputed afresh once in a while by inverting the matrix $\mbf{B}$.
 989
 990 % LU decomposition
 991 Instead of directly processing the inverse of the matrix $\mbf{B}$, it is more common to use some specific arithmetical treatment. Most simplex implementations use the so-called LU decomposition of $\mbf{B}$ as a product of a lower and an upper triangular matrix~\cite{BARTELS69}.
 992 Since the matrix $\mbf{B}$ is sparse, the sparse version of this decomposition can be used and the corresponding update performed for $\mbf{B^{-1}}$ at each iteration~\cite{BRAYTON70}. The sparse LU decomposition on CPU has been recently the subject of a large amount of research, yielding many improvements. Once its GPU counterpart is available, this might result in improved and competitive simplex implementations on GPU.
 993
 994 \putbib[Chapters/chapter10/biblio]
 995
 996
 997
 998
 999
1000
1001