BookGPU/Chapters/chapter10/ch10.tex

   1
   2
   3 \chapterauthor{Xavier Meyer}{Department of Computer Science, University of Geneva}
   4 \chapterauthor{Paul Albuquerque}{Institute for Informatics and Telecommunications, hepia, \\ University of Applied Sciences of Western Switzerland - Geneva}
   5 \chapterauthor{Bastien Chopard}{Department of Computer Science, University of Geneva}
   6
   7 %\chapter{Linear programming on a GPU: a study case based on the simplex method and the branch-cut-and bound algorithm}
   8 \chapter{Linear programming on a GPU: a~study~case}
   9 \section{Introduction}
  10 \label{chXXX:sec:intro}
  11 The simplex method is a well-known optimization algorithm for solving linear programming (LP) models in the field of operations research. It is part of software often employed by businesses for finding solutions to problems such as airline scheduling problems. The original standard simplex method was proposed by Dantzig in 1947~\cite{VCLP}. A more efficient method named the revised simplex, was later developed. Nowadays its sequential implementation can be found in almost all commercial LP solvers. But the always increasing complexity and size of LP problems from the industry, drives the demand for more computational power. Indeed, current implementations of the revised simplex strive to produce the expected results, if any. In this context, parallelization is the natural idea to investigate~\cite{HALL}. Already in 1996, Thomadakis et al.~\cite{THOMAD} implemented the standard method on a massively parallel computer and obtained an increase in performances when solving dense or large problems.
  12
  13 A few years back, in order to meet the demand for processing power, graphics card vendors made their graphical processing units (GPU) available for general-purpose computing. Since then GPUs have gained a lot of popularity as they offer an opportunity to accelerate algorithms having an architecture well-adapted to the GPU model. The simplex method falls into this category. Indeed, GPUs exhibit a massively parallel architecture optimized for matrix processing. To our knowledge, there are only few simplex implementations on GPU. Bieling et al.~\cite{BIELING} presented encouraging results while solving small to mid-size LP problems with the revised simplex. However, the complexity of their algorithm seems to be rather close to the one of the standard simplex with similar heuristics. Following the steps of this first work, an implementation of the revised simplex~\cite{LALAMI11} showed interesting results on non-sparse and square matrices. Finally, an implementation of the interior point method~\cite{JUNG08} outperformed its CPU equivalent on mid-size problems.
  14
  15 The Branch-and-Bound (B\&B) algorithm is extensively used for solving integer linear programming (ILP) problems.
  16 This tree-based exploration method subdivides the feasible region of the relaxed LP model into successively smaller subsets to obtain bounds on the objective value.
  17 Each corresponding submodel can be solved with the simplex method, and the bounds computed determine whether further branching is required.
  18 Hence, a natural parallelization of the B\&B method is to let the CPU manage the B\&B tree and despatch the relaxed submodels to a LP solver on a GPU. We refer the reader to chapter~\ref{ch8:GPU-accelerated-tree-based-exact-optimization-methods}, and the references therein, for a more complete introduction to parallel B\&B algorithms.
  19
  20 In this chapter, we present a GPU implementation of the standard and revised simplex methods, based on the CUDA technology of NVIDIA. We also derive a performance model and establish its accurateness. Let us mention that there are many technicalities involved in CUDA programming, in particular regarding the management of tasks and memories on the GPU. Thus fine tuning is indispensable to avoid a breakdown on performance.
  21 %With respect to the simplex algorithm, special attention must be given to numerical stability.
  22
  23 The chapter is organized as follows. First, we begin with a description of the standard and revised simplex methods
  24 and introduce the heuristics used in our implementations. This is followed by a presentation of the B\&B algorithm.
  25 The next section points out CUDA aspects which are important for our implementations.
  26 In the following one, we focus on the GPU implementations of the simplex method as well as the B\&B algorithm.
  27 In the sixth section, we describe a performance model for the standard simplex implementation.
  28 In the seventh section, a performance comparison between our implementations is made on real-life problems and an analysis is given.
  29 Finally, we summarize the results obtained and consider new perspectives.
  30
  31 \section{Simplex algorithm}
  32 \label{chXXX:sec:simplex-algo}
  33 \subsection{Linear programming model}
  34 \label{chXXX:subsec:lp-model}
  35 %% Mathematics of LP
  36 A linear programming\index{linear programming} (LP) model in its canonical form can be expressed as the following optimization problem:
  37 \begin{equation}
  38 \label{chXXX:equ:canon-form}
  39 %\centering
  40 \begin{array}{lcl}
  41  \text{maximize} &\hspace{10pt}& z = \mbf{c^T x} \hspace{30pt}\\
  42  \text{subject to} &\hspace{10pt} &\mbf{Ax \leq b} \hspace{30pt}\\
  43                             &\hspace{10pt} &\mbf{x \geq 0} \hspace{30pt}
  44 \end{array}
  45 \end{equation}
  46 where $\mbf{x}=(x_j),\mbf{c}=(c_j) \in \mathbb{R}^n,\mbf{b}=(b_i) \in \mathbb{R}^m$ and $\mbf{A}=(a_{ij})$ is the $m \times n$ constraints matrix.
  47 The objective function $z = \mbf{c^T x}$ is the inner product of the cost vector $\mbf{c}$ and the unknown variables $\mbf{x}$.
  48 An element $\mbf{x}$ is called a solution which is said to be feasible
  49 if it satisfies the $m$ linear constraints imposed by $\mbf{A}$ and the bound $\mbf{b}$.
  50 The feasible region $\{\mbf{x}\in \mathbb{R}^n\, |\, \mbf{Ax \leq b},\, \mbf{x \geq 0}\}$ is a convex polytope.
  51 An optimal solution to the LP problem will therefore reside on a vertex of this polytope.
  52
  53 \subsection{Standard simplex algorithm}
  54 \label{chXXX:subsec:simplex-std}
  55 % Simple explanation of the simple algorithm : 3 operations
  56 The simplex method\index{simplex!standard method}~\cite{VCLP} is an algorithm for solving linear programming (LP) models.
  57 It proceeds by iteratively visiting vertices on the boundary of the feasible region.
  58 This amounts to performing algebraic manipulations on the system of linear equations.
  59
  60 We begin by reformulating the model.
  61 So-called \textit{slack variables} $x_{n+i}$ are added to the canonical form in order to replace inequalities by equalities in equation~(\ref{chXXX:equ:canon-form}):
  62 \begin{equation}
  63 \label{chXXX:equ:augm-form}
  64 %\centering
  65 \begin{array}{cccccccc}
  66 x_{n+i} & = &  b_i - & \displaystyle\sum\limits^{n}_{j=1} & a_{ij}x_j & & (i={1,2,...,m})\\
  67 \end{array}
  68 \end{equation}
  69 The resulting problem is called the \textit{augmented form} in which the variables are divided into two disjoint index sets,
  70 $\B$ and $\N$, which correspond to the basic and nonbasic variables.
  71 Basic variables, which form the basis of the problem, are on the left-hand side of equation~(\ref{chXXX:equ:augm-form}),
  72 while nonbasic variables, which form the core of the equations, appear on the right-hand~side.
  73 We can thus consider the following LP form:
  74 \begin{equation}
  75 %\centering
  76 \begin{array}{lcl}
  77  \text{maximize} &\hspace{10pt}& z = \mbf{c^T x} \hspace{30pt}\\
  78  \text{subject to} &\hspace{10pt} &\mbf{Ax = b} \hspace{30pt}\\
  79                             &\hspace{10pt} &\mbf{x \geq 0} \hspace{30pt}
  80 \end{array}
  81 \end{equation}
  82 where $\mbf{x} \in \mathbb{R}^{n+m}$ and $\mbf{A}$ is now the $m\times (n+m)$ matrix obtained by concatenating the constraints matrix with the $m\times m$ identity matrix $\mbf{I_m}$.
  83 The cost vector has been padded with zeros so that $\mbf{c} \in \mathbb{R}^{n+m}$.
  84
  85 The basic and nonbasic variables can be separated given the expression $\mbf{A_\N x_\N + x_\B = b}$.
  86 Similarly, $z = \mbf{c^T x = c_\N^T x_\N + c_\B^T x_\B}$ with $\mbf{c_\B = 0}$.
  87 By definition, a basic solution is obtained by assigning null values to all nonbasic variables ($\mbf{x_\N=0}$).
  88 Hence, $\mbf{x=x_\B = b}$ is a basic solution.
  89
  90 The simplex algorithm searches for the optimal solution through an iterative process.
  91 For the sake of simplicity, we will assume here that $\mbf{b>0}$ (\textit{i.e.} the origin belongs to the feasible region).
  92 Otherwise a preliminary treatment is required to generate a feasible initial solution (see~\S~\ref{chXXX:sec:heuristics}).
  93 A typical iteration then consists of three operations (summarized in algorithm~\ref{chXXX:algo:simplex-std}).
  94 \begin{enumerate}
  95 %%%%%%%%%%%%%%%%%
  96 \item \textbf{Choosing the entering variable.}
  97 The entering variable is a nonbasic variable whose increase will lead to an increase of the value of the objective function $z$.
  98 This variable must be selected with care so as to yield a substantial leap towards the optimal solution.
  99 The standard way of making this choice is to select the variable $x_e$ with the largest positive coefficient
 100 $c_e = \max\{c_j>0\, |\, j\in \N\}$ in the objective function $z$.
 101 However, other strategies such as choosing the positive coefficient with the smallest index, prove to be useful.
 102 %%%%%%%%%%%%%%%%%
 103 \item \textbf{Choosing the leaving variable.}
 104 This variable is the basic variable which first violates its constraint as the entering variable $x_e$ increases.
 105 The choice of the leaving variable must guarantee that the solution remains feasible.
 106 More precisely, setting all nonbasic variables except $x_e$ to zero, $x_e$ is bounded by
 107 $$t=\min\left\{\dfrac{b_i}{a_{ie}}\, \bigg|\, a_{ie}>0,\, i=1,\dots,m \right\}$$
 108 If $t=+\infty$, the LP problem is unbounded. \\
 109 Otherwise, $t=\dfrac{b_\ell}{a_{\ell e}}$ for some $\ell\in \B$, and $x_\ell$ is the leaving variable; \\
 110 the element $a_{\ell e}$ is called the pivot.
 111 %%%%%%%%%%%%%%%%%
 112 \item \textbf{Pivoting.}
 113 Once both variables are defined, the pivoting operation switches these variables from one set to the other:
 114 the entering variable enters the basis $\mathcal{B}$, taking the place of the leaving variable which now belongs to $\mathcal{N}$,
 115 namely $\mathcal{B} \leftarrow (\mathcal{B} \setminus \{\ell\}) \cup \{e\}$ and $\mathcal{N} \leftarrow (\mathcal{N} \setminus \{e\}) \cup \{\ell\}$.
 116 Correspondingly, the columns with index $\ell$ and $e$ are exchanged between $\mbf{I_m}$ and $\mbf{A_\N}$, and similarly for $\mbf{c_\B=0}$ and $\mbf{c_\N}$.
 117 We then update the constraints matrix $\mbf{A}$, the bound $\mbf{b}$ and the cost $\mbf{c}$ using Gaussian elimination.
 118 More precisely, denoting $\mbf{\tilde{I}_m}$, $\mbf{\tilde{A}_\N}$, $\mbf{\tilde{c}_\B}$ and $\mbf{\tilde{c}_\N}$ the resulting elements after the exchange, Gaussian elimination then transforms the tableau
 119
 120 \begin{tabular}{|c|c||c|}
 121 $\mbf{\tilde{A}_\N}$ & $\mbf{\tilde{I}_m}$ & $\mbf{b}$ \\
 122 \hline
 123 $\mbf{\tilde{c}^T_\N}$ & $\mbf{\tilde{c}^T_\B}$ & $z$
 124 \end{tabular}
 125 into a tableau with updated values for $\mbf{A_\N}$, $\mbf{c_\N}$ and $\mbf{b}$
 126
 127 \begin{tabular}{|c|c||c|}
 128 $\mbf{A_\N}$ & $\mbf{I_m}$ & $\mbf{b}$ \\
 129 \hline
 130 $\mbf{c^T_\N}$ & $\mbf{c^T_\B}$ & $z-tc_e$
 131 \end{tabular}
 132
 133 The latter is obtained by first dividing the $\ell$-th row by $a_{\ell e}$;
 134 the resulting row multiplied by $a_{ie}$ ($i\not=\ell$), resp. by $c_e$, is then added to the $i$-th row, resp. the last row.
 135 Hence, $\mbf{c_\B=0}$.
 136 These operations amount to jumping from the current vertex to an adjacent vertex with objective value $z=tc_e$.
 137 \end{enumerate}
 138 % End condition
 139 The algorithm ends when no more entering variable can be found, that is when~$\mbf{c_\N \le 0}$.
 140
 141 %% \begin{algorithm}
 142 %% \caption{Standard simplex algorithm}
 143 %% \label{chXXX:algo:simplex-std}
 144 %% \begin{algorithmic}
 145 %%      \STATE \texttt{//1. Find entering variable}
 146 %%      \IF{$\mbf{c_\N \le 0}$}
 147 %%      \RETURN Optimal
 148 %%    \ENDIF
 149 %%    \STATE Choose an index $e\in \N$ such that $c_e>0$
 150 %%      \STATE \texttt{//2. Find leaving variable}
 151 %%      \IF{$(\mbf{A_\N})_e \mbf{\le 0}$}
 152 %%      \RETURN Unbounded
 153 %%      \ENDIF
 154 %%      \STATE Let $\ell \in \B$ be the index such that \\
 155 %%      \hspace*{10mm} $t := \dfrac{b_\ell}{a_{\ell e}}=\min\left\{\dfrac{b_i}{a_{ie}}\, \bigg|\, a_{ie}>0,\, i=1,\dots,m \right\}$
 156 %%      \STATE \texttt{//3. Pivoting - update }
 157 %%    \STATE $\mathcal{B} := (\mathcal{B} \setminus \{\ell\}) \cup \{e\}$, $\mathcal{N} := (\mathcal{N} \setminus \{e\}) \cup \{\ell\}$
 158 %%      \STATE Compute $z_{best}:=z_{best}+t c_e$
 159 %%    \STATE Exchange $(\mbf{I_m})_\ell$ and $(\mbf{A_\N})_e$, $c_\ell$ and $c_e$
 160 %%      \STATE Update $\mbf{A_\N}$, $\mbf{c_\N}$ and $\mbf{b}$
 161 %%    \STATE Go to 1.
 162 %% \end{algorithmic}
 163 %% \end{algorithm}
 164
 165 \begin{algorithm}
 166 \SetNlSty{}{}{:}
 167     \textit{//1. Find entering variable}\;
 168      \If{$\mbf{c_\N \le 0}$} {
 169         \Return Optimal\;
 170    }
 171    Choose an index $e\in \N$ such that $c_e>0$ \;
 172    \textit{//2. Find leaving variable}\;
 173         \If{$(\mbf{A_\N})_e \mbf{\le 0}$} {
 174         \Return Unbounded
 175         }
 176         Let $\ell \in \B$ be the index such that \;
 177          $t := \dfrac{b_\ell}{a_{\ell e}}=\min\left\{\dfrac{b_i}{a_{ie}}\, \bigg|\, a_{ie}>0,\, i=1,\dots,m \right\}$\;
 178   \textit{//3. Pivoting - update }\;
 179    $\mathcal{B} := (\mathcal{B} \setminus \{\ell\}) \cup \{e\}$, $\mathcal{N} := (\mathcal{N} \setminus \{e\}) \cup \{\ell\}$\;
 180  Compute $z_{best}:=z_{best}+t c_e$\;
 181     Exchange $(\mbf{I_m})_\ell$ and $(\mbf{A_\N})_e$, $c_\ell$ and $c_e$\;
 182 Update $\mbf{A_\N}$, $\mbf{c_\N}$ and $\mbf{b}$\;
 183     Go to 1.\;
 184 \caption{Standard simplex algorithm}
 185 \label{chXXX:algo:simplex-std}
 186
 187 \end{algorithm}
 188
 189
 190 \subsection{Revised simplex method}
 191 \label{chXXX:subsec:simplex-rev}
 192 % Global revised simplex idea
 193 The operation which takes the most time in the standard method, is the pivoting operation, and more specifically the update of the constraints matrix~$\mbf{A}$. The revised method\index{simplex!revised method} tries to avoid this costly operation by only updating a smaller part of this matrix.
 194
 195 %% Revised simplex method operations, useful for the algorithm
 196 The revised simplex method uses the same operations as in the standard method to choose the entering and leaving variables.
 197 However, since the constraints matrix $\mbf{A}$ need not be fully updated, some additional reformulation is required.
 198
 199 At any stage of the algorithm, basic and nonbasic variables can be separated according to $\mbf{Ax = A_\N x_\N + A_\B x_\B = b}$.
 200 We can then transform the system $\mbf{Ax=b}$ into $\mbf{A_\B x_\B = b-A_\N x_\N}$.
 201 Let us denote $\mbf{B=A_\B}, \mbf{N=A_\N}$.
 202 Since $\mbf{B}$ is invertible, we can write
 203 \begin{align*}
 204 \mbf{x}_\B&=\mbf{B^{-1}b-B^{-1}Nx_\N} \\
 205 %\hline
 206 z&=\mbf{c_\B^T B^{-1}b+(c_\N^T-c_\B^T B^{-1}N)x_\N}
 207 \end{align*}
 208 The vector $\mbf{c_\N^T-c_\B^T B^{-1}N}$ is called the reduced cost vector.
 209
 210 The choice of the leaving variable can be rewritten.
 211 Setting all nonbasic variables save the entering variable $x_e$, to zero, $x_e$ is then bounded by
 212 $$t=\min\left\{\dfrac{(\mbf{B^{-1}b})_i}{(\mbf{B^{-1}N})_{ie}}\, \bigg|\, (\mbf{B^{-1}N})_{ie}>0,\, i=1,\dots,m \right\}$$
 213 If $t=+\infty$, the LP problem is unbounded. Otherwise, $t=\frac{(\mbf{B^{-1}b})_\ell}{(\mbf{B^{-1}N})_{\ell e}}$ for some $\ell\in \B$, and $x_\ell$ is the leaving variable.
 214
 215 Recall that the pivoting operation begins by exchanging columns with index $\ell$ and $e$ between $\mbf{B}$ and $\mbf{N}$, and similarly for $\mbf{c_\B}$ and $\mbf{c_\N}$.
 216
 217 To emphasize the difference between standard and revised simplex, we first express the updating for the standard simplex. This amounts to
 218 \begin{align*}
 219 [\mbf{B} \quad \mbf{N} \quad \mbf{b}] & \leftarrow [\mbf{I_m} \quad \mbf{B^{-1}N} \quad \mbf{B^{-1}b}] \\
 220 [\mbf{c_\B^T} \quad \mbf{c_\N^T}] & \leftarrow [\mbf{0} \quad \mbf{c_\N^T-c_\B^T B^{-1}N}]
 221 \end{align*}
 222 which moves the current vertex to an adjacent vertex $\mbf{x=B^{-1}b}$ with objective value $z=\mbf{c_\B^T B^{-1}b}$ (the end condition remains $\mbf{c_\N \le 0}$). Many computations performed in this update phase can in fact be avoided.
 223
 224 The main point of the revised simplex method is that the only values which really need to be computed at each step are: $\mbf{B^{-1}}$, $\mbf{B^{-1}b}$,$\mbf{B^{-1}N}_{e}$, $\mbf{c_\B^T B^{-1}}$ and $\mbf{c_\B^T B^{-1}b}$. However, matrix inversion is a time consuming operation (of cubic order for Gaussian elimination).
 225 Fortunately, there are efficient ways of computing an update for $\mbf{B^{-1}}$.
 226 One way is to take advantage of the sparsity of the LP problem by using the LU sparse decomposition.
 227 This decomposition may be updated at a small cost at each iteration~\cite{SUHL93}.
 228 Another way is to use the \textit{product form of the inverse}~\cite{dantzig1953product}, which we describe hereafter.
 229
 230 Let $\mbf{b}_1,\dots,\mbf{b}_m$ be the columns of $\mbf{B}$, $\mbf{v}\in \mathbb{R}^m$
 231 and $\mbf{a}=\mbf{Bv}=\sum_{i=1}^m v_i\mbf{b}_i$.
 232 Denote $\mbf{B_a}=(\mbf{b}_1,\dots,\mbf{b}_{p-1},\mbf{a},\mbf{b}_{p+1},\dots,\mbf{b}_m)$ for a given $1\le p\le m$ such that $v_p\not= 0$.
 233 We want to compute $\mbf{B_a}^{-1}$.
 234 We first write $$\mbf{b}_p=\dfrac{1}{v_p}+\mbf{a}\sum_{i\not=p} \dfrac{-v_i}{v_p}\mbf{b}_i$$
 235 Define $$\bs\eta = \left(\dfrac{-v_1}{v_p},\dots,\dfrac{-v_{p-1}}{v_p},\dfrac{1}{v_p},\dfrac{-v_{p+1}}{v_p},\dots,\dfrac{-v_m}{v_p}\right)^T$$
 236 and $$\mbf{E}=(\bs\varepsilon_1,\dots,\bs\varepsilon_{p-1},\bs\eta,\bs\varepsilon_{p+1},\dots,\bs\varepsilon_m)$$
 237 where $\bs\varepsilon_j = (0,\dots,0,1,0,\dots,0)^T$ is the $j$-th element of the canonical basis of~$\mathbb{R}^m$.
 238 Then $\mbf{B_a E=B}$, so $\mbf{B_a}^{-1}=\mbf{E\; B}^{-1}$.
 239
 240 We apply these preliminary considerations to the simplex algorithm
 241 with $\mbf{a=N}_e, \mbf{v=(B^{-1}N)}_e$ (recall that $x_e$ is the entering variable).
 242 If initially $\mbf{B}$ is the identity matrix $\mbf{I_m}$, at the $k$-th iteration of the algorithm,
 243 the inverse matrix is given by $\mbf{B}^{-1}=\mbf{E}_k \mbf{E}_{k-1}\cdots \mbf{E}_2 \mbf{E}_1$,
 244 where $\mbf{E}_i$ is the matrix constructed at the $i$-th iteration.
 245 %Note that matrices like $\mbf{E}$ are sparse and can therefore be stored compactly.
 246
 247 \subsection{Heuristics and improvements}
 248 \label{chXXX:sec:heuristics}
 249 Specific heuristics or methods are needed to improve the performance and stability of the simplex algorithm.
 250 We will explain below how we find an initial feasible solution and how we choose the entering and leaving variables.
 251
 252 \subsubsection*{Finding an initial feasible solution}
 253 Our implementations use the \textit{two-phase simplex}~\cite{VCLP}.
 254 The first phase aims at finding a feasible solution required by the second phase to solve the original problem.
 255 If the origin $\mbf{x=0}$ is not a feasible solution, we proceed to find such a solution by solving a so-called \textit{auxiliary problem} with the simplex algorithm.
 256 This can be achieved by adding a non-negative artificial variable to each constraint equation corresponding to a basic variable which violates its non-negativity condition,
 257 before minimizing the sum of these artificial variables. The algorithm will thus try to drive all artificial variables towards zero.
 258 If it succeeds, then a basic feasible solution is available as an initial solution for the second phase
 259 in which it attempts to find an optimal solution to the original problem. Otherwise the latter is infeasible.
 260
 261 To avoid having to introduce these additional auxiliary variables, we use an alternate version of this procedure.
 262 For basic variables with index in $\mathcal{I}=\{j\in~\B \,\,|\,\, x_j<0 \}$, we temporarily relax the non-negativity condition in order to have a feasible problem,
 263 and then apply the simplex algorithm to minimize $w=-\sum_{j\in \mathcal{I}} x_j$.
 264 However, we update $\mathcal{I}$ at each iteration and modify accordingly the objective function,
 265 whose role is to drive these infeasible basic variables towards their original bound.
 266 If at some stage $\mathcal{I}=\emptyset$, we end up with an initial feasible solution.
 267 Otherwise, the algorithm terminates with $\mathcal{I}\not=\emptyset$ and $w>0$, which indicates that the original problem is infeasible.
 268 The alteration introduced above involves more computations during the first phase,
 269 but it offers the advantage of preserving the problem size and making good use of the GPU processing power.
 270
 271 \subsubsection*{Choice of the entering variable}
 272 The number of iterations required to solve a problem depends on the method used to select the entering variable. The one described in section~\ref{chXXX:subsec:simplex-std} chooses the most promising variable $x_e$ in terms of cost. While being inexpensive to compute, this method can lead to an important number of iterations before the best solution is found.
 273
 274 There exists various heuristics to select this variable $x_e$.
 275 One of the most commonly used is the \textit{steepest-edge} method~\cite{GOLDFRAB77}.
 276 To improve the speed at which the best solution is found, this method takes into account the coefficients of $\mbf{B}^{-1}\mbf{N}$.
 277 This can be explained from the geometrical point of view.
 278 The constraints $\mbf{Ax=b}$ form the hull of a convex polytope.
 279 The simplex algorithm moves from one vertex (\textit{i.e.}~a~solution) to another while trying to improve the objective function.
 280 The steepest-edge method searches for the edge along which the rate of improvement of the objective function is the best.
 281 The entering variable $x_e$ is then determined by
 282 $$ e = \argmax \left\{ \dfrac{c_j}{\sqrt{\gamma_j}} \, \bigg| \, c_j > 0 , j \in \N \right\} $$
 283 with $ \gamma_j = \Vert \mbf{\mbf{B}^{-1}\mbf{N}_j} \Vert^2 $.
 284
 285 This method is quite costly to compute but it reduces significantly the number of iterations required to solve a problem. This heuristic can be directly applied to the standard simplex since the full-tableau is updated at each iteration. However it defeats the purpose of the revised simplex algorithm since the aim is precisely to avoid updating the whole constraints matrix at each iteration.
 286 Taking this into account, the steepest-edge coefficients $\bs\gamma$ are updated based only on their current value. For the sake of clarity, the hat notation is used to differentiate the next iteration value of a variable from its current value: for example if $\bs\gamma$ denotes the steepest-edge coefficients at the current iteration, then $\hat{\bs\gamma}$ are the updated coefficients.
 287
 288 Given the column of the entering variable $\mbf{d} = \mbf{(B^{-1}N)}_e$, we may process afresh the steepest-edge coefficient of the entering variable as $\gamma_e = \Vert \mbf{d} \Vert^2$.
 289 The updated steepest-edge coefficients are then given by
 290 $$\hat{\gamma_j} = \max\left\{\gamma_j - 2\hat{\alpha_j} \beta_j + \gamma_e \hat{\alpha_j}^2, 1+\hat{\alpha_j}^{2}\right\} \quad \textrm{ for } j \neq e $$
 291 $$\hat{\gamma_e} = \gamma_e / d_e^2$$
 292 with $\hat{\bs\alpha} = \mbf{N^T} (\mbf{(\hat{B}^{-1})^T})_\ell $
 293 and $\bs\beta = \mbf{N^T} \mbf{(B^{-1})^T} \mbf{d}$
 294
 295 \subsubsection*{Choice of the leaving variable}
 296 The stability and robustness of the algorithm depend considerably on the choice of the leaving variable. With respect to this, the \textit{expand} method~\cite{GILL1989} proves to be very useful in the sense that it helps to avoid cycles and reduces the risks of encountering numerical instabilities. This method consists of two steps of complexity $\mathcal{O}(m)$. In the first step, a small perturbation is applied to the bounds of the variables to prevent stalling of the objective value, thus avoiding cycles. These perturbed bounds are then used to determine the greatest gain on the entering variable imposed by the most constraining basic variable. The second phase uses the original bounds to define the basic variable offering the gain closest to the one of the first phase. This variable will then be selected for leaving the basis.
 297
 298 \section{Branch-and-Bound\index{Branch-and-Bound} algorithm}
 299 \label{chXXX:sec:bnb}
 300 \subsection{Integer linear programming}
 301 \label{chXXX:subsec:ilp}
 302 % Why and what are integer linear programming problem
 303 In some problems, variables are integer-valued. For example, in a vehicle routing problem, one cannot assign one third of a vehicule to a specific route. Integer linear programming\index{integer linear programming} (ILP) problems restrict LP problems by imposing an integrality condition on the variables. While this change may seem to have little impact on the model, the aftermaths on the resolution method are quite important.
 304
 305 % What is the cost of adding this constraint
 306 From a geometrical perspective, the simplex algorithm is a method for finding an optimum in a convex polytope defined by a LP problem. However, the additional integrality condition results in the loss of this convexity property. The resolution method must then be altered in order to find a solution to the ILP problem. The idea is to explore the solution space by a divide-and-conquer approach in which the simplex algorithm is used to find a local optimum in subspaces.
 307
 308 \subsection{Branch-and-Bound tree}
 309 \label{chXXX:subsec:bnb-tree}
 310 % Intro
 311 The solution space is explored by the \textit{Branch-and-Bound} (B\&B) algorithm~\cite{ACHTERBERG07}.
 312 The strategy used is conceptually close to a tree traversal.
 313
 314 % Branching
 315 %At first, the ILP problem is considered as a LP problem by relaxing the integrality condition before applying a LP solver to it.
 316 %This initial relaxed problem represents the root of the tree about to be built.
 317 %From the obtained solution, a variable $x_j$ (with bounds $l_j \leq x_j \leq u_j$) that violates its integrality condition, is chosen as a branching variable.
 318 %The current problem will then be divided, if possible, into two subproblems: one having $x_j$ constrained to $l_j' = l_j$ and $u_j' = \lfloor x_j \rfloor$ and the other to $l_j' = \lceil x_j \rceil$ and $u_j' = u_j$.
 319 %Each of these LP subproblems represents a child node waiting to be solved.
 320
 321 % Branching
 322 At first, the ILP problem is considered as a LP problem by relaxing the integrality condition before applying a LP solver to it.
 323 This initial relaxed problem represents the root of the tree about to be built.
 324 From the obtained solution $\bs{\xi}$, if $\xi_j$ is not integral for some $j$, then $x_j$ can be chosen as a branching variable.
 325 The current problem will then be divided, if possible, into two subproblems: one having $x_j\le \lfloor \xi_j \rfloor$ and the other $x_j\ge \lceil \xi_j \rceil$.
 326 Each of these LP subproblems represents a child node waiting to be solved.
 327
 328 % When does it ends
 329 This is repeated for each subproblems in such a way that all variables are led towards integral values.
 330 At some point a subproblem will either be infeasible or lead to a feasible ILP solution (leaf nodes).
 331 The algorithm ends when the tree has been fully visited, returning the best feasible ILP solution.
 332
 333 % Upper, lower bounds and pruning
 334 During the exploration, lower and upper bounds on the objective value are computed.
 335 The upper bound is represented by the best objective value encountered in a child node.
 336 This non-admissible solution hints towards what the objective value of the ILP problem could be.
 337 The lower bound is the best ILP solution yet found, in other words the objective value of the most promising leaf node.
 338 The bounds may be used to prune subtrees when further branching cannot improve the best current solution.
 339
 340 % How can we be improve the exploration
 341 In the B\&B algorithm, the main two operations that impact the convergence of the bounds towards the optimal solution are
 342 the branching strategy and the node selection strategy.
 343
 344 \subsubsection*{Example}
 345 Figure~\ref{chXXX:fig:bnb} illustrates the use of the B\&B algorithm for solving an ILP.
 346 The ILP problem is the following:
 347 \begin{align*}
 348 \textrm{Maximize}  \quad z= x_1 + 4x_2 & \\
 349 \textrm{Subject to}\quad \quad 5x_1 + 8x_2 & \leq 40\\
 350  -2x_1 + 3x_2 & \leq 9\\
 351  x_1, x_2 & \geq 0 \quad \textrm{ integer-valued}
 352 \end{align*}
 353 The nodes are solved with the simplex method in the order written on the figure.
 354 At the 3rd step of the B\&B, the first feasible solution is encountered (light grey leaf).
 355 The optimal solution is encountered at the 7th step (dark grey leaf).
 356 However, this is assessed only after solving the last two nodes (8th and 9th)
 357 whose objective value $z$ is inferior to the best one yet found.
 358
 359 \begin{figure}[!h]
 360 \centering
 361 \includegraphics[width=12cm]{Chapters/chapter10/figures/tree2_conv.pdf}
 362 \caption{Solving an ILP problem using a Branch-and-Bound algorithm}
 363 \label{chXXX:fig:bnb}
 364 \end{figure}
 365
 366 \subsection{Branching strategy}
 367 \label{chXXX:subsec:bnb-branching}
 368 % What it is and why it is important
 369 The branching strategy\index{Branch-and-Bound!branching} defines the method used to select the variable on which branching will occur.
 370 The objective value of the child node depends greatly on the choice of this variable.
 371 Branching on a variable may lead to a drop on the upper bound and thus speed up the exploration,
 372 while branching on other variables could leave this bound unchanged.
 373
 374 % Example of strategy
 375 Several branching strategies exist~\cite{Achterberg05}. Let us briefly comment on some of them in terms of improvement of the objective function and processing cost.
 376 \begin{itemize}
 377 \item \textit{Smallest index strategy} is a greedy approach that always selects the variable with the smallest index as branching variable. This method is simple, has a cheap processing cost but does not try to select the best variable.
 378 \item \textit{Strong branching strategy} is an exhaustive strategy. The branching variable selected is the one amongst all the potential variables that leads to the best solution. This means that for each potential branching variable, its potential child nodes must be solved. This method is easy to implement, but its computational cost makes it inefficient.
 379 \item \textit{Reliability branching strategy} is a strategy which maintains a pseudo-cost~\cite{BENICHOU71} for each potential branching variable. The pseudo-cost of a variable is based on the result obtained when branching on it at previous steps. Since at the start, pseudo-costs are unreliable, a limited strong branching approach is used until pseudo-costs are deemed reliable. This method is more complex than the two others and requires fine tuning. It offers however the best trade-off between improvement and computional cost.
 380 \end{itemize}
 381
 382 \subsection{Node selection strategy}
 383 \label{chXXX:subsec:bnb-node-select}
 384 The node selection strategy\index{Branch-and-Bound!node selection} defines the methodology used to explore the tree.
 385 While the usual depth-first search and breadth-first search are considered and used,
 386 some remarks about the tree exploration must be made.
 387 First let us mention a few facts:
 388 \begin{enumerate}
 389  \item Solutions obtained from child nodes cannot be better than the one of their parent node.
 390  \item An LP solver is able to quickly find a solution if the subproblem (child node) is only slightly different than its parent.
 391 \end{enumerate}
 392 Given the above statements, it is of interest to quickly find a feasible solution.
 393 Indeed, this allows to prune all pending nodes which do not improve the solution found.
 394 However, the quality of the latter impacts the amount of nodes pruned.
 395 It takes more time to produce a good solution because one must search for the best nodes in the tree.
 396 Consequently, a trade-off must be made between the quality and the time required to find a solution.
 397
 398 Two types of strategies~\cite{Linderoth97} can then be considered:
 399 \begin{itemize}
 400 \item \textit{Depth-first search} aims at always selecting one of the child nodes until a leaf
 401 (infeasible subproblem or feasible solution) is reached.
 402 This strategy is characterized by fast solving and it quickly finds a feasible solution.
 403 It mostly improves the lower bound.
 404 \item  \textit{Best-first search} aims at always selecting one of the most promising nodes in the tree.
 405 This strategy is characterized by slower solving but guarantees that the first feasible solution found is the optimal.
 406 It mostly improves the upper bound.
 407 \end{itemize}
 408 A problem occurs with the best-first search strategy:
 409 there might be numerous nodes having solutions of the same quality, thus making the choice of a node difficult.
 410 To avoid this problem, the \textit{best-estimate search} strategy~\cite{FORREST74} differentiates the best nodes by estimating their potential cost with the help of a pseudo-cost (see previous section).
 411
 412 The most promising variant is a hybrid strategy: the base strategy being the best-estimate search,
 413 the subtrees of the best-estimated nodes are then subject to a limited depth-first search,
 414 more commonly called \textit{plunging}.
 415 This method launches a fast search for a feasible solution in the most promising subtrees,
 416 thus improving the upper and lower bounds at the same time.
 417
 418 \subsection{Cutting-plane methods}
 419 \label{chXXX:subsec:cuts}
 420 \textit{Cutting-planes}\index{Branch-and-Bound!cutting-plane}~\cite{WOLTER06} (also simply \textit{cuts}) are new constraints whose role is to cut off parts of the search space.
 421 They may be generated from the first LP solution (\textit{Cut-and-Branch}) or periodically during the B\&B (\textit{Branch-and-Cut}).
 422 On the one hand cutting-planes may considerably reduce the size of the solution space, but on the other hand they increase the problem size.
 423 Moreover, generating cutting-planes is costly since it requires a thorough analysis of the current state of the problem.
 424
 425 Various types or families of cutting-planes exist. Those that are most applied are the \textit{Gomory cutting-planes} and the \textit{complemented mixed integer rounding inequalities} (c-MIR). Other kinds of cutting-planes target specific families of problems, for example the \textit{0-1 Knapsack cutting-planes} or \textit{flow cover cuts}.
 426
 427 The cutting-planes generated must be carefully selected in order to avoid a huge increase of the problem size. They are selected according to three criteria: their efficiency, their orthogonality with respect to other cutting-planes, and their parallelism with respect to the objective function.
 428 Cutting-planes having the most impact on the problem are then selected, while the others are dropped.
 429
 430 \section{CUDA considerations}
 431 \label{chXXX:sec:cuda}
 432 Most expensive operations in the simplex algorithm are linear algebra functions. The CUBLAS library is used for dense matrix-vector multiplications (\texttt{cublasDgemv}) and dense vector sums (\texttt{cublasDaxpy}). The CUSPARSE library is used for the sparse matrix-vector multiplication (\texttt{cusparseDcsrmv}).
 433
 434 However, complex operations are required to implement the simplex and must be coded from scratch.
 435 For example, reduction operations (\textit{argmax, argmin}) are fundamental for selecting variables.
 436 In the following section, we first quickly describe the CUDA reduction operation, before making some global remarks on kernel optimization.
 437
 438 % Reduce operation
 439 \subsection{Parallel reduction}
 440 \label{chXXX:subsec:reduction}
 441 A parallel reduction\index{Parallel reduction} operation is performed in an efficient manner inside a GPU block as shown in figure~\ref{chXXX:fig:reduc}. Shared memory is used for a fast and reliable way to communicate between threads. However at the grid level, reduction cannot be easily implemented due to the lack of direct communication between blocks. The usual way of dealing with this type of limitation is to apply the reduction in two separate steps. The first one involves a GPU kernel reducing the data over multiple blocks, the local result of each block being stored on completion. The second step ends the reduction on a single block or on the CPU.
 442 \begin{figure}[!h]
 443 \centering
 444 \includegraphics[width=10cm]{Chapters/chapter10/figures/Reduc2.pdf}
 445 \caption{Example of a parallel reduction at block level (courtesy NVIDIA)}
 446 \label{chXXX:fig:reduc}
 447 \end{figure}
 448
 449 An optimized way of doing the reduction can be found in the example provided by NVIDIA.
 450 In order to keep code listings compact hereafter, the reduction of values amongst a block will be referred to as \textit{reduceOperation(value)} (per extension \textit{reduceArgMax(maxVal)}).
 451
 452 % Volkov - hiding latency, register usage
 453 \subsection{Kernel optimization}
 454 Optimizing a kernel is a difficult task. The most important point is to determine whether the performances are limited by the bandwidth or by the instruction throughput. Depending on the case and the specificities of the problem, various strategies may be applied. This part requires a good understanding of the underlying architecture and its limitations. The CUDA Programming Guide offers some insight into this subject, together with the interesting articles and presentations by Vassily Volkov~\cite{VOLKOV2010}. The CUDA profiler is the best way to monitor the performances of a kernel. This tool proposes multiple performance markers giving indications about potential bottlenecks.
 455
 456 \section{Implementations}
 457 \label{chXXX:sec:impl}
 458 % Foreword - no expand
 459 In this section, the implementations of the simplex algorithm are studied. The emphasis is put on the main algorithms and kernels having a major impact on performance. The \textit{expand} method used for choosing the leaving variable will not be detailed for that reason.
 460 Then the Branch-and-Bound (B\&B) implementation is quickly explained focusing on the interaction between the B\&B algorithm and the simplex solver.
 461
 462 \subsection{Standard simplex}
 463 \label{chXXX:subsec:impl-simplex-std}
 464 % Standard simplex --> cublas but steepest edge
 465 The implementation of the standard simplex algorithm (see~\S~\ref{chXXX:subsec:simplex-std}) is rather straightforward. The main difference is that the basis matrix is not stored in memory since it is equals to the identity matrix most of the time. Instead a proper data structure keeps track of the basic variables and their values.
 466 %% \begin{algorithm}
 467 %% \caption{Standard simplex algorithm}
 468 %% \label{chXXX:algo:impl-simplex-std}
 469 %% \begin{algorithmic}
 470 %%      \STATE \textit{// Find entering variable $e$}
 471 %%      \STATE $\gamma_j \leftarrow \Vert (\mbf{A_\N})_j \Vert^2$
 472 %%      \STATE $e \leftarrow argmax \left( \mbf{c} / \sqrt{\boldsymbol\gamma} \right) $
 473 %%      \IF{$e < 0$}
 474 %%      \RETURN \texttt{optima\_found}
 475 %%      \ENDIF
 476 %%      \STATE \textit{// Find leaving variable $\ell$}
 477 %%      \STATE $\ell, t \leftarrow expand((\mbf{A_\N})_e)$
 478 %%      \IF{$\ell < 0$}
 479 %%      \RETURN \texttt{unbounded}
 480 %%      \ENDIF
 481 %%      \STATE \textit{// Pivoting}
 482 %%      \STATE $\mbf{d} \leftarrow (\mbf{A_\N})_e \, , \, d_e \leftarrow (A_\N)_{\ell e}-1$
 483 %%      \STATE $\mbf{r} \leftarrow \mbf{N^T}_\ell$
 484 %%      \STATE $x_e \leftarrow x_e + t $
 485 %%      \STATE $\mbf{x_\B} \leftarrow \mbf{x_\B} - t (\mbf{A_\N})_e $
 486 %%      \STATE $\mbf{A_\N} \leftarrow  \mbf{A_\N} - \mbf{d^T} \mbf{r} / (A_\N)_{\ell e}$ \hspace{2mm}\textit{// cublasDger}
 487 %%      \STATE $\mbf{c} \leftarrow \mbf{c} -  c_\ell \mbf{r} / (A_\N)_{\ell e}$ \hspace{12mm}\textit{// cublasDaxpy}
 488 %%      \STATE $swap(x_e, x_{\ell})$
 489 %% \end{algorithmic}
 490 %% \end{algorithm}
 491
 492 \begin{algorithm}
 493 \caption{Standard simplex algorithm}
 494 \label{chXXX:algo:impl-simplex-std}
 495         \textit{// Find entering variable $e$}\;
 496          $\gamma_j \leftarrow \Vert (\mbf{A_\N})_j \Vert^2$\;
 497         $e \leftarrow argmax \left( \mbf{c} / \sqrt{\boldsymbol\gamma} \right) $\;
 498         \If{$e < 0$}     {
 499         \Return \texttt{optima\_found}
 500         }
 501         \textit{// Find leaving variable $\ell$     }\;
 502         $\ell, t \leftarrow expand((\mbf{A_\N})_e)$\;
 503         \If{$\ell < 0$} {
 504         \Return \texttt{unbounded}
 505         }
 506         \textit{// Pivoting}\;
 507         $\mbf{d} \leftarrow (\mbf{A_\N})_e \, , \, d_e \leftarrow (A_\N)_{\ell e}-1$\;
 508         $\mbf{r} \leftarrow \mbf{N^T}_\ell$\;
 509         $x_e \leftarrow x_e + t $\;
 510         $\mbf{x_\B} \leftarrow \mbf{x_\B} - t (\mbf{A_\N})_e $\;
 511         $\mbf{A_\N} \leftarrow  \mbf{A_\N} - \mbf{d^T} \mbf{r} / (A_\N)_{\ell e}$ \hspace{2mm}\textit{// cublasDger}\;
 512         $\mbf{c} \leftarrow \mbf{c} -  c_\ell \mbf{r} / (A_\N)_{\ell e}$ \hspace{12mm}\textit{// cublasDaxpy}\;
 513         $swap(x_e, x_{\ell})$\;
 514
 515 \end{algorithm}
 516
 517
 518
 519 The first step of algorithm~\ref{chXXX:algo:impl-simplex-std} is the selection of the entering variable using the steepest-edge heuristic. This step is discussed thoroughly in the next section. Then the leaving variable index $\ell$ and the potential gain on the entering variable $t$ are determined using the \textit{expand} method. Finally, the pivoting is done. The nonbasic matrix $\mbf{A_\N}$ and the objective function coefficients $\mbf{c}$ are updated using the CUBLAS library (with \texttt{cublasDger}, respectively \texttt{cublasDaxpy}). This requires the entering variable column $\mbf{d}$ and the leaving variable row $\mbf{r}$ to be copied and slightly altered. The new value of the variables is computed and, since we use a special structure instead of the basis matrix, the entering and leaving variables are swapped.
 520
 521 % Kernel example ==> steepest edge
 522 \subsubsection*{Choice of the entering variable}
 523 \label{chXXX:sec:impl-simplex-std-inVar}
 524 Let us discuss two different approaches for the selection of the entering variable. The first one relies on the CUBLAS library.
 525 The idea is to split this step in several small operations, starting with the computation of the norms one by one with the \texttt{cublasDnrm2} function.
 526 The score of each variable is then obtained by dividing the cost vector $\mbf{c}$ by the norms previously computed.
 527 The entering variable is finally selected by taking the variable with the biggest steepest-edge coefficient using \texttt{cublasDamax}.
 528
 529 While being easy to implement, such an approach would lead to poor performances. The main problem is a misuse of data parallelism. Each column can be processed independently and so at the same time. Moreover, slicing this step into small operations requires that each temporary results be stored in global memory. This creates a huge amount of slow data transfers between kernels and global memory.
 530
 531 \lstinputlisting[label=chXXX:lst:k1,caption=Kernel for the choice of the entering variable]{Chapters/chapter10/optiSE.cu}
 532
 533 To avoid this, the whole step could be implemented as a unique kernel, as shown in the simplified listing~\ref{chXXX:lst:k1}. Each block evaluates multiple potential entering variables.
 534 The threads of a block process part of the norm of a column.
 535 Then a reduction (see \S~\ref{chXXX:subsec:reduction}) is done inside the block to form the full norm.
 536 The thread having the final value can then compute the score of the current variable and determine if it is the best one encountered in this block.
 537 Once a block has evaluated its variables, the most promising one is stored in global memory.
 538 The final reduction step is finally done on the CPU.
 539
 540 The dimension of the blocks and grid have to be chosen wisely as a function of the problem size.
 541 The block size is directly correlated to the size of a column while the grid size is a trade-off between giving enough work to the scheduler in order to hide the latency and returning as few potential entering variable as possible for the final reduction step.
 542
 543 \subsubsection*{CPU-GPU interactions}
 544 The bulk of the data representing the problem is stored on the GPU.
 545 Only variables required for decision-making operations are updated on the CPU.
 546 \begin{figure}[!h]
 547 \centering
 548 \includegraphics[width=90mm]{Chapters/chapter10/figures/diagSTD2.pdf}
 549 \caption{Communications between CPU and GPU}
 550 \label{chXXX:fig:diagSTD}
 551 \end{figure}
 552 The communications arising from the aforementioned scheme are illustrated in figure~\ref{chXXX:fig:diagSTD}.
 553 The amount of data exchanged at each iteration is independent of the problem size ensuring that this implementation scales well as the problem size increases.
 554
 555 \subsection{Revised simplex}
 556 \label{chXXX:subsec:impl-simplex-rev}
 557 % Revised simplex
 558 The main steps found in the previous implementation are again present in the revised simplex. The difference comes from the fact that a full update of the matrix $\mbf{N}$ is not necessary at every iteration. As explained in section~\ref{chXXX:subsec:simplex-rev}, the basis matrix $\mbf{B}$ is updated so that the required values such as the entering variable column can be processed from the now \textit{constant} matrix $\mbf{N}$. Due to these changes, the steepest-edge method is adapted. The resulting implementation requires more operations as described in algorithm~\ref{chXXX:algo:simplex-rev}.
 559 % Algorithm
 560 %% \begin{algorithm}
 561 %% \caption{Revised simplex algorithm}
 562 %% \label{chXXX:algo:simplex-rev}
 563 %% \begin{algorithmic}
 564 %%      \STATE \textit{// Find entering variable $x_e,\, e\in \B$}
 565 %%      \STATE $\bs\tau \leftarrow  \mbf{(B^{-1})^T}\mbf{c_\B} $ \hspace*{5mm} \textit{// cublasDgemv}
 566 %%     \STATE $\bs\upsilon \leftarrow \mbf{c_\N} - \mbf{N^T} \bs\tau $ \hspace*{4mm} \textit{// cublasDgemv}
 567 %%     \STATE $e \leftarrow argmax \left( \bs\upsilon / \sqrt{\bs\gamma} \right) $
 568 %%     \STATE \textbf{if} $e < 0$ \textbf{return} \texttt{optima\_found}
 569 %%      \STATE \textit{// Find leaving variable $x_\ell,\, \ell\in \N$}
 570 %%     \STATE $\mbf{d} \leftarrow \mbf{B^{-1}} \mbf{N}_e $ \hspace*{9mm} \textit{// cublasDgemv}
 571 %%       \STATE $\ell, t \leftarrow expand(\mbf{d})$
 572 %%     \STATE \textbf{if} $\ell < 0$ \textbf{return} \texttt{unbounded}
 573 %%      \STATE \textit{// Pivoting - basis update}
 574 %%      \STATE $\bs\kappa \leftarrow \mbf{(B^{-1})^T} \mbf{d} $ \hspace*{6mm} \textit{// cublasDgemv}
 575 %%      \STATE $\boldsymbol\beta \leftarrow \mbf{N^T} \bs\kappa$ \hspace*{12mm} \textit{// cublasDgemv}
 576 %%      \STATE $\mbf{B^{-1}} \leftarrow  \mbf{E}\, \mbf{B^{-1}}$
 577 %%      %\STATE $basisUpdate(\mbf{B^{-1}}, \mbf{d})$ \textit{// See section ...}
 578 %%      \STATE $x_e \leftarrow x_e + t $
 579 %%      \STATE $\mbf{x_\B} \leftarrow \mbf{x_\B} - t\,\mbf{d} $
 580 %%      \STATE $swap(x_\ell,x_e)$
 581 %%      \STATE $\hat{\bs\alpha} \leftarrow \mbf{N^T} \left( \mbf{(B^{-1})^T} \right)_\ell$ \hspace*{4mm} \textit{// cublasDgemv}
 582 %%      \STATE $\gamma_e \leftarrow \Vert \mbf{d} \Vert^2$
 583 %%      \STATE $\hat{\gamma_j} \leftarrow  \max\left\{\gamma_j - 2 \hat{\alpha_j} \beta_j + \gamma_e \hat{\alpha_j}^2, 1+\hat{\alpha_j}^{2}\right\},\; \forall j \neq e $,$\quad \hat{\gamma_e} \leftarrow \gamma_{e} / d_e^2$
 584 %% \end{algorithmic}
 585 %% \end{algorithm}
 586
 587
 588 \begin{algorithm}
 589 \caption{Revised simplex algorithm}
 590 \label{chXXX:algo:simplex-rev}
 591         \textit{// Find entering variable $x_e,\, e\in \B$}\;
 592         $\bs\tau \leftarrow  \mbf{(B^{-1})^T}\mbf{c_\B} $ \hspace*{5mm} \textit{// cublasDgemv}\;
 593        $\bs\upsilon \leftarrow \mbf{c_\N} - \mbf{N^T} \bs\tau $ \hspace*{4mm} \textit{// cublasDgemv}
 594     $e \leftarrow argmax \left( \bs\upsilon / \sqrt{\bs\gamma} \right) $\;
 595     \If { $e < 0$ } {\Return \texttt{optima\_found} }
 596     \textit{// Find leaving variable $x_\ell,\, \ell\in \N$}     \;
 597     $\mbf{d} \leftarrow \mbf{B^{-1}} \mbf{N}_e $ \hspace*{9mm} \textit{// cublasDgemv}\;
 598     $\ell, t \leftarrow expand(\mbf{d})$\;
 599     \If { $\ell < 0$} {\Return \texttt{unbounded}}
 600      \textit{// Pivoting - basis update}\;
 601     $\bs\kappa \leftarrow \mbf{(B^{-1})^T} \mbf{d} $ \hspace*{6mm} \textit{// cublasDgemv}\;
 602     $\boldsymbol\beta \leftarrow \mbf{N^T} \bs\kappa$ \hspace*{12mm} \textit{// cublasDgemv}\;
 603     $\mbf{B^{-1}} \leftarrow  \mbf{E}\, \mbf{B^{-1}}$\;
 604     $x_e \leftarrow x_e + t $\;
 605     $\mbf{x_\B} \leftarrow \mbf{x_\B} - t\,\mbf{d} $\;
 606     $swap(x_\ell,x_e)$\;
 607     $\hat{\bs\alpha} \leftarrow \mbf{N^T} \left( \mbf{(B^{-1})^T} \right)_\ell$ \hspace*{4mm} \textit{// cublasDgemv}\;
 608     $\gamma_e \leftarrow \Vert \mbf{d} \Vert^2$ \;
 609     $\hat{\gamma_j} \leftarrow  \max\left\{\gamma_j - 2 \hat{\alpha_j} \beta_j + \gamma_e \hat{\alpha_j}^2, 1+\hat{\alpha_j}^{2}\right\},\; \forall j \neq e $,$\quad \hat{\gamma_e} \leftarrow \gamma_{e} / d_e^2$\;
 610 \end{algorithm}
 611
 612
 613 Let us compare the complexity, in terms of level 2 and level 3 BLAS operations, of both implementations. The standard one has mainly two costly steps: the selection of the entering variable and the update of the matrix $\mbf{N}$. These are level 2 BLAS operations or an equivalent, and thus the approximate complexity is $\mathcal{O}(2mn)$.
 614
 615 The new implementation proposed has three operations \texttt{cublasDgemv} where the matrix $\mbf{N}$ is involved,
 616 and three others with the matrix $\mbf{B^{-1}}$.
 617 The complexities of these operations are respectively $\mathcal{O}(mn)$ and $\mathcal{O}(m^2)$.
 618 The update of the matrix $\mbf{B}$ is a level 3 BLAS operation costing $\mathcal{O}(m^3)$.
 619 The approximate complexity is then $\mathcal{O}(3mn + 3m^2 + m^3)$.
 620
 621 It appears clearly that, in the current state, the revised simplex implementation is less efficient than the standard one. However this method can be improved and might well be better than the standard one based on two characteristics of LP problems:
 622 their high sparsity and the fact that usually $m \ll n$. In the next sections, we will give details about these improvements.
 623
 624 \subsubsection*{Choice of the entering variable}
 625 Once again, this part of the algorithm can be substantially improved.
 626 Firstly, algorithmic optimizations must be considered.
 627 Since the row $\bs\alpha$ of the leaving variable is processed to update the steepest-edge coefficients, the cost vector $\bs\upsilon$ can be updated directly without using the basis matrix $\mbf{B}$.
 628 This is done as follow
 629 \begin{equation*}
 630 \upsilon_j  = \bar{\upsilon}_j - \bar{\upsilon}_e \alpha_j, \; \forall j \neq e, \qquad \qquad
 631 \upsilon_e  =  -\dfrac{\bar{\upsilon}_e}{\alpha_e}
 632 \end{equation*}
 633 where $\bar{\upsilon_j}$ denotes the value of $\upsilon_j$ at the previous iteration.
 634 It is important to note that all these updated values ($\bs\upsilon, \bs\gamma$) must be processed afresh once in a while to reduce numerical errors.
 635
 636 Including this change, the operations required for the selection of the entering variable $e$ are detailed in algorithm~\ref{chXXX:algo:inVar}. The values related to the entering variable at the previous iteration $\bar{e}=q$ are used. The reduced costs are updated, followed by the steepest-edge coefficients. With these values the entering variable is determined.
 637
 638 %% \begin{algorithm}[!h]
 639 %% \caption{Choice of the entering variable}
 640 %% \label{chXXX:algo:inVar}
 641 %% \begin{algorithmic}
 642 %%      \STATE $q \leftarrow \bar{e}$
 643 %%      \STATE $\bar{\upsilon}_q \leftarrow c_q - \mbf{c_{\B}^T} \bar{\mbf{d}}$
 644 %%      \STATE $\bar{\gamma}_q \leftarrow \Vert \bar{\mbf{d}} \Vert$
 645 %%      \STATE \textit{// Update of the reduced costs}
 646 %%      \STATE $\upsilon_{j} \leftarrow \bar{\upsilon}_{j} - \alpha_{j} \bar{\upsilon}_q, \; \forall j \neq q$
 647 %%     \STATE $\upsilon_q \leftarrow  \dfrac{\bar{\upsilon}_q}{\bar{d}_q^2}$
 648 %%      \STATE \textit{// Update of the steepest edge coefficients}
 649 %%      \STATE $\gamma_j \leftarrow  \max\left\{\bar{\gamma}_j - 2 \alpha_j \bar{\beta}_j + \bar{\gamma}_q \alpha_j^2, 1+\alpha_{j}^{2} \right\},\; \forall j \neq q $
 650 %%      \STATE $\gamma_q \leftarrow \bar{\gamma}_q / \bar{d}_q^2$
 651 %%      \STATE \textit{// Selection of the entering variable}
 652 %%     \STATE $e \leftarrow argmax \left( \bs\upsilon / \sqrt{\bs\gamma} \right) $
 653 %% \end{algorithmic}
 654 %% \end{algorithm}
 655
 656 \begin{algorithm}[!h]
 657 \caption{Choice of the entering variable}
 658 \label{chXXX:algo:inVar}
 659         $q \leftarrow \bar{e}$\;
 660         $\bar{\upsilon}_q \leftarrow c_q - \mbf{c_{\B}^T} \bar{\mbf{d}}$   \;
 661         $\bar{\gamma}_q \leftarrow \Vert \bar{\mbf{d}} \Vert$\;
 662         \textit{// Update of the reduced costs} \;
 663         $\upsilon_{j} \leftarrow \bar{\upsilon}_{j} - \alpha_{j} \bar{\upsilon}_q, \; \forall j \neq q$\;
 664         $\upsilon_q \leftarrow  \dfrac{\bar{\upsilon}_q}{\bar{d}_q^2}$ \;
 665         \textit{// Update of the steepest edge coefficients}\;
 666         $\gamma_j \leftarrow  \max\left\{\bar{\gamma}_j - 2 \alpha_j \bar{\beta}_j + \bar{\gamma}_q \alpha_j^2, 1+\alpha_{j}^{2} \right\},\; \forall j \neq q $\;
 667         $\gamma_q \leftarrow \bar{\gamma}_q / \bar{d}_q^2$  \;
 668         \textit{// Selection of the entering variable}\;
 669     $e \leftarrow argmax \left( \bs\upsilon / \sqrt{\bs\gamma} \right) $\;
 670 \end{algorithm}
 671
 672 Coupling these operations into a single kernel is quite beneficial. This lead $\bs\upsilon$ and $\bs\gamma$ to only be loaded once from global memory. Their updated values are stored while the entering variable $e$ is selected. With these changes, the global complexity of this step is reduced from $\mathcal{O}(mn + m^2 + n)$ to $\mathcal{O}(n)$. Moreover the remaining processing is done optimally by reusing data and by overlapping global memory access and computations.
 673
 674 \subsubsection*{Basis update}
 675 % Update operation
 676 The basis update $\mbf{B^{-1}} \leftarrow  \mbf{E} \, \mbf{B^{-1}}$ is a matrix-matrix multiplication. However, due to the special form of the matrix $\mbf{E}$ (see \S~\ref{chXXX:subsec:simplex-rev}), the complexity of this operation can be reduced from $\mathcal{O}(m^3)$ to $\mathcal{O}(m^2)$~\cite{BIELING}. The matrix $\mbf{E}$ is merely the identity matrix having the $\ell$-th column replaced by the vector $\bs\eta$. The update of the matrix $\mbf{B^{-1}}$ can be rewritten as
 677 \begin{equation*}
 678 \hat{B}^{-1}_{ij}  = B^{-1}_{ij}\left(1 - \frac{d_i}{d_\ell}\right)\, , \quad \forall i \neq \ell, \qquad
 679 \qquad \hat{B}^{-1}_{\ell j}  = \frac{B^{-1}_{\ell j}}{d_\ell}
 680 \end{equation*}
 681 As shown in the listing~\ref{chXXX:lst:k2}, each block of the kernel processes a single column while each thread may compute multiple elements of a column.
 682 This organisation ensures that global memory accesses are coalescent since the matrix $\mbf{B}$ is stored column-wise.
 683
 684 \lstinputlisting[label=chXXX:lst:k2,caption=Basis update]{Chapters/chapter10/updateBasis.cu}
 685
 686 % Sparse matrix - vector
 687 \subsubsection*{Sparse matrix operations}
 688 The complexity of the implementation is now $\mathcal{O}(2mn + 3m^2)$ which is close to the one of the standard simplex implementation. The operations where the matrix $\mbf{N}$ is involved remain the more expensive (considering $m \ll n$). However, this matrix is \textit{constant} in the revised simplex allowing the use of sparse matrix-vector multiplication from the CUSPARSE library. On sparse problems, this leads to important gains in performance. The sparse storage of the matrix $\mbf{N}$ reduces significantly the memory space used by the algorithm. All these improvements come at a small cost: columns are no longer directly available in their dense format and must thus be decompressed to their dense representation when needed.
 689
 690 The complexity of the revised simplex implementation is thus finally $\mathcal{O}(2\theta + 3m^2)$ where $\theta$ is a function of $m$, $n$ and the sparsity of the problem. We shall see in section~\ref{chXXX:subsec:implPerf}, what kind of performances we may obtain on various problems.
 691
 692 \subsection{Branch-and-Bound}
 693 \label{chXXX:subsec:impl-bnb}
 694 % Branch and bound on GPU
 695 % Limiting communications
 696 The B\&B algorithm is operated from the CPU.
 697 The simplex implementations, also referred to as LP solver, are used to solve the nodes of the B\&B tree.
 698 Algorithm~\ref{chXXX:algo:bnb} contains the main operations discussed in section~\ref{chXXX:sec:bnb}.
 699 It starts by solving the root node.
 700 The branching strategy is then used to select the next variable to branch on.
 701 From thereon, until no node remains to be solved, the node selection strategy is used to elect the next one to be processed.
 702
 703 The critical factor for coupling the LP solver and the B\&B, is the amount of communications exchanged between them. Since CPU-GPU communications are expensive, the informations required to solve a new node must be minimized. Upon solving a new node, the full transfer of the problem must be avoided. This will be the focus of this section.
 704
 705 % Algorithm
 706 %% \begin{algorithm}
 707 %% \caption{Branch-and-Bound}
 708 %% \label{chXXX:algo:bnb}
 709 %% \begin{algorithmic}
 710 %%     \STATE Solution  $\leftarrow$ Solve(InitialProblem)
 711 %%     \STATE BranchVar $\leftarrow$ SelectNextVariable(Solution)
 712 %%     \STATE NodeList $\leftarrow$ AddNewNodes(BranchVar)
 713 %%     \WHILE{NodeList $\neq \emptyset$}
 714 %%     \STATE Node $\leftarrow$ SelectNextNode(NodeList)
 715 %%     \STATE Solution  $\leftarrow$ Solve(Node)
 716 %%     \IF{exists(Solution)}
 717 %%     \STATE BranchVar $\leftarrow$ SelectNextVariable(Solution)
 718 %%     \STATE NodeList $\leftarrow$ AddNewNodes(BranchVar)
 719 %%     \ENDIF
 720 %%     \ENDWHILE
 721 %% \end{algorithmic}
 722 %% \end{algorithm}
 723
 724 \begin{algorithm}
 725 \caption{Branch-and-Bound}
 726 \label{chXXX:algo:bnb}
 727     Solution  $\leftarrow$ Solve(InitialProblem)\;
 728     BranchVar $\leftarrow$ SelectNextVariable(Solution)\;
 729     NodeList $\leftarrow$ AddNewNodes(BranchVar)\;
 730     \While{NodeList $\neq \emptyset$} {
 731     Node $\leftarrow$ SelectNextNode(NodeList)\;
 732     Solution  $\leftarrow$ Solve(Node)\;
 733     \If{exists(Solution)}{
 734       BranchVar $\leftarrow$ SelectNextVariable(Solution)\;
 735       NodeList $\leftarrow$ AddNewNodes(BranchVar)\;
 736     }
 737     }
 738 \end{algorithm}
 739
 740 % Algorithmic choice --> Plunging
 741 \subsubsection*{Algorithmic choices}
 742 The first critical choice is to select an efficient tree traversal strategy that also takes advantage of the locality of the problem. At first glance, the best-first search would be a poor choice, since from one node to the next the problem could change substantially. However, when combined with the plunging heuristic, this same strategy becomes really interesting. Indeed, the plunging phase can be completely decoupled from the B\&B. The CPU acting as a decision maker, chooses a promising node and spawns a task that will independently explore the subtree of this node. Once done, this task will report its findings to the decision maker. Moreover, when plunging, the next node to be solved differs only by the new bound on the branching variable. Communications are then minimized and the LP solver is usually able to quickly find the new solution since the problem has only been slightly altered.
 743
 744 % Technical improvement --> warmstart / hotstart
 745 \subsubsection*{Warmstart}
 746 There are two case where starting from a fresh problem is required or beneficial. The first one is imposed by the numeric inaccuracy appearing after several iterations of the LP solver. The second is upon the request of a new subtree. To avoid the extensive communication costs of a full restart, the GPU keeps in memory an intermediate stable state of the problem, a \textit{warmstart}. This state could for example be the one found after solving the root node of the tree.
 747
 748 % Global flow
 749 \subsubsection*{Multi-GPU\index{multi-GPU} exploration}
 750 The organisation, CPU decision maker vs GPU explorer, offers the possibility to use multiple GPUs to explore the tree. The global knowledge is maintained by the CPU task. The latter assigns to the GPUs the task of exploring subtrees of promising nodes. Since each plunging is done locally, no communications are required between the GPUs. Moreover, the amount of nodes processed during a plunging can be used to tune the load of the CPU task.
 751
 752 \section{Performance model}
 753 \label{chXXX:sec:MODEL}
 754 Performance models\index{Performance model} are useful to predict the behaviour of implementations as a function of various parameters. Since the standard simplex algorithm is the easiest to understand, we will focus in this section on its behaviour as a function of the problem size.
 755
 756 CUDA kernels require a different kind of modelling than usually encountered in parallelism. The key idea is to capture in the model the decomposition into threads, warps and blocks. One must also pay a particular attention to global memory accesses and to how the pipelines reduce the associated latency.
 757
 758 In order to model our implementation, we follow the approach given by K. Kothapalli et al.~\cite{MODEL} (see also~\cite{ELSTER09}). Firstly, we examine the different levels of parallelism on a GPU. Then we determine the amount of work done by a single task. By combining both analysis, we establish the kernel models. The final model then consists of the sum of each kernel.
 759
 760 This model has also been used to model a multi-GPUs version of the standard simplex method~\cite{MEYER2011}.
 761
 762
 763 \subsection{Levels of parallelism}
 764 \label{chXXX:subsec:LVLParallelism}
 765
 766 A kernel can be decomposed into an initialization phase followed by a processing phase.
 767 During the initialization phase the kernel context is setup.
 768 Since this operation does not take a lot of time compared to the processing phase,
 769 it is needless to incorporate it into the model.
 770 The processing phase is more complex and its execution time depends directly on the amount of work it must perform.
 771 We shall now focus on this phase and on the various levels of parallelism on a GPU.
 772
 773 At the first level, the total work is broken down into components, the blocks. They are then dispatched on the available multiprocessors  on the GPU. The execution time of a kernel depends  on the number of blocks $N_B$ per SM (\textit{streaming multiprocessor}) and on the number $N_{SM}$ of SM per GPU.
 774
 775 At a lower level, the work of a block is shared among the various cores of its dedicated SM. This is done by organizing the threads of the block into groups, the warps. A warp is a unit of threads that can be executed in parallel on a SM. The execution time of a block is then linked to the number $N_W$ of warps per block, the number $N_T$ of threads per warp and the number $N_P$ of cores per SM.
 776
 777 The third and lowest level of parallelism is a pipeline. This pipeline enables a pseudo-parallel execution of the tasks forming a warp. The gain produced by this pipeline is expressed by its depth $D$.
 778
 779 \subsection{Amount of work done by a thread}
 780 \label{chXXX:subsec:workTask}
 781
 782 In the previous section, we defined the different levels of parallelism down to the smallest, namely the thread level. We must now estimate how much work is done by a single thread of a kernel in terms of cycles. It depends on the number and the type of instructions. In CUDA, each arithmetic instruction requires a different number of cycles. For example, a single precision add costs 4 cycles while a single precision division costs 36 cycles.
 783
 784 Moreover, since global memory access instructions are non-blocking operations, they must be counted separately from arithmetic instructions. The number of cycles involved in a global memory access, amounts to a 4 cycle instruction (read/write) followed by a non-blocking latency of 400-600 cycles. To correctly estimate the work due to such accesses, one needs to sum only the latency that is not hidden. Two consecutive read instructions executed by a thread, would cost twice the 4 cycles, but only once the latency due to its non-blocking behaviour. Once the amount of cycles involved in these memory accesses has been determined, it is then necessary to take into account the latency hidden by the scheduler (\textit{warp swapping}).
 785
 786 The total work $C_T$ done by a thread can be defined either as the sum or as the maximum of the memory access cycles and the arithmetic instructions cycles. Summing both types of cycles means we consider that latency cannot be used to hide arithmetic instructions. The maximum variant represents the opposite situation where arithmetic instructions and memory accesses are concurrent. Then only the biggest of the two represents the total work of a thread. This could occur for example when the latency is entirely hidden by the pipeline.
 787
 788 It is not trivial to define which scenario is appropriate for each kernel. There are many factors involved: the  dependency on the instructions, the behaviour of the scheduler, the quantity of data to process, and so on. The number of cycles $C_T$ done by a thread can thus be defined by the sum, respectively the maximum, of the global memory access and the arithmetic instructions.
 789 %Moreover, for each kernel the task representing the worst case scenario will be the one modelised.
 790
 791 \subsection{Global performance model}
 792 \label{chXXX:subsec:MODEL}
 793
 794 We now turn to the global performance model for a kernel. We must find out how many threads are run by a core of an SM. Recall that a kernel decomposes into blocks of threads, with each SM running $N_B$ blocks, each block having itself $N_W$ warps consisting of $N_T$ threads. Also recall that an SM has $N_P$ cores which execute batches of threads in a pipeline of depth $D$. Thus the total work $C_{core}$ done by a core is given by
 795 \begin{equation}
 796 \label{chXXX:equ:ccore}
 797 C_{core} = N_B \cdot N_W \cdot N_T \cdot C_T \cdot \dfrac{1}{N_P \cdot D}
 798 \end{equation}
 799 which represents the total work done by a SM divided by the number of cores per SM and by the depth of the pipeline.
 800 Finally, the execution time of a kernel is obtained by dividing $C_{core}$ by the core frequency.
 801
 802 \subsection{A kernel example : \textit{steepest-edge}}
 803 \label{chXXX:subsec:SE}
 804 The selection of the entering variable is done by the \textit{steepest-edge} method as described in section~\ref{chXXX:subsec:impl-simplex-std}.
 805
 806 The processing of a column is done in a single block. Each thread of a block has to compute $N_{el}=\frac{m}{N_W \cdot N_T}$ coefficients of the column. This first computation consists of $N_{el}$ multiplications and additions. The resulting partial sum of squared variables must then be reduced on a single thread of the block. This requires $N_{red} = \log_2(N_W \cdot N_T)$ additions. Since the shared memory is used optimally, there are no added communications. Finally, the coefficient $c_j$ must be divided by the result of the reduction.
 807
 808 Each block of the kernel computes $N_{col}=\frac{n}{N_B \cdot N_{SM}}$ columns, where $N_{SM}$ is the number of SM per GPU. After processing a column, a block keeps only the minimum of its previously computed steepest-edge coefficients. Thus the number of arithmetic instruction cycles for a given thread is given by
 809 \begin{equation}
 810 \label{chXXX:equ:arithmSE}
 811 C_{Arithm} =  N_{col} \cdot (N_{el}  \cdot (C_{add} + C_{mult}) + N_{red} \cdot C_{add} + C_{div} + C_{cmp})
 812 \end{equation}
 813 where $C_{ins}$ denotes the number of cycles to execute instruction $ins$.
 814
 815 Each thread has to load $N_{el}$ variables to compute its partial sum of squared variables. The thread computing the division also loads the coefficient $c_j$. This must be done for the $N_{col}$ columns that a block has to deal with. We must also take into account that the scheduler hides some latency by swapping the warps, so the total latency $C_{latency}$ must be divided by the number of warps $N_W$.
 816 Thus the number of cycles relative to memory accesses is given by:
 817
 818 \begin{equation}
 819 \label{chXXX:equ:latencySE}
 820 C_{Accesses} =  \dfrac{ N_{col} \cdot (N_{el}+1) \cdot C_{latency}} {N_W}
 821 \end{equation}
 822
 823 At the end of the execution of this kernel, each block stores in global memory its local minimum. The CPU will then have to retrieve the $N_B\cdot N_{SM}$ local minimums and reduce them. It is then profitable to minimize the number $N_B$ of blocks per SM. With a maximum of two blocks per SM, the cost of this final operation done by the CPU can be neglected when $m$ and $n$ are big.
 824
 825 It now remains to either maximize or sum $C_{Arithm}$ and $C_{Accesses}$ to obtain~$C_T$. The result of equ.~(\ref{chXXX:equ:ccore}) divided by the core frequency yields the time $T_{KSteepestEdge}$ of the steepest-edge kernel.
 826
 827 \subsection{Standard simplex GPU implementation model}
 828 \label{chXXX:subsec:model-simplex-std}
 829 As seen in section~\ref{chXXX:subsec:impl-simplex-std}, the standard simplex implementation requires only a few communications between the CPU and the GPU. Since each of these communications are constant and small, they will be neglected in the model. For the sake of simplicity, we will consider the second phase of the \textit{two-phase simplex} where we apply iteratively the three main operations: selecting the entering variable, choosing the leaving variable and pivoting. Each of these operations is computed as a kernel. The time of an iteration $T_{Kernels}$ then amounts to the sum of all three kernel times:
 830 \begin{equation}
 831 \label{chXXX:equ:1GPU}
 832 T_{Kernels} = T_{KSteepestEdge} + T_{KExpand} + T_{KPivoting}
 833 \end{equation}
 834 The times $T_{KExpand}$ and $T_{KPivoting}$ for the expand and pivoting kernels are obtained in a similar way as for the steepest-edge kernel described in the previous section.
 835
 836 With the estimated time per iteration $T_{Kernels}$, we can express the total time $T_{prob}$ required for solving a problem as
 837 \begin{equation}
 838 \label{chXXX:equ:tprob}
 839 T_{prob} = T_{init} + r \cdot T_{Kernels}
 840 \end{equation}
 841 where $r$ is the number of iterations.
 842 Note that research by Dantzig~\cite{DANTZIG80} asserts that $r$ is proportional to $m \log_2(n)$.
 843
 844 \section{Measurements and analysis}
 845 \label{chXXX:sec:measurements}
 846 Two sets of measurements are presented here. The first one is a validation of the performance model presented in section~\ref{chXXX:sec:MODEL}. The second is a comparison of the performances of the various implementations of the simplex method detailed in section~\ref{chXXX:sec:impl}.
 847
 848 As a preliminary, we ensured that our implementations were functional. For that matter, we used a set of problems available on the \textit{NETLIB}~\cite{NETBLIB} repository. This dataset usually serves as a benchmark for LP solvers. It consists of a vast variety of real and specific problems for testing the stability and robustness of an implementation. For example, some of these represent real-life models of large refineries, industrial production/allocation or fleet assignments problems. Our implementations are able to solve almost all of these problems.
 849
 850 % Test 1 : Tesla S1070 =>  Performance model + custom problems
 851 \subsection{Performance model validation}
 852 \label{chXXX:subsec:modelVal}
 853
 854 In order to check that our performance model for the standard simplex implementation is correct, a large range of problems of varying size is needed. As none of the existing datasets offered the desired diversity of problems, we used a problem generator. It is then possible to create problems of given size and density. Since usual LP problems have more variables than equations, our generated problems have a ratio of 5 variables for 1 equation ($n=5 \cdot m$).
 855 %All these problems fall into the same category. Hence, generated problems of the same size will be solved within approximately the same number of iterations and the optimum of the objective function will only slightly fluctuate.
 856
 857 The test environment is composed of a CPU server (2 Intel Xeon X5570, 2.93GHz, with 24GB DDR3 RAM)
 858 and a GPU computing system (NVIDIA tesla S1070) with the 3.2 CUDA Toolkit. This system connects 4 GPUs to the server. Each GPU has 4GB GDDR3 graphics memory, 30 streaming multiprocessors, each holding 8 cores (1.4GHz).
 859 %Such a GPU offers at peak performance up to 1TFLOPS for single precision floating point, 80GFLOPS for double precision floating point and 100GB/s memory bandwidth between threads (registers) and global memory.
 860
 861 We validated our performance model by showing that it accurately fits our measurements.
 862 The correlation between the measurements and the model is above $0.999$ (see fig.~\ref{chXXX:fig:fitting}).
 863 \begin{figure}[!h]
 864 \centering
 865 \includegraphics[width=100mm]{Chapters/chapter10/figures/Model.pdf}
 866 \caption{Performance model and measurements comparison}
 867 \label{chXXX:fig:fitting}
 868 \end{figure}
 869
 870 % Test 2 : Tesla M2090 => Performance comparison + NETLIB problems
 871 \subsection{Performance comparison of implementations}
 872 \label{chXXX:subsec:implPerf}
 873 Four different implementations are compared in this section. We will refer to each of them according to the terminology introduced below.
 874 \begin{itemize}
 875  \item Standard simplex method improved ($\mathcal{O}(2mn)$):  \texttt{Standard}
 876  \item Revised simplex method using basis kernel
 877  \begin{itemize}
 878   \item without improvements ($\mathcal{O}(3mn + 4m^2)$): \texttt{Revised-base}
 879   \item optimized ($\mathcal{O}(2mn + 3m^2)$): \texttt{Revised-opti}
 880   \item optimized with sparse matrix-vector multiplication ($\mathcal{O}(2\theta+3m^2)$): \texttt{Revised-sparse}
 881  \end{itemize}
 882 \end{itemize}
 883 These implementations all use the \textit{steepest-edge} and \textit{expand} methods for the choice of, respectively, the entering and the leaving variables.
 884
 885 We used problems from the \textit{NETLIB} repository to illustrate the improvements discussed in section~\ref{chXXX:sec:impl}. The results expected from the first three methods are quite clear. The \texttt{Standard} method should be the best, followed by the \texttt{Revised-opti} and then the \texttt{Revised-base}. However, the performance of the \texttt{Revised-sparse} implementation remains unclear since the value of $\theta$ is unknown and depends on the density and size of the constraints matrix. This is the main question we shall try to answer with our experiments.
 886
 887 The test environnement is composed of a CPU server (2 Intel Xeon X5650, 2.67GHz, with 96GB DDR3 RAM) and a GPU computing system (NVIDIA tesla M2090) with the 4.2 CUDA Toolkit. This system connects 4 GPUs to the server. Each GPU has 6GB GDDR5 graphics memory and 512 cores (1.3GHz).
 888
 889 \begin{table}[!h]
 890 \begin{tabular}{|l|r|r|r|r|}
 891 \hline
 892 \multicolumn{1}{|l|}{Problem name} & \multicolumn{1}{l|}{Rows} & \multicolumn{1}{l|}{Cols} & \multicolumn{1}{l|}{NNZ} & \multicolumn{1}{l|}{C-to-R} \\ \hline
 893 etamacro.mps & 401 & 688 & 2489 & 1.7 \\ \hline
 894 fffff800.mps & 525 & 854 & 6235 & 1.6\\ \hline
 895 finnis.mps & 498 & 614 & 2714 & 1.2\\ \hline
 896 gfrd-pnc.mps & 617 & 1092 & 3467 & 1.8\\ \hline
 897 grow15.mps & 301 & 645 & 5665 & 2.1\\ \hline
 898 grow22.mps & 441 & 946 & 8318 & 2.1\\ \hline
 899 scagr25.mps & 472 & 500 & 2029 & 1.1\\ \hline
 900 \end{tabular}
 901 \caption{NETLIB problems solved in less than 1 second}
 902 \label{chXXX:tab:small}
 903 \end{table}
 904
 905 Let us begin by discussing the performances on problems solved in less than one second. The name, size, number of non-zero elements (NNZ) and columns to rows ratio (C-to-R) of each problems are reported in table~\ref{chXXX:tab:small}. The performances shown in figure~\ref{chXXX:fig:FastSolve} corroborate our previous observations. On these problems the \texttt{Revised-sparse} method doesn't outperform the \texttt{Standard} one. This can be explained by two factors: the added communications (kernel calls) for the revised method, and the small size and density of the problems. It is likely that sparse operations on a GPU require larger amounts of data to become more efficient than their dense counterpart.
 906 \begin{figure}[!h]
 907 \centering
 908 \includegraphics[width=100mm]{Chapters/chapter10/figures/Small2.pdf}
 909 \caption{Time required to solve problems of table~\ref{chXXX:tab:small}}
 910 \label{chXXX:fig:FastSolve}
 911 \end{figure}
 912
 913 The problems shown in table~\ref{chXXX:tab:medium} are solved in less than 4 seconds. As we can see in figure~\ref{chXXX:fig:NormalSolve}, the expected trend for the \texttt{Revised-base} and the \texttt{Revised-opti} methods is now confirmed. Let us presently focus on the \texttt{Standard} and \texttt{Revised-sparse} methods. Some of the problems, in particular \textit{czprob.mps} and \textit{nesm.mps}, are solved in a comparable amount of time. The performance gain of the \texttt{Revised-sparse} is related to the C-to-R (columns to rows) ratio of these problems, displaying respectively a 3.8 and a 4.4 ratio.
 914
 915 \begin{table}[!h]
 916 \begin{tabular}{|l|r|r|r|r|}
 917 \hline
 918 \multicolumn{1}{|l|}{Problem name} & \multicolumn{1}{l|}{Rows} & \multicolumn{1}{l|}{Cols} & \multicolumn{1}{l|}{NNZ} & \multicolumn{1}{l|}{C-to-R} \\ \hline
 919 25fv47.mps & 822 & 1571 & 11127 & 1.9\\ \hline
 920 bnl1.mps & 644 & 1175 & 6129 & 1.8\\ \hline
 921 cycle.mps & 1904 & 2857 & 21322 & 1.5\\ \hline
 922 czprob.mps & 930 & 3523 & 14173 & 3.8\\ \hline
 923 ganges.mps & 1310 & 1681 & 7021 & 1.2\\ \hline
 924 nesm.mps & 663 & 2923 & 13988 & 4.4\\ \hline
 925 maros.mps & 847 & 1443 & 10006 & 1.7\\ \hline
 926 perold.mps & 626 & 1376 & 6026 & 1.0\\ \hline
 927 \end{tabular}
 928 \caption{NETLIB problems solved in the range of 1 to 4 seconds}
 929 \label{chXXX:tab:medium}
 930 \end{table}
 931
 932 \begin{figure}[!h]
 933 \centering
 934 \includegraphics[width=100mm]{Chapters/chapter10/figures/Med2.pdf}
 935 \caption{Time required to solve problems of table~\ref{chXXX:tab:medium}}
 936 \label{chXXX:fig:NormalSolve}
 937 \end{figure}
 938
 939 Finally, the biggest problems, and slowest to solve, are given in table~\ref{chXXX:tab:big}. A new tendency can be observed in figure~\ref{chXXX:fig:SlowSolve}. The \texttt{Revised-sparse} method is the fastest on most of the problems. The performances are still close between the best two methods on problems having a C-to-R ratio close to 2 like \textit{bnl2.mps}, \textit{pilot.mps} or \textit{greenbeb.mps}. However, when this ratio is greater, the \texttt{Revised-sparse} can be nearly twice as fast as the standard method. This is noticeable on \textit{80bau3b.mps} (4.5) and \textit{fit2p.mps} (4.3). Although the C-to-R ratio of \textit{d6cube.mps} (14.9) exceeds the ones previously mentioned, the \texttt{Revised-sparse} method doesn't show an impressive performance, probably due to the small amount of rows and the density of this problem, which does thus not fully benefit from the lower complexity of sparse operations.
 940
 941 \begin{table}[!h]
 942 \begin{tabular}{|l|r|r|r|r|}
 943 \hline
 944 \multicolumn{1}{|l|}{Problem name} & \multicolumn{1}{l|}{Rows} & \multicolumn{1}{l|}{Cols} & \multicolumn{1}{l|}{NNZ} & \multicolumn{1}{l|}{C-to-R} \\ \hline
 945 80bau3b.mps & 2263 & 9799 & 29063 & 4.3\\ \hline
 946 bnl2.mps & 2325 & 3489 & 16124 & 1.5\\ \hline
 947 d2q06c.mps & 2172 & 5167 & 35674 & 2.4\\ \hline
 948 d6cube.mps & 416 & 6184 & 43888 & 14.9\\ \hline
 949 fit2p.mps & 3001 & 13525 & 60784 & 4.5\\ \hline
 950 greenbeb.mps & 2393 & 5405 & 31499 & 2.3 \\ \hline
 951 maros-r7.mps & 3137 & 9408 & 151120 & 3.0\\ \hline
 952 pilot.mps & 1442 & 3652 & 43220 & 2.5 \\ \hline
 953 pilot87.mps & 2031 & 4883 & 73804 & 2.4\\ \hline
 954 \end{tabular}
 955 \caption{NETLIB problems solved in more than 5 seconds}
 956 \label{chXXX:tab:big}
 957 \end{table}
 958
 959 \begin{figure}[!h]
 960 \centering
 961 \includegraphics[width=100mm]{Chapters/chapter10/figures/Big2.pdf}
 962 \caption{Time required to solve problems of table \ref{chXXX:tab:big}}
 963 \label{chXXX:fig:SlowSolve}
 964 \end{figure}
 965
 966 \section{Conclusion and perspectives}
 967 \label{chXXX:sec:persp}
 968 In this chapter, we tried to present the knowledge and understanding necessary in our view to produce an efficient integer linear programming solver on a GPU. We proposed various solutions to implement the standard and revised simplex. We discussed the main concepts behind a Branch-and-Bound algorithm and pointed out some major concerns when it is coupled with a GPU solver. The results obtained with the standard simplex implementation allowed us to validate a detailed performance model we had proposed. Finally, we used problems from the \textit{NETLIB} library to compare the performances of our various implementations. The revised simplex implementation with sparse matrix operations, showed the best performances on time consuming problems, while the standard simplex one was more competitive on easier problems. However, sequential open-source solvers like \textit{CLP} of the \textit{COIN-OR} project still outperform such GPU implementations.
 969
 970 We shall now discuss some remaining potential improvements. A first step towards performance would be to consider the dual revised simplex~\cite{LEMKE54}. While being similar to the methods treated in this chapter, it has the capacity to greatly reduce the time needed to solve problems.
 971 Yet, the main improvement would be to tackle larger problems than the ones considered here. Indeed, problems having hundreds of thousands of variables would technically be solvable on GPU devices with 2GB to 4GB of global memory. Moreover, such amounts of data would fully benefit from the potential of these devices. However, solving problems of this size raises an issue not addressed here: numerical instabilities. This phenomenon is due to the limited precision of mathematical operations on computing devices.
 972 % Numerical innacuracies/ Example + problem
 973 Let us illustrate this problem on the inverse $\mbf{B^{-1}}$ of the basis matrix $\mbf{B}$. At each iteration of the revised simplex, $\mbf{B^{-1}}$ is updated from its previous values. Performing this update adds a small error to the result. Unfortunately, these errors accumulate at each iteration, bringing at some point the $\mbf{B^{-1}}$ matrix into a degenerate state. To avoid this situation, the matrix $\mbf{B^{-1}}$ must be recomputed afresh once in a while by inverting the matrix $\mbf{B}$.
 974
 975 % LU decomposition
 976 Instead of directly processing the inverse of the matrix $\mbf{B}$, it is more common to use some specific arithmetical treatment. Most of the simplex implementations use the LU decomposition~\cite{BRAYTON70}.
 977 Since the matrix $\mbf{B}$ is sparse, the sparse version of this decomposition can be used, and the corresponding update performed for $\mbf{B^{-1}}$ at each iteration~\cite{BARTELS69}. The sparse LU decomposition on CPU has been recently subject to a large amount of research, yielding many improvements. Once its GPU counterpart will be available, this might result in improved and competitive simplex implementations on GPU.
 978
 979 \putbib[Chapters/chapter10/biblio]
 980
 981
 982
 983
 984
 985
 986