+
+
+\chapterauthor{Xavier Meyer}{Department of Computer Science, University of Geneva}
+\chapterauthor{Paul Albuquerque}{Institute for Informatics and Telecommunications, hepia, \\ University of Applied Sciences of Western Switzerland - Geneva}
+\chapterauthor{Bastien Chopard}{Department of Computer Science, University of Geneva}
+
+%\chapter{Linear programming on a GPU: a study case based on the simplex method and the branch-cut-and bound algorithm}
+\chapter{Linear programming on a GPU: a~study~case}
+\section{Introduction}
+\label{chXXX:sec:intro}
+The simplex method is a well-known optimization algorithm for solving linear programming (LP) models in the field of operations research. It is part of software often employed by businesses for finding solutions to problems such as airline scheduling problems. The original standard simplex method was proposed by Dantzig in 1947~\cite{VCLP}. A more efficient method named the revised simplex, was later developed. Nowadays its sequential implementation can be found in almost all commercial LP solvers. But the always increasing complexity and size of LP problems from the industry, drives the demand for more computational power. Indeed, current implementations of the revised simplex strive to produce the expected results, if any. In this context, parallelization is the natural idea to investigate~\cite{HALL}. Already in 1996, Thomadakis et al.~\cite{THOMAD} implemented the standard method on a massively parallel computer and obtained an increase in performances when solving dense or large problems.
+
+A few years back, in order to meet the demand for processing power, graphics card vendors made their graphical processing units (GPU) available for general-purpose computing. Since then GPUs have gained a lot of popularity as they offer an opportunity to accelerate algorithms having an architecture well-adapted to the GPU model. The simplex method falls into this category. Indeed, GPUs exhibit a massively parallel architecture optimized for matrix processing. To our knowledge, there are only few simplex implementations on GPU. Bieling et al.~\cite{BIELING} presented encouraging results while solving small to mid-size LP problems with the revised simplex. However, the complexity of their algorithm seems to be rather close to the one of the standard simplex with similar heuristics. Following the steps of this first work, an implementation of the revised simplex~\cite{LALAMI11} showed interesting results on non-sparse and square matrices. Finally, an implementation of the interior point method~\cite{JUNG08} outperformed its CPU equivalent on mid-size problems.
+
+The Branch-and-Bound (B\&B) algorithm is extensively used for solving integer linear programming (ILP) problems.
+This tree-based exploration method subdivides the feasible region of the relaxed LP model into successively smaller subsets to obtain bounds on the objective value.
+Each corresponding submodel can be solved with the simplex method, and the bounds computed determine whether further branching is required.
+Hence, a natural parallelization of the B\&B method is to let the CPU manage the B\&B tree and despatch the relaxed submodels to a LP solver on a GPU. We refer the reader to chapter~\ref{ch8:GPU-accelerated-tree-based-exact-optimization-methods}, and the references therein, for a more complete introduction to parallel B\&B algorithms.
+
+In this chapter, we present a GPU implementation of the standard and revised simplex methods, based on the CUDA technology of NVIDIA. We also derive a performance model and establish its accurateness. Let us mention that there are many technicalities involved in CUDA programming, in particular regarding the management of tasks and memories on the GPU. Thus fine tuning is indispensable to avoid a breakdown on performance.
+%With respect to the simplex algorithm, special attention must be given to numerical stability.
+
+The chapter is organized as follows. First, we begin with a description of the standard and revised simplex methods
+and introduce the heuristics used in our implementations. This is followed by a presentation of the B\&B algorithm.
+The next section points out CUDA aspects which are important for our implementations.
+In the following one, we focus on the GPU implementations of the simplex method as well as the B\&B algorithm.
+In the sixth section, we describe a performance model for the standard simplex implementation.
+In the seventh section, a performance comparison between our implementations is made on real-life problems and an analysis is given.
+Finally, we summarize the results obtained and consider new perspectives.
+
+\section{Simplex algorithm}
+\label{chXXX:sec:simplex-algo}
+\subsection{Linear programming model}
+\label{chXXX:subsec:lp-model}
+%% Mathematics of LP
+A linear programming\index{linear programming} (LP) model in its canonical form can be expressed as the following optimization problem:
+\begin{equation}
+\label{chXXX:equ:canon-form}
+%\centering
+\begin{array}{lcl}
+ \text{maximize} &\hspace{10pt}& z = \mbf{c^T x} \hspace{30pt}\\
+ \text{subject to} &\hspace{10pt} &\mbf{Ax \leq b} \hspace{30pt}\\
+ &\hspace{10pt} &\mbf{x \geq 0} \hspace{30pt}
+\end{array}
+\end{equation}
+where $\mbf{x}=(x_j),\mbf{c}=(c_j) \in \mathbb{R}^n,\mbf{b}=(b_i) \in \mathbb{R}^m$ and $\mbf{A}=(a_{ij})$ is the $m \times n$ constraints matrix.
+The objective function $z = \mbf{c^T x}$ is the inner product of the cost vector $\mbf{c}$ and the unknown variables $\mbf{x}$.
+An element $\mbf{x}$ is called a solution which is said to be feasible
+if it satisfies the $m$ linear constraints imposed by $\mbf{A}$ and the bound $\mbf{b}$.
+The feasible region $\{\mbf{x}\in \mathbb{R}^n\, |\, \mbf{Ax \leq b},\, \mbf{x \geq 0}\}$ is a convex polytope.
+An optimal solution to the LP problem will therefore reside on a vertex of this polytope.
+
+\subsection{Standard simplex algorithm}
+\label{chXXX:subsec:simplex-std}
+% Simple explanation of the simple algorithm : 3 operations
+The simplex method\index{simplex!standard method}~\cite{VCLP} is an algorithm for solving linear programming (LP) models.
+It proceeds by iteratively visiting vertices on the boundary of the feasible region.
+This amounts to performing algebraic manipulations on the system of linear equations.
+
+We begin by reformulating the model.
+So-called \textit{slack variables} $x_{n+i}$ are added to the canonical form in order to replace inequalities by equalities in equation~(\ref{chXXX:equ:canon-form}):
+\begin{equation}
+\label{chXXX:equ:augm-form}
+%\centering
+\begin{array}{cccccccc}
+x_{n+i} & = & b_i - & \displaystyle\sum\limits^{n}_{j=1} & a_{ij}x_j & & (i={1,2,...,m})\\
+\end{array}
+\end{equation}
+The resulting problem is called the \textit{augmented form} in which the variables are divided into two disjoint index sets,
+$\B$ and $\N$, which correspond to the basic and nonbasic variables.
+Basic variables, which form the basis of the problem, are on the left-hand side of equation~(\ref{chXXX:equ:augm-form}),
+while nonbasic variables, which form the core of the equations, appear on the right-hand~side.
+We can thus consider the following LP form:
+\begin{equation}
+%\centering
+\begin{array}{lcl}
+ \text{maximize} &\hspace{10pt}& z = \mbf{c^T x} \hspace{30pt}\\
+ \text{subject to} &\hspace{10pt} &\mbf{Ax = b} \hspace{30pt}\\
+ &\hspace{10pt} &\mbf{x \geq 0} \hspace{30pt}
+\end{array}
+\end{equation}
+where $\mbf{x} \in \mathbb{R}^{n+m}$ and $\mbf{A}$ is now the $m\times (n+m)$ matrix obtained by concatenating the constraints matrix with the $m\times m$ identity matrix $\mbf{I_m}$.
+The cost vector has been padded with zeros so that $\mbf{c} \in \mathbb{R}^{n+m}$.
+
+The basic and nonbasic variables can be separated given the expression $\mbf{A_\N x_\N + x_\B = b}$.
+Similarly, $z = \mbf{c^T x = c_\N^T x_\N + c_\B^T x_\B}$ with $\mbf{c_\B = 0}$.
+By definition, a basic solution is obtained by assigning null values to all nonbasic variables ($\mbf{x_\N=0}$).
+Hence, $\mbf{x=x_\B = b}$ is a basic solution.
+
+The simplex algorithm searches for the optimal solution through an iterative process.
+For the sake of simplicity, we will assume here that $\mbf{b>0}$ (\textit{i.e.} the origin belongs to the feasible region).
+Otherwise a preliminary treatment is required to generate a feasible initial solution (see~\S~\ref{chXXX:sec:heuristics}).
+A typical iteration then consists of three operations (summarized in algorithm~\ref{chXXX:algo:simplex-std}).
+\begin{enumerate}
+%%%%%%%%%%%%%%%%%
+\item \textbf{Choosing the entering variable.}
+The entering variable is a nonbasic variable whose increase will lead to an increase of the value of the objective function $z$.
+This variable must be selected with care so as to yield a substantial leap towards the optimal solution.
+The standard way of making this choice is to select the variable $x_e$ with the largest positive coefficient
+$c_e = \max\{c_j>0\, |\, j\in \N\}$ in the objective function $z$.
+However, other strategies such as choosing the positive coefficient with the smallest index, prove to be useful.
+%%%%%%%%%%%%%%%%%
+\item \textbf{Choosing the leaving variable.}
+This variable is the basic variable which first violates its constraint as the entering variable $x_e$ increases.
+The choice of the leaving variable must guarantee that the solution remains feasible.
+More precisely, setting all nonbasic variables except $x_e$ to zero, $x_e$ is bounded by
+$$t=\min\left\{\dfrac{b_i}{a_{ie}}\, \bigg|\, a_{ie}>0,\, i=1,\dots,m \right\}$$
+If $t=+\infty$, the LP problem is unbounded. \\
+Otherwise, $t=\dfrac{b_\ell}{a_{\ell e}}$ for some $\ell\in \B$, and $x_\ell$ is the leaving variable; \\
+the element $a_{\ell e}$ is called the pivot.
+%%%%%%%%%%%%%%%%%
+\item \textbf{Pivoting.}
+Once both variables are defined, the pivoting operation switches these variables from one set to the other:
+the entering variable enters the basis $\mathcal{B}$, taking the place of the leaving variable which now belongs to $\mathcal{N}$,
+namely $\mathcal{B} \leftarrow (\mathcal{B} \setminus \{\ell\}) \cup \{e\}$ and $\mathcal{N} \leftarrow (\mathcal{N} \setminus \{e\}) \cup \{\ell\}$.
+Correspondingly, the columns with index $\ell$ and $e$ are exchanged between $\mbf{I_m}$ and $\mbf{A_\N}$, and similarly for $\mbf{c_\B=0}$ and $\mbf{c_\N}$.
+We then update the constraints matrix $\mbf{A}$, the bound $\mbf{b}$ and the cost $\mbf{c}$ using Gaussian elimination.
+More precisely, denoting $\mbf{\tilde{I}_m}$, $\mbf{\tilde{A}_\N}$, $\mbf{\tilde{c}_\B}$ and $\mbf{\tilde{c}_\N}$ the resulting elements after the exchange, Gaussian elimination then transforms the tableau
+
+\begin{tabular}{|c|c||c|}
+$\mbf{\tilde{A}_\N}$ & $\mbf{\tilde{I}_m}$ & $\mbf{b}$ \\
+\hline
+$\mbf{\tilde{c}^T_\N}$ & $\mbf{\tilde{c}^T_\B}$ & $z$
+\end{tabular}
+into a tableau with updated values for $\mbf{A_\N}$, $\mbf{c_\N}$ and $\mbf{b}$
+
+\begin{tabular}{|c|c||c|}
+$\mbf{A_\N}$ & $\mbf{I_m}$ & $\mbf{b}$ \\
+\hline
+$\mbf{c^T_\N}$ & $\mbf{c^T_\B}$ & $z-tc_e$
+\end{tabular}
+
+The latter is obtained by first dividing the $\ell$-th row by $a_{\ell e}$;
+the resulting row multiplied by $a_{ie}$ ($i\not=\ell$), resp. by $c_e$, is then added to the $i$-th row, resp. the last row.
+Hence, $\mbf{c_\B=0}$.
+These operations amount to jumping from the current vertex to an adjacent vertex with objective value $z=tc_e$.
+\end{enumerate}
+% End condition
+The algorithm ends when no more entering variable can be found, that is when~$\mbf{c_\N \le 0}$.
+
+%% \begin{algorithm}
+%% \caption{Standard simplex algorithm}
+%% \label{chXXX:algo:simplex-std}
+%% \begin{algorithmic}
+%% \STATE \texttt{//1. Find entering variable}
+%% \IF{$\mbf{c_\N \le 0}$}
+%% \RETURN Optimal
+%% \ENDIF
+%% \STATE Choose an index $e\in \N$ such that $c_e>0$
+%% \STATE \texttt{//2. Find leaving variable}
+%% \IF{$(\mbf{A_\N})_e \mbf{\le 0}$}
+%% \RETURN Unbounded
+%% \ENDIF
+%% \STATE Let $\ell \in \B$ be the index such that \\
+%% \hspace*{10mm} $t := \dfrac{b_\ell}{a_{\ell e}}=\min\left\{\dfrac{b_i}{a_{ie}}\, \bigg|\, a_{ie}>0,\, i=1,\dots,m \right\}$
+%% \STATE \texttt{//3. Pivoting - update }
+%% \STATE $\mathcal{B} := (\mathcal{B} \setminus \{\ell\}) \cup \{e\}$, $\mathcal{N} := (\mathcal{N} \setminus \{e\}) \cup \{\ell\}$
+%% \STATE Compute $z_{best}:=z_{best}+t c_e$
+%% \STATE Exchange $(\mbf{I_m})_\ell$ and $(\mbf{A_\N})_e$, $c_\ell$ and $c_e$
+%% \STATE Update $\mbf{A_\N}$, $\mbf{c_\N}$ and $\mbf{b}$
+%% \STATE Go to 1.
+%% \end{algorithmic}
+%% \end{algorithm}
+
+\begin{algorithm}
+\SetNlSty{}{}{:}
+ \textit{//1. Find entering variable}\;
+ \If{$\mbf{c_\N \le 0}$} {
+ \Return Optimal\;
+ }
+ Choose an index $e\in \N$ such that $c_e>0$ \;
+ \textit{//2. Find leaving variable}\;
+ \If{$(\mbf{A_\N})_e \mbf{\le 0}$} {
+ \Return Unbounded
+ }
+ Let $\ell \in \B$ be the index such that \;
+ $t := \dfrac{b_\ell}{a_{\ell e}}=\min\left\{\dfrac{b_i}{a_{ie}}\, \bigg|\, a_{ie}>0,\, i=1,\dots,m \right\}$\;
+ \textit{//3. Pivoting - update }\;
+ $\mathcal{B} := (\mathcal{B} \setminus \{\ell\}) \cup \{e\}$, $\mathcal{N} := (\mathcal{N} \setminus \{e\}) \cup \{\ell\}$\;
+ Compute $z_{best}:=z_{best}+t c_e$\;
+ Exchange $(\mbf{I_m})_\ell$ and $(\mbf{A_\N})_e$, $c_\ell$ and $c_e$\;
+Update $\mbf{A_\N}$, $\mbf{c_\N}$ and $\mbf{b}$\;
+ Go to 1.\;
+\caption{Standard simplex algorithm}
+\label{chXXX:algo:simplex-std}
+
+\end{algorithm}
+
+
+\subsection{Revised simplex method}
+\label{chXXX:subsec:simplex-rev}
+% Global revised simplex idea
+The operation which takes the most time in the standard method, is the pivoting operation, and more specifically the update of the constraints matrix~$\mbf{A}$. The revised method\index{simplex!revised method} tries to avoid this costly operation by only updating a smaller part of this matrix.
+
+%% Revised simplex method operations, useful for the algorithm
+The revised simplex method uses the same operations as in the standard method to choose the entering and leaving variables.
+However, since the constraints matrix $\mbf{A}$ need not be fully updated, some additional reformulation is required.
+
+At any stage of the algorithm, basic and nonbasic variables can be separated according to $\mbf{Ax = A_\N x_\N + A_\B x_\B = b}$.
+We can then transform the system $\mbf{Ax=b}$ into $\mbf{A_\B x_\B = b-A_\N x_\N}$.
+Let us denote $\mbf{B=A_\B}, \mbf{N=A_\N}$.
+Since $\mbf{B}$ is invertible, we can write
+\begin{align*}
+\mbf{x}_\B&=\mbf{B^{-1}b-B^{-1}Nx_\N} \\
+%\hline
+z&=\mbf{c_\B^T B^{-1}b+(c_\N^T-c_\B^T B^{-1}N)x_\N}
+\end{align*}
+The vector $\mbf{c_\N^T-c_\B^T B^{-1}N}$ is called the reduced cost vector.
+
+The choice of the leaving variable can be rewritten.
+Setting all nonbasic variables save the entering variable $x_e$, to zero, $x_e$ is then bounded by
+$$t=\min\left\{\dfrac{(\mbf{B^{-1}b})_i}{(\mbf{B^{-1}N})_{ie}}\, \bigg|\, (\mbf{B^{-1}N})_{ie}>0,\, i=1,\dots,m \right\}$$
+If $t=+\infty$, the LP problem is unbounded. Otherwise, $t=\frac{(\mbf{B^{-1}b})_\ell}{(\mbf{B^{-1}N})_{\ell e}}$ for some $\ell\in \B$, and $x_\ell$ is the leaving variable.
+
+Recall that the pivoting operation begins by exchanging columns with index $\ell$ and $e$ between $\mbf{B}$ and $\mbf{N}$, and similarly for $\mbf{c_\B}$ and $\mbf{c_\N}$.
+
+To emphasize the difference between standard and revised simplex, we first express the updating for the standard simplex. This amounts to
+\begin{align*}
+[\mbf{B} \quad \mbf{N} \quad \mbf{b}] & \leftarrow [\mbf{I_m} \quad \mbf{B^{-1}N} \quad \mbf{B^{-1}b}] \\
+[\mbf{c_\B^T} \quad \mbf{c_\N^T}] & \leftarrow [\mbf{0} \quad \mbf{c_\N^T-c_\B^T B^{-1}N}]
+\end{align*}
+which moves the current vertex to an adjacent vertex $\mbf{x=B^{-1}b}$ with objective value $z=\mbf{c_\B^T B^{-1}b}$ (the end condition remains $\mbf{c_\N \le 0}$). Many computations performed in this update phase can in fact be avoided.
+
+The main point of the revised simplex method is that the only values which really need to be computed at each step are: $\mbf{B^{-1}}$, $\mbf{B^{-1}b}$,$\mbf{B^{-1}N}_{e}$, $\mbf{c_\B^T B^{-1}}$ and $\mbf{c_\B^T B^{-1}b}$. However, matrix inversion is a time consuming operation (of cubic order for Gaussian elimination).
+Fortunately, there are efficient ways of computing an update for $\mbf{B^{-1}}$.
+One way is to take advantage of the sparsity of the LP problem by using the LU sparse decomposition.
+This decomposition may be updated at a small cost at each iteration~\cite{SUHL93}.
+Another way is to use the \textit{product form of the inverse}~\cite{dantzig1953product}, which we describe hereafter.
+
+Let $\mbf{b}_1,\dots,\mbf{b}_m$ be the columns of $\mbf{B}$, $\mbf{v}\in \mathbb{R}^m$
+and $\mbf{a}=\mbf{Bv}=\sum_{i=1}^m v_i\mbf{b}_i$.
+Denote $\mbf{B_a}=(\mbf{b}_1,\dots,\mbf{b}_{p-1},\mbf{a},\mbf{b}_{p+1},\dots,\mbf{b}_m)$ for a given $1\le p\le m$ such that $v_p\not= 0$.
+We want to compute $\mbf{B_a}^{-1}$.
+We first write $$\mbf{b}_p=\dfrac{1}{v_p}+\mbf{a}\sum_{i\not=p} \dfrac{-v_i}{v_p}\mbf{b}_i$$
+Define $$\bs\eta = \left(\dfrac{-v_1}{v_p},\dots,\dfrac{-v_{p-1}}{v_p},\dfrac{1}{v_p},\dfrac{-v_{p+1}}{v_p},\dots,\dfrac{-v_m}{v_p}\right)^T$$
+and $$\mbf{E}=(\bs\varepsilon_1,\dots,\bs\varepsilon_{p-1},\bs\eta,\bs\varepsilon_{p+1},\dots,\bs\varepsilon_m)$$
+where $\bs\varepsilon_j = (0,\dots,0,1,0,\dots,0)^T$ is the $j$-th element of the canonical basis of~$\mathbb{R}^m$.
+Then $\mbf{B_a E=B}$, so $\mbf{B_a}^{-1}=\mbf{E\; B}^{-1}$.
+
+We apply these preliminary considerations to the simplex algorithm
+with $\mbf{a=N}_e, \mbf{v=(B^{-1}N)}_e$ (recall that $x_e$ is the entering variable).
+If initially $\mbf{B}$ is the identity matrix $\mbf{I_m}$, at the $k$-th iteration of the algorithm,
+the inverse matrix is given by $\mbf{B}^{-1}=\mbf{E}_k \mbf{E}_{k-1}\cdots \mbf{E}_2 \mbf{E}_1$,
+where $\mbf{E}_i$ is the matrix constructed at the $i$-th iteration.
+%Note that matrices like $\mbf{E}$ are sparse and can therefore be stored compactly.
+
+\subsection{Heuristics and improvements}
+\label{chXXX:sec:heuristics}
+Specific heuristics or methods are needed to improve the performance and stability of the simplex algorithm.
+We will explain below how we find an initial feasible solution and how we choose the entering and leaving variables.
+
+\subsubsection*{Finding an initial feasible solution}
+Our implementations use the \textit{two-phase simplex}~\cite{VCLP}.
+The first phase aims at finding a feasible solution required by the second phase to solve the original problem.
+If the origin $\mbf{x=0}$ is not a feasible solution, we proceed to find such a solution by solving a so-called \textit{auxiliary problem} with the simplex algorithm.
+This can be achieved by adding a non-negative artificial variable to each constraint equation corresponding to a basic variable which violates its non-negativity condition,
+before minimizing the sum of these artificial variables. The algorithm will thus try to drive all artificial variables towards zero.
+If it succeeds, then a basic feasible solution is available as an initial solution for the second phase
+in which it attempts to find an optimal solution to the original problem. Otherwise the latter is infeasible.
+
+To avoid having to introduce these additional auxiliary variables, we use an alternate version of this procedure.
+For basic variables with index in $\mathcal{I}=\{j\in~\B \,\,|\,\, x_j<0 \}$, we temporarily relax the non-negativity condition in order to have a feasible problem,
+and then apply the simplex algorithm to minimize $w=-\sum_{j\in \mathcal{I}} x_j$.
+However, we update $\mathcal{I}$ at each iteration and modify accordingly the objective function,
+whose role is to drive these infeasible basic variables towards their original bound.
+If at some stage $\mathcal{I}=\emptyset$, we end up with an initial feasible solution.
+Otherwise, the algorithm terminates with $\mathcal{I}\not=\emptyset$ and $w>0$, which indicates that the original problem is infeasible.
+The alteration introduced above involves more computations during the first phase,
+but it offers the advantage of preserving the problem size and making good use of the GPU processing power.
+
+\subsubsection*{Choice of the entering variable}
+The number of iterations required to solve a problem depends on the method used to select the entering variable. The one described in section~\ref{chXXX:subsec:simplex-std} chooses the most promising variable $x_e$ in terms of cost. While being inexpensive to compute, this method can lead to an important number of iterations before the best solution is found.
+
+There exists various heuristics to select this variable $x_e$.
+One of the most commonly used is the \textit{steepest-edge} method~\cite{GOLDFRAB77}.
+To improve the speed at which the best solution is found, this method takes into account the coefficients of $\mbf{B}^{-1}\mbf{N}$.
+This can be explained from the geometrical point of view.
+The constraints $\mbf{Ax=b}$ form the hull of a convex polytope.
+The simplex algorithm moves from one vertex (\textit{i.e.}~a~solution) to another while trying to improve the objective function.
+The steepest-edge method searches for the edge along which the rate of improvement of the objective function is the best.
+The entering variable $x_e$ is then determined by
+$$ e = \argmax \left\{ \dfrac{c_j}{\sqrt{\gamma_j}} \, \bigg| \, c_j > 0 , j \in \N \right\} $$
+with $ \gamma_j = \Vert \mbf{\mbf{B}^{-1}\mbf{N}_j} \Vert^2 $.
+
+This method is quite costly to compute but it reduces significantly the number of iterations required to solve a problem. This heuristic can be directly applied to the standard simplex since the full-tableau is updated at each iteration. However it defeats the purpose of the revised simplex algorithm since the aim is precisely to avoid updating the whole constraints matrix at each iteration.
+Taking this into account, the steepest-edge coefficients $\bs\gamma$ are updated based only on their current value. For the sake of clarity, the hat notation is used to differentiate the next iteration value of a variable from its current value: for example if $\bs\gamma$ denotes the steepest-edge coefficients at the current iteration, then $\hat{\bs\gamma}$ are the updated coefficients.
+
+Given the column of the entering variable $\mbf{d} = \mbf{(B^{-1}N)}_e$, we may process afresh the steepest-edge coefficient of the entering variable as $\gamma_e = \Vert \mbf{d} \Vert^2$.
+The updated steepest-edge coefficients are then given by
+$$\hat{\gamma_j} = \max\left\{\gamma_j - 2\hat{\alpha_j} \beta_j + \gamma_e \hat{\alpha_j}^2, 1+\hat{\alpha_j}^{2}\right\} \quad \textrm{ for } j \neq e $$
+$$\hat{\gamma_e} = \gamma_e / d_e^2$$
+with $\hat{\bs\alpha} = \mbf{N^T} (\mbf{(\hat{B}^{-1})^T})_\ell $
+and $\bs\beta = \mbf{N^T} \mbf{(B^{-1})^T} \mbf{d}$
+
+\subsubsection*{Choice of the leaving variable}
+The stability and robustness of the algorithm depend considerably on the choice of the leaving variable. With respect to this, the \textit{expand} method~\cite{GILL1989} proves to be very useful in the sense that it helps to avoid cycles and reduces the risks of encountering numerical instabilities. This method consists of two steps of complexity $\mathcal{O}(m)$. In the first step, a small perturbation is applied to the bounds of the variables to prevent stalling of the objective value, thus avoiding cycles. These perturbed bounds are then used to determine the greatest gain on the entering variable imposed by the most constraining basic variable. The second phase uses the original bounds to define the basic variable offering the gain closest to the one of the first phase. This variable will then be selected for leaving the basis.
+
+\section{Branch-and-Bound\index{Branch-and-Bound} algorithm}
+\label{chXXX:sec:bnb}
+\subsection{Integer linear programming}
+\label{chXXX:subsec:ilp}
+% Why and what are integer linear programming problem
+In some problems, variables are integer-valued. For example, in a vehicle routing problem, one cannot assign one third of a vehicule to a specific route. Integer linear programming\index{integer linear programming} (ILP) problems restrict LP problems by imposing an integrality condition on the variables. While this change may seem to have little impact on the model, the aftermaths on the resolution method are quite important.
+
+% What is the cost of adding this constraint
+From a geometrical perspective, the simplex algorithm is a method for finding an optimum in a convex polytope defined by a LP problem. However, the additional integrality condition results in the loss of this convexity property. The resolution method must then be altered in order to find a solution to the ILP problem. The idea is to explore the solution space by a divide-and-conquer approach in which the simplex algorithm is used to find a local optimum in subspaces.
+
+\subsection{Branch-and-Bound tree}
+\label{chXXX:subsec:bnb-tree}
+% Intro
+The solution space is explored by the \textit{Branch-and-Bound} (B\&B) algorithm~\cite{ACHTERBERG07}.
+The strategy used is conceptually close to a tree traversal.
+
+% Branching
+%At first, the ILP problem is considered as a LP problem by relaxing the integrality condition before applying a LP solver to it.
+%This initial relaxed problem represents the root of the tree about to be built.
+%From the obtained solution, a variable $x_j$ (with bounds $l_j \leq x_j \leq u_j$) that violates its integrality condition, is chosen as a branching variable.
+%The current problem will then be divided, if possible, into two subproblems: one having $x_j$ constrained to $l_j' = l_j$ and $u_j' = \lfloor x_j \rfloor$ and the other to $l_j' = \lceil x_j \rceil$ and $u_j' = u_j$.
+%Each of these LP subproblems represents a child node waiting to be solved.
+
+% Branching
+At first, the ILP problem is considered as a LP problem by relaxing the integrality condition before applying a LP solver to it.
+This initial relaxed problem represents the root of the tree about to be built.
+From the obtained solution $\bs{\xi}$, if $\xi_j$ is not integral for some $j$, then $x_j$ can be chosen as a branching variable.
+The current problem will then be divided, if possible, into two subproblems: one having $x_j\le \lfloor \xi_j \rfloor$ and the other $x_j\ge \lceil \xi_j \rceil$.
+Each of these LP subproblems represents a child node waiting to be solved.
+
+% When does it ends
+This is repeated for each subproblems in such a way that all variables are led towards integral values.
+At some point a subproblem will either be infeasible or lead to a feasible ILP solution (leaf nodes).
+The algorithm ends when the tree has been fully visited, returning the best feasible ILP solution.
+
+% Upper, lower bounds and pruning
+During the exploration, lower and upper bounds on the objective value are computed.
+The upper bound is represented by the best objective value encountered in a child node.
+This non-admissible solution hints towards what the objective value of the ILP problem could be.
+The lower bound is the best ILP solution yet found, in other words the objective value of the most promising leaf node.
+The bounds may be used to prune subtrees when further branching cannot improve the best current solution.
+
+% How can we be improve the exploration
+In the B\&B algorithm, the main two operations that impact the convergence of the bounds towards the optimal solution are
+the branching strategy and the node selection strategy.
+
+\subsubsection*{Example}
+Figure~\ref{chXXX:fig:bnb} illustrates the use of the B\&B algorithm for solving an ILP.
+The ILP problem is the following:
+\begin{align*}
+\textrm{Maximize} \quad z= x_1 + 4x_2 & \\
+\textrm{Subject to}\quad \quad 5x_1 + 8x_2 & \leq 40\\
+ -2x_1 + 3x_2 & \leq 9\\
+ x_1, x_2 & \geq 0 \quad \textrm{ integer-valued}
+\end{align*}
+The nodes are solved with the simplex method in the order written on the figure.
+At the 3rd step of the B\&B, the first feasible solution is encountered (light grey leaf).
+The optimal solution is encountered at the 7th step (dark grey leaf).
+However, this is assessed only after solving the last two nodes (8th and 9th)
+whose objective value $z$ is inferior to the best one yet found.
+
+\begin{figure}[!h]
+\centering
+\includegraphics[width=12cm]{Chapters/chapter10/figures/tree2_conv.pdf}
+\caption{Solving an ILP problem using a Branch-and-Bound algorithm}
+\label{chXXX:fig:bnb}
+\end{figure}
+
+\subsection{Branching strategy}
+\label{chXXX:subsec:bnb-branching}
+% What it is and why it is important
+The branching strategy\index{Branch-and-Bound!branching} defines the method used to select the variable on which branching will occur.
+The objective value of the child node depends greatly on the choice of this variable.
+Branching on a variable may lead to a drop on the upper bound and thus speed up the exploration,
+while branching on other variables could leave this bound unchanged.
+
+% Example of strategy
+Several branching strategies exist~\cite{Achterberg05}. Let us briefly comment on some of them in terms of improvement of the objective function and processing cost.
+\begin{itemize}
+\item \textit{Smallest index strategy} is a greedy approach that always selects the variable with the smallest index as branching variable. This method is simple, has a cheap processing cost but does not try to select the best variable.
+\item \textit{Strong branching strategy} is an exhaustive strategy. The branching variable selected is the one amongst all the potential variables that leads to the best solution. This means that for each potential branching variable, its potential child nodes must be solved. This method is easy to implement, but its computational cost makes it inefficient.
+\item \textit{Reliability branching strategy} is a strategy which maintains a pseudo-cost~\cite{BENICHOU71} for each potential branching variable. The pseudo-cost of a variable is based on the result obtained when branching on it at previous steps. Since at the start, pseudo-costs are unreliable, a limited strong branching approach is used until pseudo-costs are deemed reliable. This method is more complex than the two others and requires fine tuning. It offers however the best trade-off between improvement and computional cost.
+\end{itemize}
+
+\subsection{Node selection strategy}
+\label{chXXX:subsec:bnb-node-select}
+The node selection strategy\index{Branch-and-Bound!node selection} defines the methodology used to explore the tree.
+While the usual depth-first search and breadth-first search are considered and used,
+some remarks about the tree exploration must be made.
+First let us mention a few facts:
+\begin{enumerate}
+ \item Solutions obtained from child nodes cannot be better than the one of their parent node.
+ \item An LP solver is able to quickly find a solution if the subproblem (child node) is only slightly different than its parent.
+\end{enumerate}
+Given the above statements, it is of interest to quickly find a feasible solution.
+Indeed, this allows to prune all pending nodes which do not improve the solution found.
+However, the quality of the latter impacts the amount of nodes pruned.
+It takes more time to produce a good solution because one must search for the best nodes in the tree.
+Consequently, a trade-off must be made between the quality and the time required to find a solution.
+
+Two types of strategies~\cite{Linderoth97} can then be considered:
+\begin{itemize}
+\item \textit{Depth-first search} aims at always selecting one of the child nodes until a leaf
+(infeasible subproblem or feasible solution) is reached.
+This strategy is characterized by fast solving and it quickly finds a feasible solution.
+It mostly improves the lower bound.
+\item \textit{Best-first search} aims at always selecting one of the most promising nodes in the tree.
+This strategy is characterized by slower solving but guarantees that the first feasible solution found is the optimal.
+It mostly improves the upper bound.
+\end{itemize}
+A problem occurs with the best-first search strategy:
+there might be numerous nodes having solutions of the same quality, thus making the choice of a node difficult.
+To avoid this problem, the \textit{best-estimate search} strategy~\cite{FORREST74} differentiates the best nodes by estimating their potential cost with the help of a pseudo-cost (see previous section).
+
+The most promising variant is a hybrid strategy: the base strategy being the best-estimate search,
+the subtrees of the best-estimated nodes are then subject to a limited depth-first search,
+more commonly called \textit{plunging}.
+This method launches a fast search for a feasible solution in the most promising subtrees,
+thus improving the upper and lower bounds at the same time.
+
+\subsection{Cutting-plane methods}
+\label{chXXX:subsec:cuts}
+\textit{Cutting-planes}\index{Branch-and-Bound!cutting-plane}~\cite{WOLTER06} (also simply \textit{cuts}) are new constraints whose role is to cut off parts of the search space.
+They may be generated from the first LP solution (\textit{Cut-and-Branch}) or periodically during the B\&B (\textit{Branch-and-Cut}).
+On the one hand cutting-planes may considerably reduce the size of the solution space, but on the other hand they increase the problem size.
+Moreover, generating cutting-planes is costly since it requires a thorough analysis of the current state of the problem.
+
+Various types or families of cutting-planes exist. Those that are most applied are the \textit{Gomory cutting-planes} and the \textit{complemented mixed integer rounding inequalities} (c-MIR). Other kinds of cutting-planes target specific families of problems, for example the \textit{0-1 Knapsack cutting-planes} or \textit{flow cover cuts}.
+
+The cutting-planes generated must be carefully selected in order to avoid a huge increase of the problem size. They are selected according to three criteria: their efficiency, their orthogonality with respect to other cutting-planes, and their parallelism with respect to the objective function.
+Cutting-planes having the most impact on the problem are then selected, while the others are dropped.
+
+\section{CUDA considerations}
+\label{chXXX:sec:cuda}
+Most expensive operations in the simplex algorithm are linear algebra functions. The CUBLAS library is used for dense matrix-vector multiplications (\texttt{cublasDgemv}) and dense vector sums (\texttt{cublasDaxpy}). The CUSPARSE library is used for the sparse matrix-vector multiplication (\texttt{cusparseDcsrmv}).
+
+However, complex operations are required to implement the simplex and must be coded from scratch.
+For example, reduction operations (\textit{argmax, argmin}) are fundamental for selecting variables.
+In the following section, we first quickly describe the CUDA reduction operation, before making some global remarks on kernel optimization.
+
+% Reduce operation
+\subsection{Parallel reduction}
+\label{chXXX:subsec:reduction}
+A parallel reduction\index{Parallel reduction} operation is performed in an efficient manner inside a GPU block as shown in figure~\ref{chXXX:fig:reduc}. Shared memory is used for a fast and reliable way to communicate between threads. However at the grid level, reduction cannot be easily implemented due to the lack of direct communication between blocks. The usual way of dealing with this type of limitation is to apply the reduction in two separate steps. The first one involves a GPU kernel reducing the data over multiple blocks, the local result of each block being stored on completion. The second step ends the reduction on a single block or on the CPU.
+\begin{figure}[!h]
+\centering
+\includegraphics[width=10cm]{Chapters/chapter10/figures/Reduc2.pdf}
+\caption{Example of a parallel reduction at block level (courtesy NVIDIA)}
+\label{chXXX:fig:reduc}
+\end{figure}
+
+An optimized way of doing the reduction can be found in the example provided by NVIDIA.
+In order to keep code listings compact hereafter, the reduction of values amongst a block will be referred to as \textit{reduceOperation(value)} (per extension \textit{reduceArgMax(maxVal)}).
+
+% Volkov - hiding latency, register usage
+\subsection{Kernel optimization}
+Optimizing a kernel is a difficult task. The most important point is to determine whether the performances are limited by the bandwidth or by the instruction throughput. Depending on the case and the specificities of the problem, various strategies may be applied. This part requires a good understanding of the underlying architecture and its limitations. The CUDA Programming Guide offers some insight into this subject, together with the interesting articles and presentations by Vassily Volkov~\cite{VOLKOV2010}. The CUDA profiler is the best way to monitor the performances of a kernel. This tool proposes multiple performance markers giving indications about potential bottlenecks.
+
+\section{Implementations}
+\label{chXXX:sec:impl}
+% Foreword - no expand
+In this section, the implementations of the simplex algorithm are studied. The emphasis is put on the main algorithms and kernels having a major impact on performance. The \textit{expand} method used for choosing the leaving variable will not be detailed for that reason.
+Then the Branch-and-Bound (B\&B) implementation is quickly explained focusing on the interaction between the B\&B algorithm and the simplex solver.
+
+\subsection{Standard simplex}
+\label{chXXX:subsec:impl-simplex-std}
+% Standard simplex --> cublas but steepest edge
+The implementation of the standard simplex algorithm (see~\S~\ref{chXXX:subsec:simplex-std}) is rather straightforward. The main difference is that the basis matrix is not stored in memory since it is equals to the identity matrix most of the time. Instead a proper data structure keeps track of the basic variables and their values.
+%% \begin{algorithm}
+%% \caption{Standard simplex algorithm}
+%% \label{chXXX:algo:impl-simplex-std}
+%% \begin{algorithmic}
+%% \STATE \textit{// Find entering variable $e$}
+%% \STATE $\gamma_j \leftarrow \Vert (\mbf{A_\N})_j \Vert^2$
+%% \STATE $e \leftarrow argmax \left( \mbf{c} / \sqrt{\boldsymbol\gamma} \right) $
+%% \IF{$e < 0$}
+%% \RETURN \texttt{optima\_found}
+%% \ENDIF
+%% \STATE \textit{// Find leaving variable $\ell$}
+%% \STATE $\ell, t \leftarrow expand((\mbf{A_\N})_e)$
+%% \IF{$\ell < 0$}
+%% \RETURN \texttt{unbounded}
+%% \ENDIF
+%% \STATE \textit{// Pivoting}
+%% \STATE $\mbf{d} \leftarrow (\mbf{A_\N})_e \, , \, d_e \leftarrow (A_\N)_{\ell e}-1$
+%% \STATE $\mbf{r} \leftarrow \mbf{N^T}_\ell$
+%% \STATE $x_e \leftarrow x_e + t $
+%% \STATE $\mbf{x_\B} \leftarrow \mbf{x_\B} - t (\mbf{A_\N})_e $
+%% \STATE $\mbf{A_\N} \leftarrow \mbf{A_\N} - \mbf{d^T} \mbf{r} / (A_\N)_{\ell e}$ \hspace{2mm}\textit{// cublasDger}
+%% \STATE $\mbf{c} \leftarrow \mbf{c} - c_\ell \mbf{r} / (A_\N)_{\ell e}$ \hspace{12mm}\textit{// cublasDaxpy}
+%% \STATE $swap(x_e, x_{\ell})$
+%% \end{algorithmic}
+%% \end{algorithm}
+
+\begin{algorithm}
+\caption{Standard simplex algorithm}
+\label{chXXX:algo:impl-simplex-std}
+ \textit{// Find entering variable $e$}\;
+ $\gamma_j \leftarrow \Vert (\mbf{A_\N})_j \Vert^2$\;
+ $e \leftarrow argmax \left( \mbf{c} / \sqrt{\boldsymbol\gamma} \right) $\;
+ \If{$e < 0$} {
+ \Return \texttt{optima\_found}
+ }
+ \textit{// Find leaving variable $\ell$ }\;
+ $\ell, t \leftarrow expand((\mbf{A_\N})_e)$\;
+ \If{$\ell < 0$} {
+ \Return \texttt{unbounded}
+ }
+ \textit{// Pivoting}\;
+ $\mbf{d} \leftarrow (\mbf{A_\N})_e \, , \, d_e \leftarrow (A_\N)_{\ell e}-1$\;
+ $\mbf{r} \leftarrow \mbf{N^T}_\ell$\;
+ $x_e \leftarrow x_e + t $\;
+ $\mbf{x_\B} \leftarrow \mbf{x_\B} - t (\mbf{A_\N})_e $\;
+ $\mbf{A_\N} \leftarrow \mbf{A_\N} - \mbf{d^T} \mbf{r} / (A_\N)_{\ell e}$ \hspace{2mm}\textit{// cublasDger}\;
+ $\mbf{c} \leftarrow \mbf{c} - c_\ell \mbf{r} / (A_\N)_{\ell e}$ \hspace{12mm}\textit{// cublasDaxpy}\;
+ $swap(x_e, x_{\ell})$\;
+
+\end{algorithm}
+
+
+
+The first step of algorithm~\ref{chXXX:algo:impl-simplex-std} is the selection of the entering variable using the steepest-edge heuristic. This step is discussed thoroughly in the next section. Then the leaving variable index $\ell$ and the potential gain on the entering variable $t$ are determined using the \textit{expand} method. Finally, the pivoting is done. The nonbasic matrix $\mbf{A_\N}$ and the objective function coefficients $\mbf{c}$ are updated using the CUBLAS library (with \texttt{cublasDger}, respectively \texttt{cublasDaxpy}). This requires the entering variable column $\mbf{d}$ and the leaving variable row $\mbf{r}$ to be copied and slightly altered. The new value of the variables is computed and, since we use a special structure instead of the basis matrix, the entering and leaving variables are swapped.
+
+% Kernel example ==> steepest edge
+\subsubsection*{Choice of the entering variable}
+\label{chXXX:sec:impl-simplex-std-inVar}
+Let us discuss two different approaches for the selection of the entering variable. The first one relies on the CUBLAS library.
+The idea is to split this step in several small operations, starting with the computation of the norms one by one with the \texttt{cublasDnrm2} function.
+The score of each variable is then obtained by dividing the cost vector $\mbf{c}$ by the norms previously computed.
+The entering variable is finally selected by taking the variable with the biggest steepest-edge coefficient using \texttt{cublasDamax}.
+
+While being easy to implement, such an approach would lead to poor performances. The main problem is a misuse of data parallelism. Each column can be processed independently and so at the same time. Moreover, slicing this step into small operations requires that each temporary results be stored in global memory. This creates a huge amount of slow data transfers between kernels and global memory.
+
+\lstinputlisting[label=chXXX:lst:k1,caption=Kernel for the choice of the entering variable]{Chapters/chapter10/optiSE.cu}
+
+To avoid this, the whole step could be implemented as a unique kernel, as shown in the simplified listing~\ref{chXXX:lst:k1}. Each block evaluates multiple potential entering variables.
+The threads of a block process part of the norm of a column.
+Then a reduction (see \S~\ref{chXXX:subsec:reduction}) is done inside the block to form the full norm.
+The thread having the final value can then compute the score of the current variable and determine if it is the best one encountered in this block.
+Once a block has evaluated its variables, the most promising one is stored in global memory.
+The final reduction step is finally done on the CPU.
+
+The dimension of the blocks and grid have to be chosen wisely as a function of the problem size.
+The block size is directly correlated to the size of a column while the grid size is a trade-off between giving enough work to the scheduler in order to hide the latency and returning as few potential entering variable as possible for the final reduction step.
+
+\subsubsection*{CPU-GPU interactions}
+The bulk of the data representing the problem is stored on the GPU.
+Only variables required for decision-making operations are updated on the CPU.
+\begin{figure}[!h]
+\centering
+\includegraphics[width=90mm]{Chapters/chapter10/figures/diagSTD2.pdf}
+\caption{Communications between CPU and GPU}
+\label{chXXX:fig:diagSTD}
+\end{figure}
+The communications arising from the aforementioned scheme are illustrated in figure~\ref{chXXX:fig:diagSTD}.
+The amount of data exchanged at each iteration is independent of the problem size ensuring that this implementation scales well as the problem size increases.
+
+\subsection{Revised simplex}
+\label{chXXX:subsec:impl-simplex-rev}
+% Revised simplex
+The main steps found in the previous implementation are again present in the revised simplex. The difference comes from the fact that a full update of the matrix $\mbf{N}$ is not necessary at every iteration. As explained in section~\ref{chXXX:subsec:simplex-rev}, the basis matrix $\mbf{B}$ is updated so that the required values such as the entering variable column can be processed from the now \textit{constant} matrix $\mbf{N}$. Due to these changes, the steepest-edge method is adapted. The resulting implementation requires more operations as described in algorithm~\ref{chXXX:algo:simplex-rev}.
+% Algorithm
+%% \begin{algorithm}
+%% \caption{Revised simplex algorithm}
+%% \label{chXXX:algo:simplex-rev}
+%% \begin{algorithmic}
+%% \STATE \textit{// Find entering variable $x_e,\, e\in \B$}
+%% \STATE $\bs\tau \leftarrow \mbf{(B^{-1})^T}\mbf{c_\B} $ \hspace*{5mm} \textit{// cublasDgemv}
+%% \STATE $\bs\upsilon \leftarrow \mbf{c_\N} - \mbf{N^T} \bs\tau $ \hspace*{4mm} \textit{// cublasDgemv}
+%% \STATE $e \leftarrow argmax \left( \bs\upsilon / \sqrt{\bs\gamma} \right) $
+%% \STATE \textbf{if} $e < 0$ \textbf{return} \texttt{optima\_found}
+%% \STATE \textit{// Find leaving variable $x_\ell,\, \ell\in \N$}
+%% \STATE $\mbf{d} \leftarrow \mbf{B^{-1}} \mbf{N}_e $ \hspace*{9mm} \textit{// cublasDgemv}
+%% \STATE $\ell, t \leftarrow expand(\mbf{d})$
+%% \STATE \textbf{if} $\ell < 0$ \textbf{return} \texttt{unbounded}
+%% \STATE \textit{// Pivoting - basis update}
+%% \STATE $\bs\kappa \leftarrow \mbf{(B^{-1})^T} \mbf{d} $ \hspace*{6mm} \textit{// cublasDgemv}
+%% \STATE $\boldsymbol\beta \leftarrow \mbf{N^T} \bs\kappa$ \hspace*{12mm} \textit{// cublasDgemv}
+%% \STATE $\mbf{B^{-1}} \leftarrow \mbf{E}\, \mbf{B^{-1}}$
+%% %\STATE $basisUpdate(\mbf{B^{-1}}, \mbf{d})$ \textit{// See section ...}
+%% \STATE $x_e \leftarrow x_e + t $
+%% \STATE $\mbf{x_\B} \leftarrow \mbf{x_\B} - t\,\mbf{d} $
+%% \STATE $swap(x_\ell,x_e)$
+%% \STATE $\hat{\bs\alpha} \leftarrow \mbf{N^T} \left( \mbf{(B^{-1})^T} \right)_\ell$ \hspace*{4mm} \textit{// cublasDgemv}
+%% \STATE $\gamma_e \leftarrow \Vert \mbf{d} \Vert^2$
+%% \STATE $\hat{\gamma_j} \leftarrow \max\left\{\gamma_j - 2 \hat{\alpha_j} \beta_j + \gamma_e \hat{\alpha_j}^2, 1+\hat{\alpha_j}^{2}\right\},\; \forall j \neq e $,$\quad \hat{\gamma_e} \leftarrow \gamma_{e} / d_e^2$
+%% \end{algorithmic}
+%% \end{algorithm}
+
+
+\begin{algorithm}
+\caption{Revised simplex algorithm}
+\label{chXXX:algo:simplex-rev}
+ \textit{// Find entering variable $x_e,\, e\in \B$}\;
+ $\bs\tau \leftarrow \mbf{(B^{-1})^T}\mbf{c_\B} $ \hspace*{5mm} \textit{// cublasDgemv}\;
+ $\bs\upsilon \leftarrow \mbf{c_\N} - \mbf{N^T} \bs\tau $ \hspace*{4mm} \textit{// cublasDgemv}
+ $e \leftarrow argmax \left( \bs\upsilon / \sqrt{\bs\gamma} \right) $\;
+ \If { $e < 0$ } {\Return \texttt{optima\_found} }
+ \textit{// Find leaving variable $x_\ell,\, \ell\in \N$} \;
+ $\mbf{d} \leftarrow \mbf{B^{-1}} \mbf{N}_e $ \hspace*{9mm} \textit{// cublasDgemv}\;
+ $\ell, t \leftarrow expand(\mbf{d})$\;
+ \If { $\ell < 0$} {\Return \texttt{unbounded}}
+ \textit{// Pivoting - basis update}\;
+ $\bs\kappa \leftarrow \mbf{(B^{-1})^T} \mbf{d} $ \hspace*{6mm} \textit{// cublasDgemv}\;
+ $\boldsymbol\beta \leftarrow \mbf{N^T} \bs\kappa$ \hspace*{12mm} \textit{// cublasDgemv}\;
+ $\mbf{B^{-1}} \leftarrow \mbf{E}\, \mbf{B^{-1}}$\;
+ $x_e \leftarrow x_e + t $\;
+ $\mbf{x_\B} \leftarrow \mbf{x_\B} - t\,\mbf{d} $\;
+ $swap(x_\ell,x_e)$\;
+ $\hat{\bs\alpha} \leftarrow \mbf{N^T} \left( \mbf{(B^{-1})^T} \right)_\ell$ \hspace*{4mm} \textit{// cublasDgemv}\;
+ $\gamma_e \leftarrow \Vert \mbf{d} \Vert^2$ \;
+ $\hat{\gamma_j} \leftarrow \max\left\{\gamma_j - 2 \hat{\alpha_j} \beta_j + \gamma_e \hat{\alpha_j}^2, 1+\hat{\alpha_j}^{2}\right\},\; \forall j \neq e $,$\quad \hat{\gamma_e} \leftarrow \gamma_{e} / d_e^2$\;
+\end{algorithm}
+
+
+Let us compare the complexity, in terms of level 2 and level 3 BLAS operations, of both implementations. The standard one has mainly two costly steps: the selection of the entering variable and the update of the matrix $\mbf{N}$. These are level 2 BLAS operations or an equivalent, and thus the approximate complexity is $\mathcal{O}(2mn)$.
+
+The new implementation proposed has three operations \texttt{cublasDgemv} where the matrix $\mbf{N}$ is involved,
+and three others with the matrix $\mbf{B^{-1}}$.
+The complexities of these operations are respectively $\mathcal{O}(mn)$ and $\mathcal{O}(m^2)$.
+The update of the matrix $\mbf{B}$ is a level 3 BLAS operation costing $\mathcal{O}(m^3)$.
+The approximate complexity is then $\mathcal{O}(3mn + 3m^2 + m^3)$.
+
+It appears clearly that, in the current state, the revised simplex implementation is less efficient than the standard one. However this method can be improved and might well be better than the standard one based on two characteristics of LP problems:
+their high sparsity and the fact that usually $m \ll n$. In the next sections, we will give details about these improvements.
+
+\subsubsection*{Choice of the entering variable}
+Once again, this part of the algorithm can be substantially improved.
+Firstly, algorithmic optimizations must be considered.
+Since the row $\bs\alpha$ of the leaving variable is processed to update the steepest-edge coefficients, the cost vector $\bs\upsilon$ can be updated directly without using the basis matrix $\mbf{B}$.
+This is done as follow
+\begin{equation*}
+\upsilon_j = \bar{\upsilon}_j - \bar{\upsilon}_e \alpha_j, \; \forall j \neq e, \qquad \qquad
+\upsilon_e = -\dfrac{\bar{\upsilon}_e}{\alpha_e}
+\end{equation*}
+where $\bar{\upsilon_j}$ denotes the value of $\upsilon_j$ at the previous iteration.
+It is important to note that all these updated values ($\bs\upsilon, \bs\gamma$) must be processed afresh once in a while to reduce numerical errors.
+
+Including this change, the operations required for the selection of the entering variable $e$ are detailed in algorithm~\ref{chXXX:algo:inVar}. The values related to the entering variable at the previous iteration $\bar{e}=q$ are used. The reduced costs are updated, followed by the steepest-edge coefficients. With these values the entering variable is determined.
+
+%% \begin{algorithm}[!h]
+%% \caption{Choice of the entering variable}
+%% \label{chXXX:algo:inVar}
+%% \begin{algorithmic}
+%% \STATE $q \leftarrow \bar{e}$
+%% \STATE $\bar{\upsilon}_q \leftarrow c_q - \mbf{c_{\B}^T} \bar{\mbf{d}}$
+%% \STATE $\bar{\gamma}_q \leftarrow \Vert \bar{\mbf{d}} \Vert$
+%% \STATE \textit{// Update of the reduced costs}
+%% \STATE $\upsilon_{j} \leftarrow \bar{\upsilon}_{j} - \alpha_{j} \bar{\upsilon}_q, \; \forall j \neq q$
+%% \STATE $\upsilon_q \leftarrow \dfrac{\bar{\upsilon}_q}{\bar{d}_q^2}$
+%% \STATE \textit{// Update of the steepest edge coefficients}
+%% \STATE $\gamma_j \leftarrow \max\left\{\bar{\gamma}_j - 2 \alpha_j \bar{\beta}_j + \bar{\gamma}_q \alpha_j^2, 1+\alpha_{j}^{2} \right\},\; \forall j \neq q $
+%% \STATE $\gamma_q \leftarrow \bar{\gamma}_q / \bar{d}_q^2$
+%% \STATE \textit{// Selection of the entering variable}
+%% \STATE $e \leftarrow argmax \left( \bs\upsilon / \sqrt{\bs\gamma} \right) $
+%% \end{algorithmic}
+%% \end{algorithm}
+
+\begin{algorithm}[!h]
+\caption{Choice of the entering variable}
+\label{chXXX:algo:inVar}
+ $q \leftarrow \bar{e}$\;
+ $\bar{\upsilon}_q \leftarrow c_q - \mbf{c_{\B}^T} \bar{\mbf{d}}$ \;
+ $\bar{\gamma}_q \leftarrow \Vert \bar{\mbf{d}} \Vert$\;
+ \textit{// Update of the reduced costs} \;
+ $\upsilon_{j} \leftarrow \bar{\upsilon}_{j} - \alpha_{j} \bar{\upsilon}_q, \; \forall j \neq q$\;
+ $\upsilon_q \leftarrow \dfrac{\bar{\upsilon}_q}{\bar{d}_q^2}$ \;
+ \textit{// Update of the steepest edge coefficients}\;
+ $\gamma_j \leftarrow \max\left\{\bar{\gamma}_j - 2 \alpha_j \bar{\beta}_j + \bar{\gamma}_q \alpha_j^2, 1+\alpha_{j}^{2} \right\},\; \forall j \neq q $\;
+ $\gamma_q \leftarrow \bar{\gamma}_q / \bar{d}_q^2$ \;
+ \textit{// Selection of the entering variable}\;
+ $e \leftarrow argmax \left( \bs\upsilon / \sqrt{\bs\gamma} \right) $\;
+\end{algorithm}
+
+Coupling these operations into a single kernel is quite beneficial. This lead $\bs\upsilon$ and $\bs\gamma$ to only be loaded once from global memory. Their updated values are stored while the entering variable $e$ is selected. With these changes, the global complexity of this step is reduced from $\mathcal{O}(mn + m^2 + n)$ to $\mathcal{O}(n)$. Moreover the remaining processing is done optimally by reusing data and by overlapping global memory access and computations.
+
+\subsubsection*{Basis update}
+% Update operation
+The basis update $\mbf{B^{-1}} \leftarrow \mbf{E} \, \mbf{B^{-1}}$ is a matrix-matrix multiplication. However, due to the special form of the matrix $\mbf{E}$ (see \S~\ref{chXXX:subsec:simplex-rev}), the complexity of this operation can be reduced from $\mathcal{O}(m^3)$ to $\mathcal{O}(m^2)$~\cite{BIELING}. The matrix $\mbf{E}$ is merely the identity matrix having the $\ell$-th column replaced by the vector $\bs\eta$. The update of the matrix $\mbf{B^{-1}}$ can be rewritten as
+\begin{equation*}
+\hat{B}^{-1}_{ij} = B^{-1}_{ij}\left(1 - \frac{d_i}{d_\ell}\right)\, , \quad \forall i \neq \ell, \qquad
+\qquad \hat{B}^{-1}_{\ell j} = \frac{B^{-1}_{\ell j}}{d_\ell}
+\end{equation*}
+As shown in the listing~\ref{chXXX:lst:k2}, each block of the kernel processes a single column while each thread may compute multiple elements of a column.
+This organisation ensures that global memory accesses are coalescent since the matrix $\mbf{B}$ is stored column-wise.
+
+\lstinputlisting[label=chXXX:lst:k2,caption=Basis update]{Chapters/chapter10/updateBasis.cu}
+
+% Sparse matrix - vector
+\subsubsection*{Sparse matrix operations}
+The complexity of the implementation is now $\mathcal{O}(2mn + 3m^2)$ which is close to the one of the standard simplex implementation. The operations where the matrix $\mbf{N}$ is involved remain the more expensive (considering $m \ll n$). However, this matrix is \textit{constant} in the revised simplex allowing the use of sparse matrix-vector multiplication from the CUSPARSE library. On sparse problems, this leads to important gains in performance. The sparse storage of the matrix $\mbf{N}$ reduces significantly the memory space used by the algorithm. All these improvements come at a small cost: columns are no longer directly available in their dense format and must thus be decompressed to their dense representation when needed.
+
+The complexity of the revised simplex implementation is thus finally $\mathcal{O}(2\theta + 3m^2)$ where $\theta$ is a function of $m$, $n$ and the sparsity of the problem. We shall see in section~\ref{chXXX:subsec:implPerf}, what kind of performances we may obtain on various problems.
+
+\subsection{Branch-and-Bound}
+\label{chXXX:subsec:impl-bnb}
+% Branch and bound on GPU
+% Limiting communications
+The B\&B algorithm is operated from the CPU.
+The simplex implementations, also referred to as LP solver, are used to solve the nodes of the B\&B tree.
+Algorithm~\ref{chXXX:algo:bnb} contains the main operations discussed in section~\ref{chXXX:sec:bnb}.
+It starts by solving the root node.
+The branching strategy is then used to select the next variable to branch on.
+From thereon, until no node remains to be solved, the node selection strategy is used to elect the next one to be processed.
+
+The critical factor for coupling the LP solver and the B\&B, is the amount of communications exchanged between them. Since CPU-GPU communications are expensive, the informations required to solve a new node must be minimized. Upon solving a new node, the full transfer of the problem must be avoided. This will be the focus of this section.
+
+% Algorithm
+%% \begin{algorithm}
+%% \caption{Branch-and-Bound}
+%% \label{chXXX:algo:bnb}
+%% \begin{algorithmic}
+%% \STATE Solution $\leftarrow$ Solve(InitialProblem)
+%% \STATE BranchVar $\leftarrow$ SelectNextVariable(Solution)
+%% \STATE NodeList $\leftarrow$ AddNewNodes(BranchVar)
+%% \WHILE{NodeList $\neq \emptyset$}
+%% \STATE Node $\leftarrow$ SelectNextNode(NodeList)
+%% \STATE Solution $\leftarrow$ Solve(Node)
+%% \IF{exists(Solution)}
+%% \STATE BranchVar $\leftarrow$ SelectNextVariable(Solution)
+%% \STATE NodeList $\leftarrow$ AddNewNodes(BranchVar)
+%% \ENDIF
+%% \ENDWHILE
+%% \end{algorithmic}
+%% \end{algorithm}
+
+\begin{algorithm}
+\caption{Branch-and-Bound}
+\label{chXXX:algo:bnb}
+ Solution $\leftarrow$ Solve(InitialProblem)\;
+ BranchVar $\leftarrow$ SelectNextVariable(Solution)\;
+ NodeList $\leftarrow$ AddNewNodes(BranchVar)\;
+ \While{NodeList $\neq \emptyset$} {
+ Node $\leftarrow$ SelectNextNode(NodeList)\;
+ Solution $\leftarrow$ Solve(Node)\;
+ \If{exists(Solution)}{
+ BranchVar $\leftarrow$ SelectNextVariable(Solution)\;
+ NodeList $\leftarrow$ AddNewNodes(BranchVar)\;
+ }
+ }
+\end{algorithm}
+
+% Algorithmic choice --> Plunging
+\subsubsection*{Algorithmic choices}
+The first critical choice is to select an efficient tree traversal strategy that also takes advantage of the locality of the problem. At first glance, the best-first search would be a poor choice, since from one node to the next the problem could change substantially. However, when combined with the plunging heuristic, this same strategy becomes really interesting. Indeed, the plunging phase can be completely decoupled from the B\&B. The CPU acting as a decision maker, chooses a promising node and spawns a task that will independently explore the subtree of this node. Once done, this task will report its findings to the decision maker. Moreover, when plunging, the next node to be solved differs only by the new bound on the branching variable. Communications are then minimized and the LP solver is usually able to quickly find the new solution since the problem has only been slightly altered.
+
+% Technical improvement --> warmstart / hotstart
+\subsubsection*{Warmstart}
+There are two case where starting from a fresh problem is required or beneficial. The first one is imposed by the numeric inaccuracy appearing after several iterations of the LP solver. The second is upon the request of a new subtree. To avoid the extensive communication costs of a full restart, the GPU keeps in memory an intermediate stable state of the problem, a \textit{warmstart}. This state could for example be the one found after solving the root node of the tree.
+
+% Global flow
+\subsubsection*{Multi-GPU\index{multi-GPU} exploration}
+The organisation, CPU decision maker vs GPU explorer, offers the possibility to use multiple GPUs to explore the tree. The global knowledge is maintained by the CPU task. The latter assigns to the GPUs the task of exploring subtrees of promising nodes. Since each plunging is done locally, no communications are required between the GPUs. Moreover, the amount of nodes processed during a plunging can be used to tune the load of the CPU task.
+
+\section{Performance model}
+\label{chXXX:sec:MODEL}
+Performance models\index{Performance model} are useful to predict the behaviour of implementations as a function of various parameters. Since the standard simplex algorithm is the easiest to understand, we will focus in this section on its behaviour as a function of the problem size.
+
+CUDA kernels require a different kind of modelling than usually encountered in parallelism. The key idea is to capture in the model the decomposition into threads, warps and blocks. One must also pay a particular attention to global memory accesses and to how the pipelines reduce the associated latency.
+
+In order to model our implementation, we follow the approach given by K. Kothapalli et al.~\cite{MODEL} (see also~\cite{ELSTER09}). Firstly, we examine the different levels of parallelism on a GPU. Then we determine the amount of work done by a single task. By combining both analysis, we establish the kernel models. The final model then consists of the sum of each kernel.
+
+This model has also been used to model a multi-GPUs version of the standard simplex method~\cite{MEYER2011}.
+
+
+\subsection{Levels of parallelism}
+\label{chXXX:subsec:LVLParallelism}
+
+A kernel can be decomposed into an initialization phase followed by a processing phase.
+During the initialization phase the kernel context is setup.
+Since this operation does not take a lot of time compared to the processing phase,
+it is needless to incorporate it into the model.
+The processing phase is more complex and its execution time depends directly on the amount of work it must perform.
+We shall now focus on this phase and on the various levels of parallelism on a GPU.
+
+At the first level, the total work is broken down into components, the blocks. They are then dispatched on the available multiprocessors on the GPU. The execution time of a kernel depends on the number of blocks $N_B$ per SM (\textit{streaming multiprocessor}) and on the number $N_{SM}$ of SM per GPU.
+
+At a lower level, the work of a block is shared among the various cores of its dedicated SM. This is done by organizing the threads of the block into groups, the warps. A warp is a unit of threads that can be executed in parallel on a SM. The execution time of a block is then linked to the number $N_W$ of warps per block, the number $N_T$ of threads per warp and the number $N_P$ of cores per SM.
+
+The third and lowest level of parallelism is a pipeline. This pipeline enables a pseudo-parallel execution of the tasks forming a warp. The gain produced by this pipeline is expressed by its depth $D$.
+
+\subsection{Amount of work done by a thread}
+\label{chXXX:subsec:workTask}
+
+In the previous section, we defined the different levels of parallelism down to the smallest, namely the thread level. We must now estimate how much work is done by a single thread of a kernel in terms of cycles. It depends on the number and the type of instructions. In CUDA, each arithmetic instruction requires a different number of cycles. For example, a single precision add costs 4 cycles while a single precision division costs 36 cycles.
+
+Moreover, since global memory access instructions are non-blocking operations, they must be counted separately from arithmetic instructions. The number of cycles involved in a global memory access, amounts to a 4 cycle instruction (read/write) followed by a non-blocking latency of 400-600 cycles. To correctly estimate the work due to such accesses, one needs to sum only the latency that is not hidden. Two consecutive read instructions executed by a thread, would cost twice the 4 cycles, but only once the latency due to its non-blocking behaviour. Once the amount of cycles involved in these memory accesses has been determined, it is then necessary to take into account the latency hidden by the scheduler (\textit{warp swapping}).
+
+The total work $C_T$ done by a thread can be defined either as the sum or as the maximum of the memory access cycles and the arithmetic instructions cycles. Summing both types of cycles means we consider that latency cannot be used to hide arithmetic instructions. The maximum variant represents the opposite situation where arithmetic instructions and memory accesses are concurrent. Then only the biggest of the two represents the total work of a thread. This could occur for example when the latency is entirely hidden by the pipeline.
+
+It is not trivial to define which scenario is appropriate for each kernel. There are many factors involved: the dependency on the instructions, the behaviour of the scheduler, the quantity of data to process, and so on. The number of cycles $C_T$ done by a thread can thus be defined by the sum, respectively the maximum, of the global memory access and the arithmetic instructions.
+%Moreover, for each kernel the task representing the worst case scenario will be the one modelised.
+
+\subsection{Global performance model}
+\label{chXXX:subsec:MODEL}
+
+We now turn to the global performance model for a kernel. We must find out how many threads are run by a core of an SM. Recall that a kernel decomposes into blocks of threads, with each SM running $N_B$ blocks, each block having itself $N_W$ warps consisting of $N_T$ threads. Also recall that an SM has $N_P$ cores which execute batches of threads in a pipeline of depth $D$. Thus the total work $C_{core}$ done by a core is given by
+\begin{equation}
+\label{chXXX:equ:ccore}
+C_{core} = N_B \cdot N_W \cdot N_T \cdot C_T \cdot \dfrac{1}{N_P \cdot D}
+\end{equation}
+which represents the total work done by a SM divided by the number of cores per SM and by the depth of the pipeline.
+Finally, the execution time of a kernel is obtained by dividing $C_{core}$ by the core frequency.
+
+\subsection{A kernel example : \textit{steepest-edge}}
+\label{chXXX:subsec:SE}
+The selection of the entering variable is done by the \textit{steepest-edge} method as described in section~\ref{chXXX:subsec:impl-simplex-std}.
+
+The processing of a column is done in a single block. Each thread of a block has to compute $N_{el}=\frac{m}{N_W \cdot N_T}$ coefficients of the column. This first computation consists of $N_{el}$ multiplications and additions. The resulting partial sum of squared variables must then be reduced on a single thread of the block. This requires $N_{red} = \log_2(N_W \cdot N_T)$ additions. Since the shared memory is used optimally, there are no added communications. Finally, the coefficient $c_j$ must be divided by the result of the reduction.
+
+Each block of the kernel computes $N_{col}=\frac{n}{N_B \cdot N_{SM}}$ columns, where $N_{SM}$ is the number of SM per GPU. After processing a column, a block keeps only the minimum of its previously computed steepest-edge coefficients. Thus the number of arithmetic instruction cycles for a given thread is given by
+\begin{equation}
+\label{chXXX:equ:arithmSE}
+C_{Arithm} = N_{col} \cdot (N_{el} \cdot (C_{add} + C_{mult}) + N_{red} \cdot C_{add} + C_{div} + C_{cmp})
+\end{equation}
+where $C_{ins}$ denotes the number of cycles to execute instruction $ins$.
+
+Each thread has to load $N_{el}$ variables to compute its partial sum of squared variables. The thread computing the division also loads the coefficient $c_j$. This must be done for the $N_{col}$ columns that a block has to deal with. We must also take into account that the scheduler hides some latency by swapping the warps, so the total latency $C_{latency}$ must be divided by the number of warps $N_W$.
+Thus the number of cycles relative to memory accesses is given by:
+
+\begin{equation}
+\label{chXXX:equ:latencySE}
+C_{Accesses} = \dfrac{ N_{col} \cdot (N_{el}+1) \cdot C_{latency}} {N_W}
+\end{equation}
+
+At the end of the execution of this kernel, each block stores in global memory its local minimum. The CPU will then have to retrieve the $N_B\cdot N_{SM}$ local minimums and reduce them. It is then profitable to minimize the number $N_B$ of blocks per SM. With a maximum of two blocks per SM, the cost of this final operation done by the CPU can be neglected when $m$ and $n$ are big.
+
+It now remains to either maximize or sum $C_{Arithm}$ and $C_{Accesses}$ to obtain~$C_T$. The result of equ.~(\ref{chXXX:equ:ccore}) divided by the core frequency yields the time $T_{KSteepestEdge}$ of the steepest-edge kernel.
+
+\subsection{Standard simplex GPU implementation model}
+\label{chXXX:subsec:model-simplex-std}
+As seen in section~\ref{chXXX:subsec:impl-simplex-std}, the standard simplex implementation requires only a few communications between the CPU and the GPU. Since each of these communications are constant and small, they will be neglected in the model. For the sake of simplicity, we will consider the second phase of the \textit{two-phase simplex} where we apply iteratively the three main operations: selecting the entering variable, choosing the leaving variable and pivoting. Each of these operations is computed as a kernel. The time of an iteration $T_{Kernels}$ then amounts to the sum of all three kernel times:
+\begin{equation}
+\label{chXXX:equ:1GPU}
+T_{Kernels} = T_{KSteepestEdge} + T_{KExpand} + T_{KPivoting}
+\end{equation}
+The times $T_{KExpand}$ and $T_{KPivoting}$ for the expand and pivoting kernels are obtained in a similar way as for the steepest-edge kernel described in the previous section.
+
+With the estimated time per iteration $T_{Kernels}$, we can express the total time $T_{prob}$ required for solving a problem as
+\begin{equation}
+\label{chXXX:equ:tprob}
+T_{prob} = T_{init} + r \cdot T_{Kernels}
+\end{equation}
+where $r$ is the number of iterations.
+Note that research by Dantzig~\cite{DANTZIG80} asserts that $r$ is proportional to $m \log_2(n)$.
+
+\section{Measurements and analysis}
+\label{chXXX:sec:measurements}
+Two sets of measurements are presented here. The first one is a validation of the performance model presented in section~\ref{chXXX:sec:MODEL}. The second is a comparison of the performances of the various implementations of the simplex method detailed in section~\ref{chXXX:sec:impl}.
+
+As a preliminary, we ensured that our implementations were functional. For that matter, we used a set of problems available on the \textit{NETLIB}~\cite{NETBLIB} repository. This dataset usually serves as a benchmark for LP solvers. It consists of a vast variety of real and specific problems for testing the stability and robustness of an implementation. For example, some of these represent real-life models of large refineries, industrial production/allocation or fleet assignments problems. Our implementations are able to solve almost all of these problems.
+
+% Test 1 : Tesla S1070 => Performance model + custom problems
+\subsection{Performance model validation}
+\label{chXXX:subsec:modelVal}
+
+In order to check that our performance model for the standard simplex implementation is correct, a large range of problems of varying size is needed. As none of the existing datasets offered the desired diversity of problems, we used a problem generator. It is then possible to create problems of given size and density. Since usual LP problems have more variables than equations, our generated problems have a ratio of 5 variables for 1 equation ($n=5 \cdot m$).
+%All these problems fall into the same category. Hence, generated problems of the same size will be solved within approximately the same number of iterations and the optimum of the objective function will only slightly fluctuate.
+
+The test environment is composed of a CPU server (2 Intel Xeon X5570, 2.93GHz, with 24GB DDR3 RAM)
+and a GPU computing system (NVIDIA tesla S1070) with the 3.2 CUDA Toolkit. This system connects 4 GPUs to the server. Each GPU has 4GB GDDR3 graphics memory, 30 streaming multiprocessors, each holding 8 cores (1.4GHz).
+%Such a GPU offers at peak performance up to 1TFLOPS for single precision floating point, 80GFLOPS for double precision floating point and 100GB/s memory bandwidth between threads (registers) and global memory.
+
+We validated our performance model by showing that it accurately fits our measurements.
+The correlation between the measurements and the model is above $0.999$ (see fig.~\ref{chXXX:fig:fitting}).
+\begin{figure}[!h]
+\centering
+\includegraphics[width=100mm]{Chapters/chapter10/figures/Model.pdf}
+\caption{Performance model and measurements comparison}
+\label{chXXX:fig:fitting}
+\end{figure}
+
+% Test 2 : Tesla M2090 => Performance comparison + NETLIB problems
+\subsection{Performance comparison of implementations}
+\label{chXXX:subsec:implPerf}
+Four different implementations are compared in this section. We will refer to each of them according to the terminology introduced below.
+\begin{itemize}
+ \item Standard simplex method improved ($\mathcal{O}(2mn)$): \texttt{Standard}
+ \item Revised simplex method using basis kernel
+ \begin{itemize}
+ \item without improvements ($\mathcal{O}(3mn + 4m^2)$): \texttt{Revised-base}
+ \item optimized ($\mathcal{O}(2mn + 3m^2)$): \texttt{Revised-opti}
+ \item optimized with sparse matrix-vector multiplication ($\mathcal{O}(2\theta+3m^2)$): \texttt{Revised-sparse}
+ \end{itemize}
+\end{itemize}
+These implementations all use the \textit{steepest-edge} and \textit{expand} methods for the choice of, respectively, the entering and the leaving variables.
+
+We used problems from the \textit{NETLIB} repository to illustrate the improvements discussed in section~\ref{chXXX:sec:impl}. The results expected from the first three methods are quite clear. The \texttt{Standard} method should be the best, followed by the \texttt{Revised-opti} and then the \texttt{Revised-base}. However, the performance of the \texttt{Revised-sparse} implementation remains unclear since the value of $\theta$ is unknown and depends on the density and size of the constraints matrix. This is the main question we shall try to answer with our experiments.
+
+The test environnement is composed of a CPU server (2 Intel Xeon X5650, 2.67GHz, with 96GB DDR3 RAM) and a GPU computing system (NVIDIA tesla M2090) with the 4.2 CUDA Toolkit. This system connects 4 GPUs to the server. Each GPU has 6GB GDDR5 graphics memory and 512 cores (1.3GHz).
+
+\begin{table}[!h]
+\begin{tabular}{|l|r|r|r|r|}
+\hline
+\multicolumn{1}{|l|}{Problem name} & \multicolumn{1}{l|}{Rows} & \multicolumn{1}{l|}{Cols} & \multicolumn{1}{l|}{NNZ} & \multicolumn{1}{l|}{C-to-R} \\ \hline
+etamacro.mps & 401 & 688 & 2489 & 1.7 \\ \hline
+fffff800.mps & 525 & 854 & 6235 & 1.6\\ \hline
+finnis.mps & 498 & 614 & 2714 & 1.2\\ \hline
+gfrd-pnc.mps & 617 & 1092 & 3467 & 1.8\\ \hline
+grow15.mps & 301 & 645 & 5665 & 2.1\\ \hline
+grow22.mps & 441 & 946 & 8318 & 2.1\\ \hline
+scagr25.mps & 472 & 500 & 2029 & 1.1\\ \hline
+\end{tabular}
+\caption{NETLIB problems solved in less than 1 second}
+\label{chXXX:tab:small}
+\end{table}
+
+Let us begin by discussing the performances on problems solved in less than one second. The name, size, number of non-zero elements (NNZ) and columns to rows ratio (C-to-R) of each problems are reported in table~\ref{chXXX:tab:small}. The performances shown in figure~\ref{chXXX:fig:FastSolve} corroborate our previous observations. On these problems the \texttt{Revised-sparse} method doesn't outperform the \texttt{Standard} one. This can be explained by two factors: the added communications (kernel calls) for the revised method, and the small size and density of the problems. It is likely that sparse operations on a GPU require larger amounts of data to become more efficient than their dense counterpart.
+\begin{figure}[!h]
+\centering
+\includegraphics[width=100mm]{Chapters/chapter10/figures/Small2.pdf}
+\caption{Time required to solve problems of table~\ref{chXXX:tab:small}}
+\label{chXXX:fig:FastSolve}
+\end{figure}
+
+The problems shown in table~\ref{chXXX:tab:medium} are solved in less than 4 seconds. As we can see in figure~\ref{chXXX:fig:NormalSolve}, the expected trend for the \texttt{Revised-base} and the \texttt{Revised-opti} methods is now confirmed. Let us presently focus on the \texttt{Standard} and \texttt{Revised-sparse} methods. Some of the problems, in particular \textit{czprob.mps} and \textit{nesm.mps}, are solved in a comparable amount of time. The performance gain of the \texttt{Revised-sparse} is related to the C-to-R (columns to rows) ratio of these problems, displaying respectively a 3.8 and a 4.4 ratio.
+
+\begin{table}[!h]
+\begin{tabular}{|l|r|r|r|r|}
+\hline
+\multicolumn{1}{|l|}{Problem name} & \multicolumn{1}{l|}{Rows} & \multicolumn{1}{l|}{Cols} & \multicolumn{1}{l|}{NNZ} & \multicolumn{1}{l|}{C-to-R} \\ \hline
+25fv47.mps & 822 & 1571 & 11127 & 1.9\\ \hline
+bnl1.mps & 644 & 1175 & 6129 & 1.8\\ \hline
+cycle.mps & 1904 & 2857 & 21322 & 1.5\\ \hline
+czprob.mps & 930 & 3523 & 14173 & 3.8\\ \hline
+ganges.mps & 1310 & 1681 & 7021 & 1.2\\ \hline
+nesm.mps & 663 & 2923 & 13988 & 4.4\\ \hline
+maros.mps & 847 & 1443 & 10006 & 1.7\\ \hline
+perold.mps & 626 & 1376 & 6026 & 1.0\\ \hline
+\end{tabular}
+\caption{NETLIB problems solved in the range of 1 to 4 seconds}
+\label{chXXX:tab:medium}
+\end{table}
+
+\begin{figure}[!h]
+\centering
+\includegraphics[width=100mm]{Chapters/chapter10/figures/Med2.pdf}
+\caption{Time required to solve problems of table~\ref{chXXX:tab:medium}}
+\label{chXXX:fig:NormalSolve}
+\end{figure}
+
+Finally, the biggest problems, and slowest to solve, are given in table~\ref{chXXX:tab:big}. A new tendency can be observed in figure~\ref{chXXX:fig:SlowSolve}. The \texttt{Revised-sparse} method is the fastest on most of the problems. The performances are still close between the best two methods on problems having a C-to-R ratio close to 2 like \textit{bnl2.mps}, \textit{pilot.mps} or \textit{greenbeb.mps}. However, when this ratio is greater, the \texttt{Revised-sparse} can be nearly twice as fast as the standard method. This is noticeable on \textit{80bau3b.mps} (4.5) and \textit{fit2p.mps} (4.3). Although the C-to-R ratio of \textit{d6cube.mps} (14.9) exceeds the ones previously mentioned, the \texttt{Revised-sparse} method doesn't show an impressive performance, probably due to the small amount of rows and the density of this problem, which does thus not fully benefit from the lower complexity of sparse operations.
+
+\begin{table}[!h]
+\begin{tabular}{|l|r|r|r|r|}
+\hline
+\multicolumn{1}{|l|}{Problem name} & \multicolumn{1}{l|}{Rows} & \multicolumn{1}{l|}{Cols} & \multicolumn{1}{l|}{NNZ} & \multicolumn{1}{l|}{C-to-R} \\ \hline
+80bau3b.mps & 2263 & 9799 & 29063 & 4.3\\ \hline
+bnl2.mps & 2325 & 3489 & 16124 & 1.5\\ \hline
+d2q06c.mps & 2172 & 5167 & 35674 & 2.4\\ \hline
+d6cube.mps & 416 & 6184 & 43888 & 14.9\\ \hline
+fit2p.mps & 3001 & 13525 & 60784 & 4.5\\ \hline
+greenbeb.mps & 2393 & 5405 & 31499 & 2.3 \\ \hline
+maros-r7.mps & 3137 & 9408 & 151120 & 3.0\\ \hline
+pilot.mps & 1442 & 3652 & 43220 & 2.5 \\ \hline
+pilot87.mps & 2031 & 4883 & 73804 & 2.4\\ \hline
+\end{tabular}
+\caption{NETLIB problems solved in more than 5 seconds}
+\label{chXXX:tab:big}
+\end{table}
+
+\begin{figure}[!h]
+\centering
+\includegraphics[width=100mm]{Chapters/chapter10/figures/Big2.pdf}
+\caption{Time required to solve problems of table \ref{chXXX:tab:big}}
+\label{chXXX:fig:SlowSolve}
+\end{figure}
+
+\section{Conclusion and perspectives}
+\label{chXXX:sec:persp}
+In this chapter, we tried to present the knowledge and understanding necessary in our view to produce an efficient integer linear programming solver on a GPU. We proposed various solutions to implement the standard and revised simplex. We discussed the main concepts behind a Branch-and-Bound algorithm and pointed out some major concerns when it is coupled with a GPU solver. The results obtained with the standard simplex implementation allowed us to validate a detailed performance model we had proposed. Finally, we used problems from the \textit{NETLIB} library to compare the performances of our various implementations. The revised simplex implementation with sparse matrix operations, showed the best performances on time consuming problems, while the standard simplex one was more competitive on easier problems. However, sequential open-source solvers like \textit{CLP} of the \textit{COIN-OR} project still outperform such GPU implementations.
+
+We shall now discuss some remaining potential improvements. A first step towards performance would be to consider the dual revised simplex~\cite{LEMKE54}. While being similar to the methods treated in this chapter, it has the capacity to greatly reduce the time needed to solve problems.
+Yet, the main improvement would be to tackle larger problems than the ones considered here. Indeed, problems having hundreds of thousands of variables would technically be solvable on GPU devices with 2GB to 4GB of global memory. Moreover, such amounts of data would fully benefit from the potential of these devices. However, solving problems of this size raises an issue not addressed here: numerical instabilities. This phenomenon is due to the limited precision of mathematical operations on computing devices.
+% Numerical innacuracies/ Example + problem
+Let us illustrate this problem on the inverse $\mbf{B^{-1}}$ of the basis matrix $\mbf{B}$. At each iteration of the revised simplex, $\mbf{B^{-1}}$ is updated from its previous values. Performing this update adds a small error to the result. Unfortunately, these errors accumulate at each iteration, bringing at some point the $\mbf{B^{-1}}$ matrix into a degenerate state. To avoid this situation, the matrix $\mbf{B^{-1}}$ must be recomputed afresh once in a while by inverting the matrix $\mbf{B}$.
+
+% LU decomposition
+Instead of directly processing the inverse of the matrix $\mbf{B}$, it is more common to use some specific arithmetical treatment. Most of the simplex implementations use the LU decomposition~\cite{BRAYTON70}.
+Since the matrix $\mbf{B}$ is sparse, the sparse version of this decomposition can be used, and the corresponding update performed for $\mbf{B^{-1}}$ at each iteration~\cite{BARTELS69}. The sparse LU decomposition on CPU has been recently subject to a large amount of research, yielding many improvements. Once its GPU counterpart will be available, this might result in improved and competitive simplex implementations on GPU.
+
+\putbib[Chapters/chapter10/biblio]
+
+
+
+
+
+
+