X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/177d75ae3d1a1061fb9caa43de9afca760ca0d1a..5efdabb8714996a643bae4305878c2f277eaf679:/BookGPU/Chapters/chapter10/ch10.tex?ds=sidebyside diff --git a/BookGPU/Chapters/chapter10/ch10.tex b/BookGPU/Chapters/chapter10/ch10.tex index c47af05..2664ff0 100644 --- a/BookGPU/Chapters/chapter10/ch10.tex +++ b/BookGPU/Chapters/chapter10/ch10.tex @@ -1,23 +1,23 @@ - - -\chapterauthor{Xavier Meyer}{Department of Computer Science, University of Geneva} -\chapterauthor{Paul Albuquerque}{Institute for Informatics and Telecommunications, hepia, \\ University of Applied Sciences of Western Switzerland - Geneva} -\chapterauthor{Bastien Chopard}{Department of Computer Science, University of Geneva} +\chapterauthor{Xavier Meyer and Bastien Chopard}{Department of Computer Science, University of Geneva, Switzerland} +\chapterauthor{Paul Albuquerque}{Institute for Informatics and Telecommunications, hepia, \\ University of Applied Sciences of Western Switzerland -- Geneva, Switzerland} +%\chapterauthor{Bastien Chopard}{Department of Computer Science, University of Geneva} %\chapter{Linear programming on a GPU: a study case based on the simplex method and the branch-cut-and bound algorithm} -\chapter{Linear programming on a GPU: a case study} +\chapter{Linear Programming on a GPU: A~Case~Study} \section{Introduction} \label{chXXX:sec:intro} -The simplex method is a well-known optimization algorithm for solving linear programming (LP) models in the field of operations research. It is part of software often employed by businesses for finding solutions to problems such as airline scheduling problems. The original standard simplex method was proposed by Dantzig in 1947~\cite{VCLP}. A more efficient method named the revised simplex, was later developed. Nowadays its sequential implementation can be found in almost all commercial LP solvers. But the always increasing complexity and size of LP problems from the industry, drives the demand for more computational power. Indeed, current implementations of the revised simplex strive to produce the expected results, if any. In this context, parallelization is the natural idea to investigate~\cite{HALL}. Already in 1996, Thomadakis et al.~\cite{THOMAD} implemented the standard method on a massively parallel computer and obtained an increase in performances when solving dense or large problems. +The simplex method~\cite{VCLP} is a well-known optimization algorithm for solving linear programming (LP) models in the field of operations research. It is part of software often employed by businesses for finding solutions to problems such as airline scheduling problems. The original standard simplex method was proposed by Dantzig in 1947. A more efficient method, named the revised simplex, was later developed. Nowadays its sequential implementation can be found in almost all commercial LP solvers. But the always increasing complexity and size of LP problems from the industry, drives the demand for more computational power. +%Indeed, current implementations of the revised simplex strive to produce the expected results, if any. +In this context, parallelization is the natural idea to investigate~\cite{HALL}. Already in 1996, Thomadakis and Liu~\cite{THOMAD} implemented the standard method on a massive parallel computer and obtained an increase in performances when solving dense or large problems. -A few years back, in order to meet the demand for processing power, graphics card vendors made their graphical processing units (GPU) available for general-purpose computing. Since then GPUs have gained a lot of popularity as they offer an opportunity to accelerate algorithms having an architecture well-adapted to the GPU model. The simplex method falls into this category. Indeed, GPUs exhibit a massively parallel architecture optimized for matrix processing. To our knowledge, there are only few simplex implementations on GPU. Bieling et al.~\cite{BIELING} presented encouraging results while solving small to mid-size LP problems with the revised simplex. However, the complexity of their algorithm seems to be rather close to the one of the standard simplex with similar heuristics. Following the steps of this first work, an implementation of the revised simplex~\cite{LALAMI11} showed interesting results on non-sparse and square matrices. Finally, an implementation of the interior point method~\cite{JUNG08} outperformed its CPU equivalent on mid-size problems. +A few years back, in order to meet the demand for processing power, graphics card vendors made their graphical processing units (GPU) available for general-purpose computing. Since then GPUs have gained a lot of popularity as they offer an opportunity to accelerate algorithms having an architecture well-adapted to the GPU model. The simplex method falls into this category. Indeed, GPUs exhibit a massive parallel architecture optimized for matrix processing. To our knowledge, there are only a few simplex implementations on GPU. Bieling et al.~\cite{BIELING} presented encouraging results while solving small to mid-sized LP problems with the revised simplex. However, the complexity of their algorithm seems to be rather close to the one of the standard simplex with similar heuristics. Following the steps of this first work, an implementation of the revised simplex~\cite{LALAMI11} showed interesting results on dense and square matrices. Furthermore, an implementation of the interior point method~\cite{JUNG08} outperformed its CPU equivalent on mid-sized problems. The Branch-and-Bound (B\&B) algorithm is extensively used for solving integer linear programming (ILP) problems. This tree-based exploration method subdivides the feasible region of the relaxed LP model into successively smaller subsets to obtain bounds on the objective value. Each corresponding submodel can be solved with the simplex method, and the bounds computed determine whether further branching is required. -Hence, a natural parallelization of the B\&B method is to let the CPU manage the B\&B tree and despatch the relaxed submodels to a LP solver on a GPU. We refer the reader to chapter~\ref{ch8:GPU-accelerated-tree-based-exact-optimization-methods}, and the references therein, for a more complete introduction to parallel B\&B algorithms. +Hence, a natural parallelization of the B\&B method is to let the CPU manage the B\&B tree and dispatch the relaxed submodels to an LP solver on a GPU. We refer the reader to Chapter~\ref{ch8:GPU-accelerated-tree-based-exact-optimization-methods}, and the references therein, for a more complete introduction to parallel B\&B algorithms. -In this chapter, we present a GPU implementation of the standard and revised simplex methods, based on the CUDA technology of NVIDIA. We also derive a performance model and establish its accurateness. Let us mention that there are many technicalities involved in CUDA programming, in particular regarding the management of tasks and memories on the GPU. Thus fine tuning is indispensable to avoid a breakdown on performance. +In this chapter, we present a GPU implementation of the standard and revised simplex methods, based on the CUDA technology of NVIDIA. We also derive a performance model and establish its accurateness. Let us mention that there are many technicalities involved in CUDA programming, in particular regarding the management of tasks and memories on the GPU. Thus, fine tuning is indispensable to avoid a breakdown on performance. %With respect to the simplex algorithm, special attention must be given to numerical stability. The chapter is organized as follows. First, we begin with a description of the standard and revised simplex methods @@ -30,10 +30,10 @@ Finally, we summarize the results obtained and consider new perspectives. \section{Simplex algorithm} \label{chXXX:sec:simplex-algo} -\subsection{Linear programming model} +\subsection{Linear programming model\index{Linear programming}} \label{chXXX:subsec:lp-model} %% Mathematics of LP -A linear programming\index{linear programming} (LP) model in its canonical form can be expressed as the following optimization problem: +An LP model in its canonical form can be expressed as the following optimization problem: \begin{equation} \label{chXXX:equ:canon-form} %\centering @@ -43,7 +43,7 @@ A linear programming\index{linear programming} (LP) model in its canonical form &\hspace{10pt} &\mbf{x \geq 0} \hspace{30pt} \end{array} \end{equation} -where $\mbf{x}=(x_j),\mbf{c}=(c_j) \in \mathbb{R}^n,\mbf{b}=(b_i) \in \mathbb{R}^m$ and $\mbf{A}=(a_{ij})$ is the $m \times n$ constraints matrix. +where $\mbf{x}=(x_j),\mbf{c}=(c_j) \in \mathbb{R}^n,\mbf{b}=(b_i) \in \mathbb{R}^m$, and $\mbf{A}=(a_{ij})$ is the $m \times n$ constraints matrix. The objective function $z = \mbf{c^T x}$ is the inner product of the cost vector $\mbf{c}$ and the unknown variables $\mbf{x}$. An element $\mbf{x}$ is called a solution which is said to be feasible if it satisfies the $m$ linear constraints imposed by $\mbf{A}$ and the bound $\mbf{b}$. @@ -53,7 +53,7 @@ An optimal solution to the LP problem will therefore reside on a vertex of this \subsection{Standard simplex algorithm} \label{chXXX:subsec:simplex-std} % Simple explanation of the simple algorithm : 3 operations -The simplex method\index{simplex!standard method}~\cite{VCLP} is an algorithm for solving linear programming (LP) models. +The simplex method\index{Simplex!Standard method}~\cite{VCLP} is an algorithm for solving LP models. It proceeds by iteratively visiting vertices on the boundary of the feasible region. This amounts to performing algebraic manipulations on the system of linear equations. @@ -88,17 +88,17 @@ By definition, a basic solution is obtained by assigning null values to all nonb Hence, $\mbf{x=x_\B = b}$ is a basic solution. The simplex algorithm searches for the optimal solution through an iterative process. -For the sake of simplicity, we will assume here that $\mbf{b>0}$ (\textit{i.e.} the origin belongs to the feasible region). -Otherwise a preliminary treatment is required to generate a feasible initial solution (see~\S~\ref{chXXX:sec:heuristics}). -A typical iteration then consists of three operations (summarized in algorithm~\ref{chXXX:algo:simplex-std}). +For the sake of simplicity, we will assume here that $\mbf{b>0}$, that is, the origin belongs to the feasible region. +Otherwise a preliminary treatment is required to generate a feasible initial solution (see~Section~\ref{chXXX:sec:heuristics}). +A typical iteration then consists of three operations (summarized in Algorithm~\ref{chXXX:algo:simplex-std}). \begin{enumerate} %%%%%%%%%%%%%%%%% \item \textbf{Choosing the entering variable.} -The entering variable is a nonbasic variable whose increase will lead to an increase of the value of the objective function $z$. +The entering variable is a nonbasic variable whose increase will lead to an increase in the value of the objective function $z$. This variable must be selected with care so as to yield a substantial leap towards the optimal solution. The standard way of making this choice is to select the variable $x_e$ with the largest positive coefficient $c_e = \max\{c_j>0\, |\, j\in \N\}$ in the objective function $z$. -However, other strategies such as choosing the positive coefficient with the smallest index, prove to be useful. +However, other strategies, such as choosing the positive coefficient with the smallest index, also prove to be useful. %%%%%%%%%%%%%%%%% \item \textbf{Choosing the leaving variable.} This variable is the basic variable which first violates its constraint as the entering variable $x_e$ increases. @@ -112,10 +112,10 @@ the element $a_{\ell e}$ is called the pivot. \item \textbf{Pivoting.} Once both variables are defined, the pivoting operation switches these variables from one set to the other: the entering variable enters the basis $\mathcal{B}$, taking the place of the leaving variable which now belongs to $\mathcal{N}$, -namely $\mathcal{B} \leftarrow (\mathcal{B} \setminus \{\ell\}) \cup \{e\}$ and $\mathcal{N} \leftarrow (\mathcal{N} \setminus \{e\}) \cup \{\ell\}$. +namely, $\mathcal{B} \leftarrow (\mathcal{B} \setminus \{\ell\}) \cup \{e\}$ and $\mathcal{N} \leftarrow (\mathcal{N} \setminus \{e\}) \cup \{\ell\}$. Correspondingly, the columns with index $\ell$ and $e$ are exchanged between $\mbf{I_m}$ and $\mbf{A_\N}$, and similarly for $\mbf{c_\B=0}$ and $\mbf{c_\N}$. We then update the constraints matrix $\mbf{A}$, the bound $\mbf{b}$ and the cost $\mbf{c}$ using Gaussian elimination. -More precisely, denoting $\mbf{\tilde{I}_m}$, $\mbf{\tilde{A}_\N}$, $\mbf{\tilde{c}_\B}$ and $\mbf{\tilde{c}_\N}$ the resulting elements after the exchange, Gaussian elimination then transforms the tableau +More precisely, denoting $\mbf{\tilde{I}_m}$, $\mbf{\tilde{A}_\N}$, $\mbf{\tilde{c}_\B}$, and $\mbf{\tilde{c}_\N}$ as the resulting elements after the exchange, Gaussian elimination then transforms the tableau \begin{center} \begin{tabular}{|c|c||c|} @@ -125,7 +125,7 @@ $\mbf{\tilde{c}^T_\N}$ & $\mbf{\tilde{c}^T_\B}$ & $z$ \end{tabular} \end{center} \noindent -into a tableau with updated values for $\mbf{A_\N}$, $\mbf{c_\N}$ and $\mbf{b}$ +into a tableau with updated values for $\mbf{A_\N}$, $\mbf{c_\N}$, and $\mbf{b}$ \begin{center} \begin{tabular}{|c|c||c|} $\mbf{A_\N}$ & $\mbf{I_m}$ & $\mbf{b}$ \\ @@ -134,12 +134,12 @@ $\mbf{c^T_\N}$ & $\mbf{c^T_\B}$ & $z-tc_e$ \end{tabular} \end{center} The latter is obtained by first dividing the $\ell$-th row by $a_{\ell e}$; -the resulting row multiplied by $a_{ie}$ ($i\not=\ell$), resp. by $c_e$, is then added to the $i$-th row, resp. the last row. +the resulting row multiplied by $a_{ie}$ ($i\not=\ell$) is then subtracted to the $i^\mathrm{th}$ row; the same operation is performed using $c_e$ and the last row. Hence, $\mbf{c_\B=0}$. These operations amount to jumping from the current vertex to an adjacent vertex with objective value $z=tc_e$. \end{enumerate} % End condition -The algorithm ends when no more entering variable can be found, that is when~$\mbf{c_\N \le 0}$. +The algorithm ends when no more entering variable can be found, that is, when~$\mbf{c_\N \le 0}$. %% \begin{algorithm} %% \caption{Standard simplex algorithm} @@ -166,7 +166,7 @@ The algorithm ends when no more entering variable can be found, that is when~$\m %% \end{algorithm} \begin{algorithm} -\SetNlSty{}{}{:} +%\SetNlSty{}{}{:} \textit{//1. Find entering variable}\; \If{$\mbf{c_\N \le 0}$} { \Return Optimal\; @@ -178,13 +178,13 @@ The algorithm ends when no more entering variable can be found, that is when~$\m } Let $\ell \in \B$ be the index such that \; $t := \dfrac{b_\ell}{a_{\ell e}}=\min\left\{\dfrac{b_i}{a_{ie}}\, \bigg|\, a_{ie}>0,\, i=1,\dots,m \right\}$\; - \textit{//3. Pivoting - update }\; + \textit{//3. Pivoting, update }\; $\mathcal{B} := (\mathcal{B} \setminus \{\ell\}) \cup \{e\}$, $\mathcal{N} := (\mathcal{N} \setminus \{e\}) \cup \{\ell\}$\; Compute $z_{best}:=z_{best}+t c_e$\; - Exchange $(\mbf{I_m})_\ell$ and $(\mbf{A_\N})_e$, $c_\ell$ and $c_e$\; + Exchange $(\mbf{I_m})_\ell$ and $(\mbf{A_\N})_e$, $(\mbf{c_\B})_\ell$ and $(\mbf{c_\N})_e$\; Update $\mbf{A_\N}$, $\mbf{c_\N}$ and $\mbf{b}$\; Go to 1.\; -\caption{Standard simplex algorithm} +\caption{standard simplex algorithm} \label{chXXX:algo:simplex-std} \end{algorithm} @@ -193,7 +193,7 @@ Update $\mbf{A_\N}$, $\mbf{c_\N}$ and $\mbf{b}$\; \subsection{Revised simplex method} \label{chXXX:subsec:simplex-rev} % Global revised simplex idea -The operation which takes the most time in the standard method, is the pivoting operation, and more specifically the update of the constraints matrix~$\mbf{A}$. The revised method\index{simplex!revised method} tries to avoid this costly operation by only updating a smaller part of this matrix. +The operation that takes the most time in the standard method is the pivoting operation, and more specifically, the update of the constraints matrix~$\mbf{A}$. The revised method\index{Simplex!Revised method} tries to avoid this costly operation by updating only a smaller part of this matrix. %% Revised simplex method operations, useful for the algorithm The revised simplex method uses the same operations as in the standard method to choose the entering and leaving variables. @@ -211,11 +211,11 @@ z&=\mbf{c_\B^T B^{-1}b+(c_\N^T-c_\B^T B^{-1}N)x_\N} The vector $\mbf{c_\N^T-c_\B^T B^{-1}N}$ is called the reduced cost vector. The choice of the leaving variable can be rewritten. -Setting all nonbasic variables save the entering variable $x_e$, to zero, $x_e$ is then bounded by +Setting all nonbasic variables except the entering variable $x_e$ to zero, $x_e$ is then bounded by $$t=\min\left\{\dfrac{(\mbf{B^{-1}b})_i}{(\mbf{B^{-1}N})_{ie}}\, \bigg|\, (\mbf{B^{-1}N})_{ie}>0,\, i=1,\dots,m \right\}$$ If $t=+\infty$, the LP problem is unbounded. Otherwise, $t=\frac{(\mbf{B^{-1}b})_\ell}{(\mbf{B^{-1}N})_{\ell e}}$ for some $\ell\in \B$, and $x_\ell$ is the leaving variable. -Recall that the pivoting operation begins by exchanging columns with index $\ell$ and $e$ between $\mbf{B}$ and $\mbf{N}$, and similarly for $\mbf{c_\B}$ and $\mbf{c_\N}$. +Recall that the pivoting operation begins by exchanging columns with index $\ell$ and $e$ between $\mbf{B}$ and $\mbf{N}$ and similarly for $\mbf{c_\B}$ and $\mbf{c_\N}$. To emphasize the difference between standard and revised simplex, we first express the updating for the standard simplex. This amounts to \begin{align*} @@ -224,27 +224,27 @@ To emphasize the difference between standard and revised simplex, we first expre \end{align*} which moves the current vertex to an adjacent vertex $\mbf{x=B^{-1}b}$ with objective value $z=\mbf{c_\B^T B^{-1}b}$ (the end condition remains $\mbf{c_\N \le 0}$). Many computations performed in this update phase can in fact be avoided. -The main point of the revised simplex method is that the only values which really need to be computed at each step are: $\mbf{B^{-1}}$, $\mbf{B^{-1}b}$,$\mbf{B^{-1}N}_{e}$, $\mbf{c_\B^T B^{-1}}$ and $\mbf{c_\B^T B^{-1}b}$. However, matrix inversion is a time consuming operation (of cubic order for Gaussian elimination). +The main point of the revised simplex method is that the only values which really need to be computed at each step are $\mbf{B^{-1}}$, $\mbf{B^{-1}b}$, $\mbf{B^{-1}N}_{e}$, $\mbf{c_\B^T B^{-1}}$, and $\mbf{c_\B^T B^{-1}b}$. However, matrix inversion is a time-consuming operation (of cubic order for Gaussian elimination). Fortunately, there are efficient ways of computing an update for $\mbf{B^{-1}}$. -One way is to take advantage of the sparsity of the LP problem by using the LU sparse decomposition. +One way is to take advantage of the sparsity of the LP problem by using the so-called LU decomposition\footnote{The LU decomposition is a linear algebra decomposition which allows to write a matrix as a product of a lower and an upper triangular matrix.} for sparse matrices~\cite{BARTELS69}. This decomposition may be updated at a small cost at each iteration~\cite{SUHL93}. Another way is to use the \textit{product form of the inverse}~\cite{dantzig1953product}, which we describe hereafter. -Let $\mbf{b}_1,\dots,\mbf{b}_m$ be the columns of $\mbf{B}$, $\mbf{v}\in \mathbb{R}^m$ +Let $\mbf{b}_1,\dots,\mbf{b}_m$ be the columns of $\mbf{B}$, $\mbf{v}\in \mathbb{R}^m$, and $\mbf{a}=\mbf{Bv}=\sum_{i=1}^m v_i\mbf{b}_i$. Denote $\mbf{B_a}=(\mbf{b}_1,\dots,\mbf{b}_{p-1},\mbf{a},\mbf{b}_{p+1},\dots,\mbf{b}_m)$ for a given $1\le p\le m$ such that $v_p\not= 0$. We want to compute $\mbf{B_a}^{-1}$. We first write $$\mbf{b}_p=\dfrac{1}{v_p}+\mbf{a}\sum_{i\not=p} \dfrac{-v_i}{v_p}\mbf{b}_i$$ Define $$\bs\eta = \left(\dfrac{-v_1}{v_p},\dots,\dfrac{-v_{p-1}}{v_p},\dfrac{1}{v_p},\dfrac{-v_{p+1}}{v_p},\dots,\dfrac{-v_m}{v_p}\right)^T$$ and $$\mbf{E}=(\bs\varepsilon_1,\dots,\bs\varepsilon_{p-1},\bs\eta,\bs\varepsilon_{p+1},\dots,\bs\varepsilon_m)$$ -where $\bs\varepsilon_j = (0,\dots,0,1,0,\dots,0)^T$ is the $j$-th element of the canonical basis of~$\mathbb{R}^m$. +where $\bs\varepsilon_j = (0,\dots,0,1,0,\dots,0)^T$ is the $j^\mathrm{th}$ element of the canonical basis of~$\mathbb{R}^m$. Then $\mbf{B_a E=B}$, so $\mbf{B_a}^{-1}=\mbf{E\; B}^{-1}$. We apply these preliminary considerations to the simplex algorithm with $\mbf{a=N}_e, \mbf{v=(B^{-1}N)}_e$ (recall that $x_e$ is the entering variable). -If initially $\mbf{B}$ is the identity matrix $\mbf{I_m}$, at the $k$-th iteration of the algorithm, +If initially $\mbf{B}$ is the identity matrix $\mbf{I_m}$, at the $k^\mathrm{th}$ iteration of the algorithm, the inverse matrix is given by $\mbf{B}^{-1}=\mbf{E}_k \mbf{E}_{k-1}\cdots \mbf{E}_2 \mbf{E}_1$, -where $\mbf{E}_i$ is the matrix constructed at the $i$-th iteration. +where $\mbf{E}_i$ is the matrix constructed at the $i^\mathrm{th}$ iteration. %Note that matrices like $\mbf{E}$ are sparse and can therefore be stored compactly. \subsection{Heuristics and improvements} @@ -256,13 +256,13 @@ We will explain below how we find an initial feasible solution and how we choose Our implementations use the \textit{two-phase simplex}~\cite{VCLP}. The first phase aims at finding a feasible solution required by the second phase to solve the original problem. If the origin $\mbf{x=0}$ is not a feasible solution, we proceed to find such a solution by solving a so-called \textit{auxiliary problem} with the simplex algorithm. -This can be achieved by adding a non-negative artificial variable to each constraint equation corresponding to a basic variable which violates its non-negativity condition, +This can be achieved by adding a nonnegative artificial variable to each constraint equation corresponding to a basic variable which violates its nonnegativity condition, before minimizing the sum of these artificial variables. The algorithm will thus try to drive all artificial variables towards zero. If it succeeds, then a basic feasible solution is available as an initial solution for the second phase -in which it attempts to find an optimal solution to the original problem. Otherwise the latter is infeasible. +in which it attempts to find an optimal solution to the original problem. Otherwise, the problem is infeasible. To avoid having to introduce these additional auxiliary variables, we use an alternate version of this procedure. -For basic variables with index in $\mathcal{I}=\{j\in~\B \,\,|\,\, x_j<0 \}$, we temporarily relax the non-negativity condition in order to have a feasible problem, +For basic variables with index in $\mathcal{I}=\{j\in~\B \,\,|\,\, x_j<0 \}$, we temporarily relax the nonnegativity condition in order to have a feasible problem and then apply the simplex algorithm to minimize $w=-\sum_{j\in \mathcal{I}} x_j$. However, we update $\mathcal{I}$ at each iteration and modify accordingly the objective function, whose role is to drive these infeasible basic variables towards their original bound. @@ -272,20 +272,20 @@ The alteration introduced above involves more computations during the first phas but it offers the advantage of preserving the problem size and making good use of the GPU processing power. \subsubsection*{Choice of the entering variable} -The number of iterations required to solve a problem depends on the method used to select the entering variable. The one described in section~\ref{chXXX:subsec:simplex-std} chooses the most promising variable $x_e$ in terms of cost. While being inexpensive to compute, this method can lead to an important number of iterations before the best solution is found. +The number of iterations required to solve a problem depends on the method used to select the entering variable. The one described in Section~\ref{chXXX:subsec:simplex-std} chooses the most promising variable $x_e$ in terms of cost. While being inexpensive to compute, this method can lead to an important number of iterations before the best solution is found. -There exists various heuristics to select this variable $x_e$. +There exist various heuristics to select this variable $x_e$. One of the most commonly used is the \textit{steepest-edge} method~\cite{GOLDFRAB77}. To improve the speed at which the best solution is found, this method takes into account the coefficients of $\mbf{B}^{-1}\mbf{N}$. This can be explained from the geometrical point of view. The constraints $\mbf{Ax=b}$ form the hull of a convex polytope. -The simplex algorithm moves from one vertex (\textit{i.e.}~a~solution) to another while trying to improve the objective function. +The simplex algorithm moves from one vertex (i.e.,~a~solution) to another while trying to improve the objective function. The steepest-edge method searches for the edge along which the rate of improvement of the objective function is the best. The entering variable $x_e$ is then determined by -$$ e = \argmax \left\{ \dfrac{c_j}{\sqrt{\gamma_j}} \, \bigg| \, c_j > 0 , j \in \N \right\} $$ -with $ \gamma_j = \Vert \mbf{\mbf{B}^{-1}\mbf{N}_j} \Vert^2 $. +$$e = \argmax \left\{ \dfrac{c_j}{\sqrt{\gamma_j}} \, \bigg| \, c_j > 0 , j \in \N \right\}$$ +with $\gamma_j = \Vert \mbf{B}^{-1}\mbf{N}_j \Vert^2$. -This method is quite costly to compute but it reduces significantly the number of iterations required to solve a problem. This heuristic can be directly applied to the standard simplex since the full-tableau is updated at each iteration. However it defeats the purpose of the revised simplex algorithm since the aim is precisely to avoid updating the whole constraints matrix at each iteration. +This method is quite costly to compute but it reduces significantly the number of iterations required to solve a problem. This heuristic can be directly applied to the standard simplex since the full tableau is updated at each iteration. However, it defeats the purpose of the revised simplex algorithm since the aim is precisely to avoid updating the whole constraints matrix at each iteration. Taking this into account, the steepest-edge coefficients $\bs\gamma$ are updated based only on their current value. For the sake of clarity, the hat notation is used to differentiate the next iteration value of a variable from its current value: for example if $\bs\gamma$ denotes the steepest-edge coefficients at the current iteration, then $\hat{\bs\gamma}$ are the updated coefficients. Given the column of the entering variable $\mbf{d} = \mbf{(B^{-1}N)}_e$, we may process afresh the steepest-edge coefficient of the entering variable as $\gamma_e = \Vert \mbf{d} \Vert^2$. @@ -298,45 +298,45 @@ and $\bs\beta = \mbf{N^T} \mbf{(B^{-1})^T} \mbf{d}$ \subsubsection*{Choice of the leaving variable} The stability and robustness of the algorithm depend considerably on the choice of the leaving variable. With respect to this, the \textit{expand} method~\cite{GILL1989} proves to be very useful in the sense that it helps to avoid cycles and reduces the risks of encountering numerical instabilities. This method consists of two steps of complexity $\mathcal{O}(m)$. In the first step, a small perturbation is applied to the bounds of the variables to prevent stalling of the objective value, thus avoiding cycles. These perturbed bounds are then used to determine the greatest gain on the entering variable imposed by the most constraining basic variable. The second phase uses the original bounds to define the basic variable offering the gain closest to the one of the first phase. This variable will then be selected for leaving the basis. -\section{Branch-and-Bound\index{Branch-and-Bound} algorithm} +\section{Branch-and-bound\index{Branch-and-bound} algorithm} \label{chXXX:sec:bnb} -\subsection{Integer linear programming} +\subsection{Integer linear programming\index{Integer linear programming}} \label{chXXX:subsec:ilp} % Why and what are integer linear programming problem -In some problems, variables are integer-valued. For example, in a vehicle routing problem, one cannot assign one third of a vehicule to a specific route. Integer linear programming\index{integer linear programming} (ILP) problems restrict LP problems by imposing an integrality condition on the variables. While this change may seem to have little impact on the model, the aftermaths on the resolution method are quite important. +In some problems, variables are integer-valued. For example, in a vehicle routing problem, one cannot assign one third of a vehicule to a specific route. ILP problems restrict LP problems by imposing an integrality condition on the variables. While this change may seem to have little impact on the model, the aftermaths on the resolution method are quite important. % What is the cost of adding this constraint -From a geometrical perspective, the simplex algorithm is a method for finding an optimum in a convex polytope defined by a LP problem. However, the additional integrality condition results in the loss of this convexity property. The resolution method must then be altered in order to find a solution to the ILP problem. The idea is to explore the solution space by a divide-and-conquer approach in which the simplex algorithm is used to find a local optimum in subspaces. +From a geometrical perspective, the simplex algorithm is a method for finding an optimum in a convex polytope defined by an LP problem. However, the additional integrality condition results in the loss of this convexity property. The resolution method must then be altered in order to find a solution to the ILP problem. The idea is to explore the solution space by a divide-and-conquer approach in which the simplex algorithm is used to find a local optimum in subspaces. -\subsection{Branch-and-Bound tree} +\subsection{Branch-and-bound tree} \label{chXXX:subsec:bnb-tree} % Intro -The solution space is explored by the \textit{Branch-and-Bound} (B\&B) algorithm~\cite{ACHTERBERG07}. +The solution space is explored by the B\&B algorithm~\cite{ACHTERBERG07}. The strategy used is conceptually close to a tree traversal. % Branching -%At first, the ILP problem is considered as a LP problem by relaxing the integrality condition before applying a LP solver to it. +%At first, the ILP problem is considered as an LP problem by relaxing the integrality condition before applying an LP solver to it. %This initial relaxed problem represents the root of the tree about to be built. %From the obtained solution, a variable $x_j$ (with bounds $l_j \leq x_j \leq u_j$) that violates its integrality condition, is chosen as a branching variable. %The current problem will then be divided, if possible, into two subproblems: one having $x_j$ constrained to $l_j' = l_j$ and $u_j' = \lfloor x_j \rfloor$ and the other to $l_j' = \lceil x_j \rceil$ and $u_j' = u_j$. %Each of these LP subproblems represents a child node waiting to be solved. % Branching -At first, the ILP problem is considered as a LP problem by relaxing the integrality condition before applying a LP solver to it. +At first, the ILP problem is considered as an LP problem by relaxing the integrality condition before applying an LP solver to it. This initial relaxed problem represents the root of the tree about to be built. From the obtained solution $\bs{\xi}$, if $\xi_j$ is not integral for some $j$, then $x_j$ can be chosen as a branching variable. The current problem will then be divided, if possible, into two subproblems: one having $x_j\le \lfloor \xi_j \rfloor$ and the other $x_j\ge \lceil \xi_j \rceil$. Each of these LP subproblems represents a child node waiting to be solved. % When does it ends -This is repeated for each subproblems in such a way that all variables are led towards integral values. +This is repeated for each subproblem in such a way that all variables are led towards integral values. At some point a subproblem will either be infeasible or lead to a feasible ILP solution (leaf nodes). The algorithm ends when the tree has been fully visited, returning the best feasible ILP solution. % Upper, lower bounds and pruning During the exploration, lower and upper bounds on the objective value are computed. The upper bound is represented by the best objective value encountered in a child node. -This non-admissible solution hints towards what the objective value of the ILP problem could be. +This nonadmissible solution hints towards what the objective value of the ILP problem could be. The lower bound is the best ILP solution yet found, in other words the objective value of the most promising leaf node. The bounds may be used to prune subtrees when further branching cannot improve the best current solution. @@ -362,39 +362,39 @@ whose objective value $z$ is inferior to the best one yet found. \begin{figure}[!h] \centering \includegraphics[width=12cm]{Chapters/chapter10/figures/tree2_conv.pdf} -\caption{Solving an ILP problem using a Branch-and-Bound algorithm} +\caption{Solving an ILP problem using a branch-and-bound algorithm.} \label{chXXX:fig:bnb} \end{figure} \subsection{Branching strategy} \label{chXXX:subsec:bnb-branching} % What it is and why it is important -The branching strategy\index{Branch-and-Bound!branching} defines the method used to select the variable on which branching will occur. +The branching strategy\index{Branch-and-bound!Branching} defines the method used to select the variable on which branching will occur. The objective value of the child node depends greatly on the choice of this variable. Branching on a variable may lead to a drop on the upper bound and thus speed up the exploration, while branching on other variables could leave this bound unchanged. % Example of strategy -Several branching strategies exist~\cite{Achterberg05}. Let us briefly comment on some of them in terms of improvement of the objective function and processing cost. +Several branching strategies exist~\cite{Achterberg05}. Let us briefly comment on some of them in terms of improving the objective function and processing cost. \begin{itemize} -\item \textit{Smallest index strategy} is a greedy approach that always selects the variable with the smallest index as branching variable. This method is simple, has a cheap processing cost but does not try to select the best variable. -\item \textit{Strong branching strategy} is an exhaustive strategy. The branching variable selected is the one amongst all the potential variables that leads to the best solution. This means that for each potential branching variable, its potential child nodes must be solved. This method is easy to implement, but its computational cost makes it inefficient. -\item \textit{Reliability branching strategy} is a strategy which maintains a pseudo-cost~\cite{BENICHOU71} for each potential branching variable. The pseudo-cost of a variable is based on the result obtained when branching on it at previous steps. Since at the start, pseudo-costs are unreliable, a limited strong branching approach is used until pseudo-costs are deemed reliable. This method is more complex than the two others and requires fine tuning. It offers however the best trade-off between improvement and computional cost. +\item \textit{Smallest-index strategy} is a greedy approach that always selects the variable with the smallest index as branching variable. This method is simple, has a cheap processing cost, but does not try to select the best variable. +\item \textit{Strong branching strategy} is an exhaustive strategy. The branching variable selected is the one among all the potential variables that leads to the best solution. This means that for each potential branching variable, its potential child nodes must be solved. This method is easy to implement, but its computational cost makes it inefficient. +\item \textit{Reliability branching strategy} is a strategy which maintains a pseudocost~\cite{BENICHOU71} for each potential branching variable. The pseudocost of a variable is based on the result obtained when branching on it at previous steps. Since at the start, pseudocosts are unreliable, a limited strong branching approach is used until pseudocosts are deemed reliable. This method is more complex than the two others and requires fine tuning. It offers, however, the best trade-off between improvement and computional cost. \end{itemize} \subsection{Node selection strategy} \label{chXXX:subsec:bnb-node-select} -The node selection strategy\index{Branch-and-Bound!node selection} defines the methodology used to explore the tree. +The node selection strategy\index{Branch-and-bound!Node selection} defines the methodology used to explore the tree. While the usual depth-first search and breadth-first search are considered and used, some remarks about the tree exploration must be made. First let us mention a few facts: \begin{enumerate} \item Solutions obtained from child nodes cannot be better than the one of their parent node. - \item An LP solver is able to quickly find a solution if the subproblem (child node) is only slightly different than its parent. + \item An LP solver is able to quickly find a solution if the subproblem (child node) is only slightly different from its parent. \end{enumerate} Given the above statements, it is of interest to quickly find a feasible solution. -Indeed, this allows to prune all pending nodes which do not improve the solution found. -However, the quality of the latter impacts the amount of nodes pruned. +Indeed, this allows the pruning of all pending nodes which do not improve the solution found. +However, the quality of the latter solution impacts the amount of nodes pruned. It takes more time to produce a good solution because one must search for the best nodes in the tree. Consequently, a trade-off must be made between the quality and the time required to find a solution. @@ -402,7 +402,7 @@ Two types of strategies~\cite{Linderoth97} can then be considered: \begin{itemize} \item \textit{Depth-first search} aims at always selecting one of the child nodes until a leaf (infeasible subproblem or feasible solution) is reached. -This strategy is characterized by fast solving and it quickly finds a feasible solution. +This strategy is characterized by fast solving, and it quickly finds a feasible solution. It mostly improves the lower bound. \item \textit{Best-first search} aims at always selecting one of the most promising nodes in the tree. This strategy is characterized by slower solving but guarantees that the first feasible solution found is the optimal. @@ -410,62 +410,63 @@ It mostly improves the upper bound. \end{itemize} A problem occurs with the best-first search strategy: there might be numerous nodes having solutions of the same quality, thus making the choice of a node difficult. -To avoid this problem, the \textit{best-estimate search} strategy~\cite{FORREST74} differentiates the best nodes by estimating their potential cost with the help of a pseudo-cost (see previous section). +To avoid this problem, the \textit{best-estimate search} strategy~\cite{FORREST74} differentiates the best nodes by estimating their potential cost with the help of a pseudocost (see previous section). -The most promising variant is a hybrid strategy: the base strategy being the best-estimate search, -the subtrees of the best-estimated nodes are then subject to a limited depth-first search, +The most promising variant is a hybrid strategy in which the base strategy is the best-estimate search with the subtrees of the best-estimated nodes being then subject to a limited depth-first search, more commonly called \textit{plunging}. This method launches a fast search for a feasible solution in the most promising subtrees, thus improving the upper and lower bounds at the same time. \subsection{Cutting-plane methods} \label{chXXX:subsec:cuts} -\textit{Cutting-planes}\index{Branch-and-Bound!cutting-plane}~\cite{WOLTER06} (also simply \textit{cuts}) are new constraints whose role is to cut off parts of the search space. -They may be generated from the first LP solution (\textit{Cut-and-Branch}) or periodically during the B\&B (\textit{Branch-and-Cut}). +\textit{Cutting-planes}\index{Branch-and-bound!Cutting-plane}~\cite{WOLTER06} (also simply \textit{cuts}) are new constraints whose role is to cut off parts of the search space. +They may be generated from the first LP solution (\textit{cut-and-branch}) or periodically during the B\&B (\textit{branch-and-cut}). On the one hand cutting-planes may considerably reduce the size of the solution space, but on the other hand they increase the problem size. Moreover, generating cutting-planes is costly since it requires a thorough analysis of the current state of the problem. -Various types or families of cutting-planes exist. Those that are most applied are the \textit{Gomory cutting-planes} and the \textit{complemented mixed integer rounding inequalities} (c-MIR). Other kinds of cutting-planes target specific families of problems, for example the \textit{0-1 Knapsack cutting-planes} or \textit{flow cover cuts}. +Various types or families of cutting-planes exist. Those that are most applied are the \textit{Gomory cutting-planes} and the \textit{complemented mixed integer rounding inequalities} (c-MIR). Other kinds of cutting-planes target specific families of problems, for example, the \textit{0-1 knapsack cutting-planes} or \textit{flow cover cuts}. -The cutting-planes generated must be carefully selected in order to avoid a huge increase of the problem size. They are selected according to three criteria: their efficiency, their orthogonality with respect to other cutting-planes, and their parallelism with respect to the objective function. +The cutting-planes generated must be carefully selected in order to avoid a huge increase in the problem size. They are selected according to three criteria: their efficiency, their orthogonality with respect to other cutting-planes, and their parallelism with respect to the objective function. Cutting-planes having the most impact on the problem are then selected, while the others are dropped. \section{CUDA considerations} \label{chXXX:sec:cuda} -Most expensive operations in the simplex algorithm are linear algebra functions. The CUBLAS library is used for dense matrix-vector multiplications (\texttt{cublasDgemv}) and dense vector sums (\texttt{cublasDaxpy}). The CUSPARSE library is used for the sparse matrix-vector multiplication (\texttt{cusparseDcsrmv}). +The most expensive operations in the simplex algorithm are linear algebra functions. +The NVIDIA CUDA Basic Linear Algebra Subroutines\footnote{The libraries CUBLAS and CUSPARSE are available at https://developer.nvidia.com/cublas and https://developer.nvidia.com/cusparse.} (CUBLAS) library is a GPU-accelerated version of the complete standard BLAS library. +It can be used for dense matrix-vector multiplications (\texttt{cublasDgemv}) and dense vector sums (\texttt{cublasDaxpy}). The CUSPARSE library is used for the sparse matrix-vector multiplication (\texttt{cusparseDcsrmv}). However, complex operations are required to implement the simplex and must be coded from scratch. -For example, reduction operations (\textit{argmax, argmin}) are fundamental for selecting variables. +For example, reduction operations, such as \textit{argmax} or \textit{argmin}, are fundamental for selecting variables. In the following section, we first quickly describe the CUDA reduction operation, before making some global remarks on kernel optimization. % Reduce operation \subsection{Parallel reduction} \label{chXXX:subsec:reduction} -A parallel reduction\index{Parallel reduction} operation is performed in an efficient manner inside a GPU block as shown in figure~\ref{chXXX:fig:reduc}. Shared memory is used for a fast and reliable way to communicate between threads. However at the grid level, reduction cannot be easily implemented due to the lack of direct communication between blocks. The usual way of dealing with this type of limitation is to apply the reduction in two separate steps. The first one involves a GPU kernel reducing the data over multiple blocks, the local result of each block being stored on completion. The second step ends the reduction on a single block or on the CPU. +A parallel reduction\index{Parallel reduction} operation is performed in an efficient manner inside a GPU block as shown in Figure~\ref{chXXX:fig:reduc}. Shared memory is used for a fast and reliable way to communicate between threads. However, at the grid level, reduction cannot be easily implemented due to the lack of direct communication between blocks. The usual way of dealing with this type of limitation is to apply the reduction in two separate steps. The first one involves a GPU kernel reducing the data over multiple blocks, the local result of each block being stored on completion. The second step finishes the reduction on a single block or on the CPU. +An optimized way of doing the reduction can be found in the examples\footnote{Available at http://docs.nvidia.com/cuda/cuda-samples/index.html\#advanced} provided by NVIDIA. +%In order to keep code listings compact hereafter, the reduction of values among a block will be referred to as \textit{reduceOperation(value)} (per extension \textit{reduceArgMax(maxVal)}). + \begin{figure}[!h] \centering -\includegraphics[width=10cm]{Chapters/chapter10/figures/Reduc2.pdf} -\caption{Example of a parallel reduction at block level (courtesy NVIDIA)} +\includegraphics[width=10cm]{Chapters/chapter10/figures/Reduc3.pdf} +\caption{Example of a parallel reduction at block level (courtesy NVIDIA).} \label{chXXX:fig:reduc} \end{figure} -An optimized way of doing the reduction can be found in the example provided by NVIDIA. -In order to keep code listings compact hereafter, the reduction of values amongst a block will be referred to as \textit{reduceOperation(value)} (per extension \textit{reduceArgMax(maxVal)}). - % Volkov - hiding latency, register usage \subsection{Kernel optimization} -Optimizing a kernel is a difficult task. The most important point is to determine whether the performances are limited by the bandwidth or by the instruction throughput. Depending on the case and the specificities of the problem, various strategies may be applied. This part requires a good understanding of the underlying architecture and its limitations. The CUDA Programming Guide offers some insight into this subject, together with the interesting articles and presentations by Vassily Volkov~\cite{VOLKOV2010}. The CUDA profiler is the best way to monitor the performances of a kernel. This tool proposes multiple performance markers giving indications about potential bottlenecks. +Optimizing a kernel is a difficult task. The most important point is to determine whether the performances are limited by the bandwidth or by the instruction throughput. Depending on the case and the specificities of the problem, various strategies may be applied. This part requires a good understanding of the underlying architecture and its limitations. The \textit{CUDA Programming Guide} offers some insight into this subject, as do the interesting articles and presentations by Vassily Volkov~\cite{VOLKOV2010}. The CUDA profiler is the best way to monitor the performances of a kernel. This tool proposes multiple performance markers giving indications about potential bottlenecks. \section{Implementations} \label{chXXX:sec:impl} % Foreword - no expand In this section, the implementations of the simplex algorithm are studied. The emphasis is put on the main algorithms and kernels having a major impact on performance. The \textit{expand} method used for choosing the leaving variable will not be detailed for that reason. -Then the Branch-and-Bound (B\&B) implementation is quickly explained focusing on the interaction between the B\&B algorithm and the simplex solver. +Then the B\&B implementation is quickly explained focusing on the interaction between the B\&B algorithm and the simplex solver. \subsection{Standard simplex} \label{chXXX:subsec:impl-simplex-std} % Standard simplex --> cublas but steepest edge -The implementation of the standard simplex algorithm (see~\S~\ref{chXXX:subsec:simplex-std}) is rather straightforward. The main difference is that the basis matrix is not stored in memory since it is equals to the identity matrix most of the time. Instead a proper data structure keeps track of the basic variables and their values. +The implementation of the standard simplex algorithm (see Section~\ref{chXXX:subsec:simplex-std}) is rather straightforward. The main difference with Algorithm~\ref{chXXX:algo:simplex-std} is that the basis matrix is not stored in memory since it is equal to the identity matrix most of the time. Instead a proper data structure keeps track of the basic variables and their values. %% \begin{algorithm} %% \caption{Standard simplex algorithm} %% \label{chXXX:algo:impl-simplex-std} @@ -493,15 +494,15 @@ The implementation of the standard simplex algorithm (see~\S~\ref{chXXX:subsec:s %% \end{algorithm} \begin{algorithm} -\caption{Standard simplex algorithm} +\caption{standard simplex algorithm} \label{chXXX:algo:impl-simplex-std} - \textit{// Find entering variable $e$}\; + \textit{// Find entering variable $x_e$}\; $\gamma_j \leftarrow \Vert (\mbf{A_\N})_j \Vert^2$\; $e \leftarrow argmax \left( \mbf{c} / \sqrt{\boldsymbol\gamma} \right) $\; \If{$e < 0$} { \Return \texttt{optima\_found} } - \textit{// Find leaving variable $\ell$ }\; + \textit{// Find leaving variable $x_\ell$}\; $\ell, t \leftarrow expand((\mbf{A_\N})_e)$\; \If{$\ell < 0$} { \Return \texttt{unbounded} @@ -519,46 +520,48 @@ The implementation of the standard simplex algorithm (see~\S~\ref{chXXX:subsec:s -The first step of algorithm~\ref{chXXX:algo:impl-simplex-std} is the selection of the entering variable using the steepest-edge heuristic. This step is discussed thoroughly in the next section. Then the leaving variable index $\ell$ and the potential gain on the entering variable $t$ are determined using the \textit{expand} method. Finally, the pivoting is done. The nonbasic matrix $\mbf{A_\N}$ and the objective function coefficients $\mbf{c}$ are updated using the CUBLAS library (with \texttt{cublasDger}, respectively \texttt{cublasDaxpy}). This requires the entering variable column $\mbf{d}$ and the leaving variable row $\mbf{r}$ to be copied and slightly altered. The new value of the variables is computed and, since we use a special structure instead of the basis matrix, the entering and leaving variables are swapped. +The first step of Algorithm~\ref{chXXX:algo:impl-simplex-std} is the selection of the entering variable using the steepest-edge heuristic. This step is discussed thoroughly in the next section. Then the leaving variable index $\ell$ and the potential gain on the entering variable $t$ are determined using the \textit{expand} method. Finally, the pivoting is done. The nonbasic matrix $\mbf{A_\N}$ and the objective function coefficients $\mbf{c}$ are updated using the CUBLAS library (respectively, with \texttt{cublasDger} and \texttt{cublasDaxpy}). This requires the entering variable column $\mbf{d}$ and the leaving variable row $\mbf{r}$ to be copied and slightly altered. The new value of the variables is computed and, since we use a special structure instead of the basis matrix, the entering and leaving variables are swapped. % Kernel example ==> steepest edge \subsubsection*{Choice of the entering variable} \label{chXXX:sec:impl-simplex-std-inVar} Let us discuss two different approaches for the selection of the entering variable. The first one relies on the CUBLAS library. -The idea is to split this step in several small operations, starting with the computation of the norms one by one with the \texttt{cublasDnrm2} function. +The idea is to split this step into several small operations, starting with the computation of the norms one by one with the \texttt{cublasDnrm2} function. The score of each variable is then obtained by dividing the cost vector $\mbf{c}$ by the norms previously computed. The entering variable is finally selected by taking the variable with the biggest steepest-edge coefficient using \texttt{cublasDamax}. -While being easy to implement, such an approach would lead to poor performances. The main problem is a misuse of data parallelism. Each column can be processed independently and so at the same time. Moreover, slicing this step into small operations requires that each temporary results be stored in global memory. This creates a huge amount of slow data transfers between kernels and global memory. +While being easy to implement, such an approach would lead to poor performances. The main problem is a misuse of data parallelism. Each column can be processed independently, and thus at the same time. Moreover, slicing this step into small operations requires that each temporary result be stored in global memory. This creates a huge amount of slow data transfers between kernels and global memory. -\lstinputlisting[label=chXXX:lst:k1,caption=Kernel for the choice of the entering variable]{Chapters/chapter10/optiSE.cu} +\lstinputlisting[label=chXXX:lst:k1,caption=a kernel for the choice of the entering variable]{Chapters/chapter10/optiSE.cu} -To avoid this, the whole step could be implemented as a unique kernel, as shown in the simplified listing~\ref{chXXX:lst:k1}. Each block evaluates multiple potential entering variables. +To avoid this, the whole step could be implemented as a unique kernel, as shown in the simplified Listing~\ref{chXXX:lst:k1}. Each block evaluates multiple potential entering variables. The threads of a block process part of the norm of a column. -Then a reduction (see \S~\ref{chXXX:subsec:reduction}) is done inside the block to form the full norm. +Then a reduction (see Section~\ref{chXXX:subsec:reduction}) is done inside the block to form the full norm. The thread having the final value can then compute the score of the current variable and determine if it is the best one encountered in this block. Once a block has evaluated its variables, the most promising one is stored in global memory. The final reduction step is finally done on the CPU. The dimension of the blocks and grid have to be chosen wisely as a function of the problem size. -The block size is directly correlated to the size of a column while the grid size is a trade-off between giving enough work to the scheduler in order to hide the latency and returning as few potential entering variable as possible for the final reduction step. +The block size is directly correlated to the size of a column while the grid size is a trade-off between giving enough work to the scheduler in order to hide the latency and returning as few potential entering variables as possible for the final reduction step. \subsubsection*{CPU-GPU interactions} The bulk of the data representing the problem is stored on the GPU. Only variables required for decision-making operations are updated on the CPU. +The communications arising from the aforementioned scheme are illustrated in Figure~\ref{chXXX:fig:diagSTD}. +The amount of data exchanged at each iteration is independent of the problem size ensuring that this implementation scales well as the problem size increases. + \begin{figure}[!h] \centering -\includegraphics[width=90mm]{Chapters/chapter10/figures/diagSTD2.pdf} -\caption{Communications between CPU and GPU} +\includegraphics[width=90mm]{Chapters/chapter10/figures/DiagSTD_cap.pdf} +\caption{Communications between CPU and GPU.} \label{chXXX:fig:diagSTD} \end{figure} -The communications arising from the aforementioned scheme are illustrated in figure~\ref{chXXX:fig:diagSTD}. -The amount of data exchanged at each iteration is independent of the problem size ensuring that this implementation scales well as the problem size increases. + \subsection{Revised simplex} \label{chXXX:subsec:impl-simplex-rev} % Revised simplex -The main steps found in the previous implementation are again present in the revised simplex. The difference comes from the fact that a full update of the matrix $\mbf{N}$ is not necessary at every iteration. As explained in section~\ref{chXXX:subsec:simplex-rev}, the basis matrix $\mbf{B}$ is updated so that the required values such as the entering variable column can be processed from the now \textit{constant} matrix $\mbf{N}$. Due to these changes, the steepest-edge method is adapted. The resulting implementation requires more operations as described in algorithm~\ref{chXXX:algo:simplex-rev}. +The main steps found in the previous implementation are again present in the revised simplex. The difference comes from the fact that a full update of the matrix $\mbf{N}$ is not necessary at every iteration. As explained in Section~\ref{chXXX:subsec:simplex-rev}, the basis matrix $\mbf{B}$ is updated so that the required values, such as the entering variable column, can be processed from the now \textit{constant} matrix $\mbf{N}$. Due to these changes, the steepest-edge method is adapted. The resulting implementation requires more operations as described in Algorithm~\ref{chXXX:algo:simplex-rev}. % Algorithm %% \begin{algorithm} %% \caption{Revised simplex algorithm} @@ -589,7 +592,7 @@ The main steps found in the previous implementation are again present in the rev \begin{algorithm} -\caption{Revised simplex algorithm} +\caption{revised simplex algorithm} \label{chXXX:algo:simplex-rev} \textit{// Find entering variable $x_e,\, e\in \B$}\; $\bs\tau \leftarrow \mbf{(B^{-1})^T}\mbf{c_\B} $ \hspace*{5mm} \textit{// cublasDgemv}\; @@ -600,7 +603,7 @@ The main steps found in the previous implementation are again present in the rev $\mbf{d} \leftarrow \mbf{B^{-1}} \mbf{N}_e $ \hspace*{9mm} \textit{// cublasDgemv}\; $\ell, t \leftarrow expand(\mbf{d})$\; \If { $\ell < 0$} {\Return \texttt{unbounded}} - \textit{// Pivoting - basis update}\; + \textit{// Pivoting, basis update}\; $\bs\kappa \leftarrow \mbf{(B^{-1})^T} \mbf{d} $ \hspace*{6mm} \textit{// cublasDgemv}\; $\boldsymbol\beta \leftarrow \mbf{N^T} \bs\kappa$ \hspace*{12mm} \textit{// cublasDgemv}\; $\mbf{B^{-1}} \leftarrow \mbf{E}\, \mbf{B^{-1}}$\; @@ -613,7 +616,7 @@ The main steps found in the previous implementation are again present in the rev \end{algorithm} -Let us compare the complexity, in terms of level 2 and level 3 BLAS operations, of both implementations. The standard one has mainly two costly steps: the selection of the entering variable and the update of the matrix $\mbf{N}$. These are level 2 BLAS operations or an equivalent, and thus the approximate complexity is $\mathcal{O}(2mn)$. +Let us compare the complexity, in terms of level 2 and level 3 BLAS operations, of both implementations. The standard one has mainly two costly steps: the selection of the entering variable and the update of the matrix $\mbf{N}$. These are level 2 BLAS operations or the equivalent, and thus the approximate complexity is $\mathcal{O}(2mn)$. The new implementation proposed has three operations \texttt{cublasDgemv} where the matrix $\mbf{N}$ is involved, and three others with the matrix $\mbf{B^{-1}}$. @@ -626,8 +629,8 @@ their high sparsity and the fact that usually $m \ll n$. In the next sections, w \subsubsection*{Choice of the entering variable} Once again, this part of the algorithm can be substantially improved. -Firstly, algorithmic optimizations must be considered. -Since the row $\bs\alpha$ of the leaving variable is processed to update the steepest-edge coefficients, the cost vector $\bs\upsilon$ can be updated directly without using the basis matrix $\mbf{B}$. +First, algorithmic optimizations must be considered. +Since row $\bs\alpha$ of the leaving variable is processed to update the steepest-edge coefficients, the cost vector $\bs\upsilon$ can be updated directly without using the basis matrix $\mbf{B}$. This is done as follow \begin{equation*} \upsilon_j = \bar{\upsilon}_j - \bar{\upsilon}_e \alpha_j, \; \forall j \neq e, \qquad \qquad @@ -636,7 +639,7 @@ This is done as follow where $\bar{\upsilon_j}$ denotes the value of $\upsilon_j$ at the previous iteration. It is important to note that all these updated values ($\bs\upsilon, \bs\gamma$) must be processed afresh once in a while to reduce numerical errors. -Including this change, the operations required for the selection of the entering variable $e$ are detailed in algorithm~\ref{chXXX:algo:inVar}. The values related to the entering variable at the previous iteration $\bar{e}=q$ are used. The reduced costs are updated, followed by the steepest-edge coefficients. With these values the entering variable is determined. +Including this change, the operations required for the selection of the entering variable $e$ are detailed in Algorithm~\ref{chXXX:algo:inVar}. The values related to the entering variable at the previous iteration $\bar{e}=q$ are used. The reduced costs are updated, followed by the steepest-edge coefficients. With these values the entering variable is determined. %% \begin{algorithm}[!h] %% \caption{Choice of the entering variable} @@ -657,12 +660,12 @@ Including this change, the operations required for the selection of the entering %% \end{algorithm} \begin{algorithm}[!h] -\caption{Choice of the entering variable} +\caption{choice of the entering variable} \label{chXXX:algo:inVar} $q \leftarrow \bar{e}$\; $\bar{\upsilon}_q \leftarrow c_q - \mbf{c_{\B}^T} \bar{\mbf{d}}$ \; $\bar{\gamma}_q \leftarrow \Vert \bar{\mbf{d}} \Vert$\; - \textit{// Update of the reduced costs} \; + \textit{// Update of the reduced costs}\; $\upsilon_{j} \leftarrow \bar{\upsilon}_{j} - \alpha_{j} \bar{\upsilon}_q, \; \forall j \neq q$\; $\upsilon_q \leftarrow \dfrac{\bar{\upsilon}_q}{\bar{d}_q^2}$ \; \textit{// Update of the steepest edge coefficients}\; @@ -672,38 +675,38 @@ Including this change, the operations required for the selection of the entering $e \leftarrow argmax \left( \bs\upsilon / \sqrt{\bs\gamma} \right) $\; \end{algorithm} -Coupling these operations into a single kernel is quite beneficial. This lead $\bs\upsilon$ and $\bs\gamma$ to only be loaded once from global memory. Their updated values are stored while the entering variable $e$ is selected. With these changes, the global complexity of this step is reduced from $\mathcal{O}(mn + m^2 + n)$ to $\mathcal{O}(n)$. Moreover the remaining processing is done optimally by reusing data and by overlapping global memory access and computations. +Coupling these operations into a single kernel is quite beneficial. This leads $\bs\upsilon$ and $\bs\gamma$ to be loaded only once from global memory. Their updated values are stored while the entering variable $e$ is selected. With these changes, the global complexity of this step is reduced from $\mathcal{O}(mn + m^2 + n)$ to $\mathcal{O}(n)$. Moreover the remaining processing is done optimally by reusing data and by overlapping global memory access and computations. \subsubsection*{Basis update} % Update operation -The basis update $\mbf{B^{-1}} \leftarrow \mbf{E} \, \mbf{B^{-1}}$ is a matrix-matrix multiplication. However, due to the special form of the matrix $\mbf{E}$ (see \S~\ref{chXXX:subsec:simplex-rev}), the complexity of this operation can be reduced from $\mathcal{O}(m^3)$ to $\mathcal{O}(m^2)$~\cite{BIELING}. The matrix $\mbf{E}$ is merely the identity matrix having the $\ell$-th column replaced by the vector $\bs\eta$. The update of the matrix $\mbf{B^{-1}}$ can be rewritten as +The basis update $\mbf{B^{-1}} \leftarrow \mbf{E} \, \mbf{B^{-1}}$ is a matrix-matrix multiplication. However, due to the special form of the matrix $\mbf{E}$ (see Section~\ref{chXXX:subsec:simplex-rev}), the complexity of this operation can be reduced from $\mathcal{O}(m^3)$ to $\mathcal{O}(m^2)$~\cite{BIELING}. The matrix $\mbf{E}$ is merely the identity matrix having the $\ell^\mathrm{th}$ column replaced by the vector $\bs\eta$. The update of the matrix $\mbf{B^{-1}}$ can be rewritten as \begin{equation*} \hat{B}^{-1}_{ij} = B^{-1}_{ij}\left(1 - \frac{d_i}{d_\ell}\right)\, , \quad \forall i \neq \ell, \qquad \qquad \hat{B}^{-1}_{\ell j} = \frac{B^{-1}_{\ell j}}{d_\ell} \end{equation*} -As shown in the listing~\ref{chXXX:lst:k2}, each block of the kernel processes a single column while each thread may compute multiple elements of a column. -This organisation ensures that global memory accesses are coalescent since the matrix $\mbf{B}$ is stored column-wise. +As shown in Listing~\ref{chXXX:lst:k2}, each block of the kernel processes a single column while each thread may compute multiple elements of a column. +This organization ensures that global memory accesses are coalescent since the matrix $\mbf{B}$ is stored column-wise. -\lstinputlisting[label=chXXX:lst:k2,caption=Basis update]{Chapters/chapter10/updateBasis.cu} +\lstinputlisting[label=chXXX:lst:k2,caption=basis update]{Chapters/chapter10/updateBasis.cu} % Sparse matrix - vector \subsubsection*{Sparse matrix operations} -The complexity of the implementation is now $\mathcal{O}(2mn + 3m^2)$ which is close to the one of the standard simplex implementation. The operations where the matrix $\mbf{N}$ is involved remain the more expensive (considering $m \ll n$). However, this matrix is \textit{constant} in the revised simplex allowing the use of sparse matrix-vector multiplication from the CUSPARSE library. On sparse problems, this leads to important gains in performance. The sparse storage of the matrix $\mbf{N}$ reduces significantly the memory space used by the algorithm. All these improvements come at a small cost: columns are no longer directly available in their dense format and must thus be decompressed to their dense representation when needed. +The complexity of the implementation is now $\mathcal{O}(2mn + 3m^2)$ which is close to the one of the standard simplex implementation. The operations where the matrix $\mbf{N}$ is involved remain the more expensive (considering $m \ll n$). However, this matrix is \textit{constant} in the revised simplex allowing the use of sparse matrix-vector multiplication from the CUSPARSE library. On sparse problems, this leads to important gains in performance. The sparse storage of the matrix $\mbf{N}$ reduces significantly the memory space used by the algorithm. All these improvements come at a small cost: columns are no longer directly available in their dense format and must be decompressed to their dense representation when needed. -The complexity of the revised simplex implementation is thus finally $\mathcal{O}(2\theta + 3m^2)$ where $\theta$ is a function of $m$, $n$ and the sparsity of the problem. We shall see in section~\ref{chXXX:subsec:implPerf}, what kind of performances we may obtain on various problems. +The complexity of the revised simplex implementation is thus finally $\mathcal{O}(2\theta + 3m^2)$ where $\theta$ is a function of $m$, $n$, and the sparsity of the problem. We shall see in Section~\ref{chXXX:subsec:implPerf}, what kind of performances we may obtain on various problems. -\subsection{Branch-and-Bound} +\subsection{Branch-and-bound} \label{chXXX:subsec:impl-bnb} % Branch and bound on GPU % Limiting communications The B\&B algorithm is operated from the CPU. The simplex implementations, also referred to as LP solver, are used to solve the nodes of the B\&B tree. -Algorithm~\ref{chXXX:algo:bnb} contains the main operations discussed in section~\ref{chXXX:sec:bnb}. +Algorithm~\ref{chXXX:algo:bnb} contains the main operations discussed in Section~\ref{chXXX:sec:bnb}. It starts by solving the root node. The branching strategy is then used to select the next variable to branch on. -From thereon, until no node remains to be solved, the node selection strategy is used to elect the next one to be processed. +From thereon, until no node remains to be solved, the node selection strategy is used to select the next one to be processed. -The critical factor for coupling the LP solver and the B\&B, is the amount of communications exchanged between them. Since CPU-GPU communications are expensive, the informations required to solve a new node must be minimized. Upon solving a new node, the full transfer of the problem must be avoided. This will be the focus of this section. +The critical factor for coupling the LP solver and the B\&B is the amount of communications exchanged between them. Since CPU-GPU communications are expensive, the informations required to solve a new node must be minimized. Upon solving a new node, the full transfer of the problem must be avoided. This will be the focus of this section. % Algorithm %% \begin{algorithm} @@ -725,40 +728,40 @@ The critical factor for coupling the LP solver and the B\&B, is the amount of co %% \end{algorithm} \begin{algorithm} -\caption{Branch-and-Bound} +\caption{branch-and-bound} \label{chXXX:algo:bnb} Solution $\leftarrow$ Solve(InitialProblem)\; BranchVar $\leftarrow$ SelectNextVariable(Solution)\; NodeList $\leftarrow$ AddNewNodes(BranchVar)\; \While{NodeList $\neq \emptyset$} { - Node $\leftarrow$ SelectNextNode(NodeList)\; - Solution $\leftarrow$ Solve(Node)\; - \If{exists(Solution)}{ - BranchVar $\leftarrow$ SelectNextVariable(Solution)\; - NodeList $\leftarrow$ AddNewNodes(BranchVar)\; - } + Node $\leftarrow$ SelectNextNode(NodeList)\; + Solution $\leftarrow$ Solve(Node)\; + \If{exists(Solution)}{ + BranchVar $\leftarrow$ SelectNextVariable(Solution)\; + NodeList $\leftarrow$ AddNewNodes(BranchVar)\; + } } \end{algorithm} % Algorithmic choice --> Plunging \subsubsection*{Algorithmic choices} -The first critical choice is to select an efficient tree traversal strategy that also takes advantage of the locality of the problem. At first glance, the best-first search would be a poor choice, since from one node to the next the problem could change substantially. However, when combined with the plunging heuristic, this same strategy becomes really interesting. Indeed, the plunging phase can be completely decoupled from the B\&B. The CPU acting as a decision maker, chooses a promising node and spawns a task that will independently explore the subtree of this node. Once done, this task will report its findings to the decision maker. Moreover, when plunging, the next node to be solved differs only by the new bound on the branching variable. Communications are then minimized and the LP solver is usually able to quickly find the new solution since the problem has only been slightly altered. +The first critical choice is to select an efficient tree traversal strategy that also takes advantage of the locality of the problem. At first glance, the best-first search would be a poor choice, since from one node to the next the problem could change substantially. However, when combined with the plunging heuristic, this same strategy becomes really interesting. Indeed, the plunging phase can be completely decoupled from the B\&B. The CPU, acting as a decision maker, chooses a promising node and spawns a task that will independently explore the subtree of this node. Once done, this task will report its findings to the decision maker. Moreover, when plunging, the next node to be solved differs by only the new bound on the branching variable. Communications are then minimized and the LP solver is usually able to quickly find the new solution since the problem has been only slightly altered. % Technical improvement --> warmstart / hotstart \subsubsection*{Warmstart} -There are two case where starting from a fresh problem is required or beneficial. The first one is imposed by the numeric inaccuracy appearing after several iterations of the LP solver. The second is upon the request of a new subtree. To avoid the extensive communication costs of a full restart, the GPU keeps in memory an intermediate stable state of the problem, a \textit{warmstart}. This state could for example be the one found after solving the root node of the tree. +There are two cases where starting from a fresh problem is required or beneficial. The first one is imposed by the numeric inaccuracy appearing after several iterations of the LP solver. The second is upon the request of a new subtree. To avoid the extensive communication costs of a full restart, the GPU keeps in memory an intermediate stable state of the problem, a \textit{warmstart}. This state could, for example, be the one found after solving the root node of the tree. % Global flow -\subsubsection*{Multi-GPU\index{multi-GPU} exploration} -The organisation, CPU decision maker vs GPU explorer, offers the possibility to use multiple GPUs to explore the tree. The global knowledge is maintained by the CPU task. The latter assigns to the GPUs the task of exploring subtrees of promising nodes. Since each plunging is done locally, no communications are required between the GPUs. Moreover, the amount of nodes processed during a plunging can be used to tune the load of the CPU task. +\subsubsection*{Multi-GPU\index{Multi-GPU} exploration} +Having the CPU act as a decision maker and the GPU as an explorer, allows for the possibility of using multiple GPUs to explore the tree. The global knowledge is maintained by the CPU task. The CPU assigns to the GPUs the task of exploring subtrees of promising nodes. Since each plunging is done locally, no communications are required between the GPUs. Moreover, the amount of nodes processed during a plunging can be used to tune the load of the CPU task. \section{Performance model} \label{chXXX:sec:MODEL} Performance models\index{Performance model} are useful to predict the behaviour of implementations as a function of various parameters. Since the standard simplex algorithm is the easiest to understand, we will focus in this section on its behaviour as a function of the problem size. -CUDA kernels require a different kind of modelling than usually encountered in parallelism. The key idea is to capture in the model the decomposition into threads, warps and blocks. One must also pay a particular attention to global memory accesses and to how the pipelines reduce the associated latency. +CUDA kernels require a different kind of modeling than usually encountered in parallelism. The key idea is to capture in the model the decomposition into threads, warps, and blocks. One must also pay a particular attention to global memory accesses and to how the pipelines reduce the associated latency. -In order to model our implementation, we follow the approach given by K. Kothapalli et al.~\cite{MODEL} (see also~\cite{ELSTER09}). Firstly, we examine the different levels of parallelism on a GPU. Then we determine the amount of work done by a single task. By combining both analysis, we establish the kernel models. The final model then consists of the sum of each kernel. +In order to model our implementation, we follow the approach given by K. Kothapalli et al.~\cite{MODEL} (see also~\cite{ELSTER09}). First, we examine the different levels of parallelism on a GPU. Then, we determine the amount of work done by a single task. By combining both analyses, we establish the kernel models. The final model then consists of the sum of each kernel. This model has also been used to model a multi-GPUs version of the standard simplex method~\cite{MEYER2011}. @@ -767,56 +770,59 @@ This model has also been used to model a multi-GPUs version of the standard simp \label{chXXX:subsec:LVLParallelism} A kernel can be decomposed into an initialization phase followed by a processing phase. -During the initialization phase the kernel context is setup. +During the initialization phase the kernel context is set up. Since this operation does not take a lot of time compared to the processing phase, it is needless to incorporate it into the model. The processing phase is more complex and its execution time depends directly on the amount of work it must perform. We shall now focus on this phase and on the various levels of parallelism on a GPU. -At the first level, the total work is broken down into components, the blocks. They are then dispatched on the available multiprocessors on the GPU. The execution time of a kernel depends on the number of blocks $N_B$ per SM (\textit{streaming multiprocessor}) and on the number $N_{SM}$ of SM per GPU. +At the first level, the total work is broken down into components, the blocks. They are then dispatched on the available multiprocessors on the GPU. The execution time of a kernel depends on the number of blocks $N_B$ per SM (streaming multiprocessor) and on the number $N_{SM}$ of SM per GPU. -At a lower level, the work of a block is shared among the various cores of its dedicated SM. This is done by organizing the threads of the block into groups, the warps. A warp is a unit of threads that can be executed in parallel on a SM. The execution time of a block is then linked to the number $N_W$ of warps per block, the number $N_T$ of threads per warp and the number $N_P$ of cores per SM. +At a lower level, the work of a block is shared among the various cores of its dedicated SM. This is done by organizing the threads of the block into groups, the warps. A warp is a unit of threads that can be executed in parallel on an SM. The execution time of a block is then linked to the number $N_W$ of warps per block, the number $N_T$ of threads per warp, and the number $N_P$ of cores per SM. -The third and lowest level of parallelism is a pipeline. This pipeline enables a pseudo-parallel execution of the tasks forming a warp. The gain produced by this pipeline is expressed by its depth $D$. +The third and lowest level of parallelism is a pipeline. This pipeline enables a pseudoparallel execution of the tasks forming a warp. The gain produced by this pipeline is expressed by its depth $D$. \subsection{Amount of work done by a thread} \label{chXXX:subsec:workTask} In the previous section, we defined the different levels of parallelism down to the smallest, namely the thread level. We must now estimate how much work is done by a single thread of a kernel in terms of cycles. It depends on the number and the type of instructions. In CUDA, each arithmetic instruction requires a different number of cycles. For example, a single precision add costs 4 cycles while a single precision division costs 36 cycles. -Moreover, since global memory access instructions are non-blocking operations, they must be counted separately from arithmetic instructions. The number of cycles involved in a global memory access, amounts to a 4 cycle instruction (read/write) followed by a non-blocking latency of 400-600 cycles. To correctly estimate the work due to such accesses, one needs to sum only the latency that is not hidden. Two consecutive read instructions executed by a thread, would cost twice the 4 cycles, but only once the latency due to its non-blocking behaviour. Once the amount of cycles involved in these memory accesses has been determined, it is then necessary to take into account the latency hidden by the scheduler (\textit{warp swapping}). +Moreover, since global memory access instructions are nonblocking operations, they must be counted separately from arithmetic instructions. The number of cycles involved in a global memory access amounts to a 4 cycle instruction (read/write) followed by a nonblocking latency of 400--600 cycles. To correctly estimate the work due to such accesses, one needs to sum only the latency that is not hidden. Two consecutive read instructions executed by a thread would cost twice the 4 cycles, but only once the latency due to its nonblocking behaviour. Once the amount of cycles involved in these memory accesses has been determined, it is then necessary to take into account the latency hidden by the scheduler (warp swapping). The total work $C_T$ done by a thread can be defined either as the sum or as the maximum of the memory access cycles and the arithmetic instructions cycles. Summing both types of cycles means we consider that latency cannot be used to hide arithmetic instructions. The maximum variant represents the opposite situation where arithmetic instructions and memory accesses are concurrent. Then only the biggest of the two represents the total work of a thread. This could occur for example when the latency is entirely hidden by the pipeline. -It is not trivial to define which scenario is appropriate for each kernel. There are many factors involved: the dependency on the instructions, the behaviour of the scheduler, the quantity of data to process, and so on. The number of cycles $C_T$ done by a thread can thus be defined by the sum, respectively the maximum, of the global memory access and the arithmetic instructions. +It is not trivial to define which scenario is appropriate for each kernel. There are many factors involved: the dependency on the instructions, the behaviour of the scheduler, the quantity of data to process, and so on. + +If latency cannot be used to hide arithmetic instructions, the number of cycles $C_T$ done by a thread can be defined as the sum of the global memory access and the arithmetic instructions. +Otherwise, one must consider the maximum instead of the sum. %Moreover, for each kernel the task representing the worst case scenario will be the one modelised. \subsection{Global performance model} \label{chXXX:subsec:MODEL} -We now turn to the global performance model for a kernel. We must find out how many threads are run by a core of an SM. Recall that a kernel decomposes into blocks of threads, with each SM running $N_B$ blocks, each block having itself $N_W$ warps consisting of $N_T$ threads. Also recall that an SM has $N_P$ cores which execute batches of threads in a pipeline of depth $D$. Thus the total work $C_{core}$ done by a core is given by +We now turn to the global performance model for a kernel. We must find out how many threads are run by a core of an SM. Recall that a kernel decomposes into blocks of threads, with each SM running $N_B$ blocks, each block having $N_W$ warps consisting of $N_T$ threads. Also recall that an SM has $N_P$ cores which execute batches of threads in a pipeline of depth $D$. Thus, the total work $C_{core}$ done by a core is given by \begin{equation} \label{chXXX:equ:ccore} C_{core} = N_B \cdot N_W \cdot N_T \cdot C_T \cdot \dfrac{1}{N_P \cdot D} \end{equation} -which represents the total work done by a SM divided by the number of cores per SM and by the depth of the pipeline. +which represents the total work done by an SM divided by the number of cores per SM and by the depth of the pipeline. Finally, the execution time of a kernel is obtained by dividing $C_{core}$ by the core frequency. -\subsection{A kernel example : \textit{steepest-edge}} +\subsection{A kernel example: \textit{steepest-edge}} \label{chXXX:subsec:SE} -The selection of the entering variable is done by the \textit{steepest-edge} method as described in section~\ref{chXXX:subsec:impl-simplex-std}. +The selection of the entering variable is done by the \textit{steepest-edge} method as described in Section~\ref{chXXX:subsec:impl-simplex-std}. The processing of a column is done in a single block. Each thread of a block has to compute $N_{el}=\frac{m}{N_W \cdot N_T}$ coefficients of the column. This first computation consists of $N_{el}$ multiplications and additions. The resulting partial sum of squared variables must then be reduced on a single thread of the block. This requires $N_{red} = \log_2(N_W \cdot N_T)$ additions. Since the shared memory is used optimally, there are no added communications. Finally, the coefficient $c_j$ must be divided by the result of the reduction. -Each block of the kernel computes $N_{col}=\frac{n}{N_B \cdot N_{SM}}$ columns, where $N_{SM}$ is the number of SM per GPU. After processing a column, a block keeps only the minimum of its previously computed steepest-edge coefficients. Thus the number of arithmetic instruction cycles for a given thread is given by +Each block of the kernel computes $N_{col}=\frac{n}{N_B \cdot N_{SM}}$ columns, where $N_{SM}$ is the number of SM per GPU. After processing a column, a block keeps only the minimum of its previously computed steepest-edge coefficients. Thus, the number of arithmetic instruction cycles for a given thread is given by \begin{equation} \label{chXXX:equ:arithmSE} C_{Arithm} = N_{col} \cdot (N_{el} \cdot (C_{add} + C_{mult}) + N_{red} \cdot C_{add} + C_{div} + C_{cmp}) \end{equation} -where $C_{ins}$ denotes the number of cycles to execute instruction $ins$. +where $C_{ins}$ denotes the number of cycles to execute instruction $ins\in \{add, div, mult, cmp\}$. -Each thread has to load $N_{el}$ variables to compute its partial sum of squared variables. The thread computing the division also loads the coefficient $c_j$. This must be done for the $N_{col}$ columns that a block has to deal with. We must also take into account that the scheduler hides some latency by swapping the warps, so the total latency $C_{latency}$ must be divided by the number of warps $N_W$. -Thus the number of cycles relative to memory accesses is given by: +Each thread has to load $N_{el}$ variables to compute its partial sum of squared variables. The thread computing the division also loads the coefficient $c_j$. This must be done for the $N_{col}$ columns with which a block has to deal. We must also take into account that the scheduler hides some latency by swapping the warps, so the total latency $C_{latency}$ must be divided by the number of warps $N_W$. +Thus, the number of cycles relative to memory accesses is given by \begin{equation} \label{chXXX:equ:latencySE} @@ -825,11 +831,11 @@ C_{Accesses} = \dfrac{ N_{col} \cdot (N_{el}+1) \cdot C_{latency}} {N_W} At the end of the execution of this kernel, each block stores in global memory its local minimum. The CPU will then have to retrieve the $N_B\cdot N_{SM}$ local minimums and reduce them. It is then profitable to minimize the number $N_B$ of blocks per SM. With a maximum of two blocks per SM, the cost of this final operation done by the CPU can be neglected when $m$ and $n$ are big. -It now remains to either maximize or sum $C_{Arithm}$ and $C_{Accesses}$ to obtain~$C_T$. The result of equ.~(\ref{chXXX:equ:ccore}) divided by the core frequency yields the time $T_{KSteepestEdge}$ of the steepest-edge kernel. +It now remains to either maximize or sum $C_{Arithm}$ and $C_{Accesses}$ to obtain~$C_T$. The result of Equation~(\ref{chXXX:equ:ccore}) divided by the core frequency yields the time $T_{KSteepestEdge}$ of the steepest-edge kernel. \subsection{Standard simplex GPU implementation model} \label{chXXX:subsec:model-simplex-std} -As seen in section~\ref{chXXX:subsec:impl-simplex-std}, the standard simplex implementation requires only a few communications between the CPU and the GPU. Since each of these communications are constant and small, they will be neglected in the model. For the sake of simplicity, we will consider the second phase of the \textit{two-phase simplex} where we apply iteratively the three main operations: selecting the entering variable, choosing the leaving variable and pivoting. Each of these operations is computed as a kernel. The time of an iteration $T_{Kernels}$ then amounts to the sum of all three kernel times: +As seen in Section~\ref{chXXX:subsec:impl-simplex-std}, the standard simplex implementation requires only a few communications between the CPU and the GPU. Since all of these communications are constant and small, they will be neglected in the model. For the sake of simplicity, we will consider the second phase of the \textit{two-phase simplex} where we apply iteratively the three main operations: selecting the entering variable, choosing the leaving variable, and pivoting. Each of these operations is computed as a kernel. The time of an iteration $T_{Kernels}$ then amounts to the sum of all three kernel times: \begin{equation} \label{chXXX:equ:1GPU} T_{Kernels} = T_{KSteepestEdge} + T_{KExpand} + T_{KPivoting} @@ -846,9 +852,9 @@ Note that research by Dantzig~\cite{DANTZIG80} asserts that $r$ is proportional \section{Measurements and analysis} \label{chXXX:sec:measurements} -Two sets of measurements are presented here. The first one is a validation of the performance model presented in section~\ref{chXXX:sec:MODEL}. The second is a comparison of the performances of the various implementations of the simplex method detailed in section~\ref{chXXX:sec:impl}. +Two sets of measurements are presented here. The first one is a validation of the performance model presented in Section~\ref{chXXX:sec:MODEL}. The second is a comparison of the performances of the various implementations of the simplex method detailed in Section~\ref{chXXX:sec:impl}. -As a preliminary, we ensured that our implementations were functional. For that matter, we used a set of problems available on the \textit{NETLIB}~\cite{NETBLIB} repository. This dataset usually serves as a benchmark for LP solvers. It consists of a vast variety of real and specific problems for testing the stability and robustness of an implementation. For example, some of these represent real-life models of large refineries, industrial production/allocation or fleet assignments problems. Our implementations are able to solve almost all of these problems. +As a preliminary, we ensured that our implementations were functional. For that matter, we used a set of problems available on the NETLIB Repository~\cite{NETBLIB}. This dataset usually serves as a benchmark for LP solvers. It consists of a vast variety of real and specific problems for testing the stability and robustness of an implementation. For example, some of these represent real-life models of large refineries, industrial production/allocation, or fleet assignments problems. Our implementations are able to solve almost all of these problems. % Test 1 : Tesla S1070 => Performance model + custom problems \subsection{Performance model validation} @@ -858,15 +864,15 @@ In order to check that our performance model for the standard simplex implementa %All these problems fall into the same category. Hence, generated problems of the same size will be solved within approximately the same number of iterations and the optimum of the objective function will only slightly fluctuate. The test environment is composed of a CPU server (2 Intel Xeon X5570, 2.93GHz, with 24GB DDR3 RAM) -and a GPU computing system (NVIDIA tesla S1070) with the 3.2 CUDA Toolkit. This system connects 4 GPUs to the server. Each GPU has 4GB GDDR3 graphics memory, 30 streaming multiprocessors, each holding 8 cores (1.4GHz). +and a GPU computing system (NVIDIA Tesla S1070) with the 3.2 CUDA Toolkit. This system connects 4 GPUs to the server. Each GPU has 4GB GDDR3 graphics memory and 30 streaming multiprocessors, each holding 8 cores (1.4GHz). %Such a GPU offers at peak performance up to 1TFLOPS for single precision floating point, 80GFLOPS for double precision floating point and 100GB/s memory bandwidth between threads (registers) and global memory. We validated our performance model by showing that it accurately fits our measurements. -The correlation between the measurements and the model is above $0.999$ (see fig.~\ref{chXXX:fig:fitting}). +The correlation between the measurements and the model is above $0.999$ (see Figure~\ref{chXXX:fig:fitting}). \begin{figure}[!h] \centering \includegraphics[width=100mm]{Chapters/chapter10/figures/Model.pdf} -\caption{Performance model and measurements comparison} +\caption{Performance model and measurements comparison.} \label{chXXX:fig:fitting} \end{figure} @@ -875,25 +881,25 @@ The correlation between the measurements and the model is above $0.999$ (see fig \label{chXXX:subsec:implPerf} Four different implementations are compared in this section. We will refer to each of them according to the terminology introduced below. \begin{itemize} - \item Standard simplex method improved ($\mathcal{O}(2mn)$): \texttt{Standard} + \item Standard simplex method improved ($\mathcal{O}(2mn)$): \textit{Standard} \item Revised simplex method using basis kernel \begin{itemize} - \item without improvements ($\mathcal{O}(3mn + 4m^2)$): \texttt{Revised-base} - \item optimized ($\mathcal{O}(2mn + 3m^2)$): \texttt{Revised-opti} - \item optimized with sparse matrix-vector multiplication ($\mathcal{O}(2\theta+3m^2)$): \texttt{Revised-sparse} + \item without improvements ($\mathcal{O}(3mn + 4m^2)$): \textit{Revised-base} + \item optimized ($\mathcal{O}(2mn + 3m^2)$): \textit{Revised-opti} + \item optimized with sparse matrix-vector multiplication ($\mathcal{O}(2\theta+3m^2)$): \textit{Revised-sparse} \end{itemize} \end{itemize} These implementations all use the \textit{steepest-edge} and \textit{expand} methods for the choice of, respectively, the entering and the leaving variables. -We used problems from the \textit{NETLIB} repository to illustrate the improvements discussed in section~\ref{chXXX:sec:impl}. The results expected from the first three methods are quite clear. The \texttt{Standard} method should be the best, followed by the \texttt{Revised-opti} and then the \texttt{Revised-base}. However, the performance of the \texttt{Revised-sparse} implementation remains unclear since the value of $\theta$ is unknown and depends on the density and size of the constraints matrix. This is the main question we shall try to answer with our experiments. +We used problems from the NETLIB Repository to illustrate the improvements discussed in Section~\ref{chXXX:sec:impl}. The results expected from the first three methods are quite clear. The \textit{Standard} method should be the best, followed by the \textit{Revised-opti}, and then the \textit{Revised-base}. However, the performance of the \textit{Revised-sparse} implementation remains unclear since the value of $\theta$ is unknown and depends on the density and size of the constraints matrix. This is the main question we shall try to answer with our experiments. -The test environnement is composed of a CPU server (2 Intel Xeon X5650, 2.67GHz, with 96GB DDR3 RAM) and a GPU computing system (NVIDIA tesla M2090) with the 4.2 CUDA Toolkit. This system connects 4 GPUs to the server. Each GPU has 6GB GDDR5 graphics memory and 512 cores (1.3GHz). +The test environnement is composed of a CPU server (2 Intel Xeon X5650, 2.67GHz, with 96GB DDR3 RAM) and a GPU computing system (NVIDIA Tesla M2090) with the 4.2 CUDA Toolkit. This system connects 4 GPUs to the server. Each GPU has 6GB GDDR5 graphics memory and 512 cores (1.3GHz). \begin{table}[!h] \begin{center} \begin{tabular}{|l|r|r|r|r|} \hline -\multicolumn{1}{|l|}{Problem name} & \multicolumn{1}{l|}{Rows} & \multicolumn{1}{l|}{Cols} & \multicolumn{1}{l|}{NNZ} & \multicolumn{1}{l|}{C-to-R} \\ \hline +\multicolumn{1}{|l|}{Problem Name} & \multicolumn{1}{l|}{Rows} & \multicolumn{1}{l|}{Cols} & \multicolumn{1}{l|}{NNZ} & \multicolumn{1}{l|}{C-to-R} \\ \hline etamacro.mps & 401 & 688 & 2489 & 1.7 \\ \hline fffff800.mps & 525 & 854 & 6235 & 1.6\\ \hline finnis.mps & 498 & 614 & 2714 & 1.2\\ \hline @@ -902,26 +908,26 @@ grow15.mps & 301 & 645 & 5665 & 2.1\\ \hline grow22.mps & 441 & 946 & 8318 & 2.1\\ \hline scagr25.mps & 472 & 500 & 2029 & 1.1\\ \hline \end{tabular} -\caption{NETLIB problems solved in less than 1 second} +\caption{NETLIB problems solved in less than 1 second.} \label{chXXX:tab:small} \end{center} \end{table} -Let us begin by discussing the performances on problems solved in less than one second. The name, size, number of non-zero elements (NNZ) and columns to rows ratio (C-to-R) of each problems are reported in table~\ref{chXXX:tab:small}. The performances shown in figure~\ref{chXXX:fig:FastSolve} corroborate our previous observations. On these problems the \texttt{Revised-sparse} method doesn't outperform the \texttt{Standard} one. This can be explained by two factors: the added communications (kernel calls) for the revised method, and the small size and density of the problems. It is likely that sparse operations on a GPU require larger amounts of data to become more efficient than their dense counterpart. +Let us begin by discussing the performances on problems solved in less than one second. The name, size, number of nonzero elements (NNZ), and columns to rows ratio (C-to-R) of each problem are reported in Table~\ref{chXXX:tab:small}. The performances shown in Figure~\ref{chXXX:fig:FastSolve} corroborate our previous observations. On these problems the \textit{Revised-sparse} method doesn't outperform the \textit{Standard} one. This can be explained by two factors: the added communications (kernel calls) for the revised method, and the small size and density of the problems. It is likely that sparse operations on a GPU require larger amounts of data to become more efficient than their dense counterparts. \begin{figure}[!h] \centering -\includegraphics[width=100mm]{Chapters/chapter10/figures/Small2.pdf} -\caption{Time required to solve problems of table~\ref{chXXX:tab:small}} +\includegraphics[width=100mm]{Chapters/chapter10/figures/Small_N.pdf} +\caption{Time required to solve problems of Table~\ref{chXXX:tab:small}.} \label{chXXX:fig:FastSolve} \end{figure} -The problems shown in table~\ref{chXXX:tab:medium} are solved in less than 4 seconds. As we can see in figure~\ref{chXXX:fig:NormalSolve}, the expected trend for the \texttt{Revised-base} and the \texttt{Revised-opti} methods is now confirmed. Let us presently focus on the \texttt{Standard} and \texttt{Revised-sparse} methods. Some of the problems, in particular \textit{czprob.mps} and \textit{nesm.mps}, are solved in a comparable amount of time. The performance gain of the \texttt{Revised-sparse} is related to the C-to-R (columns to rows) ratio of these problems, displaying respectively a 3.8 and a 4.4 ratio. +The problems shown in Table~\ref{chXXX:tab:medium} are solved in less than 4 seconds. As we can see in Figure~\ref{chXXX:fig:NormalSolve}, the expected trend for the \textit{Revised-base} and the \textit{Revised-opti} methods is now confirmed. Let us presently focus on the \textit{Standard} and \textit{Revised-sparse} methods. Some of the problems, in particular czprob.mps and nesm.mps, are solved in a comparable amount of time. The performance gain of the \textit{Revised-sparse} is related to the C-to-R ratio of these problems, displaying, respectively, a 3.8 and a 4.4 ratio. \begin{table}[!h] \begin{center} \begin{tabular}{|l|r|r|r|r|} \hline -\multicolumn{1}{|l|}{Problem name} & \multicolumn{1}{l|}{Rows} & \multicolumn{1}{l|}{Cols} & \multicolumn{1}{l|}{NNZ} & \multicolumn{1}{l|}{C-to-R} \\ \hline +\multicolumn{1}{|l|}{Problem Name} & \multicolumn{1}{l|}{Rows} & \multicolumn{1}{l|}{Cols} & \multicolumn{1}{l|}{NNZ} & \multicolumn{1}{l|}{C-to-R} \\ \hline 25fv47.mps & 822 & 1571 & 11127 & 1.9\\ \hline bnl1.mps & 644 & 1175 & 6129 & 1.8\\ \hline cycle.mps & 1904 & 2857 & 21322 & 1.5\\ \hline @@ -938,18 +944,18 @@ perold.mps & 626 & 1376 & 6026 & 1.0\\ \hline \begin{figure}[!h] \centering -\includegraphics[width=100mm]{Chapters/chapter10/figures/Med2.pdf} -\caption{Time required to solve problems of table~\ref{chXXX:tab:medium}} +\includegraphics[width=100mm]{Chapters/chapter10/figures/Med_N.pdf} +\caption{Time required to solve problems of Table~\ref{chXXX:tab:medium}.} \label{chXXX:fig:NormalSolve} \end{figure} -Finally, the biggest problems, and slowest to solve, are given in table~\ref{chXXX:tab:big}. A new tendency can be observed in figure~\ref{chXXX:fig:SlowSolve}. The \texttt{Revised-sparse} method is the fastest on most of the problems. The performances are still close between the best two methods on problems having a C-to-R ratio close to 2 like \textit{bnl2.mps}, \textit{pilot.mps} or \textit{greenbeb.mps}. However, when this ratio is greater, the \texttt{Revised-sparse} can be nearly twice as fast as the standard method. This is noticeable on \textit{80bau3b.mps} (4.5) and \textit{fit2p.mps} (4.3). Although the C-to-R ratio of \textit{d6cube.mps} (14.9) exceeds the ones previously mentioned, the \texttt{Revised-sparse} method doesn't show an impressive performance, probably due to the small amount of rows and the density of this problem, which does thus not fully benefit from the lower complexity of sparse operations. +Finally, the biggest problems, and slowest to solve, are given in Table~\ref{chXXX:tab:big}. A new tendency can be observed in Figure~\ref{chXXX:fig:SlowSolve}. The \textit{Revised-sparse} method is the fastest on most of the problems. The performances are still close between the best two methods on problems having a C-to-R ratio close to 2 such as bnl2.mps, pilot.mps, or greenbeb.mps. However, when this ratio is greater, the \textit{Revised-sparse} can be nearly twice as fast as the standard method. This is noticeable on 80bau3b.mps (4.5) and fit2p.mps (4.3). Although the C-to-R ratio of d6cube.mps (14.9) exceeds the ones previously mentioned, the \textit{Revised-sparse} method doesn't show an impressive performance, probably due to the small amount of rows and the density of this problem, which doesn't fully benefit from the lower complexity of sparse operations. \begin{table}[!h] \begin{center} \begin{tabular}{|l|r|r|r|r|} \hline -\multicolumn{1}{|l|}{Problem name} & \multicolumn{1}{l|}{Rows} & \multicolumn{1}{l|}{Cols} & \multicolumn{1}{l|}{NNZ} & \multicolumn{1}{l|}{C-to-R} \\ \hline +\multicolumn{1}{|l|}{Problem Name} & \multicolumn{1}{l|}{Rows} & \multicolumn{1}{l|}{Cols} & \multicolumn{1}{l|}{NNZ} & \multicolumn{1}{l|}{C-to-R} \\ \hline 80bau3b.mps & 2263 & 9799 & 29063 & 4.3\\ \hline bnl2.mps & 2325 & 3489 & 16124 & 1.5\\ \hline d2q06c.mps & 2172 & 5167 & 35674 & 2.4\\ \hline @@ -960,30 +966,30 @@ maros-r7.mps & 3137 & 9408 & 151120 & 3.0\\ \hline pilot.mps & 1442 & 3652 & 43220 & 2.5 \\ \hline pilot87.mps & 2031 & 4883 & 73804 & 2.4\\ \hline \end{tabular} -\caption{NETLIB problems solved in more than 5 seconds} +\caption{NETLIB problems solved in more than 5 seconds.} \label{chXXX:tab:big} \end{center} \end{table} \begin{figure}[!h] \centering -\includegraphics[width=100mm]{Chapters/chapter10/figures/Big2.pdf} -\caption{Time required to solve problems of table \ref{chXXX:tab:big}} +\includegraphics[width=100mm]{Chapters/chapter10/figures/Big_N.pdf} +\caption{Time required to solve problems of Table \ref{chXXX:tab:big}.} \label{chXXX:fig:SlowSolve} \end{figure} \section{Conclusion and perspectives} \label{chXXX:sec:persp} -In this chapter, we tried to present the knowledge and understanding necessary in our view to produce an efficient integer linear programming solver on a GPU. We proposed various solutions to implement the standard and revised simplex. We discussed the main concepts behind a Branch-and-Bound algorithm and pointed out some major concerns when it is coupled with a GPU solver. The results obtained with the standard simplex implementation allowed us to validate a detailed performance model we had proposed. Finally, we used problems from the \textit{NETLIB} library to compare the performances of our various implementations. The revised simplex implementation with sparse matrix operations, showed the best performances on time consuming problems, while the standard simplex one was more competitive on easier problems. However, sequential open-source solvers like \textit{CLP} of the \textit{COIN-OR} project still outperform such GPU implementations. +In this chapter, we have tried to present the knowledge and understanding necessary in our view to produce an efficient integer linear programming solver on a GPU. We proposed various solutions to implement the standard and revised simplex. We have discussed the main concepts behind a branch-and-bound algorithm and pointed out some major concerns when it is coupled with a GPU solver. The results obtained with the standard simplex implementation allowed us to validate a detailed performance model we had proposed. Finally, we used problems from the NETLIB library to compare the performances of our various implementations. The revised simplex implementation with sparse matrix operations showed the best performances on time-consuming problems, while the standard simplex one was more competitive on easier problems. However, sequential open-source solvers such as CLP of the COIN-OR project still outperform such GPU implementations. We shall now discuss some remaining potential improvements. A first step towards performance would be to consider the dual revised simplex~\cite{LEMKE54}. While being similar to the methods treated in this chapter, it has the capacity to greatly reduce the time needed to solve problems. -Yet, the main improvement would be to tackle larger problems than the ones considered here. Indeed, problems having hundreds of thousands of variables would technically be solvable on GPU devices with 2GB to 4GB of global memory. Moreover, such amounts of data would fully benefit from the potential of these devices. However, solving problems of this size raises an issue not addressed here: numerical instabilities. This phenomenon is due to the limited precision of mathematical operations on computing devices. +Yet, the main improvement would be seen when tackling larger problems than the ones considered here. Indeed, problems having hundreds of thousands of variables would technically be solvable on GPU devices with 2GB to 4GB of global memory. Moreover, such amounts of data would fully benefit from the potential of these devices. However, solving problems of this size raises an issue not addressed here: numerical instabilities. This phenomenon is due to the limited precision of mathematical operations on computing devices. % Numerical innacuracies/ Example + problem Let us illustrate this problem on the inverse $\mbf{B^{-1}}$ of the basis matrix $\mbf{B}$. At each iteration of the revised simplex, $\mbf{B^{-1}}$ is updated from its previous values. Performing this update adds a small error to the result. Unfortunately, these errors accumulate at each iteration, bringing at some point the $\mbf{B^{-1}}$ matrix into a degenerate state. To avoid this situation, the matrix $\mbf{B^{-1}}$ must be recomputed afresh once in a while by inverting the matrix $\mbf{B}$. % LU decomposition -Instead of directly processing the inverse of the matrix $\mbf{B}$, it is more common to use some specific arithmetical treatment. Most of the simplex implementations use the LU decomposition~\cite{BRAYTON70}. -Since the matrix $\mbf{B}$ is sparse, the sparse version of this decomposition can be used, and the corresponding update performed for $\mbf{B^{-1}}$ at each iteration~\cite{BARTELS69}. The sparse LU decomposition on CPU has been recently subject to a large amount of research, yielding many improvements. Once its GPU counterpart will be available, this might result in improved and competitive simplex implementations on GPU. +Instead of directly processing the inverse of the matrix $\mbf{B}$, it is more common to use some specific arithmetical treatment. Most simplex implementations use the so-called LU decomposition of $\mbf{B}$ as a product of a lower and an upper triangular matrix~\cite{BARTELS69}. +Since the matrix $\mbf{B}$ is sparse, the sparse version of this decomposition can be used and the corresponding update performed for $\mbf{B^{-1}}$ at each iteration~\cite{BRAYTON70}. The sparse LU decomposition on CPU has been recently the subject of a large amount of research, yielding many improvements. Once its GPU counterpart is available, this might result in improved and competitive simplex implementations on GPU. \putbib[Chapters/chapter10/biblio]