-\chapterauthor{Xavier Meyer}{Department of Computer Science, University of Geneva}
-\chapterauthor{Paul Albuquerque}{Institute for Informatics and Telecommunications, hepia, \\ University of Applied Sciences of Western Switzerland - Geneva}
-\chapterauthor{Bastien Chopard}{Department of Computer Science, University of Geneva}
+\chapterauthor{Xavier Meyer and Bastien Chopard}{Department of Computer Science, University of Geneva, Switzerland}
+\chapterauthor{Paul Albuquerque}{Institute for Informatics and Telecommunications, Hepia, \\ University of Applied Sciences of Western Switzerland - Geneva, Switzerland}
+%\chapterauthor{Bastien Chopard}{Department of Computer Science, University of Geneva}
%\chapter{Linear programming on a GPU: a study case based on the simplex method and the branch-cut-and bound algorithm}
-\chapter{Linear programming on a GPU: a~study~case}
+\chapter{Linear programming on a GPU: a case study}
\section{Introduction}
\label{chXXX:sec:intro}
The simplex method is a well-known optimization algorithm for solving linear programming (LP) models in the field of operations research. It is part of software often employed by businesses for finding solutions to problems such as airline scheduling problems. The original standard simplex method was proposed by Dantzig in 1947~\cite{VCLP}. A more efficient method named the revised simplex, was later developed. Nowadays its sequential implementation can be found in almost all commercial LP solvers. But the always increasing complexity and size of LP problems from the industry, drives the demand for more computational power. Indeed, current implementations of the revised simplex strive to produce the expected results, if any. In this context, parallelization is the natural idea to investigate~\cite{HALL}. Already in 1996, Thomadakis et al.~\cite{THOMAD} implemented the standard method on a massively parallel computer and obtained an increase in performances when solving dense or large problems.
We then update the constraints matrix $\mbf{A}$, the bound $\mbf{b}$ and the cost $\mbf{c}$ using Gaussian elimination.
More precisely, denoting $\mbf{\tilde{I}_m}$, $\mbf{\tilde{A}_\N}$, $\mbf{\tilde{c}_\B}$ and $\mbf{\tilde{c}_\N}$ the resulting elements after the exchange, Gaussian elimination then transforms the tableau
+\begin{center}
\begin{tabular}{|c|c||c|}
$\mbf{\tilde{A}_\N}$ & $\mbf{\tilde{I}_m}$ & $\mbf{b}$ \\
\hline
$\mbf{\tilde{c}^T_\N}$ & $\mbf{\tilde{c}^T_\B}$ & $z$
\end{tabular}
+\end{center}
+\noindent
into a tableau with updated values for $\mbf{A_\N}$, $\mbf{c_\N}$ and $\mbf{b}$
-
+\begin{center}
\begin{tabular}{|c|c||c|}
$\mbf{A_\N}$ & $\mbf{I_m}$ & $\mbf{b}$ \\
\hline
$\mbf{c^T_\N}$ & $\mbf{c^T_\B}$ & $z-tc_e$
\end{tabular}
-
+\end{center}
The latter is obtained by first dividing the $\ell$-th row by $a_{\ell e}$;
the resulting row multiplied by $a_{ie}$ ($i\not=\ell$), resp. by $c_e$, is then added to the $i$-th row, resp. the last row.
Hence, $\mbf{c_\B=0}$.
\label{chXXX:algo:simplex-rev}
\textit{// Find entering variable $x_e,\, e\in \B$}\;
$\bs\tau \leftarrow \mbf{(B^{-1})^T}\mbf{c_\B} $ \hspace*{5mm} \textit{// cublasDgemv}\;
- $\bs\upsilon \leftarrow \mbf{c_\N} - \mbf{N^T} \bs\tau $ \hspace*{4mm} \textit{// cublasDgemv}
+ $\bs\upsilon \leftarrow \mbf{c_\N} - \mbf{N^T} \bs\tau $ \hspace*{4mm} \textit{// cublasDgemv}\;
$e \leftarrow argmax \left( \bs\upsilon / \sqrt{\bs\gamma} \right) $\;
\If { $e < 0$ } {\Return \texttt{optima\_found} }
\textit{// Find leaving variable $x_\ell,\, \ell\in \N$} \;
The test environnement is composed of a CPU server (2 Intel Xeon X5650, 2.67GHz, with 96GB DDR3 RAM) and a GPU computing system (NVIDIA tesla M2090) with the 4.2 CUDA Toolkit. This system connects 4 GPUs to the server. Each GPU has 6GB GDDR5 graphics memory and 512 cores (1.3GHz).
\begin{table}[!h]
+\begin{center}
\begin{tabular}{|l|r|r|r|r|}
\hline
\multicolumn{1}{|l|}{Problem name} & \multicolumn{1}{l|}{Rows} & \multicolumn{1}{l|}{Cols} & \multicolumn{1}{l|}{NNZ} & \multicolumn{1}{l|}{C-to-R} \\ \hline
\end{tabular}
\caption{NETLIB problems solved in less than 1 second}
\label{chXXX:tab:small}
+\end{center}
\end{table}
Let us begin by discussing the performances on problems solved in less than one second. The name, size, number of non-zero elements (NNZ) and columns to rows ratio (C-to-R) of each problems are reported in table~\ref{chXXX:tab:small}. The performances shown in figure~\ref{chXXX:fig:FastSolve} corroborate our previous observations. On these problems the \texttt{Revised-sparse} method doesn't outperform the \texttt{Standard} one. This can be explained by two factors: the added communications (kernel calls) for the revised method, and the small size and density of the problems. It is likely that sparse operations on a GPU require larger amounts of data to become more efficient than their dense counterpart.
The problems shown in table~\ref{chXXX:tab:medium} are solved in less than 4 seconds. As we can see in figure~\ref{chXXX:fig:NormalSolve}, the expected trend for the \texttt{Revised-base} and the \texttt{Revised-opti} methods is now confirmed. Let us presently focus on the \texttt{Standard} and \texttt{Revised-sparse} methods. Some of the problems, in particular \textit{czprob.mps} and \textit{nesm.mps}, are solved in a comparable amount of time. The performance gain of the \texttt{Revised-sparse} is related to the C-to-R (columns to rows) ratio of these problems, displaying respectively a 3.8 and a 4.4 ratio.
\begin{table}[!h]
+\begin{center}
\begin{tabular}{|l|r|r|r|r|}
\hline
\multicolumn{1}{|l|}{Problem name} & \multicolumn{1}{l|}{Rows} & \multicolumn{1}{l|}{Cols} & \multicolumn{1}{l|}{NNZ} & \multicolumn{1}{l|}{C-to-R} \\ \hline
\end{tabular}
\caption{NETLIB problems solved in the range of 1 to 4 seconds}
\label{chXXX:tab:medium}
+\end{center}
\end{table}
\begin{figure}[!h]
Finally, the biggest problems, and slowest to solve, are given in table~\ref{chXXX:tab:big}. A new tendency can be observed in figure~\ref{chXXX:fig:SlowSolve}. The \texttt{Revised-sparse} method is the fastest on most of the problems. The performances are still close between the best two methods on problems having a C-to-R ratio close to 2 like \textit{bnl2.mps}, \textit{pilot.mps} or \textit{greenbeb.mps}. However, when this ratio is greater, the \texttt{Revised-sparse} can be nearly twice as fast as the standard method. This is noticeable on \textit{80bau3b.mps} (4.5) and \textit{fit2p.mps} (4.3). Although the C-to-R ratio of \textit{d6cube.mps} (14.9) exceeds the ones previously mentioned, the \texttt{Revised-sparse} method doesn't show an impressive performance, probably due to the small amount of rows and the density of this problem, which does thus not fully benefit from the lower complexity of sparse operations.
\begin{table}[!h]
+\begin{center}
\begin{tabular}{|l|r|r|r|r|}
\hline
\multicolumn{1}{|l|}{Problem name} & \multicolumn{1}{l|}{Rows} & \multicolumn{1}{l|}{Cols} & \multicolumn{1}{l|}{NNZ} & \multicolumn{1}{l|}{C-to-R} \\ \hline
\end{tabular}
\caption{NETLIB problems solved in more than 5 seconds}
\label{chXXX:tab:big}
+\end{center}
\end{table}
\begin{figure}[!h]