X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/87cc8f5850fb71cf7f864b59fd9318c89077e85e..46085c242f7cc02e2b5b04f7673e28fcb7782371:/BookGPU/Chapters/chapter8/ch8.tex?ds=inline diff --git a/BookGPU/Chapters/chapter8/ch8.tex b/BookGPU/Chapters/chapter8/ch8.tex index 9352b56..ddd58a3 100644 --- a/BookGPU/Chapters/chapter8/ch8.tex +++ b/BookGPU/Chapters/chapter8/ch8.tex @@ -1,55 +1,44 @@ -\chapterauthor{Imen Chakroun}{Universit\'e Lille 1 CNRS/LIFL, INRIA Lille Nord Europe, Cit\'e scientifique - 59655, Villeneuve d'Ascq cedex, France\\} -\chapterauthor{Nouredine Melab}{Universit\'e Lille 1 CNRS/LIFL, INRIA Lille Nord Europe, Cit\'e scientifique - 59655, Villeneuve d'Ascq cedex, France\\} - -\chapter{GPU-accelerated Tree-based Exact Optimization Methods} +\chapterauthor{Imen Chakroun and Nouredine Melab}{University of Lille 1 CNRS/LIFL, INRIA Lille Nord Europe, Cit\'e scientifique, 59655 Villeneuve d'Ascq cedex, France\\} +%\chapterauthor{Nouredine Melab}{Universit\'e Lille 1 CNRS/LIFL, INRIA Lille Nord Europe, Cit\'e scientifique - 59655, Villeneuve d'Ascq cedex, France\\} +\chapter{GPU-accelerated tree-based exact optimization methods} +\label{ch8:GPU-accelerated-tree-based-exact-optimization-methods} \section{Introduction} \label{ch8:introduction} -In practice, a wide range of problems can be modeled as NP-hard combinatorial optimization problems (COPs). Those problems consist in choosing the best combination out of a large finite set of possible combinations and are known to be large in size and difficult to solve to optimality. One of the most popular methods for solving exactly a COP (finding a solution having the optimal cost), is the Branch-and-Bound (B\&B) algorithm. This algorithm is based on an implicit enumeration of all the feasible solutions of the tackled problem. Enumerating the solutions of a problem consists in building a dynamically generated search tree whose nodes are subsets of solutions of the considered problem. The construction of such tree and its exploration is performed using four operators: branching, bounding, selection and pruning. Due to the exponentially increasing number of potential solutions, the B\&B algorithm explores only promising nodes of the search tree using an estimated optimal solution called ``lower bound'' of the associated sub-problem. - -\vspace{0.3cm} - -Although this bounding mechanism allows to considerably reduce the exploration time, often only small or moderately-sized instances of COPs can be practically solved. For this reason, over the last decades, parallel computing has been revealed as an attractive way to deal with larger instances of COPs. However, while many contributions have been proposed for parallel B\&B methods using Massively Parallel Processors \cite{ch8:Allen_1997}, Networks or Clusters of Workstations \cite{ch8:Quinn_1990} and SMP machines \cite{ch8:Casadoa_2008}, very few contributions have been proposed for redesigning B\&B algorithms on Graphical Processing Units (GPUs) \cite{ch8:Carneiro_2011}. For years, the use of GPU accelerators was limited to graphics and video applications. Driven by the demand for high-definition 3D graphics on personal computers, GPUs have evolved into a highly parallel, multi-threaded and many-core environment. Their utilization has recently been extended to other application domains such as scientific computing \cite{ch8:Kurzak_2010}. +In practice, a wide range of problems can be modeled as NP-hard combinatorial optimization problems (COPs). Those problems consist of choosing the best combination out of a large finite set of possible combinations and are known to be large in size and difficult to solve optimality. One of the most popular methods for solving exactly a COP (finding a solution having the optimal cost, is the Branch-and-Bound (B\&B) algorithm. This algorithm is based on an implicit enumeration of all the feasible solutions of the tackled problem. Enumerating the solutions of a problem consists of building a dynamically generated search tree whose nodes are subsets of solutions of the considered problem. The construction of such a tree and its exploration is performed using four operators: branching, bounding, selection, and pruning. Due to the exponentially increasing number of potential solutions, the B\&B algorithm explores only promising nodes of the search tree using an estimated optimal solution called ``lower bound'' of the associated subproblem.\\ -\vspace{0.3cm} +Although this bounding mechanism allows the considerable reduction of the exploration time, often only small or moderatelysized instances of COPs can be practically solved. For this reason, over the last decades, parallel computing has been revealed as an attractive way to deal with larger instances of COPs. However, while many contributions have been proposed for parallel B\&B methods using massively parallel processors \cite{ch8:Allen_1997}, networks or clusters of workstations \cite{ch8:Quinn_1990}, and Shared Memory Multiprocessors (SMP) machines \cite{ch8:Casadoa_2008}, very few contributions have been proposed for redesigning B\&B algorithms on Graphical Processing Units (GPUs) \cite{ch8:Carneiro_2011}. For years, the use of GPU accelerators was limited to graphics and video applications. Driven by the demand for high-definition 3D graphics on personal computers, GPUs have evolved into a highly parallel, multithreaded and many-core environment. Their utilization has recently been extended to other application domains such as scientific computing \cite{ch8:Kurzak_2010}.\\ -In this work, we rethink the design and implementation of irregular tree-based algorithms such as B\&B algorithm on top of GPUs. During the execution of the B\&B algorithm, the number of new generated nodes and the number of not yet explored but promising nodes are variable and depend on the level of the tree being explored and on the best solution found so far. Therefore, due to such unstructured and unpredictable nature of its search tree, designing efficient B\&B on top of GPUs is not straightforward. We investigate two different approaches for designing GPU-based B\&B starting from the parallel models for B\&B identified in \cite{ch8:MelabHDR_2005}. The first one is based on the ``parallel tree exploration'' paradigm. This approach consists in exploring in parallel different sub-spaces of the tree. The second approach is based on the ``parallel evaluation of bounds'' approach. The two approaches have been applied to the permutation Flowshop Scheduling Problem \index{Flowshop Scheduling Problem} (FSP)(see Section~\ref{ch8:BB-FSP}) which is an NP-hard combinatorial optimization problem. The lower bound function used in this work for FSP is the one proposed in~\cite{ch8:Johnson_1954} for two machines and generalized in~\cite{ch8:Lenstra_1978} to more than two machines. +In this chapter, we rethink the design and implementation of irregular tree-based algorithms such as the B\&B algorithm on top of GPUs. During the execution of the B\&B algorithm, the number of newly generated nodes and the number of not yet explored but promising nodes are variable and depend on the level of the tree being explored and on the best solution found so far. Therefore, due to such unstructured and unpredictable nature of its search tree, designing efficient B\&B on top of GPUs is not straightforward. We investigate two different approaches for designing GPU-based B\&B starting from the parallel models for B\&B identified in \cite{ch8:MelabHDR_2005}. The first one is based on the ``parallel tree exploration'' paradigm. This approach consists of exploring in parallel different subspaces of the tree. The second approach is based on the ``parallel evaluation of bounds'' approach. The two approaches have been applied to the permutation Flowshop Scheduling Problem \index{Flowshop Scheduling Problem} (FSP; see Section~\ref{ch8:BB-FSP}) which is an NP-hard combinatorial optimization problem. The lower bound function used in this work for FSP is the one proposed in~\cite{ch8:Johnson_1954} for two machines and generalized in~\cite{ch8:Lenstra_1978} to more than two machines.\\ -\vspace{0.3cm} -When rethinking those two parallel model for GPU's architectures, our main focus was on the lower bound function. Indeed, preliminary experiments we carried out on some Taillard's problem instances \cite{ch8:Taillard_1993} show that computing the lower bounds takes on average between 98\% and 99\% of the total execution time of the B\&B. The GPU-based lower bound's implementation raises mainly two challenges. On the one hand, having in mind that the execution model of GPUs is SIMD, irregular computations (containing loops and conditional instructions) contained in the lower bound function may lead to a very challenging issue: the thread or branch divergence. This problem drops down the performance and arises when threads of a same warp (the smallest executable unit of parallelism on the GPU) execute different data-dependent instructions. On the other hand, the lower bound computation usually uses large in size and frequently accessed data structures. Since GPU is a many-core co-processor device that provides a hierarchy of memories having different sizes and access latencies, the placement and sharing of these data sets become challenging. +When rethinking those two parallel models for GPU's architectures, our main focus was on the lower bound function. Indeed, preliminary experiments we carried out on some of Taillard's problem instances \cite{ch8:Taillard_1993} show that computing the lower bounds takes on average between 98\% and 99\% of the total execution time of the B\&B. The GPU-based lower bound's implementation raises mainly two challenges. On the one hand, having in mind that the execution model of GPUs is Single Instruction Multiple Data (SIMD), irregular computations (containing loops and conditional instructions) contained in the lower bound function may lead to a very challenging issue: the thread or branch divergence. This problem drops down the performance and arises when threads of a same warp (the smallest executable unit of parallelism on the GPU) execute different data-dependent instructions. On the other hand, the lower bound computation usually uses large and frequently accessed data structures. Since GPU is a many-core coprocessor device that provides a hierarchy of memories having different sizes and access latencies, the placement and sharing of these data sets become challenging.\\ -\vspace{0.3cm} -The scope of this chapter is to design parallel B\&B algorithms on GPU accelerators to allow highly efficient solving of permutation-based COPs. To do so, our contributions consist in: (1) rethinking two approaches for parallel B\&B on top of GPUs, discussing the performances of each and identifying which best suits the GPU accelerators. (2) proposing a new approach for thread/branch divergence reduction through a thorough analysis of the different loops and conditional instructions of the bounding function. (2) defining an optimal mapping of the data structures of the bounding function on the hierarchy of memories provided in the GPU device through a careful analysis of both the data structures (size and access frequency) and the GPU memories (size and access latency). +The scope of this chapter is to design parallel B\&B algorithms on GPU accelerators to allow highly efficient solving of permutation-based COPs. To do so, our contributions consist of: (1) rethinking two approaches for parallel B\&B on top of GPUs, discussing the performances of each and identifying which best suits the GPU accelerators, (2) proposing a new approach for thread/branch divergence reduction through a thorough analysis of the different loops and conditional instructions of the bounding function, and (3) defining an optimal mapping of the data structures of the bounding function on the hierarchy of memories provided in the GPU device through a careful analysis of both the data structures (size and access frequency) and the GPU memories (size and access latency). \\ -\vspace{0.3cm} -The chapter is organized in seven main sections. Section \ref{ch8:BB} presents the B\&B algorithm. Section \ref{ch8:Parallel-BB} introduces the different models used to parallelize B\&B algorithms. Section \ref{ch8:BB-FSP} briefly describes the Flowshop Scheduling permutation Problem. In Section~\ref{ch8:approach1}, we describe the GPU-accelerated B\&B based on the parallel tree exploration. In Section~\ref{ch8:approach2}, details about the second approach, the GPU-accelerated B\&B based on the parallel evaluation of lower bounds, are given. In Section \ref{ch8:ThreadDivergence}, the thread divergence issue related to the location of nodes in the B\&B tree and to the control flow instructions within the bounding operator is described. In Section \ref{ch8:DataAccessOpt}, the memory access optimization challenge is addressed and an overview of the GPU memory hierarchy and the used memory access pattern is given. In Section~\ref{ch8:Experiments}, we report experimental results showing the performances of each of two studied approaches compared to a sequential CPU-based execution of the B\&B and demonstrating the efficiency of the proposed optimizations. +The chapter is organized into seven main sections. Section \ref{ch8:BB} presents the B\&B algorithm. Section \ref{ch8:Parallel-BB} introduces the different models used to parallelize B\&B algorithms. Section \ref{ch8:BB-FSP} briefly describes the Flowshop Scheduling permutation Problem. In Section~\ref{ch8:approach1}, we describe the GPU-accelerated B\&B based on the parallel tree exploration. In Section~\ref{ch8:approach2}, details about the second approach, the GPU-accelerated B\&B based on the parallel evaluation of lower bounds, are given. In Section \ref{ch8:ThreadDivergence}, the thread divergence issue related to the location of nodes in the B\&B tree and to the control flow instructions within the bounding operator is described. In Section \ref{ch8:DataAccessOpt}, the memory access optimization challenge is addressed and an overview of the GPU memory hierarchy and the used memory access pattern is given. In Section~\ref{ch8:Experiments}, we report experimental results showing the performances of each of two studied approaches compared to a sequential CPU-based execution of the B\&B and demonstrating the efficiency of the proposed optimizations. -\section{Branch-and-Bound \index{Branch-and-Bound} algorithm} +\section{Branch-and-bound \index{Branch-and-bound} algorithm} \label{ch8:BB} -Branch-and-bound algorithms are by far the most widely used methods for exactly solving large scale NP-hard combinatorial optimization problems. Indeed, they allow to find the optimal solution of a problem with proof of optimality. +Branch-and-bound algorithms are by far the most widely used methods for exactly solving large scale NP-hard combinatorial optimization problems. Indeed, they allow the finding of the optimal solution of a problem with proof of optimality. \\ -\vspace{0.3cm} +The basic idea of the B\&B algorithm consists in implicitly enumerating all the solutions of the original problem by only examining a subset of feasible solutions and eliminating the others when they are not likely to lead to a feasible or an optimal solution. Enumerating the solutions of a problem consists of building a dynamically generated search tree whose nodes are subsets of solutions of the considered problem. The construction of such tree and its exploration are performed using four operators: branching, bounding, selection and pruning.\\ -The basic idea of the B\&B algorithm consists in implicitly enumerating all the solutions of the original problem by only examining a subset of feasible solutions and eliminating the others when they are not likely to lead to a feasible or an optimal solution. Enumerating the solutions of a problem consists in building a dynamically generated search tree whose nodes are subsets of solutions of the considered problem. The construction of such tree and its exploration are performed using four operators: branching, bounding, selection and pruning. - -\vspace{0.3cm} - -The algorithm proceeds in several iterations during which the best solution found so far is progressively improved. During the exploration process, the search space is described by a pool of unexplored nodes and the best solution found so far. The generated and not yet examined (pending) nodes are kept into a list initialized with the original problem. At each iteration of the algorithm, the following steps are performed: +The algorithm proceeds in several iterations during which the best solution found so far is progressively improved. During the exploration process, the search space is analyzed by a pool of unexplored nodes and the best solution found so far. The generated and not yet examined (pending) nodes are kept in a list initialized with the original problem. At each iteration of the algorithm, the following steps are performed: \begin{itemize} - \item The {\it selection operator} chooses one node to process among the pending nodes according to a defined strategy. If the selection is based on the depth of the sub-problem in the B\&B tree, we speak about a depth-first exploration strategy. A selection based on the breadth of the sub-problem is called a breadth-first exploration. A best-first selection strategy could also be used. It is based on the presumed capacity of the node to yield good solutions. - \item The {\it branching operator} subdivides a solution space into two or more disjointed sub-spaces to be investigated in a subsequent iteration. - \item The {\it bounding operator} computes a bound value of the optimal solution of each generated sub-problem. - \item Each sub-problem having a greater bound than the upper-bound, i.e. the cost of the best solution found so far, is eliminated using the {\it pruning operator}. + \item The {\it selection operator} chooses one node to process among the pending nodes according to a defined strategy. If the selection is based on the depth of the subproblem in the B\&B tree, we speak about a depth-first exploration strategy. A selection based on the breadth of the subproblem is called a breadth-first exploration. A best-first selection strategy could also be used. It is based on the presumed capacity of the node to yield good solutions. + \item The {\it branching operator} subdivides a solution space into two or more disjointed subspaces to be investigated in a subsequent iteration. + \item The {\it bounding operator} computes a bound value of the optimal solution of each generated subproblem. + \item Each subproblem having a greater bound than the upper-bound, i.e., the cost of the best solution found so far, is eliminated using the {\it pruning operator}. \end{itemize} -Algorithm \ref{ch8:algoBB} gives the general template of the Branch-and-Bound method. +Algorithm \ref{ch8:algoBB} gives the general template of the branch-and-bound method. \begin{algorithm}[H] @@ -80,7 +69,7 @@ Set the Best\_Solution to $\emptyset$; \\ \If{ Lower\_Bound $\leq$ Upper\_Bound } { Branch(Sub\_Problem); \\ - Insert child sub problems into the tree; + Insert child subproblems into the tree; } \Else { @@ -89,33 +78,31 @@ Set the Best\_Solution to $\emptyset$; \\ } } -\caption{General template of the Branch-and-Bound Algorithm.} +\caption{general template of the branch-and-bound algorithm.} \label{ch8:algoBB} \end{algorithm} -\section{Parallel Branch-and-Bound algorithms} +\section{Parallel branch-and-bound algorithms} \label{ch8:Parallel-BB} -Thanks to the bounding operator, B\&B allows to significantly reduce the computing time needed to explore the whole solution space. However, finding an optimal solution for large instances remains unpractical using a sequential B\&B. Therefore, parallel processing of these algorithms has been widely studied in the literature. In \cite{ch8:MelabHDR_2005}, a taxonomy of the various existing parallel paradigm used to parallelize the B\&B algorithm is presented. +Thanks to the bounding operator, B\&B allows the significant reduction of the computing time needed to explore the whole solution space. However, finding an optimal solution for large instances remains impractical using a sequential B\&B. Therefore, parallel processing of these algorithms has been widely studied in the literature. In \cite{ch8:MelabHDR_2005}, a taxonomy of the various existing parallel paradigms used to parallelize the B\&B algorithm is presented.\\ -\vspace{0.2cm} -This taxonomy based on the classification proposed in \cite{ch8:Gendron_1994} identified several models to accelerate the B\&B search. The first model we consider in this chapter is called ``parallel tree exploration model'' and belongs to the ``Tree-based'' strategies that aim to build and explore the B\&B tree in parallel. The second model called ``parallel evaluation of bounds model'' (evaluation of bounds in parallel) belong to the parallelization approach called ``Node-based''. This strategy aims to accelerate the execution of a particular operation at the node level. +This taxonomy based on the classification proposed in \cite{ch8:Gendron_1994} identified several models to accelerate the B\&B search. The first model we consider in this chapter is called ``parallel tree exploration model'' and belongs to the ``tree-based'' strategies that aim to build and explore the B\&B tree in parallel. The second model called ``parallel evaluation of bounds model'' (evaluation of bounds in parallel) belong to the parallelization approach called ``node-based''. This strategy aims to accelerate the execution of a particular operation at the node level. -\vspace{0.2cm} \subsection{The parallel tree exploration model} \label{ch8:para_tree} -Tree-based strategies consist in building and/or exploring the solution tree in parallel by performing operations on several sub-problems simultaneously. This coarse-grained type of parallelism affects the general structure of the B\&B algorithm and makes it highly irregular.\\ +Tree-based strategies consist of building and/or exploring the solution tree in parallel by performing operations on several subproblems simultaneously. This coarse-grained type of parallelism affects the general structure of the B\&B algorithm and makes it highly irregular.\\ -The parallel tree exploration \index{parallel tree exploration} model, illustrated in Figure \ref{ch8:parallel_tree}, consists in visiting in parallel different paths of the same tree. The search tree is explored in parallel by performing the branching, selection, bounding and elimination operators on several sub-problems simultaneously.\\ +The parallel tree exploration \index{parallel tree exploration} model, illustrated in Figure \ref{ch8:parallel_tree}, consists of visiting in parallel different paths of the same tree. The search tree is explored in parallel by performing the branching, selection, bounding, and elimination operators on several subproblems simultaneously.\\ \begin{figure} \begin{center} \includegraphics[scale=0.5]{Chapters/chapter8/figures/parallel_exploration.eps}% -\caption{Illustration of the parallel tree exploration model} +\caption{Illustration of the parallel tree exploration model.} \label{ch8:parallel_tree} \end{center} \end{figure} @@ -123,38 +110,35 @@ The parallel tree exploration \index{parallel tree exploration} model, illustrat \subsection{The parallel evaluation of bounds model} \label{ch8:Node_parallel} -Node-based strategies introduce parallelism when performing the operations on a single problem. For instance, they consist in executing the bounding operation in parallel for each sub-problem to accelerate the execution. This type of parallelism has no influence on the general structure of the B\&B algorithm and is particular to the problem being solved.\\ +Node-based strategies introduce parallelism when performing the operations on a single problem. For instance, they consist of executing the bounding operation in parallel for each subproblem to accelerate the execution. This type of parallelism has no influence on the general structure of the B\&B algorithm and is particular to the problem being solved.\\ -The parallel evaluation of bounds \index{parallel evaluation of bounds} model, as shown in Figure \ref{ch8:bounds_parallel}, allows the parallelization of the bounding of sub-problems generated by the branching operator. This model is used in the case where the bounding operator is performed several times after the branching operator. The model does not change the order and the number of explored sub-problems in the parallel B\&B algorithm compared to the sequential B\&B. +The parallel evaluation of bounds \index{parallel evaluation of bounds} model, as shown in Figure \ref{ch8:bounds_parallel}, allows the parallelization of the bounding of subproblems generated by the branching operator. This model is used in the case where the bounding operator is performed several times after the branching operator. Compared to the sequential B\&B, the model does not change the order and the number of explored subproblems in the parallel B\&B algorithm. \begin{figure} \begin{center} \includegraphics[scale=0.5]{Chapters/chapter8/figures/parallel_bounding.eps}% -\caption{Illustration of the parallel evaluation of bounds model} +\caption{Illustration of the parallel evaluation of bounds model.} \label{ch8:bounds_parallel} \end{center} \end{figure} -\section{The Flowshop Scheduling Problem} +\section{The flowshop scheduling problem} \label{ch8:BB-FSP} -\subsection{Definition of the Flowshop Scheduling Problem} -\label{ch8:LB-FSP} - -As a case study for our GPU-based Branch-and-Bound, we considered the NP-hard and well-known problem in the scheduling theory: the "Permutation Flow-shop Scheduling Problem" (FSP). -In this work, the mono-objective case is considered. The FSP aims to find the optimal schedule of n jobs on m machines so that the overall completion time of all jobs, called {\it makespan}, is minimized. +\subsection{Definition of the flowshop scheduling problem} -\vspace{0.3cm} +As a case study for our GPU-based branch-and-bound algorithm, we considered the NP-hard and well-known problem in the scheduling theory: the ``Permutation Flow-shop Scheduling Problem'' (FSP). +In this work, the mono-objective case is considered. The FSP aims to find the optimal schedule of $n$ jobs on $m$ machines so that the overall completion time of all jobs, called {\it makespan}, is minimized. \\ -Let us suppose the set of jobs is represented by J = {$j_1$, $j_2$, . . . $j_n$} and the set of machines is represented by M = {$m_1$,$m_2$, . . .$m_m$} organized -in the line. Each job $j_i$ is a sequence of operations ji = { $oi_1$, $oi_2$, . . . $oi_m$ } where oim is the duration required for the job ji on the machine m. +Let us suppose the set of jobs is represented by $J = \{j_1, j_2, \dots, j_n\}$ and the set of machines is represented by $M=\{m_1,m_2, \dots, m_m\}$ organized +in the line. Each job $j_i$ is a sequence of operations $j_i = { oi_1, oi_2, \dots, oi_m }$ where $oi_m$ is the duration required for the job $j_i$ on the machine $m$. A feasible solution of the flowshop permutation should satisfy these constraints: \begin{itemize} - \item A machine can not start processing a job if all the machines, which are located upstream, did not finish their treatment. Thus, the operation $oi_j$ cannot be processed by the machine $m_j$ if it is not completed on $m_j$ - 1. -\item An operation can not be interrupted, and the machines are critical resources, because a machine processes one job at a time. -\item The sequence of jobs should be the same on every machine, e.g. if j3 is treated in position 2 on the first machine, j3 is also executed in position 2 on all machines. + \item A machine cannot start processing a job if all the machines, which are located upstream, have not finished their treatment. Thus, the operation $oi_j$ cannot be processed by the machine $m_j$ if it is not completed on $m_{j - 1}$. +\item An operation cannot be interrupted, and the machines are critical resources, because a machine processes one job at a time. +\item The sequence of jobs should be the same on every machine, e.g. if $j_3$ is treated in position 2 on the first machine, $j_3$ is also executed in position 2 on all machines. \end{itemize} Figure~\ref{flow-shop} illustrates a solution of a flow-shop problem instance defined by 6 jobs and 3 machines. @@ -166,16 +150,13 @@ Figure~\ref{flow-shop} illustrates a solution of a flow-shop problem instance de \label{flow-shop} \end{figure} -\vspace{0.3cm} -\subsection{Lower Bound \index{Lower Bound} for the Flowshop Scheduling Problem} -\label{ch8:LB-FSP} +\subsection{Lower bound \index{Lower bound} for the flowshop scheduling problem} -The lower bounding technique provides a lower bound (LB) for each sub-problem generated by the branching operator. The more the bound is accurate, the more it allows to eliminate not promising nodes from the search tree. Therefore, the efficiency of a B\&B algorithm depends strongly on the quality of its lower bound function. In this chapter, we use the lower bound proposed by Lenstra {\it et al.}~\cite{ch8:Lenstra_1978} for FSP, based on the Johnson's algorithm~\cite{ch8:Johnson_1954}. +The lower bounding technique provides a lower bound (LB) for each subproblem generated by the branching operator. The more the bound is accurate, the more it allows the elimination from the search tree that are not promising. Therefore, the efficiency of a B\&B algorithm depends strongly on the quality of its lower bound function. In this chapter, we use the lower bound proposed by Lenstra et al.~\cite{ch8:Lenstra_1978} for FSP, based on the Johnson's algorithm~\cite{ch8:Johnson_1954}.\\ -\vspace{0.2cm} -The Johnson's algorithm allows to solve optimally FSP with two machines ($m=2$) using the following transitive rule $\preceq$: +The Johnson's algorithm allows the optimal solution of FSP with two machines ($m=2$) using the following transitive rule $\preceq$: $$J_i \preceq J_j \Leftrightarrow \min(p_{i,1}\ ;\ p_{j,2}) \leq \min(p_{i,2}\ ;\ p_{j,1})$$ @@ -185,16 +166,15 @@ We recall that $p_{k,l}$ designates the processing time of the job $J_k$ on the \textbf{Jonhson's theorem} \emph{Given $P$ an FSP with $m=2$, if $J_i\preceq J_j$ there exists an optimal schedule for $P$ in which the job $J_i$ precedes the job $J_j$.}\\ According to Johnson's theorem, FSP with $m=2$ is solved with a time complexity of $O(n.log n)$. The optimal solution is obtained by first sorting in increasing order the jobs having a -processing time shorter on the first machine than on the second one~; Second, sorting in decreasing order the jobs having a shorter processing time on the second machine. +processing time shorter on the first machine than on the second one~; Second, sorting in decreasing order the jobs having a shorter processing time on the second machine. \\ -\vspace{0.2cm} In~\cite{ch8:JRJackson_1956} and~\cite{ch8:LGMitten_1959}, the Johnson's rule has been extended by Jackson and Mitten with lags which allowed further Lenstra {\it et al.} to propose a lower bound for FSP with $m \geq 3$. A lag~$l_j$ designates the minimum duration between the starting time of the job $J_j$ on the second machine and its finishing time on the first machine. Jackson and Mitten demonstrated that the optimal solution for FSP with $m=2$ can be obtained using the following transitive rule $\preceq$: $$J_i \preceq J_j \Leftrightarrow \min(p_{i,1}+l_i\ ;\ l_j+p_{j,2}) \leq \min(l_i+p_{i,2}\ ;\ p_{j,1}+l_j)$$ -Based on this rule, Lenstra {\it et al.}~\cite{ch8:Lenstra_1978} have proposed the following lower bound for a sub-problem associated to a partial schedule where a set {\Large $\jmath$} of jobs have to be scheduled on $m$ machines. $P_{Ja}^*(\jmath,M_k,M_l)$ represents the Jackson-Mitten optimal solution for the sub-problem that consists in scheduling the set {\Large $\jmath$} of jobs on the two machines $M_k$ and~$M_l$. The term $r_{i,k} = \sum_{ll} p_{j,k}$ refers to the latency between the finishing time of $J_j$ on $M_l$ and the finishing time of the schedule. +Based on this rule, Lenstra {\it et al.}~\cite{ch8:Lenstra_1978} have proposed the following lower bound for a subproblem associated to a partial schedule where a set {\Large $\jmath$} of jobs have to be scheduled on $m$ machines. $P_{Ja}^*(\jmath,M_k,M_l)$ represents the Jackson-Mitten optimal solution for the subproblem that consists in scheduling the set {\Large $\jmath$} of jobs on the two machines $M_k$ and~$M_l$. The term $r_{i,k} = \sum_{ll} p_{j,k}$ refers to the latency between the finishing time of $J_j$ on $M_l$ and the finishing time of the schedule. $$LB(\jmath)=\max\limits_{1 \leq k < l \leq m}\{P_{Ja}^*(\jmath,M_k,M_l)+\min\limits_{(i,j)\in \jmath^2, i \neq j}(r_{i,k}+q_{j,l}) \}$$ @@ -214,9 +194,8 @@ all the machines between~$k$~and~$l$. \section{GPU-accelerated B\&B based on the parallel tree exploration (GPU-PTE-BB)} \label{ch8:approach1} -The first approach we investigate for designing B\&B on GPUs consists in exploring in parallel the generated search tree. The idea is to divide the global search space into disjoint sub-spaces that are explored in parallel by the GPU threads. As explained in Section \ref{ch8:BB}, during the execution of a B\&B, the search space is described by a list of unexplored (pending) nodes and the best solution found so far. In the considered GPU-based scheme, a set of parent nodes is selected from this list according to their depth: deepest pending nodes are the first selected. The selected pool of nodes is off loaded to the GPU where each thread builds its own local search tree by applying the {\it branching}, {\it bounding} and {\it pruning} operators to the assigned node. +The first approach we investigate for designing B\&B on GPUs consists in exploring in parallel the generated search tree. The idea is to divide the global search space into disjoint sub-spaces that are explored in parallel by the GPU threads. As explained in Section \ref{ch8:BB}, during the execution of a B\&B, the search space is described by a list of unexplored (pending) nodes and the best solution found so far. In the considered GPU-based scheme, a set of parent nodes is selected from this list according to their depth: deepest pending nodes are the first selected. The selected pool of nodes is off loaded to the GPU where each thread builds its own local search tree by applying the {\it branching}, {\it bounding} and {\it pruning} operators to the assigned node.\\ -\vspace{0.2cm} \begin{figure}[h!] \centering @@ -225,16 +204,14 @@ The first approach we investigate for designing B\&B on GPUs consists in explori \label{tree_approach} \end{figure} -\vspace{0.2cm} According to the CUDA threading model, each thread has a unique identifier used to determine its assigned role, assigns specific input and output positions and selects work to perform. Therefore, each node (problem) from the pending list is mapped to a thread to ensure that each sub-space of the solution space is evaluated concurrently and is disjoint from others. Figure \ref{tree_approach} illustrates the scheme of the parallel tree exploration-based GPU-accelerated B\&B. \section{GPU-accelerated B\&B based on the parallel evaluation of bounds (GPU-PEB-BB) } \label{ch8:approach2} -In the GPU-accelerated B\&B based on the parallel evaluation of bounds, illustrated in Figure~\ref{ch8:approach}, the generation of the sub-problems (elimination, selection and branching operations) to be solved is performed on CPU and the evaluation of their lower bounds (bounding operation) is executed on the GPU device. The pool of sub-problems generated on CPU is off-loaded to the GPU device to be evaluated by a pool of threads partitioned into blocks. Each thread applies the lower bound function to one sub-problem. Once the evaluation is completed, the lower bound values corresponding to the different sub-problems is returned to the CPU to be used by the elimination operator to decide either to be pruned or to be decomposed. The process is iterated until the exploration is completed and the optimal solution is found. +In the GPU-accelerated B\&B based on the parallel evaluation of bounds, illustrated in Figure~\ref{ch8:approach}, the generation of the subproblems (elimination, selection and branching operations) to be solved is performed on CPU and the evaluation of their lower bounds (bounding operation) is executed on the GPU device. The pool of subproblems generated on CPU is off-loaded to the GPU device to be evaluated by a pool of threads partitioned into blocks. Each thread applies the lower bound function to one subproblem. Once the evaluation is completed, the lower bound values corresponding to the different subproblems is returned to the CPU to be used by the elimination operator to decide either to be pruned or to be decomposed. The process is iterated until the exploration is completed and the optimal solution is found. -\vspace{0.2cm} \begin{figure}[h!] \begin{center} @@ -244,31 +221,26 @@ In the GPU-accelerated B\&B based on the parallel evaluation of bounds, illustra \end{center} \end{figure} -\vspace{0.2cm} In both considered approaches, GPU-PEB-BB and GPU-PTE-BB, the GPU-based lower bound's implementation raises mainly two challenges. The first one is related to the ``single instruction multiple data'' (SIMD) model of the GPU and to the implementation of the LB. Indeed, although typically every GPU thread will run the identical lower bound function, the body of the lower bound can contains conditions on thread identifiers and data. This implies that different instructions are executed in some threads. In SIMD architectures like GPUs this behavior leads to the thread or branch divergence issue. This problem arises when threads of a same warp execute different data-dependent instructions. It might causes serious performance declining since computation occurs in parallel only when the same instructions are being performed. The second challenge consists in adjusting the pattern of accesses to the GPU device memory. Good placement of data over the different memory hierarchy grants programmers to further improve the throughput of many high-performance CUDA applications. For B\&B applied to FSP, threads of the same block perform concurrent accesses to the six data structures of the problem when they execute the lower bound function. These data structures have different sizes and access frequencies and should be wisely placed on the different memories of the GPUs that also have different sizes and latencies. -\vspace{0.2cm} In the following, we present how we dealt with the thread/branch divergence issue and maps the different data structures on the memory hierarchy of the GPU device taking into account the characteristics of the data structures and those of the different GPU memories. -\vspace{-0.4cm} -\section{Thread divergence \index{Thread divergence}} +\section{Thread divergence} +\label{ch8:ThreadDivergence} \subsection{The thread divergence issue} -During the execution of an application on GPU, to each GPU multiprocessor is assigned one or more thread block(s) to execute. Those threads are partitioned into warps that get scheduled for execution. For each instruction of the flow, the multiprocessor selects a warp that is ready to be run. A warp executes one common instruction at a time, so full efficiency is realized when all threads of a warp agree on their execution path. In this chapter, the G80 model, in which a warp is a pool of 32 threads, is used. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken. Threads that are not on the taken path are disabled, and when all paths complete, the threads converge back to the same execution path. This phenomenon is called thread/branch divergence and often causes serious performance degradations. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjointed code paths. +During the execution of an application on GPU, to each GPU multiprocessor is assigned one or more thread block(s) to execute. Those threads are partitioned into warps that get scheduled for execution. For each instruction of the flow, the multiprocessor selects a warp that is ready to be run. A warp executes one common instruction at a time, so full efficiency is realized when all threads of a warp agree on their execution path. In this chapter, the G80 model, in which a warp is a pool of 32 threads, is used. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken. Threads that are not on the taken path are disabled, and when all paths complete, the threads converge back to the same execution path. This phenomenon is called thread/branch divergence\index{Thread divergence} and often causes serious performance degradations. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjointed code paths.\\ -\vspace{0.2cm} -This section discusses thread divergence issue encountered when computing the bounds by GPU. The thread divergence occurs for two main reasons, namely the locations of nodes in the search tree and the control flow instructions within the bounding operator. +This section discusses thread divergence issue encountered when computing the bounds by GPU. The thread divergence occurs for two main reasons, namely the locations of nodes in the search tree and the control flow instructions within the bounding operator. \\ -\vspace{0.3cm} -\textbf{Divergence related to the location of nodes} +\textbf{Divergence related to the location of nodes}\\ -\vspace{0.3cm} This divergence is related to the positions of the nodes in the B\&B search tree. Below is given an example from the source code of the used LB showing that the execution flow depends on the position of the node in the search tree. In the following piece of code, three methods are used {\it is\_leaf()}, {\it makespan()} and {\it lower\_bound()}. {\it is\_leaf()} tests if the node {\it \_node} is a leaf or an internal node. If {\it \_node} is a leaf, {\it makespan()} computes the cost of its makespan. Otherwise, {\it \_node} is an internal node and {\it lower\_bound()} computes the value of its lower bound. @@ -277,13 +249,11 @@ This divergence is related to the positions of the nodes in the B\&B search tree return _node.makespan(); else return _node.lower_bound(); -\end{verbatim} +\end{verbatim}\\ -\vspace{0.3cm} -\textbf{Divergence related to the control flow instructions} +\textbf{Divergence related to the control flow instructions}\\ -\vspace{0.2cm} Control flow refers to the order in which the instructions, statements or function calls are executed in a program. This flow is determined by instructions such as {\it if-then-else}, {\it for}, {\it while-do}, {\it switch-case}, etc. There are a dozen of such instructions in the implementation of our bounding operator. The source code examples given below show two scenarios in which this kind of instructions is used. @@ -305,35 +275,27 @@ Control flow refers to the order in which the instructions, statements or functi \end{itemize} In these two examples, {\it thread\_idx} is the index associated to the current thread. Let suppose that the code of Example 1 is executed by $32$ threads, {\it pool[thread\_idx].begin} is equal to $0$ for the first thread, and {\it pool[thread\_idx].begin} is not equal to $0$ for the other $31$ threads. When the first thread executes the statement {\it ``time = TimeArrival[1];''}, -all the other $31$ threads remain idle. Therefore, the GPU cores on which these $31$ threads are executed remain idle and can not be used during the execution of the statement {\it ``time = TimeArrival[1];``}. +all the other $31$ threads remain idle. Therefore, the GPU cores on which these $31$ threads are executed remain idle and can not be used during the execution of the statement {\it ``time = TimeArrival[1];``}. \\ -\vspace{0.2cm} -The same scenario occurs during the execution of Example 2. Let us suppose that the instruction is executed by $32$ threads, {\it pool[thread\_idx].begin} is equal to $100$ for the first thread, and {\it pool[thread\_idx].begin} is equal to $0$ for the other $31$ threads. When the first thread executes the loop $for$, all the other $31$ threads remain idle. +The same scenario occurs during the execution of Example 2. Let us suppose that the instruction is executed by $32$ threads, {\it pool[thread\_idx].begin} is equal to $100$ for the first thread, and {\it pool[thread\_idx].begin} is equal to $0$ for the other $31$ threads. When the first thread executes the loop $for$, all the other $31$ threads remain idle. \\ -\vspace{0.2cm} Existing techniques for handling branch divergence either demand hardware support \cite{ch8:Fung} or require host-GPU interaction \cite{ch8:Zhang}, which incurs overhead. Some other works such as \cite{ch8:Han} intervene at the code level. They expose a branch distribution method that aims to reduce the divergent portion of a branch by factoring out structurally similar code from the branch paths. In our work, we have also opted for software-based optimizations like \cite{ch8:Han}. In fact, we figure out how to literally rewrite the branching instructions into basic ones in order to make thread execution paths uniform. We also demonstrate that we could ameliorate performances only by judiciously reordering data being assigned to each thread. \subsection{Mechanisms for reducing branch divergence} -\vspace{0.3cm} - \textbf{Thread-data reordering} + \textbf{Thread-data reordering}\\ -\vspace{0.2cm} +At each iteration of our GPU-accelerated B\&B approach, several thousands of subproblems are sent to the GPU. The GPU groups the received subproblems into several warps according to their reception order. The first 32 subproblems belong to the first warp, the following 32 subproblems belong to the second warp, etc. Therefore, thread-data reordering technique sorts subproblems before sending them to the GPU. These subproblems are sorted according to their position in the B\&B tree. This sort of subproblems allows to have warps containing more homogeneous subproblems, and reduces the number of thread divergences. \\ -At each iteration of our GPU-accelerated B\&B approach, several thousands of sub-problems are sent to the GPU. The GPU groups the received sub-problems into several warps according to their reception order. The first 32 sub-problems belong to the first warp, the following 32 sub-problems belong to the second warp, etc. Therefore, thread-data reordering technique sorts sub-problems before sending them to the GPU. These sub-problems are sorted according to their position in the B\&B tree. This sort of sub-problems allows to have warps containing more homogeneous sub-problems, and reduces the number of thread divergences. -\vspace{0.2cm} + \textbf{Branch refactoring}\\ - \textbf{Branch refactoring} -\vspace{0.2cm} - -As quoted above, thread or branch divergence occurs when the kernel includes conditional instructions and loops that make the threads performing different control flows leading to their serial execution. In this chapter, we investigate the branch refactoring approach to deal with thread divergence. Branch refactoring consists in rewriting the conditional instructions so that threads of the same warp execute an uniform code avoiding their divergence. To do that, two major ``if" scenarios are studied and some optimizations are proposed accordingly. These two scenarios correspond to the conditional instructions contained in the $LB$ kernel code. In the first scenario, the conditional expression is a comparison of the content of a variable to 0. For instance, the following example extracted from the pseudo-code of the lower bound $LB$ illustrates such scenario. +As quoted above, thread or branch divergence occurs when the kernel includes conditional instructions and loops that make the threads performing different control flows leading to their serial execution. In this chapter, we investigate the branch refactoring approach to deal with thread divergence. Branch refactoring consists in rewriting the conditional instructions so that threads of the same warp execute an uniform code avoiding their divergence. To do that, two major ``if" scenarios are studied and some optimizations are proposed accordingly. These two scenarios correspond to the conditional instructions contained in the $LB$ kernel code. In the first scenario, the conditional expression is a comparison of the content of a variable to 0. For instance, the following example extracted from the pseudo-code of the lower bound $LB$ illustrates such scenario.\\ -\vspace{0.3cm} \begin{tabular}{l} \\ @@ -343,11 +305,9 @@ As quoted above, thread or branch divergence occurs when the kernel includes con \textsf{ else tmp = RM[1] ; }\\ \\ \end{tabular} -\vspace{0.2cm} -The refactoring idea is to replace the conditional expression by two functions namely $f$ and $g$ as shown in Equation~\ref{ch8:Eq1}. +The refactoring idea is to replace the conditional expression by two functions namely $f$ and $g$ as shown in Equation~\ref{ch8:Eq1}.\\ -\vspace{0.2cm} The behavior of $f$ and $g$ fits the cosine trigonometric function. These functions return values between $0$ and $1$. An integer variable is used to store the result of the cosine function. Its value is $0$ or $1$ since it is rounded to $0$ if it is not equal to~$1$. In order to increase the performance the CUDA runtime math operations are used: $sinf(x)$, $expf(x)$ and so forth. Those functions are mapped directly to the hardware level~\cite{ch8:cuda}. They are faster but provide lower accuracy which does not matter in our case because the results are rounded to $int$. @@ -374,13 +334,11 @@ The behavior of $f$ and $g$ fits the cosine trigonometric function. These functi \right.} \end{array} \label{ch8:Eq1} -\end{equation} +\end{equation}\\ -\vspace{0.3cm} The throughput of $sinf(x)$, $cosf(x)$, $expf(x)$ is one operation per clock cycle~\cite{ch8:cuda}. The refactoring result for the ``if" pseudo-code given above is the following: -\vspace{0.3cm} \begin{tabular}{l} \\ @@ -390,11 +348,9 @@ The throughput of $sinf(x)$, $cosf(x)$, $expf(x)$ is one operation per clock cyc \textsf{tmp = (1 - coeff) $\times$ MM[1] + coeff $\times$ RM[1];}\\ \\ \end{tabular} -\vspace{0.3cm} The second "if" scenario considered in our study compares two values between themselves as shown in Equation~\ref{ch8:Eq2}. -\vspace{0.2cm} \begin{equation} \begin{array}{lllllllll} @@ -420,24 +376,20 @@ The second "if" scenario considered in our study compares two values between the \label{ch8:Eq2} \end{equation} -\vspace{0.3cm} -For instance, the following example extracted from the pseudo-code of the lower bound $LB$ illustrates such scenario. +For instance, the following example extracted from the pseudo-code of the lower bound $LB$ illustrates such scenario.\\ -\vspace{0.3cm} \footnotesize \begin{tabular}{ll} \\ \multicolumn{2}{l}{\textsf{if(RM[1]] $>$ MIN )}\{} \textsf{Best\_idx = Current\_idx;} \textsf{\}}\\\\ \end{tabular} -\normalsize +\normalsize \\ -\vspace{0.3cm} The same transformations as those applied for the first scenario are applied here using the exponential function. Recall that the exponential is a positive function which is equal to $1$ when applied to $0$. Thus, if $x$ is greater than $y$ then $expf(x-y-1)$ returns a value between $0$ and $1$. If the result is rounded to an integer value $0$ will be obtained. Now, if $x$ is less than $y$ then $expf(x-y-1)$ returns a value greater than $1$ and since the minimum between $1$ and the exponential is get, the returned result would be $1$. Such behavior satisfies exactly our prerequisites. The above ``if" instruction pseudo-code is now equivalent to: -\vspace{0.3cm} \small \begin{tabular}{l} @@ -450,9 +402,8 @@ The same transformations as those applied for the first scenario are applied her \section{Memory access optimization} \label{ch8:DataAccessOpt} -Memory access optimizations \index{Memory access optimizations} are by far the most studied area for improving GPU-based application performances. Indeed, adjusting the pattern of accesses to the GPU device memory grants programmers to further improve the throughput of many high-performance CUDA applications. The goal of memory access optimizations is generally to use as much fast memory and as little slow-access memory as possible. This section discusses how best to set up data LB items on the various kinds of memory on the device. +Memory access optimizations \index{Memory access optimizations} are by far the most studied area for improving GPU-based application performances. Indeed, adjusting the pattern of accesses to the GPU device memory grants programmers to further improve the throughput of many high-performance CUDA applications. The goal of memory access optimizations is generally to use as much fast memory and as little slow-access memory as possible. This section discusses how best to set up data LB items on the various kinds of memory on the device. \\ -\vspace{0.2cm} CUDA enabled devices use several memory spaces, which have different characteristics in term of sizes and access latencies. These memory spaces include global memory, local memory , shared memory, texture memory , and registers. Devices of compute capability 2.0 have also an L1 $/$ L2 cache hierarchy that is used to cache local and global memory accesses. @@ -462,20 +413,16 @@ CUDA enabled devices use several memory spaces, which have different characteris The data access optimization challenge is to find the best mapping of the data structures of the application at hand (different sizes and access frequencies) and the GPU hierarchy of memories (different sizes and access latencies). For instance, of these different memory spaces, global memory is the most plentiful but the one with the highest access latency. On the contrary, shared memory is smaller in size but has much higher bandwidth and lower latency than the global memory. -\subsection{Complexity analysis of the memory usage of the Lower Bound } -\label{ch8:MemComplex} +\subsection{Complexity analysis of the memory usage of the lower bound } -In this section, the characteristics of the data structures used by the lower bound function are studied in terms of sizes and access frequencies. For an efficient implementation of the LB, six data structures are required: the matrix $PTM$ of the processing times of the jobs, the matrix of lags $LM$, the Johnson's matrix $JM$, the matrix $RM$ of the earliest starting times of jobs, the matrix $QM$ of their lowest latency times and the matrix $MM$ containing the couples of machines. The complexities of the different data structures are summarized in Table~\ref{ch8:tabMemComplex} where the columns represent respectively the name of the data structure, its size and the number of times it is accessed. +In this section, the characteristics of the data structures used by the lower bound function are studied in terms of sizes and access frequencies. For an efficient implementation of the LB, six data structures are required: the matrix $PTM$ of the processing times of the jobs, the matrix of lags $LM$, the Johnson's matrix $JM$, the matrix $RM$ of the earliest starting times of jobs, the matrix $QM$ of their lowest latency times and the matrix $MM$ containing the couples of machines. The complexities of the different data structures are summarized in Table~\ref{ch8:tabMemComplex} where the columns represent respectively the name of the data structure, its size and the number of times it is accessed.\\ -\vspace{0.2cm} -In the $LB$ expression, the computation of the term $P_{Ja}^*(\jmath,M_k,M_l)$ requires the calculation of the lag of each remaining job to be scheduled on the couple $(M_k,M_l)$ of machines using its processing times on these machines (Johnson's rule with lags). Such computation is repeated for each couple $(M_k,M_l)$ of machines with $1 \leq k,l \leq m$ and $k