Although this bounding mechanism allows the considerable reduction of the exploration time, often only small or moderatelysized instances of COPs can be practically solved. For this reason, over the last decades, parallel computing has been revealed as an attractive way to deal with larger instances of COPs. However, while many contributions have been proposed for parallel B\&B methods using massively parallel processors \cite{ch8:Allen_1997}, networks or clusters of workstations \cite{ch8:Quinn_1990}, and Shared Memory Multiprocessors (SMP) machines \cite{ch8:Casadoa_2008}, very few contributions have been proposed for redesigning B\&B algorithms on Graphical Processing Units (GPUs) \cite{ch8:Carneiro_2011}. For years, the use of GPU accelerators was limited to graphics and video applications. Driven by the demand for high-definition 3D graphics on personal computers, GPUs have evolved into a highly parallel, multithreaded and many-core environment. Their utilization has recently been extended to other application domains such as scientific computing \cite{ch8:Kurzak_2010}.\\
-In this chapter, we rethink the design and implementation of irregular tree-based algorithms such as the B\&B algorithm on top of GPUs. During the execution of the B\&B algorithm, the number of newly generated nodes and the number of not yet explored but promising nodes are variable and depend on the level of the tree being explored and on the best solution found so far. Therefore, due to such unstructured and unpredictable nature of its search tree, designing efficient B\&B on top of GPUs is not straightforward. We investigate two different approaches for designing GPU-based B\&B starting from the parallel models for B\&B identified in \cite{ch8:MelabHDR_2005}. The first one is based on the ``parallel tree exploration'' paradigm. This approach consists of exploring in parallel different subspaces of the tree. The second approach is based on the ``parallel evaluation of bounds'' approach. The two approaches have been applied to the permutation Flowshop Scheduling Problem \index{Flowshop Scheduling Problem} (FSP; see Section~\ref{ch8:BB-FSP}) which is an NP-hard combinatorial optimization problem. The lower bound function used in this work for FSP is the one proposed in~\cite{ch8:Johnson_1954} for two machines and generalized in~\cite{ch8:Lenstra_1978} to more than two machines.\\
+In this chapter, we rethink the design and implementation of irregular tree-based algorithms such as the B\&B algorithm on top of GPUs. During the execution of the B\&B algorithm, the number of newly generated nodes and the number of not yet explored but promising nodes are variable and depend on the level of the tree being explored and on the best solution found so far. Therefore, due to such unstructured and unpredictable nature of its search tree, designing efficient B\&B on top of GPUs is not straightforward. We investigate two different approaches for designing GPU-based B\&B starting from the parallel models for B\&B identified in \cite{ch8:MelabHDR_2005}. The first one is based on the ``parallel tree exploration'' paradigm. This approach consists of exploring in parallel different subspaces of the tree. The second approach is based on the ``parallel evaluation of bounds'' approach. The two approaches have been applied to the permutation Flowshop Scheduling Problem \index{flowshop scheduling problem} (FSP; see Section~\ref{ch8:BB-FSP}) which is an NP-hard combinatorial optimization problem. The lower bound function used in this work for FSP is the one proposed in~\cite{ch8:Johnson_1954} for two machines and generalized in~\cite{ch8:Lenstra_1978} to more than two machines.\\
When rethinking those two parallel models for GPU's architectures, our main focus was on the lower bound function. Indeed, preliminary experiments we carried out on some of Taillard's problem instances \cite{ch8:Taillard_1993} show that computing the lower bounds takes on average between 98\% and 99\% of the total execution time of the B\&B. The GPU-based lower bound's implementation raises mainly two challenges. On the one hand, having in mind that the execution model of GPUs is Single Instruction Multiple Data (SIMD), irregular computations (containing loops and conditional instructions) contained in the lower bound function may lead to a very challenging issue: the thread or branch divergence. This problem drops down the performance and arises when threads of a same warp (the smallest executable unit of parallelism on the GPU) execute different data-dependent instructions. On the other hand, the lower bound computation usually uses large and frequently accessed data structures. Since GPU is a many-core coprocessor device that provides a hierarchy of memories having different sizes and access latencies, the placement and sharing of these data sets become challenging.\\
The chapter is organized into seven main sections. Section \ref{ch8:BB} presents the B\&B algorithm. Section \ref{ch8:Parallel-BB} introduces the different models used to parallelize B\&B algorithms. Section \ref{ch8:BB-FSP} briefly describes the Flowshop Scheduling permutation Problem. In Section~\ref{ch8:approach1}, we describe the GPU-accelerated B\&B based on the parallel tree exploration. In Section~\ref{ch8:approach2}, details about the second approach, the GPU-accelerated B\&B based on the parallel evaluation of lower bounds, are given. In Section \ref{ch8:ThreadDivergence}, the thread divergence issue related to the location of nodes in the B\&B tree and to the control flow instructions within the bounding operator is described. In Section \ref{ch8:DataAccessOpt}, the memory access optimization challenge is addressed and an overview of the GPU memory hierarchy and the used memory access pattern is given. In Section~\ref{ch8:Experiments}, we report experimental results showing the performances of each of two studied approaches compared to a sequential CPU-based execution of the B\&B and demonstrating the efficiency of the proposed optimizations.
-\section{Branch-and-bound \index{Branch-and-bound} algorithm}
+\section{Branch-and-bound \index{branch-and-bound} algorithm}
\label{ch8:BB}
Branch-and-bound algorithms are by far the most widely used methods for exactly solving large scale NP-hard combinatorial optimization problems. Indeed, they allow the finding of the optimal solution of a problem with proof of optimality. \\
Tree-based strategies consist of building and/or exploring the solution tree in parallel by performing operations on several subproblems simultaneously. This coarse-grained type of parallelism affects the general structure of the B\&B algorithm and makes it highly irregular.\\
-The parallel tree exploration \index{parallel tree exploration} model, illustrated in Figure \ref{ch8:parallel_tree}, consists of visiting in parallel different paths of the same tree. The search tree is explored in parallel by performing the branching, selection, bounding, and elimination operators on several subproblems simultaneously.\\
+The parallel tree exploration \index{parallel!tree exploration} model, illustrated in Figure \ref{ch8:parallel_tree}, consists of visiting in parallel different paths of the same tree. The search tree is explored in parallel by performing the branching, selection, bounding, and elimination operators on several subproblems simultaneously.\\
\begin{figure}
\begin{center}
Node-based strategies introduce parallelism when performing the operations on a single problem. For instance, they consist of executing the bounding operation in parallel for each subproblem to accelerate the execution. This type of parallelism has no influence on the general structure of the B\&B algorithm and is particular to the problem being solved.\\
-The parallel evaluation of bounds \index{parallel evaluation of bounds} model, as shown in Figure \ref{ch8:bounds_parallel}, allows the parallelization of the bounding of subproblems generated by the branching operator. This model is used in the case where the bounding operator is performed several times after the branching operator. Compared to the sequential B\&B, the model does not change the order and the number of explored subproblems in the parallel B\&B algorithm.
+The parallel evaluation of bounds \index{parallel!evaluation of bounds} model, as shown in Figure \ref{ch8:bounds_parallel}, allows the parallelization of the bounding of subproblems generated by the branching operator. This model is used in the case where the bounding operator is performed several times after the branching operator. Compared to the sequential B\&B, the model does not change the order and the number of explored subproblems in the parallel B\&B algorithm.
\begin{figure}
\begin{center}
\end{figure}
-\subsection{Lower bound \index{Lower bound} for the flowshop scheduling problem}
+\subsection{Lower bound \index{lower bound} for the flowshop scheduling problem}
The lower bounding technique provides a lower bound (LB) for each subproblem generated by the branching operator. The more the bound is accurate, the more it allows the elimination from the search tree that are not promising. Therefore, the efficiency of a B\&B algorithm depends strongly on the quality of its lower bound function. In this chapter, we use the lower bound proposed by Lenstra et al.~\cite{ch8:Lenstra_1978} for FSP, based on the Johnson's algorithm~\cite{ch8:Johnson_1954}.\\
\subsection{The thread divergence issue}
-During the execution of an application on GPU, one or more thread block(s) are assigned to each GPU multiprocessor to execute. Those threads are partitioned into warps that get scheduled for execution. For each instruction of the flow, the multiprocessor selects a warp that is ready to be run. A warp executes one common instruction at a time, so full efficiency is realized when all threads of a warp agree on their execution path. In this chapter, the G80 model, in which a warp is a pool of 32 threads, is used. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken. Threads that are not on the taken path are disabled, and when all paths are complete, the threads converge back to the same execution path. This phenomenon is called thread/branch divergence\index{Thread divergence} and often causes serious performance degradations. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjointed code paths.\\
+During the execution of an application on GPU, one or more thread block(s) are assigned to each GPU multiprocessor to execute. Those threads are partitioned into warps that get scheduled for execution. For each instruction of the flow, the multiprocessor selects a warp that is ready to be run. A warp executes one common instruction at a time, so full efficiency is realized when all threads of a warp agree on their execution path. In this chapter, the G80 model, in which a warp is a pool of 32 threads, is used. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken. Threads that are not on the taken path are disabled, and when all paths are complete, the threads converge back to the same execution path. This phenomenon is called thread/branch divergence\index{GPU!thread divergence} and often causes serious performance degradations. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjointed code paths.\\
This section discusses thread divergence issues encountered when computing the bounds by GPU. The thread divergence occurs for two main reasons, namely, the locations of nodes in the search tree and the control flow instructions within the bounding operator. \\
\section{Memory access optimization}
\label{ch8:DataAccessOpt}
-Memory access optimizations \index{Memory access optimizations} are by far the most studied area for improving GPU-based application performances. Indeed, adjusting the pattern of accesses to the GPU device memory allows programmers to further improve the throughput of many high-performance CUDA applications. The goal of memory access optimizations is generally to use as much fast-access memory and as little slow-access memory as possible. This section discusses how best to set up data LB items on the various kinds of memory on the device. \\
+Memory access optimizations \index{memory access optimizations} are by far the most studied area for improving GPU-based application performances. Indeed, adjusting the pattern of accesses to the GPU device memory allows programmers to further improve the throughput of many high-performance CUDA applications. The goal of memory access optimizations is generally to use as much fast-access memory and as little slow-access memory as possible. This section discusses how best to set up data LB items on the various kinds of memory on the device. \\
CUDA-enabled devices use several memory spaces, which have different characteristics in term of sizes and access latencies. These memory spaces include global memory, local memory , shared memory, texture memory , and registers. Devices of compute capability 2.0 also have an L1/L2 cache hierarchy that is used to cache local and global memory accesses.
\subsection{Complexity analysis of the memory usage of the lower bound }
-In this section, the characteristics of the data structures used by the lower bound function are studied in terms of sizes and access frequencies. For an efficient implementation of the LB, six data structures are required: the matrix $PTM$ of the processing times of the jobs, the matrix of lags $LM$, the Johnson's matrix $JM$, the matrix $RM$ of the earliest starting times of jobs, the matrix $QM$ of their lowest latency times, and the matrix $MM$ containing the couples of machines. The complexities of the different data structures are summarized in Table~\ref{ch8:tabMemComplex} where the columns represent, respectively, the name of the data structure, its size, and the number of times it is accessed.\\
+In this section, the characteristics of the data structures used by the lower bound function are studied in terms of sizes and access frequencies. For an efficient implementation of the LB, six data structures are required: the matrix PTM of the processing times of the jobs, the matrix of lags LM, the Johnson's matrix JM, the matrix RM of the earliest starting times of jobs, the matrix QM of their lowest latency times, and the matrix MM containing the couples of machines. The complexities of the different data structures are summarized in Table~\ref{ch8:tabMemComplex} where the columns represent, respectively, the name of the data structure, its size, and the number of times it is accessed.\\
-In the LB expression, the computation of the term $P_{Ja}^*(\jmath,M_k,M_l)$ requires the calculation of the lag of each remaining job to be scheduled on the couple $(M_k,M_l)$ of machines using its processing times on these machines (Johnson's rule with lags). Such computation is repeated for each couple $(M_k,M_l)$ of machines with $1 \leq k,l \leq m$ and $k<l$. To avoid the repetitive computation of the lags, they are computed once at the beginning of the algorithm and stored in the matrix $LM$. The dimension of $LM$ is $n \times \frac{m\times (m-1)}{2}$, where $n$ and $m$ are respectively the number of jobs to be scheduled and $m$ the number of machines. $LM$ is accessed $n' \times \frac{m \times (m-1)}{2}$ times, $n'$ being the number of remaining jobs to be scheduled in the subproblem for which the lower bound is being calculated. The processing times of all the jobs on all the machines are stored in the matrix $PTM$. This matrix has a dimension of $n \times m$ and is accessed $n' \times m \times (m-1)$ times.\\
+In the LB expression, the computation of the term $P_{Ja}^*(\jmath,M_k,M_l)$ requires the calculation of the lag of each remaining job to be scheduled on the couple $(M_k,M_l)$ of machines using its processing times on these machines (Johnson's rule with lags). Such computation is repeated for each couple $(M_k,M_l)$ of machines with $1 \leq k,l \leq m$ and $k<l$. To avoid the repetitive computation of the lags, they are computed once at the beginning of the algorithm and stored in the matrix $LM$. The dimension of $LM$ is $n \times \frac{m\times (m-1)}{2}$, where $n$ and $m$ are respectively the number of jobs to be scheduled and $m$ the number of machines. $LM$ is accessed $n' \times \frac{m \times (m-1)}{2}$ times, $n'$ being the number of remaining jobs to be scheduled in the subproblem for which the lower bound is being calculated. The processing times of all the jobs on all the machines are stored in the matrix PTM. This matrix has a dimension of $n \times m$ and is accessed $n' \times m \times (m-1)$ times.\\
-In addition, in order to avoid relaunching the Johnson's algorithm for each couple of machines and each subset of jobs, the Johnson's algorithm is computed once to find the optimal solutions on the couples of machines. These optimal solutions are then stored in the Johnson's matrix $JM$. This matrix has the same dimension as $LM$ and is accessed $n \times \frac{m \times (m-1)}{2}$ times during the computation of the lower bound. Finally, the $MM$ matrix that contains all the couples of machines has a dimension and access frequency of $m \times (m-1)$. \\
+In addition, in order to avoid relaunching the Johnson's algorithm for each couple of machines and each subset of jobs, the Johnson's algorithm is computed once to find the optimal solutions on the couples of machines. These optimal solutions are then stored in the Johnson's matrix JM. This matrix has the same dimension as LM and is accessed $n \times \frac{m \times (m-1)}{2}$ times during the computation of the lower bound. Finally, the MM matrix that contains all the couples of machines has a dimension and access frequency of $m \times (m-1)$. \\
-To reduce the computation time cost of the term $\min\limits_{(i,j)\in \jmath^2, i \neq j}(r_{i,k}+q_{j,l})$ in the LB expression, two matrices are defined, namely $RM$ and $QM$. They are used to store, respectively, the lowest starting and latency times of all the jobs on each machine. Their dimension is $m$ and, are accessed $ m \times (m-1)$ times and $ \frac{m \times (m-1)}{2}$ times, respectively.
+To reduce the computation time cost of the term $\min\limits_{(i,j)\in \jmath^2, i \neq j}(r_{i,k}+q_{j,l})$ in the LB expression, two matrices are defined, namely RM and QM. They are used to store, respectively, the lowest starting and latency times of all the jobs on each machine. Their dimension is $m$ and, are accessed $ m \times (m-1)$ times and $ \frac{m \times (m-1)}{2}$ times, respectively.
\begin{table}
\centering
\end{table}
-Taking into consideration the sizes of each data structure presented in Table \ref{ch8:tabMemSizes}, our challenge is to find which data structure has to be mapped onto which memory and in some cases how to split the data structures onto different memories and efficiently manage their accesses. The sizes in bytes reported in Table \ref{ch8:tabMemSizes} are computed knowing that in our implementation the elements of $JM$ and $PTM$ are unsigned chars (one byte) and that the elements of $LM$, $RM$, $QM$, and $MM$ are unsigned short ints (2 bytes). It is important here to highlight that the types of the data of the used matrices impact the size of each matrix. For instance, a matrix of $100$ integers has a size of $400$ octets while the same matrix with $100$ unsigned chars has a size of $100$ octets. In order to minimize the size of each of the used matrices, we analyzed the ranges of their values and defined their data types accordingly. For instance, in PTM all the processing times have positive values varying between $0$ and $100$. Therefore, we defined PTM as a matrix of \verb|unsigned char| having values in the range $[0, 255]$. Using the \verb|unsigned char| type instead of the integer type allows us to reduce by $4$ times the memory space occupied by PTM.\\
+Taking into consideration the sizes of each data structure presented in Table \ref{ch8:tabMemSizes}, our challenge is to find which data structure has to be mapped onto which memory and in some cases how to split the data structures onto different memories and efficiently manage their accesses. The sizes in bytes reported in Table \ref{ch8:tabMemSizes} are computed knowing that in our implementation the elements of JM and PTM are unsigned chars (one byte) and that the elements of LM, RM, QM, and MM are unsigned short ints (2 bytes). It is important here to highlight that the types of the data of the used matrices impact the size of each matrix. For instance, a matrix of $100$ integers has a size of $400$ octets while the same matrix with $100$ unsigned chars has a size of $100$ octets. In order to minimize the size of each of the used matrices, we analyzed the ranges of their values and defined their data types accordingly. For instance, in PTM all the processing times have positive values varying between $0$ and $100$. Therefore, we defined PTM as a matrix of \verb|unsigned char| having values in the range $[0, 255]$. Using the \verb|unsigned char| type instead of the integer type allows us to reduce by $4$ times the memory space occupied by PTM.\\
According to the Table \ref{ch8:tabMemSizes} :
\begin{itemize}
- \item The data structures $RM$, $QM$ and $MM$ are small sized-matrices. Therefore, their impact on the performances is not significant whatever is the memory to which they are off-loaded. In particular, preliminary experiments prove that putting them on the shared memory would allows a very poor performance improvement.
-\item The $LM$ data structure is the double of the $JM$ in memory size but with a much lower access frequency. It is thus better to map $JM$ on the shared memory.
-\item The $PTM$ has almost the same access frequency than $JM$ but requires less memory space.
+ \item The data structures RM, QM and MM are small sized-matrices. Therefore, their impact on the performances is not significant whatever is the memory to which they are off-loaded. In particular, preliminary experiments prove that putting them on the shared memory would allows a very poor performance improvement.
+\item The LM data structure is the double of the JM in memory size but with a much lower access frequency. It is thus better to map JM on the shared memory.
+\item The PTM has almost the same access frequency than JM but requires less memory space.
\end{itemize}
-Consequently, the focus is put on the study of the performance impact of the placement of $JM$ and $PTM$ on the shared memory. Three placement scenarios of $JM$ and $PTM$ are experimented and studied: (1) Only $PTM$ is stored in shared memory and all others are placed in global memory~; (2) Only $JM$ is stored in shared memory and all others are placed on global memory~; (3) $PTM$ and $JM$ are stored together in shared memory and all others are placed on global memory. \\
+Consequently, the focus is put on the study of the performance impact of the placement of JM and PTM on the shared memory. Three placement scenarios of JM and PTM are experimented and studied: (1) Only PTM is stored in shared memory and all others are placed in global memory~; (2) Only JM is stored in shared memory and all others are placed on global memory~; (3) PTM and JM are stored together in shared memory and all others are placed on global memory. \\
Taking profit from the configurable storage space provided in the new Fermi-based devices, the $64$ KB of local storage was split between the shared memory and the L1 cache according to the experimented scenario.
\subsection{Experimental protocol: computing the speedup}
\label{ch8:Protocol}
-We need to compute the speed up of our approach to evaluate its performances. This speed up is obtained by comparing our GPU B\&B version to a sequential B\&B version deployed on one CPU core. However, all the instances used in our experiments are extremely hard to solve. Indeed, the resolution of each of these instances requires several months of computation on one CPU core. For example, the optimal solution of one of these instances defined by $50$ jobs and $20$ machines is obtained after $25$ days of computation using an average of $328$ CPU cores \cite{ch8:Mezmaz_2007}. \\
+We need to compute the speedup of our approach to evaluate its performances. This speedup is obtained by comparing our GPU B\&B version to a sequential B\&B version deployed on one CPU core. However, all the instances used in our experiments are extremely hard to solve. Indeed, the resolution of each of these instances requires several months of computation on one CPU core. For example, the optimal solution of one of these instances defined by $50$ jobs and $20$ machines is obtained after $25$ days of computation using an average of $328$ CPU cores \cite{ch8:Mezmaz_2007}. \\
-Using the approach defined in \cite{ch8:Mezmaz_2007}, it is possible to obtain a random list $L$ of subproblems such that the resolution of $L$ lasts $T$ minutes with a sequential B\&B. So by initializing the pool of our sequential B\&B with the subproblems of this list $L$, we are sure that the resolution of the sequential B\&B will last $T{cpu}$ minutes such as $T{cpu}$ will be approximately equal to $T$. Therefore, it will be possible to initialize the pool of our GPU B\&B with the same list $L$ of subproblems in order to compute the speed up. Let us suppose that the resolution of the GPU B\&B will last $T{gpu}$ minutes. So the speed up of our GPU algorithm will be equal to $Tcpu/Tgpu$. With this experimental protocol, the subproblems explored by the GPU and CPU B\&B versions will be exactly the same. So to find the speed up associated to an instance, we:
+Using the approach defined in \cite{ch8:Mezmaz_2007}, it is possible to obtain a random list $L$ of subproblems such that the resolution of $L$ lasts $T$ minutes with a sequential B\&B. So by initializing the pool of our sequential B\&B with the subproblems of this list $L$, we are sure that the resolution of the sequential B\&B will last $T{cpu}$ minutes such as $T{cpu}$ will be approximately equal to $T$. Therefore, it will be possible to initialize the pool of our GPU B\&B with the same list $L$ of subproblems in order to compute the speedup. Let us suppose that the resolution of the GPU B\&B will last $T{gpu}$ minutes. So the speedup of our GPU algorithm will be equal to $Tcpu/Tgpu$. With this experimental protocol, the subproblems explored by the GPU and CPU B\&B versions will be exactly the same. So to find the speedup associated to an instance, we:
\begin{itemize}
\item compute, using the approach defined in \cite{ch8:Mezmaz_2007}, a list $L$ of subproblems such as the resolution of $L$ lasts $T$ minutes with a sequential B\&B;
\item solve the subproblems of this pool with our GPU B\&B;
\item get the GPU resolution time $T{gpu}$ and the number of explored subproblems $N{gpu}$;
\item check that $N{gpu}$ is exactly equal to $N{cpu}$;
-\item and finally compute the speed up associated to this instance by dividing $T{cpu}$ by $T{gpu}$ (i.e., $Tcpu/Tgpu$).
+\item and finally compute the speedup associated to this instance by dividing $T{cpu}$ by $T{gpu}$ (i.e., $Tcpu/Tgpu$).
\end{itemize}
\footnotesize
\begin{tabular}{|r|r|r|r|r|}
\hline
-Instance (No. of jobs x No. of machines) & 20$\times$20 & 50$\times$20 & 100$\times$20 & 200$\times$20 \\
+Instance (No. of jobs $\times$ No. of machines) & 20$\times$20 & 50$\times$20 & 100$\times$20 & 200$\times$20 \\
\hline
Sequential resolution time (minutes) & 10 & 50 & 150 & 300 \\
\hline
The objective of the experimental study presented in this section is to compared the performances of both proposed approaches for designing B\&B on top of GPUs.
-Table \ref{ch8:ParaGPU1} and Table~\ref{ch8:ParaGPU2} report respectively the speedups obtained with the GPU-PTE-BB and GPU-PEB-BB approaches for different problem instances. The first part of both tables gives the size of the pool generated and evaluated on the GPU. The second part of the tables gives the average speedup for each group of instances and for each pool size. Each line corresponds to a group of $10$ instances defined by the same number of jobs and the same number of machines.
+Table \ref{ch8:ParaGPU1} and Table~\ref{ch8:ParaGPU2} report the speedups obtained with the GPU-PTE-BB and GPU-PEB-BB approaches, respectively, for different problem instances. The first part of both tables gives the size of the pool generated and evaluated on the GPU. The second part of the tables gives the average speedup for each group of instances and for each pool size. Each line corresponds to a group of $10$ instances defined by the same number of jobs and the same number of machines. \\
-The results obtained with the GPU-PTE-BB approach (see Table \ref{ch8:ParaGPU1}) show that exploring in parallel the tree search allows to speedup the execution of the B\&B compared to a CPU-based execution. Indeed, an acceleration factor up to 40.50 is obtained for the 20 $\times$ 20 problem instances using a pool of 262144 subproblems.
+The results obtained with the GPU-PTE-BB approach (see Table \ref{ch8:ParaGPU1}) show that exploring the tree search in parallel allows the speedup of the execution of the B\&B compared to a CPU-based execution. Indeed, an acceleration factor up to 40.50 is obtained for the 20 $\times$ 20 problem instances using a pool of 262144 subproblems. \\
-The results show also that the parallel efficiency decreases with the size of the problem instance. For a fixed number of machines (here 20 machines) and a fixed pool size, the obtained speedup decline accordingly with the number of jobs. For instance for a pool size of 262144, the acceleration factor obtained with 200 jobs (13.4) while it is (40.50) for the instances with 20 jobs. This behavior is mainly due to the overhead induced by the transfer of the pool of resulting subproblems between the CPU and the GPU. For example, for the instances with 200 jobs the size of the pool to exchange between the CPU and the GPU is ten times bigger than the size of the pool for the instances with 20 jobs.
-
-\begin{table}[htbp]
+\begin{table}[h]
\setlength{\tabcolsep}{0.2cm}
\renewcommand{\arraystretch}{1.2}
\centering
Pool size & 4096 & 8192 & 16384 & 32768 & 65536 & 131072 & 262144\\
\hline
\hline
-(NJobs $\times$ NMachines) & \multicolumn{7}{|c|}{Average speedup for each group of 10 instances}\\
+($N$ Jobs $\times$ $N$ Machines) & \multicolumn{7}{|c|}{Average speedup for each group of 10 instances}\\
\hline
$200 \times $20 & 1.12 & 2.89 & 3.57 & 4.23 & 6.442 & 8.32 & 13.4\\
\hline
\label{ch8:ParaGPU1}
\end{table}
-The results obtained with the GPU-PEB-BB approach (see Table \ref{ch8:ParaGPU2}) show that evaluating in parallel the bounds of a selected pool, allow to significantly speedup the execution of the B\&B. Indeed, an acceleration factor up to 71.69 is obtained for the 200 $\times$ 20 problem instances using a pool of 262144 subproblems. The results show also that the parallel efficiency grows with the size of the problem instance. For a fixed number of machines (here 20 machines) and a fixed pool size, the obtained speedup grows accordingly with the number of jobs. For instance for a pool size of 262144, the acceleration factor obtained with 200 jobs (71.69) is almost the double of the one obtained with 20 jobs (38.40).
-As far the pool size tuning is considered, we could notice that this parameter depends strongly on the problem instance being solved. Indeed, while the best acceleration is obtained with a pool size of 8192 subproblems for the instances 50 $\times$ 20 and 20 $\times$ 20, the best speedups are obtained with a pool size of 262144 subproblems with the instances 200 $\times$ 20 and 100 $\times$ 20.\\
+The results show also that the parallel efficiency decreases with the size of the problem instance. For a fixed number of machines (here 20 machines) and a fixed pool size, the obtained speedup declines accordingly with the number of jobs. For instance for a pool size of 262144, the acceleration factor obtained with 200 jobs is 13.4 while it is 40.50 for the instances with 20 jobs. This behavior is mainly due to the overhead induced by the transfer of the pool of resulting subproblems between the CPU and the GPU. For example, for the instances with 200 jobs the size of the pool to exchange between the CPU and the GPU is ten times bigger than the size of the pool for the instances with 20 jobs.\\
-\begin{table}
+
+The results obtained with the GPU-PEB-BB approach (see Table \ref{ch8:ParaGPU2}) show that evaluating the bounds of a selected pool in parallel, allows significants speedup of the execution of the B\&B. Indeed, an acceleration factor up to 71.69 is obtained for the 200 $\times$ 20 problem instances using a pool of 262144 subproblems. The results show also that the parallel efficiency grows with the size of the problem instance. For a fixed number of machines (here 20 machines) and a fixed pool size, the obtained speedup grows accordingly with the number of jobs. For instance for a pool size of 262144, the acceleration factor obtained with 200 jobs (71.69) is almost double obtained with 20 jobs (38.40). \\
+
+\begin{table}[h]
\setlength{\tabcolsep}{0.2cm}
\renewcommand{\arraystretch}{1.2}
\centering
\label{ch8:ParaGPU2}
\end{table}
-Compared to the parallel tree exploration-based GPU-accelerated B\&B approach, the parallel evaluation of bounds approach is by far much more efficient wherever the instance is. For example, while the GPU-PEB-BB approach reaches speedup of $\times$71.69 for the instance with 200 jobs on 20 machines, a speedup of a $\times$13.4 is measured with the parallel tree exploration-based approach which corresponds to an acceleration of $\times$5.56 . Moreover, on the contrary to the GPU-PEB-BB approach, in the GPU-PTE-BB the speedups decrease when the problem instance becomes higher. Remember here that while in the GPU-PEB-BB approach all threads evaluate only one node each whatever the permutation size is. In the GPU-PTE-BB, each thread branches all the children of its assigned parent node. Therefore, the bigger the size of the permutation is, the bigger the amount of work performed by each thread is and the bigger the difference between the workload is. Indeed, let us suppose that for the instance with $200$ jobs, the thread $0$ handles a node from the level $2$ of the tree and the thread $100$ handles a node from the level $170$ of the tree. In this case, the thread $0$ generates and evaluates $198$ nodes while the thread $100$ decomposes and bounds only $30$ nodes. The problem in this example is that the kernel execution would last until the thread $0$ finishes its work while the other threads might have ended their works and stayed idle.
+As far as the pool size tuning is considered, we could notice that this parameter depends strongly on the problem instance being solved. Indeed, while the best acceleration is obtained with a pool size of 8192 subproblems for the instances 50 $\times$ 20 and 20 $\times$ 20 (Table\~ref{ch8:ParaGPU2} in bold), the best speedups are obtained with a pool size of 262144 subproblems with the instances 200 $\times$ 20 and 100 $\times$ 20 (Table\~ref{ch8:ParaGPU2} in bold).\\
+
+
+
+Compared to the parallel tree exploration-based GPU-accelerated B\&B approach, the parallel evaluation of bounds approach is by far much more efficient wherever the instance is. For example, while the GPU-PEB-BB approach reaches speedup of $\times$71.69 for the instance with 200 jobs on 20 machines, a speedup of a $\times$13.4 is measured with the parallel tree exploration-based approach which corresponds to an acceleration of $\times$5.56. Moreover, to the contrary of the GPU-PEB-BB approach, in the GPU-PTE-BB the speedups decrease when the problem instance becomes higher. Remember here that while in the GPU-PEB-BB approach all threads evaluate only one node each whatever the permutation size is. In the GPU-PTE-BB, each thread branches all the children of its assigned parent node. Therefore, the bigger the size of the permutation, the bigger the amount of work performed by each thread is and the bigger the difference between the workload. Indeed, let us suppose that for the instance with $200$ jobs, the thread $0$ handles a node from the level $2$ of the tree and the thread $100$ handles a node from the level $170$ of the tree. In this case, the thread $0$ generates and evaluates $198$ nodes while the thread $100$ decomposes and bounds only $30$ nodes. The problem in this example is that the kernel execution would last until the thread $0$ finishes its work while the other threads might have completed their work and remains idle.
\subsection{Thread divergence reduction}
-The objective of this section is to demonstrate that the thread divergence reduction mechanisms we propose has an impact on the performance of the GPU accelerated B\&B and to evaluate how this impact is significant.
+The objective of this section is to demonstrate that the thread divergence reduction mechanisms we propose have an impact on the performance of the GPU accelerated B\&B and to evaluate how this impact is significant.
In the following, the reported results are obtained with the GPU-accelerated B\&B based on the parallel evaluation of bounds.
-\begin{table}[!h]
+\begin{table}[h]
\setlength{\tabcolsep}{0.2cm}
\renewcommand{\arraystretch}{1.2}
\centering
Pool size & 4096 & 8192 & 16384 & 32768 & 65536 & 131072 & 262144\\
\hline
\hline
-(NJobs $\times$ NMachines) & \multicolumn{7}{|c|}{Average speedup for each group of 10 instances}\\
+($N$ Jobs $\times$ $N$ Machines) & \multicolumn{7}{|c|}{Average speedup for each group of 10 instances}\\
\hline
\hline
$200 \times $20 & 46.63 & 60.88 & 63.80 & 67.51 & 73.47 & 75.94 & \textbf{77.46}\\
\label{ch8:ParaDivergence}
\end{table}
-Table~\ref{ch8:ParaDivergence} shows the experimental results obtained using the sorting process and the refactoring approach presented in Section \ref{ch8:ThreadDivergence}. Results show that the proposed optimizations emphasize the GPU acceleration reported in Table~\ref{ch8:ParaGPU2} and obtained without thread divergence reduction. For example, for the instances of 200 jobs over 20 machines and a pool size of 262144, the average reported speedup is 77.46 while the average acceleration factor obtained without thread divergence management for the same instances and the same pool size is 71.69 which corresponds to an improvement of 7.68\%. Such considerable but not outstanding improvement is predictable, as claimed in \cite{ch8:Han}, since the factorized part of the branches in the FSP lower bound is very small.
+Table~\ref{ch8:ParaDivergence} shows the experimental results obtained using the sorting process and the refactoring approach presented in Section \ref{ch8:ThreadDivergence}. Results show that the proposed optimizations emphasize the GPU acceleration reported in Table~\ref{ch8:ParaGPU2} obtained without thread divergence reduction. For example, for the instances of 200 jobs over 20 machines and a pool size of 262144, the average reported speedup is 77.46 (Table~\ref{ch8:ParaDivergence} in bold) while the average acceleration factor obtained without thread divergence management for the same instances and the same pool size is 71.69 which corresponds to an improvement of 7.68\%. Such considerable but not outstanding improvement is predictable, as claimed in \cite{ch8:Han}, since the factorized part of the branches in the FSP lower bound is very small.
\subsection{Data access optimization}
-The objective of the experimental study presented in this section is to find the best mapping of the six data structures of the lower bound LB kernel on the memories of the GPU device. In the following, the reported results are obtained with the GPU-accelerated B\&B based on the parallel evaluation of bounds.
+The objective of the experimental study presented in this section is to find the best mapping of the six data structures of the LB kernel on the memories of the GPU device. In the following, the reported results are obtained with the GPU-accelerated B\&B based on the parallel evaluation of bounds.
-Table~\ref{ch8:PTM-on-SM} reports the speedups obtained for the first experimented scenario where only the matrix $PTM$ is put on the shared memory. Results show that the speedup grows on average with the growing of the pool size in the same way as in Table~\ref{ch8:ParaDivergence}. For the largest problem instance and pool size, putting the PTM matrix on the shared memory improves the speedups up to ($14\%$) compared to those obtained when $PTM$ is on global memory reaching an acceleration of $\times 90.51$ for the problem instances $200 \times 20$ and a pool size of $262144$ subproblems .
+Table~\ref{ch8:PTM-on-SM} reports the speedups obtained for the first experimental scenario where only the matrix PTM is put on the shared memory. Results show that the speedup grows on average with the growing of the pool size in the same way as in Table~\ref{ch8:ParaDivergence}. For the largest problem instance and pool size, putting the PTM matrix on the shared memory improves the speedups up ($14\%$) compared to those obtained when PTM is on global memory reaching an acceleration of $\times 90.51$ for the problem instances $200 \times 20$ and a pool size of $262144$ subproblems .
\begin{table}
\centering
Pool size & 4096 & 8192 & 16384 & 32768 & 65536 & 131072 & 262144\\
\hline
\hline
-(NJobs $\times$ NMachines) & \multicolumn{7}{|c|}{Average speedup for each group of 10 instances}\\
+($N$ Jobs $\times$ $N$ Machines) & \multicolumn{7}{|c|}{Average speedup for each group of 10 instances}\\
\hline
\hline
$200 \times $20 & 54.03 & 67.75 & 68.43 & 72.17 & 82.01 & 88.35 & \textbf{90.51}\\
% \hline
% \hline
\end{tabular}
- \caption[Speedup for different FSP instances and pool sizes obtained with data access optimization.]{Speedup for different FSP instances and pool sizes obtained with data access optimization. $PTM$ is placed in shared memory and all others are placed in global memory.}
+ \caption[Speedup for different FSP instances and pool sizes obtained with data access optimization.]{Speedup for different FSP instances and pool sizes obtained with data access optimization. PTM is placed in shared memory and all others are placed in global memory.}
\label{ch8:PTM-on-SM}
\end{table}
-Table~\ref{ch8:JM-on-SM} reports the behavior of the speedup averaged on the different problem instances (sizes) as a function of the pool size for the scenario where the Johnson's matrix is put on the shared memory. Results show that putting the $JM$ matrix on the shared matrix improves more the performances comparing to the first scenario where $PTM$ is put on the shared memory. Indeed, according to Table~\ref{ch8:tabMemComplex}, matrix $JM$ is accessed more frequently than matrix $PTM$. Putting $JM$ matrix on the shared memory allows accelerations up to $\times 97.83$ for the problem instances $200 \times 20$.
+Table~\ref{ch8:JM-on-SM} reports the behavior of the speedup averaged on the different problem instances (sizes) as a function of the pool size for the scenario where the Johnson's matrix is put on the shared memory. Results show that putting the JM matrix on the shared memory improves the performances more than in the first scenario where PTM is put on the shared memory. Indeed, according to Table~\ref{ch8:tabMemComplex}, matrix JM is accessed more frequently than matrix PTM. Putting JM matrix on the shared memory allows accelerations up to $\times 97.83$ for the problem instances $200 \times 20$.
\begin{table}
\centering
Pool size & 4096 & 8192 & 16384 & 32768 & 65536 & 131072 & 262144\\
\hline
\hline
-(NJobs $\times$ NMachines) & \multicolumn{7}{|c|}{Average speedup for each group of 10 instances}\\
+($N$ Jobs $\times$ $N$ Machines) & \multicolumn{7}{|c|}{Average speedup for each group of 10 instances}\\
\hline
\hline
$200 \times $20 & 63.01 & 79.40 & 81.40 & 84.02 & 93.61 & 96.56 & \textbf{97.83}\\
% \hline
\end{tabular}
\caption[Speedup for different FSP instances and pool sizes obtained with data access optimization.]{Speedup for different FSP instances and pool sizes obtained with data access optimization.
-$JM$ is placed in shared memory and all others are placed in global memory.}
+JM is placed in shared memory and all others are placed in global memory.}
\label{ch8:JM-on-SM}
\end{table}
-Table~\ref{ch8:JM-PTM-on-SM} reports the behavior of the average speedup for the different problem instances (sizes) with $20$ machines for the data placement scenario where both $PTM$ and $JM$ are put on shared memory. According to the underlying Table, the scenarios~(3) ($JM$ together or without $PTM$ in shared memory) is clearly better than the scenarii~(1)and~(2) (respectively $PTM$ in shared memory and $JM$ in shared memory) whatever is the problem instance (size).
+Table~\ref{ch8:JM-PTM-on-SM} reports the behavior of the average speedup for the different problem instances (sizes) with $20$ machines for the data placement scenario where both PTM and JM are put on shared memory. According to the underlying tables, scenarios~3 (JM together with PTM in shared memory) is clearly better than the scenarios~1 and~2 (respectively PTM in shared memory and JM in shared memory) whatever the problem instance (size).
\begin{table}
\centering
% \hline
% \hline
\end{tabular}
- \caption[Speedup for different FSP instances and pool sizes obtained with data access optimization.]{Speedup for different FSP instances and pool sizes obtained with data access optimization. $PTM$ and $JM$ are placed together in shared memory and all others are placed in global memory.}
+ \caption[Speedup for different FSP instances and pool sizes obtained with data access optimization.]{Speedup for different FSP instances and pool sizes obtained with data access optimization. PTM and JM are placed together in shared memory and all others are placed in global memory.}
\label{ch8:JM-PTM-on-SM}
\end{table}
-By carefully analyzing each of the scenarii of data placement on the memory hierarchies of the GPU, the recommendation is to put in the shared memory the Johnson's and the processing time matrices ($JM$ and $PTM$) if they fit in together. Otherwise, the whole or a part of the Johnson's matrix has to be put in priority in the shared memory. The other data structures are mapped to the global memory.
+By carefully analyzing each of the scenarios of data placement on the memory hierarchies of the GPU, the recommendation is to put in the shared memory the Johnson's and the processing time matrices (JM and PTM) if they fit in together. Otherwise, the whole or a part of the Johnson's matrix has to be given in priority in the shared memory. The other data structures are mapped to the global memory.
\section{Conclusion and future work}
\label{ch8:Conclusion}
-In this chapter, we have revisited the design of parallel B\&B algorithms on GPU accelerators to allow highly efficient solving of permutation-based COPs. To do so, our contributions consist in: (1) rethinking two approaches for parallel B\&B on top of GPUs, discussing the performances of each and identifying which best suits the GPU accelerators. (2) proposing a new approach for thread/branch divergence reduction through a thorough analysis of the different loops and conditional instructions of the bounding function. (3) defining an optimal mapping of the data structures of the bounding function on the hierarchy of memories provided in the GPU device through a careful analysis of both the data structures (size and access frequency) and the GPU memories (size and access latency).
+In this chapter, we have revisited the design of parallel B\&B algorithms on GPU accelerators to allow highly efficient solving of permutation-based COPs. To do so, our contributions consisted of: (1) rethinking two approaches for parallel B\&B on top of GPUs, discussing the performances of each and identifying which best suits the GPU accelerators; (2) proposing a new approach for thread/branch divergence reduction through a thorough analysis of the different loops and conditional instructions of the bounding function; and (3) defining an optimal mapping of the data structures of the bounding function on the hierarchy of memories provided in the GPU device through a careful analysis of both the data structures (size and access frequency) and the GPU memories (size and access latency).
-In the first parallel tree-exploration-based B\&B, a set of pending nodes is selected from this list according to their depth and off-loaded to the GPU where each thread builds its own local search tree by applying
-the branching, bounding and pruning operators to the assigned node. In the GPU-accelerated B\&B based on the parallel evaluation of bounds, the generation of the subproblems (branching, selection and pruning operations) is performed on CPU and the evaluation of their lower bounds (bounding operation) is executed on the GPU device. Pools of subproblems are off-loaded from CPU to GPU to be evaluated by blocks of threads. After evaluation, the lower bounds are returned to the CPU.
+In the first parallel treeexploration-based B\&B, a set of pending nodes is selected from this list according to their depth and off-loaded to the GPU where each thread builds its own local search tree by applying
+the branching, bounding, and pruning operators to the assigned node. In the GPU-accelerated B\&B based on the parallel evaluation of bounds, the generation of the subproblems (branching, selection, and pruning operations) is performed on CPU and the evaluation of their lower bounds (bounding operation) is executed on the GPU device. Pools of subproblems are off-loaded from CPU to GPU to be evaluated by blocks of threads. After evaluation, the lower bounds are returned to the CPU.
-In both considered approaches, our focus is on the GPU-based lower bound's implementation and the associated thread divergence and data placement challenges. The proposed mechanisms for reducing the thread divergence issue are based on a thorough analysis of the different loops and conditional instructions of the lower bound function. On the one hand, the sorting process aims to homogenize the data of the subproblems off-loaded to the GPU to minimize the number of threads that diverge on loop instructions. On the other hand, the technique of branch refactoring rewrite the conditional instructions into uniform instructions so that threads of the same warp execute a same code. The proposed data access optimization is based on a preliminary analysis of the lower bound function. Such analysis allowed us to identify six data structures for which we have proposed a complexity analysis in terms of memory size and access frequency. Due to the limited size of the shared memory the matrices do not fit in all together. According to the complexity study, the recommendation is to put in the shared memory the Johnson's and the processing time matrices ($JM$ and $PTM$) if they fit in together. Otherwise, the whole or a part of the Johnson's matrix has to be put in priority in the shared memory. The other data structures are mapped to the global memory. Such recommendation has been confirmed through extensive experiments using a recent C2050 Tesla GPU card.
+In both considered approaches, our focus is on the GPU-based lower bound's implementation and the associated thread divergence and data placement challenges. The proposed mechanisms for reducing the thread divergence issue are based on a thorough analysis of the different loops and conditional instructions of the lower bound function. On the one hand, the sorting process aims to homogenize the data of the subproblems off-loaded to the GPU to minimize the number of threads that diverge on loop instructions. On the other hand, the technique of branch refactoring rewrite the conditional instructions into uniform instructions so that threads of the same warp execute the same code. The proposed data access optimization is based on a preliminary analysis of the lower bound function. Such analysis allowed us to identify six data structures for which we have proposed a complexity analysis in terms of memory size and access frequency. Due to the limited size of the shared memory the matrices do not fit in all together. According to the complexity study, the recommendation is to put the Johnson's and the processing time matrices (JM and PTM) in the shared memory if they fit in together. Otherwise, the whole or a part of the Johnson's matrix should be put in priority in the shared memory. The other data structures are mapped to the global memory. Such recommendation has been confirmed through extensive experiments using a recent C2050 Tesla GPU card.
-The Flowshop Scheduling Problem has been considered as a case study. The proposed approaches have been experimented using a Tesla C2050 GPU card on different classes of FSP instances. The experimental results show that the parallel evaluation of bounds is the parallelization paradigm that performs better on top of GPU accelerators. Compared to the parallel tree-exploration model, accelerations up to $\times$5.56 are achieved.
+The flowshop scheduling problem has been considered as a case study. The proposed approaches have been experimented using a Tesla C2050 GPU card on different classes of FSP instances. The experimental results show that the parallel evaluation of bounds is the parallelization paradigm that performs better on top of GPU accelerators. Compared to the parallel treeexploration model, accelerations up to $\times$5.56 are achieved.
-Experiments show also that the proposed refactoring approach improves the parallel efficiency whatever the FSP instance and the pool size are. However, the improvement was not significant because the factorized part of the branches in the FSP lower bound is very small. The optimizations obtained with the proposed thread reduction mechanisms allowed us to achieve accelerations up to $\times$77.46 compared to a sequential B\&B. The data access optimizations grant accelerations up to $\times 100$ compared to a single CPU-based B\&B.
+Experiments show also that the proposed refactoring approach improves the parallel efficiency whatever the FSP instance and pool size. However, the improvement was not significant because the factorized part of the branches in the FSP lower bound is very small. The optimizations obtained with the proposed thread reduction mechanisms allowed us to achieve accelerations up to $\times$77.46 compared to a sequential B\&B. The data access optimizations allow accelerations up to $\times 100$ compared to a single CPU-based B\&B.
-In the near future, we plan to extend this work to a cluster of GPU-accelerated multi-core processors. From the application point of view, the objective is to optimally solve challenging and unsolved Flow-Shop instances as we did it for one 50$\times$20 problem instance with grid computing \cite{ch8:Mezmaz_2007}. Finally, we plan to investigate other lower bound functions to deal with other combinatorial optimization problems.
+In the near future, we plan to extend this work to a cluster of GPU-accelerated multicore processors. From the application point of view, the objective is to optimally solve challenging and unsolved flowshop instances as we did it for one 50$\times$20 problem instance with grid computing \cite{ch8:Mezmaz_2007}. Finally, we plan to investigate other lower bound functions to deal with other combinatorial optimization problems.
\putbib[Chapters/chapter8/biblio8]