BookGPU/Chapters/chapter8/ch8.tex

   1
   2 \chapterauthor{Imen Chakroun and Nouredine Melab}{University of Lille 1 CNRS/LIFL, INRIA Lille Nord Europe, Cit\'e scientifique, 59655 Villeneuve d'Ascq cedex, France\\}
   3 %\chapterauthor{Nouredine Melab}{Universit\'e Lille 1 CNRS/LIFL, INRIA Lille Nord Europe, Cit\'e scientifique - 59655, Villeneuve d'Ascq cedex, France\\}
   4
   5 \chapter{GPU-accelerated Tree-based Exact Optimization Methods}
   6 \label{ch8:GPU-accelerated-tree-based-exact-optimization-methods}
   7 \section{Introduction}
   8 \label{ch8:introduction}
   9
  10 In practice, a wide range of problems can be modeled as NP-hard combinatorial optimization problems (COPs). Those problems consist in choosing the best combination out of a large finite set of possible combinations and are known to be large in size and difficult to solve to optimality. One of the most popular methods for solving exactly a COP (finding a solution having the optimal cost), is the Branch-and-Bound (B\&B) algorithm. This algorithm is based on an implicit enumeration of all the feasible solutions of the tackled problem. Enumerating the solutions of a problem consists in building a dynamically generated search tree whose nodes are subsets of solutions of the considered problem. The construction of such tree and its exploration is performed using four operators: branching, bounding, selection and pruning. Due to the exponentially increasing number of potential solutions, the B\&B algorithm explores only promising nodes of the search tree using an estimated optimal solution called ``lower bound'' of the associated sub-problem.
  11
  12 \vspace{0.3cm}
  13
  14 Although this bounding mechanism allows to considerably reduce the exploration time, often only small or moderately-sized instances of COPs can be practically solved. For this reason, over the last decades, parallel computing has been revealed as an attractive way to deal with larger instances of COPs. However, while many contributions have been proposed for parallel B\&B methods using Massively Parallel Processors \cite{ch8:Allen_1997}, Networks or Clusters of Workstations \cite{ch8:Quinn_1990} and SMP machines \cite{ch8:Casadoa_2008}, very few contributions have been proposed for redesigning B\&B algorithms on Graphical Processing Units (GPUs) \cite{ch8:Carneiro_2011}. For years, the use of GPU accelerators was limited to graphics and video applications. Driven by the demand for high-definition 3D graphics on personal computers, GPUs have evolved into a highly parallel, multi-threaded and many-core environment. Their utilization has recently been extended to other application domains such as scientific computing \cite{ch8:Kurzak_2010}.
  15
  16 \vspace{0.3cm}
  17
  18 In this work, we rethink the design and implementation of irregular tree-based algorithms such as B\&B algorithm on top of GPUs. During the execution of the B\&B algorithm, the number of new generated nodes and the number of not yet explored but promising nodes are variable and depend on the level of the tree being explored and on the best solution found so far. Therefore, due to such unstructured and unpredictable nature of its search tree, designing efficient B\&B on top of GPUs is not straightforward. We investigate two different approaches for designing GPU-based B\&B starting from the parallel models for B\&B identified in \cite{ch8:MelabHDR_2005}. The first one is based on the ``parallel tree exploration'' paradigm. This approach consists in exploring in parallel different sub-spaces of the tree. The second approach is based on the ``parallel evaluation of bounds'' approach. The two approaches have been applied to the permutation Flowshop Scheduling Problem \index{Flowshop Scheduling Problem} (FSP)(see Section~\ref{ch8:BB-FSP}) which is an NP-hard combinatorial optimization problem. The lower bound function used in this work for FSP is the one proposed in~\cite{ch8:Johnson_1954} for two machines and generalized in~\cite{ch8:Lenstra_1978} to more than two machines.
  19
  20 \vspace{0.3cm}
  21
  22 When rethinking those two parallel model for GPU's architectures, our main focus was on the lower bound function. Indeed, preliminary experiments we carried out on some Taillard's problem instances \cite{ch8:Taillard_1993} show that computing the lower bounds takes on average between 98\% and 99\% of the total execution time of the B\&B. The GPU-based lower bound's implementation raises mainly two challenges. On the one hand, having in mind that the execution model of GPUs is SIMD, irregular computations (containing loops and conditional instructions) contained in the lower bound function may lead to a very challenging issue: the thread or branch divergence. This problem drops down the performance and arises when threads of a same warp (the smallest executable unit of parallelism on the GPU) execute different data-dependent instructions. On the other hand, the lower bound computation usually uses large in size and frequently accessed data structures. Since GPU is a many-core co-processor device that provides a hierarchy of memories having different sizes and access latencies, the placement and sharing of these data sets become challenging.
  23
  24 \vspace{0.3cm}
  25
  26 The scope of this chapter is to design parallel B\&B algorithms on GPU accelerators to allow highly efficient solving of permutation-based COPs. To do so, our contributions consist in: (1) rethinking two approaches for parallel B\&B on top of GPUs, discussing the performances of each and identifying which best suits the GPU accelerators. (2) proposing a new approach for thread/branch divergence reduction through a thorough analysis of the different loops and conditional instructions of the bounding function. (2) defining an optimal mapping of the data structures of the bounding function on the hierarchy of memories provided in the GPU device through a careful analysis of both the data structures (size and access frequency) and the GPU memories (size and access latency).
  27
  28 \vspace{0.3cm}
  29
  30 The chapter is organized in seven main sections. Section \ref{ch8:BB} presents the B\&B algorithm. Section \ref{ch8:Parallel-BB} introduces the different models used to parallelize B\&B algorithms. Section \ref{ch8:BB-FSP} briefly describes the Flowshop Scheduling permutation Problem. In Section~\ref{ch8:approach1}, we describe the GPU-accelerated B\&B based on the parallel tree exploration. In Section~\ref{ch8:approach2},  details about the second approach, the GPU-accelerated B\&B based on the parallel evaluation of lower bounds, are given. In Section \ref{ch8:ThreadDivergence}, the thread divergence issue related to the location of nodes in the B\&B tree and to the control flow instructions within the bounding operator is described. In Section \ref{ch8:DataAccessOpt}, the memory access optimization challenge is addressed and an overview of the GPU memory hierarchy and the used memory access pattern is given. In Section~\ref{ch8:Experiments}, we report experimental results showing the performances of each of two studied approaches compared to a sequential CPU-based execution of the B\&B and demonstrating the efficiency of the proposed optimizations.
  31
  32 \section{Branch-and-Bound \index{Branch-and-Bound} algorithm}
  33 \label{ch8:BB}
  34
  35 Branch-and-bound algorithms are by far the most widely used methods for exactly solving large scale NP-hard combinatorial optimization problems. Indeed, they allow to find the optimal solution of a problem with proof of optimality.
  36
  37 \vspace{0.3cm}
  38
  39 The basic idea of the B\&B algorithm consists in implicitly enumerating all the solutions of the original problem by only examining a subset of feasible solutions and eliminating the others when they are not likely to lead to a feasible or an optimal solution. Enumerating the solutions of a problem consists in building a dynamically generated search tree whose nodes are subsets of solutions of the considered problem. The construction of such tree and its exploration are performed using four operators: branching, bounding, selection and pruning.
  40
  41 \vspace{0.3cm}
  42
  43 The algorithm proceeds in several iterations during which the best solution found so far is progressively improved. During the exploration process, the search space is described by a pool of unexplored nodes and the best solution found so far. The generated and not yet examined (pending) nodes are kept into a list initialized with the original problem. At each iteration of the algorithm, the following steps are performed:
  44
  45 \begin{itemize}
  46  \item The {\it selection operator} chooses one node to process among the pending nodes according to a defined strategy. If the selection is based on the depth of the sub-problem in the B\&B tree, we speak about a depth-first exploration strategy. A selection based on the breadth of the sub-problem is called a breadth-first exploration. A best-first selection strategy could also be used. It is based on the presumed capacity of the node to yield good solutions.
  47  \item The {\it branching operator} subdivides a solution space into two or more disjointed sub-spaces to be investigated in a subsequent iteration.
  48  \item The {\it bounding operator} computes a bound value of the optimal solution of each generated sub-problem.
  49  \item Each sub-problem having a greater bound than the upper-bound, i.e. the cost of the best solution found so far, is eliminated using the {\it pruning operator}.
  50 \end{itemize}
  51
  52 Algorithm \ref{ch8:algoBB} gives the general template of the Branch-and-Bound method.
  53
  54 \begin{algorithm}[H]
  55
  56 \SetAlgoLined
  57
  58 \vspace{0.2cm}
  59
  60 Create the initial problem; \\
  61 Inset the initial problem into the tree; \\
  62 Set the Upper\_Bound to $\propto$;  \\
  63 Set the Best\_Solution to $\emptyset$; \\
  64
  65 \While{ not\_empty\_tree() }
  66 {
  67     \vspace{0.2cm}
  68
  69     Sub\_Problem = Take\_sub\_problem();
  70
  71     \If{ Is\_leaf ( Sub\_Problem ) }
  72     {
  73             Upper\_Bound = Cost\_Of( Sub\_Problem );\\
  74             Best\_Solution = Sub\_Problem;
  75     }
  76     \Else
  77     {
  78        Lower\_Bound = compute\_lower\_bound(Sub\_Problem);
  79
  80        \If{ Lower\_Bound $\leq$ Upper\_Bound }
  81        {
  82           Branch(Sub\_Problem); \\
  83           Insert child sub problems into the tree;
  84        }
  85        \Else
  86        {
  87           Prune (Sub\_Problem);
  88        }
  89     }
  90 }
  91
  92 \caption{General template of the Branch-and-Bound Algorithm.}
  93 \label{ch8:algoBB}
  94 \end{algorithm}
  95
  96 \section{Parallel Branch-and-Bound algorithms}
  97 \label{ch8:Parallel-BB}
  98
  99 Thanks to the bounding operator, B\&B allows to significantly reduce the computing time needed to explore the whole solution space. However, finding an optimal solution for large instances remains unpractical using a sequential B\&B. Therefore, parallel processing of these algorithms has been widely studied in the literature. In \cite{ch8:MelabHDR_2005}, a taxonomy of the various existing parallel paradigm used to parallelize the B\&B algorithm is presented.
 100
 101 \vspace{0.2cm}
 102
 103 This taxonomy based on the classification proposed in \cite{ch8:Gendron_1994} identified several models to accelerate the B\&B search. The first model we consider in this chapter is called ``parallel tree exploration model'' and belongs to the ``Tree-based'' strategies that aim to build and explore the B\&B tree in parallel. The second model called ``parallel evaluation of bounds model'' (evaluation of bounds in parallel) belong to the parallelization approach called ``Node-based''. This strategy aims to accelerate the execution of a particular operation at the node level.
 104
 105 \vspace{0.2cm}
 106
 107 \subsection{The parallel tree exploration model}
 108 \label{ch8:para_tree}
 109
 110 Tree-based strategies consist in building and/or exploring the solution tree in parallel by performing operations on several sub-problems simultaneously. This coarse-grained type of parallelism affects the general structure of the B\&B algorithm and makes it highly irregular.\\
 111
 112 The parallel tree exploration \index{parallel tree exploration} model, illustrated in Figure \ref{ch8:parallel_tree}, consists in visiting in parallel different paths of the same tree. The search tree is explored in parallel by performing the branching, selection, bounding and elimination operators on several sub-problems simultaneously.\\
 113
 114 \begin{figure}
 115   \begin{center}
 116 \includegraphics[scale=0.5]{Chapters/chapter8/figures/parallel_exploration.eps}%
 117
 118 \caption{Illustration of the parallel tree exploration model}
 119 \label{ch8:parallel_tree}
 120   \end{center}
 121 \end{figure}
 122
 123 \subsection{The parallel evaluation of bounds model}
 124 \label{ch8:Node_parallel}
 125
 126 Node-based strategies introduce parallelism when performing the operations on a single problem. For instance, they consist in executing the bounding operation in parallel for each sub-problem to accelerate the execution. This type of parallelism has no influence on the general structure of the B\&B algorithm and is particular to the problem being solved.\\
 127
 128 The parallel evaluation of bounds \index{parallel evaluation of bounds} model, as shown in Figure \ref{ch8:bounds_parallel}, allows the parallelization of the bounding of sub-problems generated by the branching operator. This model is used in the case where the bounding operator is performed several times after the branching operator. The model does not change the order and the number of explored sub-problems in the parallel B\&B algorithm compared to the sequential B\&B.
 129
 130 \begin{figure}
 131   \begin{center}
 132 \includegraphics[scale=0.5]{Chapters/chapter8/figures/parallel_bounding.eps}%
 133
 134 \caption{Illustration of the parallel evaluation of bounds model}
 135 \label{ch8:bounds_parallel}
 136   \end{center}
 137 \end{figure}
 138
 139 \section{The Flowshop Scheduling Problem}
 140 \label{ch8:BB-FSP}
 141
 142 \subsection{Definition of the Flowshop Scheduling Problem}
 143 \label{ch8:LB-FSP}
 144
 145 As a case study for our GPU-based Branch-and-Bound, we considered the NP-hard and well-known problem in the scheduling theory: the "Permutation Flow-shop Scheduling Problem" (FSP).
 146 In this work, the mono-objective case is considered. The FSP aims to find the optimal schedule of n jobs on m machines so that the overall completion time of all jobs, called {\it makespan}, is minimized.
 147
 148 \vspace{0.3cm}
 149
 150 Let us suppose the set of jobs is represented by J = {$j_1$, $j_2$, . . . $j_n$} and the set of machines is represented by M = {$m_1$,$m_2$, . . .$m_m$} organized
 151 in the line. Each job $j_i$ is a sequence of operations ji = { $oi_1$, $oi_2$, . . . $oi_m$ } where oim is the duration required for the job ji on the machine m.
 152 A feasible solution of the flowshop permutation should satisfy these constraints:
 153
 154 \begin{itemize}
 155  \item A machine can not start processing a job if all the machines, which are located upstream, did not finish their treatment. Thus, the operation $oi_j$ cannot be processed by the machine $m_j$ if it is not completed on $m_j$ - 1.
 156 \item An operation can not be interrupted, and the machines are critical resources, because a machine processes one job at a time.
 157 \item The sequence of jobs should be the same on every machine, e.g. if j3 is treated in position 2 on the first machine, j3 is also executed in position 2 on all machines.
 158 \end{itemize}
 159
 160 Figure~\ref{flow-shop} illustrates a solution of a flow-shop problem instance defined by 6 jobs and 3 machines.
 161
 162 \begin{figure}[h!]
 163 \centering
 164 \includegraphics[height=1.7cm,width=6.8cm]{Chapters/chapter8/figures/FlowShop.eps}
 165 \caption{Flow-shop problem instance with 3 jobs and 6 machines.}
 166 \label{flow-shop}
 167 \end{figure}
 168
 169 \vspace{0.3cm}
 170
 171 \subsection{Lower Bound \index{Lower Bound} for the Flowshop Scheduling Problem}
 172 \label{ch8:LB-FSP}
 173
 174 The lower bounding technique provides a lower bound (LB) for each sub-problem generated by the branching operator. The more the bound is accurate, the more it allows to eliminate not promising nodes from the search tree. Therefore, the efficiency of a B\&B algorithm depends strongly on the quality of its lower bound function. In this chapter, we use the lower bound proposed by Lenstra {\it et al.}~\cite{ch8:Lenstra_1978} for FSP, based on the Johnson's algorithm~\cite{ch8:Johnson_1954}.
 175
 176 \vspace{0.2cm}
 177
 178 The Johnson's algorithm allows to solve optimally FSP with two machines ($m=2$) using the following transitive rule $\preceq$:
 179
 180 $$J_i \preceq J_j \Leftrightarrow \min(p_{i,1}\ ;\ p_{j,2}) \leq
 181 \min(p_{i,2}\ ;\ p_{j,1})$$
 182
 183 We recall that $p_{k,l}$ designates the processing time of the job $J_k$ on the machine $M_l$. From the above rule, it follows the Johnson's theorem: \\
 184
 185 \textbf{Jonhson's theorem} \emph{Given $P$ an FSP with $m=2$, if $J_i\preceq J_j$ there exists an optimal schedule for $P$ in which the job $J_i$ precedes the job $J_j$.}\\
 186
 187 According to Johnson's theorem, FSP with $m=2$ is solved with a time complexity of $O(n.log n)$. The optimal solution is obtained by first sorting in increasing order the jobs having a
 188 processing time shorter on the first machine than on the second one~; Second, sorting in decreasing order the jobs having a shorter processing time on the second machine.
 189
 190 \vspace{0.2cm}
 191
 192 In~\cite{ch8:JRJackson_1956} and~\cite{ch8:LGMitten_1959}, the Johnson's rule has been extended by Jackson and Mitten with lags which allowed further Lenstra {\it et al.} to propose a lower bound for FSP with $m \geq 3$. A lag~$l_j$ designates the minimum duration between the starting time of the job $J_j$ on the second machine and its finishing time on the first machine. Jackson and Mitten demonstrated that the optimal solution for FSP with $m=2$ can be obtained using the following transitive rule $\preceq$:
 193
 194 $$J_i \preceq J_j \Leftrightarrow \min(p_{i,1}+l_i\ ;\ l_j+p_{j,2})
 195 \leq \min(l_i+p_{i,2}\ ;\ p_{j,1}+l_j)$$
 196
 197 Based on this rule, Lenstra {\it et al.}~\cite{ch8:Lenstra_1978} have proposed the following lower bound for a sub-problem associated to a partial schedule where a set {\Large $\jmath$} of jobs have to be scheduled on $m$ machines. $P_{Ja}^*(\jmath,M_k,M_l)$ represents the Jackson-Mitten optimal solution for the sub-problem that consists in scheduling the set {\Large $\jmath$} of jobs on the two machines $M_k$ and~$M_l$. The term  $r_{i,k} = \sum_{l<k} p_{i,l}$ designates the starting time of the job $J_i$ on the machine $M_k$. The other term $q_{j,l} = \sum_{k>l} p_{j,k}$ refers to the latency between the finishing time of $J_j$ on $M_l$ and the finishing time of the schedule.
 198
 199 $$LB(\jmath)=\max\limits_{1 \leq k < l \leq m}\{P_{Ja}^*(\jmath,M_k,M_l)+\min\limits_{(i,j)\in \jmath^2, i \neq
 200 j}(r_{i,k}+q_{j,l}) \}$$
 201
 202 According to this $LB$ expression, the lower bound for the scheduling of a subset {\Large $\jmath$} of jobs is calculated by applying the Johnson's rule with lags considering all the couples
 203 $(k,l)$ for $1 \leq k,l \leq m$ and $k<l$. As illustrated in Figure~\ref{LagKLExample}, the lag $l_j$ of a job $J_j$ for a couple $(k,l)$ of machines is the sum of the processing times of the job on
 204 all the machines between~$k$~and~$l$.
 205
 206 \begin{figure}
 207   \begin{center}
 208 \includegraphics[width=8cm]{Chapters/chapter8/figures/johnson_with_lags.eps}%
 209 \caption{The lag $l_j$ of a job $J_j$ for a couple $(k,l)$ of machines is the sum of the processing times of the job on all the machines between~$k$~and~$l$.}
 210 \label{LagKLExample}
 211   \end{center}
 212 \end{figure}
 213
 214 \section{GPU-accelerated B\&B based on the parallel tree exploration (GPU-PTE-BB)}
 215 \label{ch8:approach1}
 216
 217 The first approach we investigate for designing B\&B on GPUs consists in exploring in parallel the generated search tree. The idea is to divide the global search space into disjoint sub-spaces that are explored in parallel by the GPU threads. As explained in Section \ref{ch8:BB}, during the execution of a B\&B, the search space is described by a list of unexplored (pending) nodes and the best solution found so far. In the considered GPU-based scheme, a set of parent nodes is selected from this list according to their depth: deepest pending nodes are the first selected. The selected pool of nodes is off loaded to the GPU where each thread builds its own local search tree by applying the {\it branching}, {\it bounding} and {\it pruning} operators to the assigned node.
 218
 219 \vspace{0.2cm}
 220
 221 \begin{figure}[h!]
 222 \centering
 223 \includegraphics[height=8cm, width=8.1cm]{Chapters/chapter8/figures/Diagram1.eps}
 224 \caption{The overall architecture of the parallel tree exploration-based GPU-accelerated Branch-and-Bound algorithm.}
 225 \label{tree_approach}
 226 \end{figure}
 227
 228 \vspace{0.2cm}
 229
 230 According to the CUDA threading model, each thread has a unique identifier used to determine its assigned role, assigns specific input and output positions and selects work to perform. Therefore, each node (problem) from the pending list is mapped to a thread to ensure that each sub-space of the solution space is evaluated concurrently and is disjoint from others. Figure \ref{tree_approach} illustrates the scheme of the parallel tree exploration-based GPU-accelerated B\&B.
 231
 232 \section{GPU-accelerated B\&B based on the parallel evaluation of bounds (GPU-PEB-BB) }
 233 \label{ch8:approach2}
 234
 235 In the GPU-accelerated B\&B based on the parallel evaluation of bounds, illustrated in Figure~\ref{ch8:approach}, the generation of the sub-problems (elimination, selection and branching operations) to be solved is performed on CPU and the evaluation of their lower bounds (bounding operation) is executed on the GPU device. The pool of sub-problems generated on CPU is off-loaded to the GPU device to be evaluated by a pool of threads partitioned into blocks. Each thread applies the lower bound function to one sub-problem. Once the evaluation is completed, the lower bound values corresponding to the different sub-problems is returned to the CPU to be used by the elimination operator to decide either to be pruned or to be decomposed. The process is iterated until the exploration is completed and the optimal solution is found.
 236
 237 \vspace{0.2cm}
 238
 239 \begin{figure}[h!]
 240   \begin{center}
 241 \includegraphics[scale=0.3]{Chapters/chapter8/figures/approach.eps}%
 242 \caption{The overall architecture of the GPU-accelerated Branch-and-Bound algorithm based on the parallel evaluation of bounds.}
 243 \label{ch8:approach}
 244   \end{center}
 245 \end{figure}
 246
 247 \vspace{0.2cm}
 248
 249 In both considered approaches, GPU-PEB-BB and GPU-PTE-BB, the GPU-based lower bound's implementation raises mainly two challenges. The first one is related to the ``single instruction multiple data'' (SIMD) model of the GPU and to the implementation of the LB. Indeed, although typically every GPU thread will run the identical lower bound function, the body of the lower bound can contains conditions on thread identifiers and data. This implies that different instructions are executed in some threads. In SIMD architectures like GPUs this behavior leads to the thread or branch divergence issue. This problem arises when threads of a same warp execute different data-dependent instructions. It might causes serious performance declining since computation occurs in parallel only when the same instructions are being performed. The second challenge consists in adjusting the pattern of accesses to the GPU device memory. Good placement of data over the different memory hierarchy grants programmers to further improve the throughput of many high-performance CUDA applications. For B\&B applied to FSP, threads of the same block perform concurrent accesses to the six data structures of the problem when they execute the lower bound function. These data structures have different sizes and access frequencies and should be wisely placed on the different memories of the GPUs that also have different sizes and latencies.
 250
 251 \vspace{0.2cm}
 252
 253 In the following, we present how we dealt with the thread/branch divergence issue and maps the different data structures on the memory hierarchy of the GPU device taking into account the characteristics of the data structures and those of the different GPU memories.
 254
 255 \vspace{-0.4cm}
 256
 257 \section{Thread divergence}
 258 \label{ch8:ThreadDivergence}
 259
 260 \subsection{The thread divergence issue}
 261
 262 During the execution of an application on GPU, to each GPU multiprocessor is assigned one or more thread block(s) to execute. Those threads are partitioned into warps that get scheduled for execution. For each  instruction of the flow, the multiprocessor selects a warp that is ready to be run. A warp executes one common instruction at a time, so full efficiency is realized when all threads of a warp agree on their execution path. In this chapter, the G80 model, in which a warp is a pool of 32 threads, is used. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken. Threads that are not on the taken path are disabled, and when all paths complete, the threads converge back to the same execution path. This phenomenon is called thread/branch divergence\index{Thread divergence} and often causes serious performance degradations. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjointed code paths.
 263
 264 \vspace{0.2cm}
 265
 266 This section discusses thread divergence issue encountered when computing the bounds by GPU. The thread divergence occurs for two main reasons, namely the locations of nodes in the search tree and the control flow instructions within the bounding operator.
 267
 268 \vspace{0.3cm}
 269
 270 \textbf{Divergence related to the location of nodes}
 271
 272 \vspace{0.3cm}
 273
 274 This divergence is related to the positions of the nodes in the B\&B search tree. Below is given an example from the source code of the used LB showing that the execution flow depends on the position of the node in the search tree. In the following piece of code, three methods are used {\it is\_leaf()}, {\it makespan()} and {\it lower\_bound()}. {\it is\_leaf()} tests if the node {\it \_node} is a leaf or an internal node. If {\it \_node} is a leaf, {\it makespan()} computes the cost of its makespan. Otherwise, {\it \_node} is an internal node and {\it lower\_bound()} computes the value of its lower bound.
 275
 276 \begin{verbatim}
 277         if (_node.is_leaf())
 278            return _node.makespan();
 279         else
 280            return _node.lower_bound();
 281 \end{verbatim}
 282
 283 \vspace{0.3cm}
 284
 285 \textbf{Divergence related to the control flow instructions}
 286
 287 \vspace{0.2cm}
 288
 289 Control flow refers to the order in which the instructions, statements or function calls are executed in a program. This flow is determined by instructions such as {\it if-then-else}, {\it for}, {\it while-do}, {\it switch-case}, etc. There are a dozen of such instructions in the implementation of our bounding operator. The source code examples given below show two scenarios in which this kind of instructions is used.
 290
 291 \begin{itemize}
 292 \item Example 1:\\ \vspace{-0.4cm}
 293 \begin{verbatim}
 294    if( pool[thread_idx].begin != 0 )
 295       time = TimeMachines[1] ;
 296    else
 297       time = TimeArrival[1] ;
 298 \end{verbatim}
 299
 300 \item Example 2:\\ \vspace{-0.4cm}
 301 \begin{verbatim}
 302    for(int k = 0 ; k < pool[thread_idx].begin; k++)
 303       jobTime = jobEnd[k] ;
 304 \end{verbatim}
 305
 306 \end{itemize}
 307
 308 In these two examples, {\it thread\_idx} is the index associated to the current thread. Let suppose that the code of Example 1 is executed by $32$ threads, {\it pool[thread\_idx].begin} is equal to $0$ for the first thread, and {\it pool[thread\_idx].begin} is not equal to $0$ for the other $31$ threads. When the first thread executes the statement {\it ``time = TimeArrival[1];''},
 309 all the other $31$ threads remain idle. Therefore, the GPU cores on which these $31$ threads are executed remain idle and can not be used during the execution of the statement {\it ``time = TimeArrival[1];``}.
 310
 311 \vspace{0.2cm}
 312
 313 The same scenario occurs during the execution of Example 2. Let us suppose that the instruction is executed by $32$ threads, {\it pool[thread\_idx].begin} is equal to $100$ for the first thread, and {\it pool[thread\_idx].begin} is equal to $0$ for the other $31$ threads. When the first thread executes the loop $for$, all the other $31$ threads remain idle.
 314
 315 \vspace{0.2cm}
 316
 317 Existing techniques for handling branch divergence either demand hardware support \cite{ch8:Fung} or require host-GPU interaction \cite{ch8:Zhang}, which incurs overhead. Some other works such as \cite{ch8:Han} intervene at the code level. They expose a branch distribution method that aims to reduce the divergent portion of a branch by factoring out structurally similar code from the branch paths. In our work, we have also opted for software-based optimizations like \cite{ch8:Han}. In fact, we figure out how to literally rewrite the branching instructions into basic ones in order to make thread execution paths uniform. We also demonstrate that we could ameliorate performances only by judiciously reordering data being assigned to each thread.
 318
 319 \subsection{Mechanisms for reducing branch divergence}
 320
 321 \vspace{0.3cm}
 322
 323  \textbf{Thread-data reordering}
 324
 325 \vspace{0.2cm}
 326
 327 At each iteration of our GPU-accelerated B\&B approach, several thousands of sub-problems are sent to the GPU. The GPU groups the received sub-problems into several warps according to their reception order. The first 32 sub-problems belong to the first warp, the following 32 sub-problems belong to the second warp, etc. Therefore, thread-data reordering technique sorts sub-problems before sending them to the GPU. These sub-problems are sorted according to their position in the B\&B tree. This sort of sub-problems allows to have warps containing more homogeneous sub-problems, and reduces the number of thread divergences.
 328
 329 \vspace{0.2cm}
 330
 331  \textbf{Branch refactoring}
 332
 333 \vspace{0.2cm}
 334
 335 As quoted above, thread or branch divergence occurs when the kernel includes conditional instructions and loops that make the threads performing different control flows leading to their serial execution. In this chapter, we investigate the branch refactoring approach to deal with thread divergence. Branch refactoring consists in rewriting the conditional instructions so that threads of the same warp execute an uniform code avoiding their divergence. To do that, two major ``if" scenarios are studied and some optimizations are proposed accordingly. These two scenarios correspond to the conditional instructions contained in the $LB$ kernel code. In the first scenario, the conditional expression is a comparison of the content of a variable to 0. For instance, the following example extracted from the pseudo-code of the lower bound $LB$ illustrates such scenario.
 336
 337 \vspace{0.3cm}
 338
 339 \begin{tabular}{l}
 340 \\
 341 \small
 342 \textsf{ if ( pool[thread\_idx].limit1 $\neq$ 0 ) tmp = MM[1];  }\\
 343 \small
 344 \textsf{ else  tmp = RM[1] ; }\\  \\
 345 \end{tabular}
 346
 347 \vspace{0.2cm}
 348
 349 The refactoring idea is to replace the conditional expression by two functions namely $f$ and $g$ as shown in Equation~\ref{ch8:Eq1}.
 350
 351 \vspace{0.2cm}
 352
 353 The behavior of $f$ and $g$ fits the cosine trigonometric function. These functions return values between $0$ and $1$. An integer variable is used to store the result of the cosine function. Its value is $0$ or $1$ since it is rounded to $0$ if it is not equal to~$1$. In order to increase the performance the CUDA runtime math operations are used: $sinf(x)$, $expf(x)$ and so forth. Those functions are mapped directly to the hardware level~\cite{ch8:cuda}. They are faster but provide lower accuracy which does not matter in our case because the results are rounded to $int$.
 354
 355 \begin{equation}
 356 \begin{array}{lllllllll}
 357 \small
 358     &\multicolumn{8}{l}{ if (x \neq 0) ~ a = b[1]; ~~~~~~ if (x \neq 0) ~ a = b[1] + 0 \times c[1];} \\
 359     & \multicolumn{2}{l}{} & ~~~~~~\Rightarrow & \multicolumn{2}{l}{} \\
 360     &\multicolumn{2}{l}{\emph{else}}    $a = c[1];$ &     &\multicolumn{2}{l}{\emph{else}} $a = 0 \times b[1] + c[1];$ \\\\
 361     & \multicolumn{6}{l}{\Rightarrow a = f(x) \times b[1] + g(x) \times c[1];}\\ \\
 362     &\multicolumn{6}{l}{\emph{where:}}\\
 363     &&\multicolumn{6}{l}{ f(x)=\left\{
 364                     \begin{array}{lll}
 365                         f(x) = 0    & if &x = 0\\
 366                         1           &else\\
 367                     \end{array}
 368                 \right.}\\
 369      &\multicolumn{6}{l}{\emph{and}}\\
 370     &&\multicolumn{6}{l}{g(x)=\left\{
 371                     \begin{array}{lll}
 372                         g(x) = 1    & if &x = 0\\
 373                         0           &else & \\
 374                     \end{array}
 375                   \right.}
 376 \end{array}
 377 \label{ch8:Eq1}
 378 \end{equation}
 379
 380 \vspace{0.3cm}
 381
 382 The throughput of $sinf(x)$, $cosf(x)$, $expf(x)$ is one operation per clock cycle~\cite{ch8:cuda}. The refactoring result for the ``if" pseudo-code given above is the following:
 383
 384 \vspace{0.3cm}
 385
 386 \begin{tabular}{l}
 387 \\
 388 \small
 389 \textsf{int coeff = \_\_cosf (pool[thread\_idx].limit1);}\\
 390 \small
 391 \textsf{tmp = (1 - coeff) $\times$ MM[1] +  coeff $\times$ RM[1];}\\ \\
 392 \end{tabular}
 393
 394 \vspace{0.3cm}
 395
 396 The second "if" scenario considered in our study compares two values between themselves as shown in Equation~\ref{ch8:Eq2}.
 397
 398 \vspace{0.2cm}
 399
 400 \begin{equation}
 401 \begin{array}{lllllllll}
 402 \small
 403     &\multicolumn{8}{l}{if (x > y)  a = b[1];~~~ \Rightarrow if (x - y \geq 1)   a = b[1];}\\\\
 404     \Rightarrow &\multicolumn{6}{l}{if (x - y - 1 \geq 0)}& a = b[1];~~~~~ (x, y) \in N\\\\
 405     \Rightarrow &\multicolumn{4}{l}{a = f(x, y) \times b[1] + g(x,y) \times a;}\\\\
 406     &\multicolumn{6}{l}{\emph{where:}}\\
 407     &&\multicolumn{6}{l}{ f(x,y)=\left\{
 408                     \begin{array}{lll}
 409                         1    & if &x - y - 1 \geq 0\\
 410                         0    &if&x - y - 1 < 0\\
 411                     \end{array}
 412                 \right.}\\
 413     &\multicolumn{6}{l}{\emph{and}}\\
 414     &&\multicolumn{6}{l}{g(x,y)=\left\{
 415                     \begin{array}{llll}
 416                         0    & if &x - y - 1 \geq 0\\
 417                         1    &if&x - y - 1 < 0 & \\
 418                     \end{array}
 419                 \right.}\\\\
 420 \end{array}
 421 \label{ch8:Eq2}
 422 \end{equation}
 423
 424 \vspace{0.3cm}
 425
 426 For instance, the following example extracted from the pseudo-code of the lower bound $LB$ illustrates such scenario.
 427
 428 \vspace{0.3cm}
 429
 430 \footnotesize
 431 \begin{tabular}{ll}
 432 \\
 433 \multicolumn{2}{l}{\textsf{if(RM[1]] $>$ MIN )}\{}  \textsf{Best\_idx = Current\_idx;}  \textsf{\}}\\\\
 434 \end{tabular}
 435 \normalsize
 436
 437 \vspace{0.3cm}
 438
 439 The same transformations as those applied for the first scenario are applied here using the exponential function. Recall that the exponential is a positive function which is equal to $1$ when applied to $0$. Thus, if $x$ is greater than $y$ then $expf(x-y-1)$ returns a value between $0$ and $1$. If the result is rounded to an integer value $0$ will be obtained. Now, if $x$ is less than $y$ then $expf(x-y-1)$ returns a value greater than $1$ and since the minimum between $1$ and the exponential is get, the returned result would be $1$. Such behavior satisfies exactly our prerequisites. The above ``if" instruction pseudo-code is now equivalent to:
 440
 441 \vspace{0.3cm}
 442
 443 \small
 444 \begin{tabular}{l}
 445 \\
 446 \textsf{int coeff = min(1, \_\_expf(RM[1] - MIN - 1)); }\\
 447 \textsf{Best\_idx = coeff $\times$ Current\_idx + ( 1 - coeff ) $\times$ Best\_idx ;}\\
 448 \end{tabular}
 449 \normalsize
 450
 451 \section{Memory access optimization}
 452 \label{ch8:DataAccessOpt}
 453
 454 Memory access optimizations \index{Memory access optimizations} are by far the most studied area for improving GPU-based application performances. Indeed, adjusting the pattern of accesses to the GPU device memory grants programmers to further improve the throughput of many high-performance CUDA applications. The goal of memory access optimizations is generally to use as much fast memory and as little slow-access memory as possible. This section discusses how best to set up data LB items on the various kinds of memory on the device.
 455
 456 \vspace{0.2cm}
 457
 458 CUDA enabled devices use several memory spaces, which have different characteristics in term of sizes and access latencies. These memory spaces include global memory, local memory , shared memory, texture memory , and registers. Devices of compute capability 2.0 have also an L1 $/$ L2 cache hierarchy that is used to cache local and global memory accesses.
 459
 460 \begin{itemize}
 461 \item At the thread-level, each thread has its own allocated registers and a private local memory. CUDA uses this local memory for thread-private variables that do not fit in the threads registers, as well as for stack frames and register spilling. \item At the thread block-level, each thread block has a shared memory visible to all its associated threads. \item At the grid-level, all threads have access to the same global memory. Texture and constant cached memories are two other memories accessible by all threads.
 462 \end{itemize}
 463
 464 The data access optimization challenge is to find the best mapping of the data structures of the application at hand (different sizes and access frequencies) and the GPU hierarchy of memories (different sizes and access latencies). For instance, of these different memory spaces, global memory is the most plentiful but the one with the highest access latency. On the contrary, shared memory is smaller in size but has much higher bandwidth and lower latency than the global memory.
 465
 466 \subsection{Complexity analysis of the memory usage of the Lower Bound }
 467 \label{ch8:MemComplex}
 468
 469 In this section, the characteristics of the data structures used by the lower bound function are studied in terms of sizes and access frequencies. For an efficient implementation of the LB, six data structures are required: the  matrix $PTM$ of the processing times of the jobs, the matrix of lags $LM$, the Johnson's matrix $JM$, the matrix $RM$ of the earliest starting times of jobs, the matrix $QM$ of their lowest latency times and the matrix $MM$ containing the couples of machines. The complexities of the different data structures are summarized in Table~\ref{ch8:tabMemComplex} where the columns represent respectively the name of the data structure, its size and the number of times it is accessed.
 470
 471 \vspace{0.2cm}
 472
 473 In the $LB$ expression, the computation of the term $P_{Ja}^*(\jmath,M_k,M_l)$ requires the calculation of the lag of each remaining job to be scheduled on the couple $(M_k,M_l)$ of machines using its processing times on these machines (Johnson's rule with lags). Such computation is repeated for each couple $(M_k,M_l)$ of machines with $1 \leq k,l \leq m$ and $k<l$. To avoid the repetitive computation of the lags, they are computed once at the beginning of the algorithm and stored in the matrix $LM$. The dimension of $LM$ is $n \times \frac{m\times (m-1)}{2}$, where $n$ and $m$ are respectively the number of jobs to be scheduled and $m$ the number of machines. $LM$ is accessed $n' \times \frac{m \times (m-1)}{2}$ times, $n'$ being the number of remaining jobs to be scheduled in the sub-problem for which the lower bound is being calculated. The processing times of all the jobs on all the machines are stored in the matrix $PTM$. This matrix has a dimension of $n \times m$ and is accessed $n' \times m \times (m-1)$ times.
 474
 475 \vspace{0.2cm}
 476
 477 In addition, in order to avoid relaunching the Johnson's algorithm for each couple of machines and each subset of jobs, the Johnson's algorithm is computed once to find the optimal solutions on the couples of machines. These optimal solutions are then stored in the Johnson's matrix $JM$. This matrix has the same dimension as $LM$ and is accessed $n \times \frac{m \times (m-1)}{2}$ times during the computation of the lower bound. Finally, the $MM$ matrix that contains all the couples of machines has a dimension and access frequency of $m \times (m-1)$.
 478
 479 \vspace{0.2cm}
 480
 481 To reduce the computation time cost of the term $\min\limits_{(i,j)\in \jmath^2, i \neq j}(r_{i,k}+q_{j,l})$ in the $LB$ expression, two matrices are defined, namely $RM$ and $QM$. They are used to store respectively the lowest starting and latency times of all the jobs on each machine. Their dimension is $m$ and are accessed $ m \times (m-1)$ times and $ \frac{m \times (m-1)}{2}$ times respectively.
 482
 483 \begin{table}
 484   \centering
 485 \begin{tabular}{|c|c|c|}
 486 \hline
 487   \textbf{Matrix} & \textbf{Size} & \textbf{Number of accesses} \\
 488  \hline
 489  \hline
 490    PTM &  $n \times m$ & $n' \times m \times (m-1)$ \\
 491  \hline
 492    LM & $n \times \frac{m \times (m-1)}{2}$ & $n' \times \frac{m \times (m-1)}{2}$ \\
 493  \hline
 494    JM & $n \times \frac{m \times (m-1)}{2}$ & $n \times \frac{m \times (m-1)}{2}$ \\
 495  \hline
 496    RM &  $m$ & $m \times (m-1)$ \\
 497  \hline
 498    QM &  $m$ & $\frac{m \times (m-1)}{2}$ \\
 499  \hline
 500    MM &  $m \times (m-1)$ & $m \times (m-1)$ \\
 501  \hline
 502 \end{tabular}
 503  \caption[The different data structures of the $LB$ algorithm and their associated complexities in memory size and numbers of accesses.]{The different data structures of the $LB$ algorithm and their associated complexities in memory size and numbers of accesses. The parameters $n$, $m$ and $n'$ designate respectively the total number of jobs, the total number of machines and the number of remaining jobs to be scheduled for the sub-problems the lower bound is being computed.}
 504 \label{ch8:tabMemComplex}
 505 \end{table}
 506
 507 \subsection{Data placement pattern of the Lower Bound on GPU}
 508 \label{ch8:MemComplex}
 509
 510 This section discusses how best to map the six data structures identified above on the various kinds of memories of the GPU device.
 511
 512 \vspace{0.2cm}
 513
 514 The focus is put on the shared memory which is a key enabler for many high-performance CUDA applications. Indeed, because it is on-chip, shared memory has much higher bandwidth and lower latency than local and global memory. However, for large problem instances (large $n$ and $m$) the data structures especially JM and LM (see Table \ref{ch8:tabMemSizes}), do not fit in the shared memory for some GPU configurations.
 515
 516 \vspace{0.2cm}
 517
 518 In order to achieve further performances, we also take care of adequately use the global memory by judiciously configuring the L1 cache which greatly enables improving performance over direct access to global memory. Indeed, the GPU device we are using in our experiments is based on the NVIDIA Fermi architecture which introduced two new hierarchies of memories (L1 $/$ L2 cache)
 519 compared to older architectures.
 520
 521 \begin{table}
 522   \centering
 523   \footnotesize
 524   \begin{tabular}{|r|r|r|r|r|r|}
 525     \hline
 526 Prob. instance & JM & LM & PTM & RM, QM & MM \\
 527     \hline
 528     \hline
 529 $200 \times 20$ & 38.000 (38KB) & 38.000 (76KB) & 4.000 (4KB) & 20 (0.04KB) & 380 (0.76KB) \\
 530     \hline
 531 $100 \times 20$ & 19.000 (19KB) & 19.000 (38KB) & 2.000 (2KB) & 20 (0.04KB) & 380 (0.76KB) \\
 532     \hline
 533 $50 \times 20$ & 9.500 (9.5KB) & 9.500 (19KB) & 1.000 (1KB) & 20 (0.04KB) & 380 (0.76KB) \\
 534     \hline
 535 $20 \times 20$ & 3.800 (3.8KB) & 3.800 (7.6KB) & 400 (0.4KB) & 20 (0.04KB) & 380 (0.76KB) \\
 536    \hline
 537   \end{tabular}
 538 \caption[The sizes of each data structure for the different experimented problem instances.]{The sizes of each data structure for the different experimented problem instances. The sizes are given in number of elements and in bytes (between brackets).}
 539 \label{ch8:tabMemSizes}
 540 \end{table}
 541
 542 \vspace{0.2cm}
 543
 544 Taking into consideration the sizes of each data structure presented in Table \ref{ch8:tabMemSizes}, our challenge is to find which data structure has to be mapped on which memory and in some cases how to split the data  structures on different memories and efficiently manage their accesses. The sizes in bytes reported in Table \ref{ch8:tabMemSizes}, are computed knowing that in our implementation the elements of $JM$ and $PTM$ are unsigned chars (one byte) and that the elements of $LM$, $RM$, $QM$ and $MM$ are unsigned short ints (2 bytes). It is important here to highlight that the types of the data of the used matrices impact the size of each matrix. For instance, a matrix of $100$ integers has a size of $400$ octets while the same matrix with $100$ unsigned chars has a size of $100$ octets. In order to minimize the size of each of the used matrices, we analyzed the ranges of their values and defined their data types accordingly. For instance, in PTM all the processing times have positive values varying between $0$ and $100$. Therefore, we defined PTM as a matrix of \verb|unsigned char| having values in the range $[0, 255]$. Using the \verb|unsigned char| type instead of the integer type allows us to reduce by $4$ times the memory space occupied by PTM.
 545
 546 \vspace{0.2cm}
 547
 548 According to the Table \ref{ch8:tabMemSizes} :
 549
 550 \begin{itemize}
 551  \item The data structures $RM$, $QM$ and $MM$ are small sized matrices. Therefore, their impact on the performances is not significant whatever is the memory to which they are off-loaded. In particular, preliminary experiments proves that putting them on the shared memory would allows a very poor performance improvement.
 552 \item The $LM$ data structure is the double of the $JM$ in memory size but with a much lower access frequency. It is thus better to map $JM$ on the shared memory.
 553 \item The $PTM$ has almost the same access frequency than $JM$ but requires less memory space.
 554 \end{itemize}
 555
 556 \vspace{0.2cm}
 557
 558 Consequently, the focus is put on the study of the performance impact of the placement of $JM$ and $PTM$ on the shared memory. Three placement scenarios of $JM$ and $PTM$ are experimented and studied: (1) Only $PTM$ is stored in shared memory and all others are placed in global memory~; (2) Only $JM$ is stored in shared memory and all others are placed on global memory~; (3) $PTM$ and $JM$ are stored together in shared memory and all others are placed on global memory.
 559
 560 \vspace{0.2cm}
 561
 562 Taking profit from the configurable storage space provided in the new Fermi-based devices, the $64$ KB of local storage was spitted between the shared memory and the L1 cache according to the experimented scenario.
 563
 564 \begin{itemize}
 565 \item For the scenario were the data structures are put on the shared memory the $64$ KB of available storage are split on $48$ KB for shared memory and $16$ KB for L1 cache.
 566 \item For the scenario where the data sets are put on global memory we used $16$ KB for shared memory and $48$ KB for L1 cache.
 567 \end{itemize}
 568
 569 \section{Experiments}
 570 \label{ch8:Experiments}
 571
 572 In the following, we present the experimental study  we have performed with the aim to evaluate the performance impact of the GPU-accelerated bounding, the techniques for reducing the thread divergence and the proposed approach for data placement on the GPU memories.
 573
 574 \subsection{Parameters settings}
 575
 576 In our experiments, we used the flow-shop instances defined by Taillard \cite{ch8:Taillard_1993}. These standard instances are often used in the literature to evaluate the performance of methods that minimize the makespan. Optimal solutions of some of these instances are still not known. These instances are divided into groups of $10$ instances. In each group, the $10$ instances  are defined by the same number of jobs and the same number of machines. The groups of 10 instances have different numbers of jobs, namely $20$, $50$, $10$, $200$ and $500$, and different numbers of machines, namely $5$, $10$ and $20$. For example, there are $10$ instances with $200$ jobs and $20$ machines belonging to the same group of instances.
 577
 578 \vspace{0.2cm}
 579
 580 In this work, we used only the instances where the number of machines is equal to $20$. Indeed, instances where the number of machines is equal to $5$ or $10$ are easy to solve. For these instances, the used bounding operator gives so good lower bounds that it is possible to solve them in few minutes using a sequential B\&B. Therefore, these instances do not require the use of a GPU.
 581
 582 \vspace{0.2cm}
 583
 584 Our approach has been implemented using C-CUDA 4.0. The experiments have been carried out using a an Intel Xeon E5520 bi-processor coupled with a GPU device. The bi-processor is 64-bit, quad-core and has a clock speed of 2.27GHz. The GPU device is an Nvidia Tesla C2050 with 448 CUDA cores (14 multiprocessors with 32 cores each), a clock speed of 1.15GHz, a 2.8GB global memory, a 49.15KB shared memory, and a warp size of 32.
 585
 586 \subsection{Experimental protocol: computing the speed up}
 587 \label{ch8:Protocol}
 588
 589 We need to compute the speed up of our approach to evaluate its performances. This speed up is obtained by comparing our GPU B\&B version to a sequential B\&B version deployed on one CPU core. However, all the instances used in our experiments are extremely hard to solve. Indeed, the resolution of each of these instances requires several months of computation on one CPU core. For example, the optimal solution of one of these instances defined by $50$ jobs and $20$ machines is obtained after $25$ days of computation using an average of $328$ CPU cores \cite{ch8:Mezmaz_2007}.
 590
 591 \vspace{0.2cm}
 592
 593 Using the approach defined in \cite{ch8:Mezmaz_2007}, it is possible to obtain a random list $L$ of sub-problems such as the resolution of $L$ lasts $T$ minutes with a sequential B\&B. So by initializing the pool of our sequential B\&B with the sub-problems of this list $L$, we are sure that the resolution of the sequential B\&B will last $T{cpu}$ minutes such as $T{cpu}$ will be approximately equal to $T$. Therefore, it will be possible to initialize the pool of our GPU B\&B with the same list $L$ of sub-problems in order to compute the speed up. Let suppose that the resolution of the GPU B\&B will last $T{gpu}$ minutes. So the speed up of our GPU algorithm will be equal to $Tcpu/Tgpu$. With this experimental protocol, the sub-problems explored by the GPU and CPU B\&B versions will be exactly the same. So to find the speed up associated to an instance, we:
 594
 595 \begin{itemize}
 596 \item compute, using the approach defined in \cite{ch8:Mezmaz_2007}, a list $L$ of sub-problems such as the resolution of $L$ lasts $T$ minutes with a sequential B\&B,
 597 \item initialize the pool of our sequential B\&B with the sub-problems of this list $L$,
 598 \item solve the sub-problems of this pool with our sequential B\&B ,
 599 \item get the sequential resolution time $T{cpu}$ and the number of explored sub-problems $N{cpu}$,
 600 \item check that $T{cpu}$ is approximately equal to $T$,
 601 \item initialize the pool of our GPU B\&B with the sub-problems of the list $L$,
 602 \item solve the sub-problems of this pool with our GPU B\&B,
 603 \item get the GPU resolution time $T{gpu}$ and the number of explored sub-problems $N{gpu}$,
 604 \item check that $N{gpu}$ is exactly equal to $N{cpu}$,
 605 \item and finally compute the speed up associated to this instance by dividing $T{cpu}$ on $T{gpu}$ (i.e. $Tcpu/Tgpu$).
 606 \end{itemize}
 607
 608 \vspace{0.2cm}
 609
 610 Table \ref{ch8:instance_time} gives, for each instance according to its number of jobs and its number of machines, the used resolution time with a sequential B\&B. For example, the sequential resolution time of each instance defined with $20$ jobs and $20$ machines is approximately 10 minutes. Of course, the computation time of the lower bound of a sub-problem defined with $20$ jobs and $20$ machines is on average greater than the computation time of the lower bound of a sub-problem defined with $50$ jobs and $20$ machines. Therefore, as shown in this table, the sequential resolution time increases with the size of the instance in order to be sure that the number of sub-problems explored is significant for all instances.
 611
 612 \begin{table}
 613 \setlength{\tabcolsep}{0.2cm}
 614 \renewcommand{\arraystretch}{1.2}
 615 \centering
 616 \footnotesize
 617 \begin{tabular}{|r|r|r|r|r|}
 618 \hline
 619 Instance (No. of jobs x No. of machines) & 20$\times$20 & 50$\times$20  & 100$\times$20 & 200$\times$20 \\
 620 \hline
 621 Sequential resolution time (minutes) & 10 & 50  & 150 & 300 \\
 622 \hline
 623 \end{tabular}
 624 \caption{The sequential resolution time of each instance according to its number of jobs and machines}
 625 \label{ch8:instance_time}
 626 \end{table}
 627
 628 \subsection{Performance impact of GPU-based parallelism}
 629
 630 The objective of the experimental study presented in this section is to compared the performances of both proposed approaches for designing B\&B on top of GPUs.
 631
 632 Table \ref{ch8:ParaGPU1} and Table~\ref{ch8:ParaGPU2} report respectively the speedups obtained with the GPU-PTE-BB and GPU-PEB-BB approaches for different problem instances. The first part of both tables gives the size of the pool generated and evaluated on the GPU. The second part of the tables gives the average speedup for each group of instances and for each pool size. Each line corresponds to a group of $10$ instances defined by the same number of jobs and the same number of machines.
 633
 634 The results obtained with the GPU-PTE-BB approach (see Table \ref{ch8:ParaGPU1}) show that exploring in parallel the tree search allows to speedup the execution of the B\&B compared to a CPU-based execution. Indeed, an acceleration factor up to 40.50 is obtained for the 20 $\times$ 20 problem instances using a pool of 262144 sub-problems.
 635
 636 The results show also that the parallel efficiency decreases with the size of the problem instance. For a fixed number of machines (here 20 machines) and a fixed pool size, the obtained speedup decline accordingly with the number of jobs. For instance for a pool size of 262144, the acceleration factor obtained with 200 jobs (13.4) while it is (40.50) for the instances with 20 jobs. This behavior is mainly due to the overhead induced by the transfer of the pool of resulting sub-problems between the CPU and the GPU. For example, for the instances with 200 jobs the size of the pool to exchange between the CPU and the GPU is ten times bigger than the size of the pool for the instances with 20 jobs.
 637
 638 \begin{table}
 639 \setlength{\tabcolsep}{0.2cm}
 640 \renewcommand{\arraystretch}{1.2}
 641   \centering
 642   \footnotesize
 643  \begin{tabular}{|r|r|r|r|r|r|r|r|}
 644     \hline
 645 Pool size & 4096 & 8192 & 16384 & 32768 & 65536 & 131072 & 262144\\
 646     \hline
 647     \hline
 648 (NJobs $\times$ NMachines) & \multicolumn{7}{|c|}{Average speedup for each group of 10 instances}\\
 649     \hline
 650 $200 \times $20 & 1.12 & 2.89 & 3.57 & 4.23 & 6.442 & 8.32 & 13.4\\
 651     \hline
 652 $100 \times $20 & 1.33 & 1.88 & 3.45 & 6.45 & 12.38 & 20.40 & 28.76 \\
 653     \hline
 654 $50 \times $20 & 2.70 & 3.80 & 6.82 & 13.04 & 23.53 & 30.94 & 37.66\\
 655     \hline
 656 $20 \times $20 & 6.43 & 11.43 & 20.14 & 27.78 & 30.12 & 35.74 & 40.50\\
 657     \hline
 658     \hline
 659 % Total average speedup & 2.895 & 5.0 & 8.495 & 14.625 & 22.61 & 30.6 & 41.65\\
 660 %     \hline
 661 %     \hline
 662   \end{tabular}
 663   \caption{Speedups for different problem instances and pool sizes with the GPU-PTE-BB approach.}
 664 \label{ch8:ParaGPU1}
 665 \end{table}
 666
 667 The results obtained with the GPU-PEB-BB approach (see Table \ref{ch8:ParaGPU2}) show that evaluating in parallel the bounds of a selected pool, allow to significantly speedup the execution of the B\&B. Indeed, an acceleration factor up to 71.69 is obtained for the 200 $\times$ 20 problem instances using a pool of 262144 sub-problems. The results show also that the parallel efficiency grows with the size of the problem instance. For a fixed number of machines (here 20 machines) and a fixed pool size, the obtained speedup grows accordingly with the number of jobs. For instance for a pool size of 262144, the acceleration factor obtained with 200 jobs (71.69) is almost the double of the one obtained with 20 jobs (38.40).
 668
 669 As far the pool size tuning is considered, we could notice that this parameter depends strongly on the problem instance being solved. Indeed, while the best acceleration is obtained with a pool size of 8192 sub-problems for the instances 50 $\times$ 20 and 20 $\times$ 20, the best speedups are obtained with a pool size of 262144 sub-problems with the instances 200 $\times$ 20 and 100 $\times$ 20.\\
 670
 671 \begin{table}
 672 \setlength{\tabcolsep}{0.2cm}
 673 \renewcommand{\arraystretch}{1.2}
 674   \centering
 675   \footnotesize
 676  \begin{tabular}{|r|r|r|r|r|r|r|r|}
 677     \hline
 678 Pool size & 4096 & 8192 & 16384 & 32768 & 65536 & 131072 & 262144\\
 679     \hline
 680     \hline
 681 (NJobs $\times$ NMachines) & \multicolumn{7}{|c|}{Average speedup for each group of 10 instances}\\
 682     \hline
 683 $200 \times $20 & 42.83 & 56.23 & 57.68 & 61.21 & 66.75 & 68.30 & \textbf{71.69}\\
 684     \hline
 685 $100 \times $20 & 42.59 & 56.18 & 57.53 & 60.95 & 65.52 & 65.70 & \textbf{65.97}\\
 686     \hline
 687 $50 \times $20 & 42.57 & \textbf{56.15} & 55.69 & 55.49 & 55.39 & 55.27 & 55.14\\
 688     \hline
 689 $20 \times $20 & 38.74 & \textbf{46.47} & 45.37 & 41.92 & 39.55 & 38.90 & 38.40\\
 690     \hline
 691     \hline
 692 % Total average speedup & 41.68 & 53.76 & 54.07 & 54.89 & 56.80 & 57.04 & 57.80\\
 693 %     \hline
 694 %     \hline
 695   \end{tabular}
 696   \caption{Speedups for different problem instances and pool sizes with the GPU-PEB-BB approach.}
 697 \label{ch8:ParaGPU2}
 698 \end{table}
 699
 700 Compared to the parallel tree exploration-based GPU-accelerated B\&B approach, the parallel evaluation of bounds approach is by far much more efficient wherever the instance is. For example, while the GPU-PEB-BB approach reaches speedup of $\times$71.69 for the instance with 200 jobs on 20 machines, a speedup of a $\times$13.4 is measured with the parallel tree exploration-based approach which corresponds to an acceleration of $\times$5.56 . Moreover, on the contrary to the GPU-PEB-BB approach, in the GPU-PTE-BB the speedups decrease when the problem instance becomes higher. Remember here that while in the GPU-PEB-BB approach all threads evaluate only one node each whatever the permutation size is. In the GPU-PTE-BB, each thread branches all the children of its assigned parent node. Therefore, the bigger the size of the permutation is, the bigger the amount of work performed by each thread is and the bigger the difference between the workload is. Indeed, let us suppose that for the instance with $200$ jobs, the thread $0$ handles a node from the level $2$ of the tree and the thread $100$ handles a node from the level $170$ of the tree. In this case, the thread $0$ generates and evaluates $198$ nodes while the thread $100$ decomposes and bounds only $30$ nodes. The problem in this example is that the kernel execution would last until the thread $0$ finishes its work while the other threads might have ended their works and stayed idle.
 701
 702 \subsection{Thread divergence reduction}
 703
 704 The objective of this section is to demonstrate that the thread divergence reduction mechanisms we propose has an impact on the performance of the GPU accelerated B\&B and to evaluate how this impact is significant.
 705 In the following, the reported results are obtained with the GPU-accelerated B\&B based on the parallel evaluation of bounds.
 706
 707 \begin{table}[!h]
 708 \setlength{\tabcolsep}{0.2cm}
 709 \renewcommand{\arraystretch}{1.2}
 710   \centering
 711   \footnotesize
 712  \begin{tabular}{|r|r|r|r|r|r|r|r|}
 713     \hline
 714 Pool size & 4096 & 8192 & 16384 & 32768 & 65536 & 131072 & 262144\\
 715     \hline
 716     \hline
 717 (NJobs $\times$ NMachines)  & \multicolumn{7}{|c|}{Average speedup for each group of 10 instances}\\
 718     \hline
 719     \hline
 720 $200 \times $20 & 46.63 & 60.88 & 63.80 & 67.51 & 73.47 & 75.94 & \textbf{77.46}\\
 721     \hline
 722 $100 \times $20 & 45.35 & 58.49 & 60.15 & 62.75 & 66.49 & 66.64 & \textbf{67.01}\\
 723     \hline
 724 $50 \times $20 & 44.39 & \textbf{58.30} & 57.72 & 57.68 & 57.37 & 57.01 & 56.42\\
 725     \hline
 726 $20 \times $20 & 41.71 & \textbf{50.28} & 49.19 & 45.90 & 42.03 & 41.80 & 41.65\\
 727     \hline
 728     \hline
 729 % Total average speedup & 44.52 & 56.99 & 57.72 & 58.46 & 59.84 & 60.35 & 60.64\\
 730 %     \hline
 731 %     \hline
 732   \end{tabular}
 733   \caption{Speedups for different instances and pool sizes using thread divergence management.}
 734 \label{ch8:ParaDivergence}
 735 \end{table}
 736
 737 Table~\ref{ch8:ParaDivergence} shows the experimental results obtained using the sorting process and the refactoring approach presented in Section \ref{ch8:ThreadDivergence}. Results show that the proposed optimizations emphasize the GPU acceleration reported in Table~\ref{ch8:ParaGPU2} and obtained without thread divergence reduction. For example, for the instances of 200 jobs over 20 machines and a pool size of 262144, the average reported speedup is 77.46 while the average acceleration factor obtained without thread divergence management for the same instances and the same pool size is 71.69 which corresponds to an improvement of 7.68\%. Such considerable but not outstanding improvement is predictable, as claimed in \cite{ch8:Han}, since the factorized part of the branches in the FSP lower bound is very small.
 738
 739 \subsection{Data access optimization}
 740
 741 The objective of the experimental study presented in this section is to find the best mapping of the six data structures of the lower bound LB kernel on the memories of the GPU device. In the following, the reported results are obtained with the GPU-accelerated B\&B based on the parallel evaluation of bounds.
 742
 743 Table~\ref{ch8:PTM-on-SM} reports the speedups obtained for the first experimented scenario where only the matrix $PTM$ is put on the shared memory. Results show that the speedup grows on average with the growing of the pool size in the same way as in Table~\ref{ch8:ParaDivergence}. For the largest problem instance and pool size, putting the PTM matrix on the shared memory improves the speedups up to ($14\%$) compared to those obtained when $PTM$ is on global memory reaching an acceleration of $\times 90.51$ for the problem instances $200 \times 20$ and a pool size of $262144$ sub-problems .
 744
 745 \begin{table}
 746   \centering
 747   \footnotesize
 748   \begin{tabular}{|r|r|r|r|r|r|r|r|}
 749     \hline
 750 Pool size & 4096 & 8192 & 16384 & 32768 & 65536 & 131072 & 262144\\
 751     \hline
 752     \hline
 753 (NJobs $\times$ NMachines)  & \multicolumn{7}{|c|}{Average speedup for each group of 10 instances}\\
 754     \hline
 755     \hline
 756 $200 \times $20 & 54.03 & 67.75 & 68.43 & 72.17 & 82.01 & 88.35 & \textbf{90.51}\\
 757     \hline
 758 $100 \times $20 & 52.92 & 66.57 & 66.25 & 71.21 & 76.63 & 79.76 & \textbf{83.01}\\
 759     \hline
 760 $50 \times $20 & 49.85 & \textbf{65.68} & 64.40 & 59.91 & 58.57 & 57.36 & 55.09\\
 761     \hline
 762 $20 \times $20 & 41.94 & \textbf{60.10} & 48.28 & 39.86 & 39.61 & 38.93 & 37.79 \\
 763      \hline
 764     \hline
 765 % Average Speedup& 49.69 & 65.03 & 61.84 & 60.79 & 64.21 & 66.10 & 66.60 \\
 766 %     \hline
 767 %     \hline
 768   \end{tabular}
 769  \caption[Speedup for different FSP instances and pool sizes obtained with data access optimization.]{Speedup for different FSP instances and pool sizes obtained with data access optimization. $PTM$ is placed in shared memory and all others are placed in global memory.}
 770 \label{ch8:PTM-on-SM}
 771 \end{table}
 772
 773 Table~\ref{ch8:JM-on-SM} reports the behavior of the speedup averaged on the different problem instances (sizes) as a function of the pool size for the scenario where the Johnson's matrix is put on the shared memory. Results show that putting the $JM$ matrix on the shared matrix improves more the performances comparing to the first scenario where $PTM$ is put on the shared memory. Indeed, according to Table~\ref{ch8:tabMemComplex}, matrix $JM$ is accessed more frequently than matrix $PTM$. Putting $JM$ matrix on the shared memory allows accelerations up to $\times 97.83$ for the problem instances $200 \times 20$.
 774
 775 \begin{table}
 776   \centering
 777   \footnotesize
 778   \begin{tabular}{|r|r|r|r|r|r|r|r|}
 779     \hline
 780 Pool size & 4096 & 8192 & 16384 & 32768 & 65536 & 131072 & 262144\\
 781     \hline
 782     \hline
 783 (NJobs $\times$ NMachines) & \multicolumn{7}{|c|}{Average speedup for each group of 10 instances}\\
 784     \hline
 785     \hline
 786 $200 \times $20 & 63.01 & 79.40 & 81.40 & 84.02 & 93.61 & 96.56 & \textbf{97.83}\\
 787     \hline
 788 $100 \times $20 & 61.70 & 77.79 & 79.32 & 81.25 & 86.73 & 87.81 & \textbf{88.69}\\
 789     \hline
 790 $50 \times $20 & 59.79 & \textbf{75.32} & 72.20 & 71.04 & 70.12 & 68.74 & 68.07 \\
 791     \hline
 792 $20 \times $20 & 49.00 & \textbf{60.25} & 55.50 & 45.88 & 44.47 & 43.11 & 42.82 \\
 793      \hline
 794     \hline
 795 % Average Speedup& 58.37 & 73.19 & 72.11 & 70.55 & 73.73 & 74.06 & 74.35 \\
 796 %     \hline
 797 %     \hline
 798   \end{tabular}
 799  \caption[Speedup for different FSP instances and pool sizes obtained with data access optimization.]{Speedup for different FSP instances and pool sizes obtained with data access optimization.
 800 $JM$ is placed in shared memory and all others are placed in global memory.}
 801 \label{ch8:JM-on-SM}
 802 \end{table}
 803
 804 Table~\ref{ch8:JM-PTM-on-SM} reports the behavior of the average speedup for the different problem instances (sizes) with $20$ machines for the data placement scenario where both $PTM$ and $JM$ are put on shared memory. According to the underlying Table, the scenarios~(3) ($JM$ together or without $PTM$ in shared memory) is clearly better than the scenarii~(1)and~(2) (respectively $PTM$ in shared memory and $JM$ in shared memory) whatever is the problem instance (size).
 805
 806 \begin{table}
 807   \centering
 808   \footnotesize
 809   \begin{tabular}{|r|r|r|r|r|r|r|r|}
 810     \hline
 811 Pool size & 4096 & 8192 & 16384 & 32768 & 65536 & 131072 & 262144\\
 812     \hline
 813     \hline
 814 (NJobs $\times$ NMachines)  & \multicolumn{7}{|c|}{Average speedup for each group of 10 instances}\\
 815     \hline
 816     \hline
 817 $200 \times $20 & 66.13 & 87.34 & 88.861 & 95.23 & 98.83 & 99.89 & \textbf{100.48}\\
 818     \hline
 819 $100 \times $20 & 65.85 & 86.33 & 87.60 & 89.18 & 91.41 & 92.02 & \textbf{92.39}\\
 820     \hline
 821 $50 \times $20 & 64.91 & \textbf{81.50} & 78.02 & 74.16 & 73.83 & 73.25 & 72.71\\
 822     \hline
 823 $20 \times $20 & 53.64 & \textbf{61.47} & 59.55 & 51.39 & 47.40 & 46.53 & 46.37\\
 824      \hline
 825     \hline
 826 % Average Speedup & 62.63 & 79.16 & 78.51 & 77.49 & 77.87 & 77.92 & 77.99\\
 827 %     \hline
 828 %     \hline
 829   \end{tabular}
 830  \caption[Speedup for different FSP instances and pool sizes obtained with data access optimization.]{Speedup for different FSP instances and pool sizes obtained with data access optimization. $PTM$ and $JM$ are placed together in shared memory and all others are placed in global memory.}
 831 \label{ch8:JM-PTM-on-SM}
 832 \end{table}
 833
 834 By carefully analyzing each of the scenarii of data placement on the memory hierarchies of the GPU, the recommendation is to put in the shared memory the Johnson's and the processing time matrices ($JM$ and $PTM$) if they fit in together. Otherwise, the whole or a part of the Johnson's matrix has to be put in priority in the shared memory. The other data structures are mapped to the global memory.
 835
 836 \section{Conclusion and Future Work}
 837 \label{ch8:Conclusion}
 838
 839 In this chapter, we have revisited the design of parallel B\&B algorithms on GPU accelerators to allow highly efficient solving of permutation-based COPs. To do so, our contributions consist in: (1) rethinking two approaches for parallel B\&B on top of GPUs, discussing the performances of each and identifying which best suits the GPU accelerators. (2) proposing a new approach for thread/branch divergence reduction through a thorough analysis of the different loops and conditional instructions of the bounding function. (3) defining an optimal mapping of the data structures of the bounding function on the hierarchy of memories provided in the GPU device through a careful analysis of both the data structures (size and access frequency) and the GPU memories (size and access latency).
 840
 841 In the first parallel tree-exploration-based B\&B, a set of pending nodes is selected from this list according to their depth and off-loaded to the GPU where each thread builds its own local search tree by applying
 842 the branching, bounding and pruning operators to the assigned node. In the GPU-accelerated B\&B based on the parallel evaluation of bounds, the generation of the sub-problems (branching, selection and pruning operations) is performed on CPU and the evaluation of their lower bounds (bounding operation) is executed on the GPU device. Pools of sub-problems are off-loaded from CPU to GPU to be evaluated by blocks of threads. After evaluation, the lower bounds are returned to the CPU.
 843
 844 In both considered approaches, our focus is on the GPU-based lower bound's implementation and the associated thread divergence and data placement challenges. The proposed mechanisms for reducing the thread divergence issue are based on a thorough analysis of the different loops and conditional instructions of the lower bound function. On the one hand, the sorting process aims to homogenize the data of the sub-problems off-loaded to the GPU to minimize the number of threads that diverge on loop instructions. On the other hand, the technique of branch refactoring rewrite the conditional instructions into uniform instructions so that threads of the same warp execute a same code. The proposed data access optimization is based on a preliminary analysis of the lower bound function. Such analysis allowed us to identify six data structures for which we have proposed a complexity analysis in terms of memory size and access frequency. Due to the limited size of the shared memory the matrices do not fit in all together. According to the complexity study, the recommendation is to put in the shared memory the Johnson's and the processing time matrices ($JM$ and $PTM$) if they fit in together. Otherwise, the whole or a part of the Johnson's matrix has to be put in priority in the shared memory. The other data structures are mapped to the global memory. Such recommendation has been confirmed through extensive experiments using a recent C2050 Tesla GPU card.
 845
 846 The Flowshop Scheduling Problem has been considered as a case study. The proposed approaches have been experimented using a Tesla C2050 GPU card on different classes of FSP instances. The experimental results show that the parallel evaluation of bounds is the parallelization paradigm that performs better on top of GPU accelerators. Compared to the parallel tree-exploration model, accelerations up to $\times$5.56 are achieved.
 847
 848 Experiments show also that the proposed refactoring approach improves the parallel efficiency whatever the FSP instance and the pool size are. However, the improvement was not significant because the factorized part of the branches in the FSP lower bound is very small. The optimizations obtained with the proposed thread reduction mechanisms allowed us to achieve accelerations up to $\times$77.46 compared to a sequential B\&B. The data access optimizations grant accelerations up to $\times 100$ compared to a single CPU-based B\&B.
 849
 850 In the near future, we plan to extend this work to a cluster of GPU-accelerated multi-core processors. From the application point of view, the objective is to optimally solve challenging and unsolved Flow-Shop instances as we did it for one 50$\times$20 problem instance with grid computing \cite{ch8:Mezmaz_2007}. Finally, we plan to investigate other lower bound functions to deal with other combinatorial optimization problems.
 851
 852 \putbib[Chapters/chapter8/biblio8]