From f58a76268a01dff4b9427557dba6f052787415d5 Mon Sep 17 00:00:00 2001 From: couturie Date: Mon, 16 Sep 2013 20:48:41 +0200 Subject: [PATCH] new --- BookGPU/Chapters/chapter6/PartieAsync.tex | 2 +- BookGPU/Chapters/chapter6/PartieSync.tex | 4 +- BookGPU/Chapters/chapter8/biblio8.bib | 2 +- BookGPU/Chapters/chapter8/ch8.tex | 83 ++++++++++++----------- 4 files changed, 46 insertions(+), 45 deletions(-) diff --git a/BookGPU/Chapters/chapter6/PartieAsync.tex b/BookGPU/Chapters/chapter6/PartieAsync.tex index 1623f8c..d8da9a4 100644 --- a/BookGPU/Chapters/chapter6/PartieAsync.tex +++ b/BookGPU/Chapters/chapter6/PartieAsync.tex @@ -621,7 +621,7 @@ execution. They are similar to the mechanism used for managing the end messages at the end of the entire process. Line~23 directly updates the number of other nodes that are in local convergence by adding the received state of the source node. This is possible due to the encoding that is used to -represent the local convergence (1) and the non convergence (0). +represent the local convergence (1) and the nonconvergence (0). %\begin{algorithm}[H] % \caption{Reception function in the synchronized scheme.} diff --git a/BookGPU/Chapters/chapter6/PartieSync.tex b/BookGPU/Chapters/chapter6/PartieSync.tex index d8d281c..ce21565 100755 --- a/BookGPU/Chapters/chapter6/PartieSync.tex +++ b/BookGPU/Chapters/chapter6/PartieSync.tex @@ -97,7 +97,7 @@ parallel programming schemes on a GPU cluster: Using CUDA\index{CUDA}, GPU kernel executions are nonblocking, and GPU/CPU data transfers\index{CUDA!data transfer} are blocking or nonblocking operations. All GPU kernel executions and CPU/GPU -data transfers are associated to "streams,"\index{CUDA!stream} and all operations on a same stream +data transfers are associated to ``streams'',\index{CUDA!stream} and all operations on a same stream are serialized. When transferring data from the CPU to the GPU, then running GPU computations, and finally transferring results from the GPU to the CPU, there is a natural synchronization and serialization if these operations are achieved on @@ -489,7 +489,7 @@ working on independent subsets of data. \Lst{algo:ch6p1overlapstreamsequence} is not so generic as \Lst{algo:ch6p1overlapseqsequence}. -\subsection{Interleaved communications-transfers-computations\\overlapping} +\subsection{Interleaved communications-transfers-computations overlapping} Many algorithms do not support splitting data transfers and kernel calls, and cannot exploit CUDA streams, for example, when each GPU thread requires access to diff --git a/BookGPU/Chapters/chapter8/biblio8.bib b/BookGPU/Chapters/chapter8/biblio8.bib index a915ef3..623c025 100644 --- a/BookGPU/Chapters/chapter8/biblio8.bib +++ b/BookGPU/Chapters/chapter8/biblio8.bib @@ -165,6 +165,6 @@ pages = "1-9", @misc{ch8:cuda, author = {{NVIDIA Corporation}}, keywords = {CUDA}, - title = {{NVIDIA CUDA C} Programming Guide, Version 4.0 }, + title = {{NVIDIA CUDA C} {P}rogramming {G}uide, Version 4.0 }, year = 2011 } diff --git a/BookGPU/Chapters/chapter8/ch8.tex b/BookGPU/Chapters/chapter8/ch8.tex index 7570035..d131b53 100644 --- a/BookGPU/Chapters/chapter8/ch8.tex +++ b/BookGPU/Chapters/chapter8/ch8.tex @@ -7,27 +7,27 @@ \section{Introduction} \label{ch8:introduction} -In practice, a wide range of problems can be modeled as NP-hard combinatorial optimization problems (COPs). Those problems consist of choosing the best combination out of a large finite set of possible combinations and are known to be large in size and difficult to solve optimality. One of the most popular methods for solving exactly a COP (finding a solution having the optimal cost, is the Branch-and-Bound (B\&B) algorithm. This algorithm is based on an implicit enumeration of all the feasible solutions of the tackled problem. Enumerating the solutions of a problem consists of building a dynamically generated search tree whose nodes are subsets of solutions of the considered problem. The construction of such a tree and its exploration is performed using four operators: branching, bounding, selection, and pruning. Due to the exponentially increasing number of potential solutions, the B\&B algorithm explores only promising nodes of the search tree using an estimated optimal solution called ``lower bound'' of the associated subproblem.\\ +In practice, a wide range of problems can be modeled as NP-hard combinatorial optimization problems (COPs). Those problems consist of choosing the best combination out of a large finite set of possible combinations and are known to be large in size and difficult to solve optimality. One of the most popular methods for solving exactly a COP (finding a solution having the optimal cost, is the Branch-and-Bound (B\&B) algorithm. This algorithm is based on an implicit enumeration of all the feasible solutions of the tackled problem. Enumerating the solutions of a problem consists of building a dynamically generated search tree whose nodes are subsets of solutions of the considered problem. The construction of such a tree and its exploration is performed using four operators: branching, bounding, selection, and pruning. Due to the exponentially increasing number of potential solutions, the B\&B algorithm explores only promising nodes of the search tree using an estimated optimal solution called ``lower bound'' of the associated subproblem. -Although this bounding mechanism allows the considerable reduction of the exploration time, often only small or moderatelysized instances of COPs can be practically solved. For this reason, over the last decades, parallel computing has been revealed as an attractive way to deal with larger instances of COPs. However, while many contributions have been proposed for parallel B\&B methods using massively parallel processors \cite{ch8:Allen_1997}, networks or clusters of workstations \cite{ch8:Quinn_1990}, and Shared Memory Multiprocessors (SMP) machines \cite{ch8:Casadoa_2008}, very few contributions have been proposed for redesigning B\&B algorithms on Graphical Processing Units (GPUs) \cite{ch8:Carneiro_2011}. For years, the use of GPU accelerators was limited to graphics and video applications. Driven by the demand for high-definition 3D graphics on personal computers, GPUs have evolved into a highly parallel, multithreaded and many-core environment. Their utilization has recently been extended to other application domains such as scientific computing \cite{ch8:Kurzak_2010}.\\ +Although this bounding mechanism allows the considerable reduction of the exploration time, often only small or moderatelysized instances of COPs can be practically solved. For this reason, over the last decades, parallel computing has been revealed as an attractive way to deal with larger instances of COPs. However, while many contributions have been proposed for parallel B\&B methods using massively parallel processors \cite{ch8:Allen_1997}, networks or clusters of workstations \cite{ch8:Quinn_1990}, and Shared Memory Multiprocessors (SMP) machines \cite{ch8:Casadoa_2008}, very few contributions have been proposed for redesigning B\&B algorithms on Graphical Processing Units (GPUs) \cite{ch8:Carneiro_2011}. For years, the use of GPU accelerators was limited to graphics and video applications. Driven by the demand for high-definition 3D graphics on personal computers, GPUs have evolved into a highly parallel, multithreaded and many-core environment. Their utilization has recently been extended to other application domains such as scientific computing \cite{ch8:Kurzak_2010}. -In this chapter, we rethink the design and implementation of irregular tree-based algorithms such as the B\&B algorithm on top of GPUs. During the execution of the B\&B algorithm, the number of newly generated nodes and the number of not yet explored but promising nodes are variable and depend on the level of the tree being explored and on the best solution found so far. Therefore, due to such unstructured and unpredictable nature of its search tree, designing efficient B\&B on top of GPUs is not straightforward. We investigate two different approaches for designing GPU-based B\&B starting from the parallel models for B\&B identified in \cite{ch8:MelabHDR_2005}. The first one is based on the ``parallel tree exploration'' paradigm. This approach consists of exploring in parallel different subspaces of the tree. The second approach is based on the ``parallel evaluation of bounds'' approach. The two approaches have been applied to the permutation Flowshop Scheduling Problem \index{flowshop scheduling problem} (FSP; see Section~\ref{ch8:BB-FSP}) which is an NP-hard combinatorial optimization problem. The lower bound function used in this work for FSP is the one proposed in~\cite{ch8:Johnson_1954} for two machines and generalized in~\cite{ch8:Lenstra_1978} to more than two machines.\\ +In this chapter, we rethink the design and implementation of irregular tree-based algorithms such as the B\&B algorithm on top of GPUs. During the execution of the B\&B algorithm, the number of newly generated nodes and the number of not yet explored but promising nodes are variable and depend on the level of the tree being explored and on the best solution found so far. Therefore, due to such unstructured and unpredictable nature of its search tree, designing efficient B\&B on top of GPUs is not straightforward. We investigate two different approaches for designing GPU-based B\&B starting from the parallel models for B\&B identified in \cite{ch8:MelabHDR_2005}. The first one is based on the ``parallel tree exploration'' paradigm. This approach consists of exploring in parallel different subspaces of the tree. The second approach is based on the ``parallel evaluation of bounds'' approach. The two approaches have been applied to the permutation Flowshop Scheduling Problem \index{flowshop scheduling problem} (FSP; see Section~\ref{ch8:BB-FSP}) which is an NP-hard combinatorial optimization problem. The lower bound function used in this work for FSP is the one proposed in~\cite{ch8:Johnson_1954} for two machines and generalized in~\cite{ch8:Lenstra_1978} to more than two machines. -When rethinking those two parallel models for GPU's architectures, our main focus was on the lower bound function. Indeed, preliminary experiments we carried out on some of Taillard's problem instances \cite{ch8:Taillard_1993} show that computing the lower bounds takes on average between 98\% and 99\% of the total execution time of the B\&B. The GPU-based lower bound's implementation raises mainly two challenges. On the one hand, having in mind that the execution model of GPUs is Single Instruction Multiple Data (SIMD), irregular computations (containing loops and conditional instructions) contained in the lower bound function may lead to a very challenging issue: the thread or branch divergence. This problem drops down the performance and arises when threads of a same warp (the smallest executable unit of parallelism on the GPU) execute different data-dependent instructions. On the other hand, the lower bound computation usually uses large and frequently accessed data structures. Since GPU is a many-core coprocessor device that provides a hierarchy of memories having different sizes and access latencies, the placement and sharing of these data sets become challenging.\\ +When rethinking those two parallel models for GPU's architectures, our main focus was on the lower bound function. Indeed, preliminary experiments we carried out on some of Taillard's problem instances \cite{ch8:Taillard_1993} show that computing the lower bounds takes on average between 98\% and 99\% of the total execution time of the B\&B. The GPU-based lower bound's implementation raises mainly two challenges. On the one hand, having in mind that the execution model of GPUs is Single Instruction Multiple Data (SIMD), irregular computations (containing loops and conditional instructions) contained in the lower bound function may lead to a very challenging issue: the thread or branch divergence. This problem drops down the performance and arises when threads of a same warp (the smallest executable unit of parallelism on the GPU) execute different data-dependent instructions. On the other hand, the lower bound computation usually uses large and frequently accessed data structures. Since GPU is a many-core coprocessor device that provides a hierarchy of memories having different sizes and access latencies, the placement and sharing of these data sets become challenging. -The scope of this chapter is to design parallel B\&B algorithms on GPU accelerators to allow highly efficient solving of permutation-based COPs. To do so, our contributions consist of: (1) rethinking two approaches for parallel B\&B on top of GPUs, discussing the performances of each and identifying which best suits the GPU accelerators, (2) proposing a new approach for thread/branch divergence reduction through a thorough analysis of the different loops and conditional instructions of the bounding function, and (3) defining an optimal mapping of the data structures of the bounding function on the hierarchy of memories provided in the GPU device through a careful analysis of both the data structures (size and access frequency) and the GPU memories (size and access latency). \\ +The scope of this chapter is to design parallel B\&B algorithms on GPU accelerators to allow highly efficient solving of permutation-based COPs. To do so, our contributions consist of: (1) rethinking two approaches for parallel B\&B on top of GPUs, discussing the performances of each and identifying which best suits the GPU accelerators, (2) proposing a new approach for thread/branch divergence reduction through a thorough analysis of the different loops and conditional instructions of the bounding function, and (3) defining an optimal mapping of the data structures of the bounding function on the hierarchy of memories provided in the GPU device through a careful analysis of both the data structures (size and access frequency) and the GPU memories (size and access latency). The chapter is organized into seven main sections. Section \ref{ch8:BB} presents the B\&B algorithm. Section \ref{ch8:Parallel-BB} introduces the different models used to parallelize B\&B algorithms. Section \ref{ch8:BB-FSP} briefly describes the Flowshop Scheduling permutation Problem. In Section~\ref{ch8:approach1}, we describe the GPU-accelerated B\&B based on the parallel tree exploration. In Section~\ref{ch8:approach2}, details about the second approach, the GPU-accelerated B\&B based on the parallel evaluation of lower bounds, are given. In Section \ref{ch8:ThreadDivergence}, the thread divergence issue related to the location of nodes in the B\&B tree and to the control flow instructions within the bounding operator is described. In Section \ref{ch8:DataAccessOpt}, the memory access optimization challenge is addressed and an overview of the GPU memory hierarchy and the used memory access pattern is given. In Section~\ref{ch8:Experiments}, we report experimental results showing the performances of each of two studied approaches compared to a sequential CPU-based execution of the B\&B and demonstrating the efficiency of the proposed optimizations. - +\clearpage \section{Branch-and-bound \index{branch-and-bound} algorithm} \label{ch8:BB} -Branch-and-bound algorithms are by far the most widely used methods for exactly solving large scale NP-hard combinatorial optimization problems. Indeed, they allow the finding of the optimal solution of a problem with proof of optimality. \\ +Branch-and-bound algorithms are by far the most widely used methods for exactly solving large scale NP-hard combinatorial optimization problems. Indeed, they allow the finding of the optimal solution of a problem with proof of optimality. -The basic idea of the B\&B algorithm consists in implicitly enumerating all the solutions of the original problem by only examining a subset of feasible solutions and eliminating the others when they are not likely to lead to a feasible or an optimal solution. Enumerating the solutions of a problem consists of building a dynamically generated search tree whose nodes are subsets of solutions of the considered problem. The construction of such tree and its exploration are performed using four operators: branching, bounding, selection and pruning.\\ +The basic idea of the B\&B algorithm consists in implicitly enumerating all the solutions of the original problem by only examining a subset of feasible solutions and eliminating the others when they are not likely to lead to a feasible or an optimal solution. Enumerating the solutions of a problem consists of building a dynamically generated search tree whose nodes are subsets of solutions of the considered problem. The construction of such tree and its exploration are performed using four operators: branching, bounding, selection and pruning. The algorithm proceeds in several iterations during which the best solution found so far is progressively improved. During the exploration process, the search space is analyzed by a pool of unexplored nodes and the best solution found so far. The generated and not yet examined (pending) nodes are kept in a list initialized with the original problem. At each iteration of the algorithm, the following steps are performed: @@ -85,7 +85,7 @@ Set the Best\_Solution to $\emptyset$; \\ \section{Parallel branch-and-bound algorithms} \label{ch8:Parallel-BB} -Thanks to the bounding operator, B\&B allows the significant reduction of the computing time needed to explore the whole solution space. However, finding an optimal solution for large instances remains impractical using a sequential B\&B. Therefore, parallel processing of these algorithms has been widely studied in the literature. In \cite{ch8:MelabHDR_2005}, a taxonomy of the various existing parallel paradigms used to parallelize the B\&B algorithm is presented.\\ +Thanks to the bounding operator, B\&B allows the significant reduction of the computing time needed to explore the whole solution space. However, finding an optimal solution for large instances remains impractical using a sequential B\&B. Therefore, parallel processing of these algorithms has been widely studied in the literature. In \cite{ch8:MelabHDR_2005}, a taxonomy of the various existing parallel paradigms used to parallelize the B\&B algorithm is presented. This taxonomy based on the classification proposed in \cite{ch8:Gendron_1994} identified several models to accelerate the B\&B search. The first model we consider in this chapter is called ``parallel tree exploration model'' and belongs to the ``tree-based'' strategies that aim to build and explore the B\&B tree in parallel. The second model called ``parallel evaluation of bounds model'' (evaluation of bounds in parallel) belong to the parallelization approach called ``node-based''. This strategy aims to accelerate the execution of a particular operation at the node level. @@ -94,9 +94,9 @@ This taxonomy based on the classification proposed in \cite{ch8:Gendron_1994} id \subsection{The parallel tree exploration model} \label{ch8:para_tree} -Tree-based strategies consist of building and/or exploring the solution tree in parallel by performing operations on several subproblems simultaneously. This coarse-grained type of parallelism affects the general structure of the B\&B algorithm and makes it highly irregular.\\ +Tree-based strategies consist of building and/or exploring the solution tree in parallel by performing operations on several subproblems simultaneously. This coarse-grained type of parallelism affects the general structure of the B\&B algorithm and makes it highly irregular. -The parallel tree exploration \index{parallel!tree exploration} model, illustrated in Figure \ref{ch8:parallel_tree}, consists of visiting in parallel different paths of the same tree. The search tree is explored in parallel by performing the branching, selection, bounding, and elimination operators on several subproblems simultaneously.\\ +The parallel tree exploration \index{parallel!tree exploration} model, illustrated in Figure \ref{ch8:parallel_tree}, consists of visiting in parallel different paths of the same tree. The search tree is explored in parallel by performing the branching, selection, bounding, and elimination operators on several subproblems simultaneously. \begin{figure} \begin{center} @@ -107,10 +107,11 @@ The parallel tree exploration \index{parallel!tree exploration} model, illustrat \end{center} \end{figure} + \subsection{The parallel evaluation of bounds model} \label{ch8:Node_parallel} -Node-based strategies introduce parallelism when performing the operations on a single problem. For instance, they consist of executing the bounding operation in parallel for each subproblem to accelerate the execution. This type of parallelism has no influence on the general structure of the B\&B algorithm and is particular to the problem being solved.\\ +Node-based strategies introduce parallelism when performing the operations on a single problem. For instance, they consist of executing the bounding operation in parallel for each subproblem to accelerate the execution. This type of parallelism has no influence on the general structure of the B\&B algorithm and is particular to the problem being solved. The parallel evaluation of bounds \index{parallel!evaluation of bounds} model, as shown in Figure \ref{ch8:bounds_parallel}, allows the parallelization of the bounding of subproblems generated by the branching operator. This model is used in the case where the bounding operator is performed several times after the branching operator. Compared to the sequential B\&B, the model does not change the order and the number of explored subproblems in the parallel B\&B algorithm. @@ -129,7 +130,7 @@ The parallel evaluation of bounds \index{parallel!evaluation of bounds} model, a \subsection{Definition of the flowshop scheduling problem} As a case study for our GPU-based branch-and-bound algorithm, we considered the NP-hard and well-known problem in the scheduling theory: the ``Permutation Flow-shop Scheduling Problem'' (FSP). -In this work, the mono-objective case is considered. The FSP aims to find the optimal schedule of $n$ jobs on $m$ machines so that the overall completion time of all jobs, called {\it makespan}, is minimized. \\ +In this work, the mono-objective case is considered. The FSP aims to find the optimal schedule of $n$ jobs on $m$ machines so that the overall completion time of all jobs, called {\it makespan}, is minimized. Let us suppose the set of jobs is represented by $J = \{j_1, j_2, \dots, j_n\}$ and the set of machines is represented by $M=\{m_1,m_2, \dots, m_m\}$ organized in the line. Each job $j_i$ is a sequence of operations $j_i = { oi_1, oi_2, \dots, oi_m }$ where $oi_m$ is the duration required for the job $j_i$ on the machine $m$. @@ -153,7 +154,7 @@ Figure~\ref{flow-shop} illustrates a solution of a flow-shop problem instance de \subsection{Lower bound \index{lower bound} for the flowshop scheduling problem} -The lower bounding technique provides a lower bound (LB) for each subproblem generated by the branching operator. The more the bound is accurate, the more it allows the elimination from the search tree that are not promising. Therefore, the efficiency of a B\&B algorithm depends strongly on the quality of its lower bound function. In this chapter, we use the lower bound proposed by Lenstra et al.~\cite{ch8:Lenstra_1978} for FSP, based on the Johnson's algorithm~\cite{ch8:Johnson_1954}.\\ +The lower bounding technique provides a lower bound (LB) for each subproblem generated by the branching operator. The more the bound is accurate, the more it allows the elimination from the search tree that are not promising. Therefore, the efficiency of a B\&B algorithm depends strongly on the quality of its lower bound function. In this chapter, we use the lower bound proposed by Lenstra et al.~\cite{ch8:Lenstra_1978} for FSP, based on the Johnson's algorithm~\cite{ch8:Johnson_1954}. The Johnson's algorithm allows the optimal solution of FSP with two machines ($m=2$) using the following transitive rule $\preceq$: @@ -161,12 +162,12 @@ The Johnson's algorithm allows the optimal solution of FSP with two machines ($ $$J_i \preceq J_j \Leftrightarrow \min(p_{i,1}\ ;\ p_{j,2}) \leq \min(p_{i,2}\ ;\ p_{j,1})$$ -We recall that $p_{k,l}$ designates the processing time of the job $J_k$ on the machine $M_l$. From the above rule, follows the Johnson's theorem: \\ +We recall that $p_{k,l}$ designates the processing time of the job $J_k$ on the machine $M_l$. From the above rule, follows the Johnson's theorem: -\textbf{Jonhson's theorem} \emph{Given $P$ and FSP with $m=2$, if $J_i\preceq J_j$, there exists an optimal schedule for $P$ in which job $J_i$ precedes job $J_j$.}\\ +\textbf{Johnson's theorem} \emph{Given $P$ and FSP with $m=2$, if $J_i\preceq J_j$, there exists an optimal schedule for $P$ in which job $J_i$ precedes job $J_j$.} According to Johnson's theorem, FSP with $m=2$ is solved with a time complexity of $O(n.log n)$. The optimal solution is obtained by first sorting in increasing order the jobs having a -processing time shorter on the first machine than on the second one, and second, sorting in decreasing order the jobs having a shorter processing time on the second machine. \\ +processing time shorter on the first machine than on the second one, and second, sorting in decreasing order the jobs having a shorter processing time on the second machine. In~\cite{ch8:JRJackson_1956} and~\cite{ch8:LGMitten_1959}, the Johnson's rule extended by Jackson and Mitten with lags which further allowed Lenstra et al. to propose a lower bound for FSP with $m \geq 3$. A lag~$l_j$ designates the minimum duration between the starting time of the job $J_j$ on the second machine and its finishing time on the first machine. Jackson and Mitten demonstrated that the optimal solution for FSP with $m=2$ can be obtained using the following transitive rule $\preceq$: @@ -194,7 +195,7 @@ all the machines between~$k$~and~$l$. \section{GPU-accelerated B\&B based on the parallel tree exploration (GPU-PTE-BB)} \label{ch8:approach1} -The first approach we investigate for designing B\&B on GPUs consists of exploring in parallel the generated search tree. The idea is to divide the global search space into disjoint sub-spaces that are explored in parallel by the GPU threads. As explained in Section \ref{ch8:BB}, during the execution of a B\&B, the search space is described by a list of unexplored (pending) nodes and the best solution found so far. In the considered GPU-based scheme, a set of parent nodes is selected from this list according to their depth: deepest pending nodes are the first selected. The selected pool of nodes is off-loaded to the GPU where each thread builds its own local search tree by applying the {\it branching}, {\it bounding}, and {\it pruning} operators to the assigned node.\\ +The first approach we investigate for designing B\&B on GPUs consists of exploring in parallel the generated search tree. The idea is to divide the global search space into disjoint sub-spaces that are explored in parallel by the GPU threads. As explained in Section \ref{ch8:BB}, during the execution of a B\&B, the search space is described by a list of unexplored (pending) nodes and the best solution found so far. In the considered GPU-based scheme, a set of parent nodes is selected from this list according to their depth: deepest pending nodes are the first selected. The selected pool of nodes is off-loaded to the GPU where each thread builds its own local search tree by applying the {\it branching}, {\it bounding}, and {\it pruning} operators to the assigned node. \begin{figure}[h!] @@ -233,10 +234,10 @@ In the following, we present how we dealt with the thread/branch divergence issu \subsection{The thread divergence issue} -During the execution of an application on GPU, one or more thread block(s) are assigned to each GPU multiprocessor to execute. Those threads are partitioned into warps that get scheduled for execution. For each instruction of the flow, the multiprocessor selects a warp that is ready to be run. A warp executes one common instruction at a time, so full efficiency is realized when all threads of a warp agree on their execution path. In this chapter, the G80 model, in which a warp is a pool of 32 threads, is used. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken. Threads that are not on the taken path are disabled, and when all paths are complete, the threads converge back to the same execution path. This phenomenon is called thread/branch divergence\index{GPU!thread divergence} and often causes serious performance degradations. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjointed code paths.\\ +During the execution of an application on GPU, one or more thread block(s) are assigned to each GPU multiprocessor to execute. Those threads are partitioned into warps that get scheduled for execution. For each instruction of the flow, the multiprocessor selects a warp that is ready to be run. A warp executes one common instruction at a time, so full efficiency is realized when all threads of a warp agree on their execution path. In this chapter, the G80 model, in which a warp is a pool of 32 threads, is used. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken. Threads that are not on the taken path are disabled, and when all paths are complete, the threads converge back to the same execution path. This phenomenon is called thread/branch divergence\index{GPU!thread divergence} and often causes serious performance degradations. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjointed code paths. -This section discusses thread divergence issues encountered when computing the bounds by GPU. The thread divergence occurs for two main reasons, namely, the locations of nodes in the search tree and the control flow instructions within the bounding operator. \\ +This section discusses thread divergence issues encountered when computing the bounds by GPU. The thread divergence occurs for two main reasons, namely, the locations of nodes in the search tree and the control flow instructions within the bounding operator. \noindent \textbf{Divergence related to the location of nodes}\\ @@ -275,10 +276,10 @@ Control flow refers to the order in which the instructions, statements, or funct \end{itemize} In these two examples, thread\_idx is the index associated to the current thread. Let suppose that the code of Example 1 is executed by $32$ threads, pool[thread\_idx].begin is equal to $0$ for the first thread, and pool[thread\_idx].begin is not equal to $0$ for the other $31$ threads. When the first thread executes the statement time = TimeArrival[1];, -all the other $31$ threads remain idle. Therefore, the GPU cores on which these $31$ threads are executed remain idle and cannot be used during the execution of the statement time = TimeArrival[1];.\\ +all the other $31$ threads remain idle. Therefore, the GPU cores on which these $31$ threads are executed remain idle and cannot be used during the execution of the statement time = TimeArrival[1];. -The same scenario occurs during the execution of Example 2. Let us suppose that the instruction is executed by $32$ threads, pool[thread\_idx].begin is equal to $100$ for the first thread, and pool[thread\_idx].begin is equal to $0$ for the other $31$ threads. When the first thread executes the loop $for$, all the other $31$ threads remain idle. \\ +The same scenario occurs during the execution of Example 2. Let us suppose that the instruction is executed by $32$ threads, pool[thread\_idx].begin is equal to $100$ for the first thread, and pool[thread\_idx].begin is equal to $0$ for the other $31$ threads. When the first thread executes the loop $for$, all the other $31$ threads remain idle. Existing techniques for handling branch divergence either demand hardware support \cite{ch8:Fung} or require host-GPU interaction \cite{ch8:Zhang}, which incurs overhead. Some other works such as \cite{ch8:Han} intervene at the code level. They expose a branch distribution method that aims to reduce the divergent portion of a branch by factoring out structurally similar code from the branch paths. In our work, we have also opted for software-based optimizations as in \cite{ch8:Han}. In fact, we figure out how to literally rewrite the branching instructions into basic ones in order to make thread execution paths uniform. We also demonstrate that we could ameliorate performances only by judiciously reordering data being assigned to each thread. @@ -294,7 +295,7 @@ At each iteration of our GPU-accelerated B\&B approach, several thousands of sub \noindent \textbf{Branch refactoring}\\ -As stated above, thread or branch divergence occurs when the kernel includes conditional instructions and loops that make the threads performing different control flows lead to their serial execution. In this chapter, we investigate the branch refactoring approach to deal with thread divergence. Branch refactoring consists of rewriting the conditional instructions so that threads of the same warp execute a uniform code avoiding their divergence. To do that, two major if scenarios are studied and some optimizations are proposed accordingly. These two scenarios correspond to the conditional instructions contained in the LB kernel code. In the first scenario, the conditional expression is a comparison of the content of a variable to 0. For instance, the following example extracted from the pseudocode of the lower bound LB illustrates such a scenario.\\ +As stated above, thread or branch divergence occurs when the kernel includes conditional instructions and loops that make the threads performing different control flows lead to their serial execution. In this chapter, we investigate the branch refactoring approach to deal with thread divergence. Branch refactoring consists of rewriting the conditional instructions so that threads of the same warp execute a uniform code avoiding their divergence. To do that, two major if scenarios are studied and some optimizations are proposed accordingly. These two scenarios correspond to the conditional instructions contained in the LB kernel code. In the first scenario, the conditional expression is a comparison of the content of a variable to 0. For instance, the following example extracted from the pseudocode of the lower bound LB illustrates such a scenario. \begin{tabular}{l} @@ -306,7 +307,7 @@ As stated above, thread or branch divergence occurs when the kernel includes con \end{tabular} -The refactoring idea is to replace the conditional expression by two functions namely $f$ and $g$ as shown in Equation~\ref{ch8:Eq1}.\\ +The refactoring idea is to replace the conditional expression by two functions namely $f$ and $g$ as shown in Equation~\ref{ch8:Eq1}. The behavior of $f$ and $g$ fits the cosine trigonometric function. These functions return values between $0$ and $1$. An integer variable is used to store the result of the cosine function. Its value is $0$ or $1$ since it is rounded to $0$ if it is not equal to~$1$. In order to increase the performance the CUDA runtime math operations are used: sinf(x), expf(x), and so forth. Those functions are mapped directly to the hardware level~\cite{ch8:cuda}. They are faster but provide lower accuracy which does not matter in our case because the results are rounded to int. @@ -415,13 +416,13 @@ The data access optimization challenge is to find the best mapping of the data s \subsection{Complexity analysis of the memory usage of the lower bound } -In this section, the characteristics of the data structures used by the lower bound function are studied in terms of sizes and access frequencies. For an efficient implementation of the LB, six data structures are required: the matrix PTM of the processing times of the jobs, the matrix of lags LM, the Johnson's matrix JM, the matrix RM of the earliest starting times of jobs, the matrix QM of their lowest latency times, and the matrix MM containing the couples of machines. The complexities of the different data structures are summarized in Table~\ref{ch8:tabMemComplex} where the columns represent, respectively, the name of the data structure, its size, and the number of times it is accessed.\\ +In this section, the characteristics of the data structures used by the lower bound function are studied in terms of sizes and access frequencies. For an efficient implementation of the LB, six data structures are required: the matrix PTM of the processing times of the jobs, the matrix of lags LM, the Johnson's matrix JM, the matrix RM of the earliest starting times of jobs, the matrix QM of their lowest latency times, and the matrix MM containing the couples of machines. The complexities of the different data structures are summarized in Table~\ref{ch8:tabMemComplex} where the columns represent, respectively, the name of the data structure, its size, and the number of times it is accessed. -In the LB expression, the computation of the term $P_{Ja}^*(\jmath,M_k,M_l)$ requires the calculation of the lag of each remaining job to be scheduled on the couple $(M_k,M_l)$ of machines using its processing times on these machines (Johnson's rule with lags). Such computation is repeated for each couple $(M_k,M_l)$ of machines with $1 \leq k,l \leq m$ and $k