add ch17

[book_gpu.git] / BookGPU / Chapters / chapter8 / ch8.tex
diff --git a/BookGPU/Chapters/chapter8/ch8.tex b/BookGPU/Chapters/chapter8/ch8.tex

index a4aaac3b429e515bd6c0c220da6c3589be417170..9352b567457b4a6b9bf16d2e5669450ea640ebb2 100644 (file)
--- a/BookGPU/Chapters/chapter8/ch8.tex
+++ b/BookGPU/Chapters/chapter8/ch8.tex
@@ -15,7 +15,7 @@ Although this bounding mechanism allows to considerably reduce the exploration t
  
  \vspace{0.3cm}
  
  
  \vspace{0.3cm}
  
-In this work, we rethink the design and implementation of irregular tree-based algorithms such as B\&B algorithm on top of GPUs. During the execution of the B\&B algorithm, the number of new generated nodes and the number of not yet explored but promising nodes are variable and depend on the level of the tree being explored and on the best solution found so far. Therefore, due to such unstructured and unpredictable nature of its search tree, designing efficient B\&B on top of GPUs is not straightforward. We investigate two different approaches for designing GPU-based B\&B starting from the parallel models for B\&B identified in \cite{ch8:MelabHDR_2005}. The first one is based on the ``parallel tree exploration'' paradigm. This approach consists in exploring in parallel different sub-spaces of the tree. The second approach is based on the ``parallel evaluation of bounds'' approach. The two approaches have been applied to the permutation \index{Flowshop Scheduling Problem} (FSP)(see Section~\ref{ch8:BB-FSP}) which is an NP-hard combinatorial optimization problem. The lower bound function used in this work for FSP is the one proposed in~\cite{ch8:Johnson_1954} for two machines and generalized in~\cite{ch8:Lenstra_1978} to more than two machines.
+In this work, we rethink the design and implementation of irregular tree-based algorithms such as B\&B algorithm on top of GPUs. During the execution of the B\&B algorithm, the number of new generated nodes and the number of not yet explored but promising nodes are variable and depend on the level of the tree being explored and on the best solution found so far. Therefore, due to such unstructured and unpredictable nature of its search tree, designing efficient B\&B on top of GPUs is not straightforward. We investigate two different approaches for designing GPU-based B\&B starting from the parallel models for B\&B identified in \cite{ch8:MelabHDR_2005}. The first one is based on the ``parallel tree exploration'' paradigm. This approach consists in exploring in parallel different sub-spaces of the tree. The second approach is based on the ``parallel evaluation of bounds'' approach. The two approaches have been applied to the permutation Flowshop Scheduling Problem \index{Flowshop Scheduling Problem} (FSP)(see Section~\ref{ch8:BB-FSP}) which is an NP-hard combinatorial optimization problem. The lower bound function used in this work for FSP is the one proposed in~\cite{ch8:Johnson_1954} for two machines and generalized in~\cite{ch8:Lenstra_1978} to more than two machines.
  
  \vspace{0.3cm}
  
  
  \vspace{0.3cm}
  
@@ -29,7 +29,7 @@ The scope of this chapter is to design parallel B\&B algorithms on GPU accelerat
  
  The chapter is organized in seven main sections. Section \ref{ch8:BB} presents the B\&B algorithm. Section \ref{ch8:Parallel-BB} introduces the different models used to parallelize B\&B algorithms. Section \ref{ch8:BB-FSP} briefly describes the Flowshop Scheduling permutation Problem. In Section~\ref{ch8:approach1}, we describe the GPU-accelerated B\&B based on the parallel tree exploration. In Section~\ref{ch8:approach2},  details about the second approach, the GPU-accelerated B\&B based on the parallel evaluation of lower bounds, are given. In Section \ref{ch8:ThreadDivergence}, the thread divergence issue related to the location of nodes in the B\&B tree and to the control flow instructions within the bounding operator is described. In Section \ref{ch8:DataAccessOpt}, the memory access optimization challenge is addressed and an overview of the GPU memory hierarchy and the used memory access pattern is given. In Section~\ref{ch8:Experiments}, we report experimental results showing the performances of each of two studied approaches compared to a sequential CPU-based execution of the B\&B and demonstrating the efficiency of the proposed optimizations.
  
  
  The chapter is organized in seven main sections. Section \ref{ch8:BB} presents the B\&B algorithm. Section \ref{ch8:Parallel-BB} introduces the different models used to parallelize B\&B algorithms. Section \ref{ch8:BB-FSP} briefly describes the Flowshop Scheduling permutation Problem. In Section~\ref{ch8:approach1}, we describe the GPU-accelerated B\&B based on the parallel tree exploration. In Section~\ref{ch8:approach2},  details about the second approach, the GPU-accelerated B\&B based on the parallel evaluation of lower bounds, are given. In Section \ref{ch8:ThreadDivergence}, the thread divergence issue related to the location of nodes in the B\&B tree and to the control flow instructions within the bounding operator is described. In Section \ref{ch8:DataAccessOpt}, the memory access optimization challenge is addressed and an overview of the GPU memory hierarchy and the used memory access pattern is given. In Section~\ref{ch8:Experiments}, we report experimental results showing the performances of each of two studied approaches compared to a sequential CPU-based execution of the B\&B and demonstrating the efficiency of the proposed optimizations.
  
-\section{\index{Branch-and-Bound} algorithm}
+\section{Branch-and-Bound \index{Branch-and-Bound} algorithm}
  \label{ch8:BB}
  
  Branch-and-bound algorithms are by far the most widely used methods for exactly solving large scale NP-hard combinatorial optimization problems. Indeed, they allow to find the optimal solution of a problem with proof of optimality. 
  \label{ch8:BB}
  
  Branch-and-bound algorithms are by far the most widely used methods for exactly solving large scale NP-hard combinatorial optimization problems. Indeed, they allow to find the optimal solution of a problem with proof of optimality. 
@@ -109,7 +109,7 @@ This taxonomy based on the classification proposed in \cite{ch8:Gendron_1994} id
  
  Tree-based strategies consist in building and/or exploring the solution tree in parallel by performing operations on several sub-problems simultaneously. This coarse-grained type of parallelism affects the general structure of the B\&B algorithm and makes it highly irregular.\\
  
  
  Tree-based strategies consist in building and/or exploring the solution tree in parallel by performing operations on several sub-problems simultaneously. This coarse-grained type of parallelism affects the general structure of the B\&B algorithm and makes it highly irregular.\\
  
-The \index{parallel tree exploration} model, illustrated in Figure \ref{ch8:parallel_tree}, consists in visiting in parallel different paths of the same tree. The search tree is explored in parallel by performing the branching, selection, bounding and elimination operators on several sub-problems simultaneously.\\
+The parallel tree exploration \index{parallel tree exploration} model, illustrated in Figure \ref{ch8:parallel_tree}, consists in visiting in parallel different paths of the same tree. The search tree is explored in parallel by performing the branching, selection, bounding and elimination operators on several sub-problems simultaneously.\\
  
  \begin{figure}
    \begin{center}
  
  \begin{figure}
    \begin{center}
@@ -125,7 +125,7 @@ The \index{parallel tree exploration} model, illustrated in Figure \ref{ch8:para
  
  Node-based strategies introduce parallelism when performing the operations on a single problem. For instance, they consist in executing the bounding operation in parallel for each sub-problem to accelerate the execution. This type of parallelism has no influence on the general structure of the B\&B algorithm and is particular to the problem being solved.\\
  
  
  Node-based strategies introduce parallelism when performing the operations on a single problem. For instance, they consist in executing the bounding operation in parallel for each sub-problem to accelerate the execution. This type of parallelism has no influence on the general structure of the B\&B algorithm and is particular to the problem being solved.\\
  
-The \index{parallel evaluation of bounds} model, as shown in Figure \ref{ch8:bounds_parallel}, allows the parallelization of the bounding of sub-problems generated by the branching operator. This model is used in the case where the bounding operator is performed several times after the branching operator. The model does not change the order and the number of explored sub-problems in the parallel B\&B algorithm compared to the sequential B\&B.
+The parallel evaluation of bounds \index{parallel evaluation of bounds} model, as shown in Figure \ref{ch8:bounds_parallel}, allows the parallelization of the bounding of sub-problems generated by the branching operator. This model is used in the case where the bounding operator is performed several times after the branching operator. The model does not change the order and the number of explored sub-problems in the parallel B\&B algorithm compared to the sequential B\&B.
  
  \begin{figure}
    \begin{center}
  
  \begin{figure}
    \begin{center}
@@ -168,7 +168,7 @@ Figure~\ref{flow-shop} illustrates a solution of a flow-shop problem instance de
  
  \vspace{0.3cm}
  
  
  \vspace{0.3cm}
  
-\subsection{\index{Lower Bound} for the Flowshop Scheduling Problem}
+\subsection{Lower Bound \index{Lower Bound} for the Flowshop Scheduling Problem}
  \label{ch8:LB-FSP}
  
  The lower bounding technique provides a lower bound (LB) for each sub-problem generated by the branching operator. The more the bound is accurate, the more it allows to eliminate not promising nodes from the search tree. Therefore, the efficiency of a B\&B algorithm depends strongly on the quality of its lower bound function. In this chapter, we use the lower bound proposed by Lenstra {\it et al.}~\cite{ch8:Lenstra_1978} for FSP, based on the Johnson's algorithm~\cite{ch8:Johnson_1954}.
  \label{ch8:LB-FSP}
  
  The lower bounding technique provides a lower bound (LB) for each sub-problem generated by the branching operator. The more the bound is accurate, the more it allows to eliminate not promising nodes from the search tree. Therefore, the efficiency of a B\&B algorithm depends strongly on the quality of its lower bound function. In this chapter, we use the lower bound proposed by Lenstra {\it et al.}~\cite{ch8:Lenstra_1978} for FSP, based on the Johnson's algorithm~\cite{ch8:Johnson_1954}.
@@ -254,8 +254,7 @@ In the following, we present how we dealt with the thread/branch divergence issu
  
  \vspace{-0.4cm}
  
  
  \vspace{-0.4cm}
  
-\section{\index{Thread divergence}}
-\label{ch8:ThreadDivergence}
+\section{Thread divergence \index{Thread divergence}}
  
  \subsection{The thread divergence issue}
  
  
  \subsection{The thread divergence issue}
  
@@ -451,7 +450,7 @@ The same transformations as those applied for the first scenario are applied her
  \section{Memory access optimization}
  \label{ch8:DataAccessOpt}
  
  \section{Memory access optimization}
  \label{ch8:DataAccessOpt}
  
-\index{Memory access optimizations} are by far the most studied area for improving GPU-based application performances. Indeed, adjusting the pattern of accesses to the GPU device memory grants programmers to further improve the throughput of many high-performance CUDA applications. The goal of memory access optimizations is generally to use as much fast memory and as little slow-access memory as possible. This section discusses how best to set up data LB items on the various kinds of memory on the device.   
+Memory access optimizations \index{Memory access optimizations} are by far the most studied area for improving GPU-based application performances. Indeed, adjusting the pattern of accesses to the GPU device memory grants programmers to further improve the throughput of many high-performance CUDA applications. The goal of memory access optimizations is generally to use as much fast memory and as little slow-access memory as possible. This section discusses how best to set up data LB items on the various kinds of memory on the device.   
  
  \vspace{0.2cm} 
  
  
  \vspace{0.2cm}