new

[book_gpu.git] / BookGPU / Chapters / chapter8 / ch8.tex
diff --git a/BookGPU/Chapters/chapter8/ch8.tex b/BookGPU/Chapters/chapter8/ch8.tex

index d131b53d1e8970aebf18e705f669ea18aa66acb4..ff9c80d5774630653a3d720cd998bea26b3d2494 100644 (file)
--- a/BookGPU/Chapters/chapter8/ch8.tex
+++ b/BookGPU/Chapters/chapter8/ch8.tex
@@ -1,5 +1,5 @@
  
  
-\chapterauthor{Imen Chakroun and Nouredine Melab}{University of Lille 1 CNRS/LIFL, INRIA Lille Nord Europe, Cit\'e scientifique, 59655 Villeneuve d'Ascq cedex, France\\}
+\chapterauthor{Imen Chakroun and Nouredine Melab}{University of Lille 1, CNRS/LIFL/INRIA, France}
  %\chapterauthor{Nouredine Melab}{Universit\'e Lille 1 CNRS/LIFL, INRIA Lille Nord Europe, Cit\'e scientifique - 59655, Villeneuve d'Ascq cedex, France\\}
  
  \chapter{GPU-accelerated tree-based exact optimization methods}
  %\chapterauthor{Nouredine Melab}{Universit\'e Lille 1 CNRS/LIFL, INRIA Lille Nord Europe, Cit\'e scientifique - 59655, Villeneuve d'Ascq cedex, France\\}
  
  \chapter{GPU-accelerated tree-based exact optimization methods}
@@ -78,7 +78,7 @@ Set the Best\_Solution to $\emptyset$; \\
      }
  }
  
      }
  }
  
-\caption{general template of the branch-and-bound algorithm.}
+\caption{general template of the branch-and-bound algorithm}
  \label{ch8:algoBB}
  \end{algorithm}
  
  \label{ch8:algoBB}
  \end{algorithm}
  
@@ -88,7 +88,7 @@ Set the Best\_Solution to $\emptyset$; \\
  Thanks to the bounding operator, B\&B allows the significant reduction of the computing time needed to explore the whole solution space. However, finding an optimal solution for large instances remains impractical using a sequential B\&B. Therefore, parallel processing of these algorithms has been widely studied in the literature. In \cite{ch8:MelabHDR_2005}, a taxonomy of the various existing parallel paradigms used to parallelize the B\&B algorithm is presented.
  
  
  Thanks to the bounding operator, B\&B allows the significant reduction of the computing time needed to explore the whole solution space. However, finding an optimal solution for large instances remains impractical using a sequential B\&B. Therefore, parallel processing of these algorithms has been widely studied in the literature. In \cite{ch8:MelabHDR_2005}, a taxonomy of the various existing parallel paradigms used to parallelize the B\&B algorithm is presented.
  
  
-This taxonomy based on the classification proposed in \cite{ch8:Gendron_1994} identified several models to accelerate the B\&B search. The first model we consider in this chapter is called ``parallel tree exploration model'' and belongs to the ``tree-based'' strategies that aim to build and explore the B\&B tree in parallel. The second model called ``parallel evaluation of bounds model'' (evaluation of bounds in parallel) belong to the parallelization approach called ``node-based''. This strategy aims to accelerate the execution of a particular operation at the node level.
+This taxonomy based on the classification proposed in \cite{ch8:Gendron_1994} identified several models to accelerate the B\&B search. The first model we consider in this chapter is called ``parallel tree exploration model'' and belongs to the ``tree-based'' strategies that aim to build and explore the B\&B tree in parallel. The second model called ``parallel evaluation of bounds model'' (evaluation of bounds in parallel) belong to the parallelization approach called ``node-based''. This strategy aims to accelerate the execution of a particular operation at the node level.\eject
  
  
  \subsection{The parallel tree exploration model}
  
  
  \subsection{The parallel tree exploration model}
@@ -115,7 +115,8 @@ Node-based strategies introduce parallelism when performing the operations on a
  
  The parallel evaluation of bounds \index{parallel!evaluation of bounds} model, as shown in Figure \ref{ch8:bounds_parallel}, allows the parallelization of the bounding of subproblems generated by the branching operator. This model is used in the case where the bounding operator is performed several times after the branching operator. Compared to the sequential B\&B, the model does not change the order and the number of explored subproblems in the parallel B\&B algorithm.
  
  
  The parallel evaluation of bounds \index{parallel!evaluation of bounds} model, as shown in Figure \ref{ch8:bounds_parallel}, allows the parallelization of the bounding of subproblems generated by the branching operator. This model is used in the case where the bounding operator is performed several times after the branching operator. Compared to the sequential B\&B, the model does not change the order and the number of explored subproblems in the parallel B\&B algorithm.
  
-\begin{figure}
+\clearpage
+\begin{figure}[t!]
    \begin{center}
  \includegraphics[scale=0.5]{Chapters/chapter8/figures/parallel_bounding.eps}%
  
    \begin{center}
  \includegraphics[scale=0.5]{Chapters/chapter8/figures/parallel_bounding.eps}%
  
@@ -237,7 +238,7 @@ In the following, we present how we dealt with the thread/branch divergence issu
  During the execution of an application on GPU,  one or more thread block(s) are assigned to each GPU multiprocessor to execute. Those threads are partitioned into warps that get scheduled for execution. For each  instruction of the flow, the multiprocessor selects a warp that is ready to be run. A warp executes one common instruction at a time, so full efficiency is realized when all threads of a warp agree on their execution path. In this chapter, the G80 model, in which a warp is a pool of 32 threads, is used. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken. Threads that are not on the taken path are disabled, and when all paths are complete, the threads converge back to the same execution path. This phenomenon is called thread/branch divergence\index{GPU!thread divergence} and often causes serious performance degradations. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjointed code paths.
  
  
  During the execution of an application on GPU,  one or more thread block(s) are assigned to each GPU multiprocessor to execute. Those threads are partitioned into warps that get scheduled for execution. For each  instruction of the flow, the multiprocessor selects a warp that is ready to be run. A warp executes one common instruction at a time, so full efficiency is realized when all threads of a warp agree on their execution path. In this chapter, the G80 model, in which a warp is a pool of 32 threads, is used. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken. Threads that are not on the taken path are disabled, and when all paths are complete, the threads converge back to the same execution path. This phenomenon is called thread/branch divergence\index{GPU!thread divergence} and often causes serious performance degradations. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjointed code paths.
  
  
-This section discusses thread divergence issues encountered when computing the bounds by GPU. The thread divergence occurs for two main reasons, namely, the locations of nodes in the search tree and the control flow instructions within the bounding operator. 
+This section discusses thread divergence issues encountered when computing the bounds by GPU. The thread divergence occurs for two main reasons, namely, the locations of nodes in the search tree and the control flow instructions within the bounding operator. \\
  
  
  \noindent \textbf{Divergence related to the location of nodes}\\
  
  
  \noindent \textbf{Divergence related to the location of nodes}\\
@@ -427,6 +428,7 @@ In addition, in order to avoid relaunching the Johnson's algorithm for each coup
  
  To reduce the computation time cost of the term $\min\limits_{(i,j)\in \jmath^2, i \neq j}(r_{i,k}+q_{j,l})$ in the LB expression, two matrices are defined, namely RM and QM. They are used to store, respectively, the lowest starting and latency times of all the jobs on each machine. Their dimension is $m$ and, are accessed $ m \times (m-1)$ times and $ \frac{m \times (m-1)}{2}$ times, respectively.
  
  
  To reduce the computation time cost of the term $\min\limits_{(i,j)\in \jmath^2, i \neq j}(r_{i,k}+q_{j,l})$ in the LB expression, two matrices are defined, namely RM and QM. They are used to store, respectively, the lowest starting and latency times of all the jobs on each machine. Their dimension is $m$ and, are accessed $ m \times (m-1)$ times and $ \frac{m \times (m-1)}{2}$ times, respectively.
  
+\clearpage
  \begin{table}
    \centering
  \begin{tabular}{|c|c|c|}
  \begin{table}
    \centering
  \begin{tabular}{|c|c|c|}
@@ -531,7 +533,7 @@ Using the approach defined in \cite{ch8:Mezmaz_2007}, it is possible to obtain a
  \begin{itemize}
  \item compute, using the approach defined in \cite{ch8:Mezmaz_2007}, a list $L$ of subproblems such as the resolution of $L$ lasts $T$ minutes with a sequential B\&B;
  \item initialize the pool of our sequential B\&B with the subproblems of this list $L$;
  \begin{itemize}
  \item compute, using the approach defined in \cite{ch8:Mezmaz_2007}, a list $L$ of subproblems such as the resolution of $L$ lasts $T$ minutes with a sequential B\&B;
  \item initialize the pool of our sequential B\&B with the subproblems of this list $L$;
-\item solve the subproblems of this pool with our sequential B\&B ,
+\item solve the subproblems of this pool with our sequential B\&B;
  \item get the sequential resolution time $T{cpu}$ and the number of explored subproblems $N{cpu}$;
  \item check that $T{cpu}$ is approximately equal to $T$;
  \item initialize the pool of our GPU B\&B with the subproblems of the list $L$;
  \item get the sequential resolution time $T{cpu}$ and the number of explored subproblems $N{cpu}$;
  \item check that $T{cpu}$ is approximately equal to $T$;
  \item initialize the pool of our GPU B\&B with the subproblems of the list $L$;