MM & $m \times (m-1)$ & $m \times (m-1)$ \\
\hline
\end{tabular}
-\vspace{0.5cm}
- \caption{The different data structures of the $LB$ algorithm and their associated complexities in memory size and numbers of accesses. The parameters $n$, $m$ and $n'$ designate respectively the total number of jobs, the total number of machines and the number of remaining jobs to be scheduled for the sub-problems the lower bound is being computed.}
+ \caption[The different data structures of the $LB$ algorithm and their associated complexities in memory size and numbers of accesses.]{The different data structures of the $LB$ algorithm and their associated complexities in memory size and numbers of accesses. The parameters $n$, $m$ and $n'$ designate respectively the total number of jobs, the total number of machines and the number of remaining jobs to be scheduled for the sub-problems the lower bound is being computed.}
\label{ch8:tabMemComplex}
\end{table}
In order to achieve further performances, we also take care of adequately use the global memory by judiciously configuring the L1 cache which greatly enables improving performance over direct access to global memory. Indeed, the GPU device we are using in our experiments is based on the NVIDIA Fermi architecture which introduced two new hierarchies of memories (L1 $/$ L2 cache)
compared to older architectures.
-\begin{table*}
+\begin{table}
\centering
\footnotesize
\begin{tabular}{|r|r|r|r|r|r|}
$20 \times 20$ & 3.800 (3.8KB) & 3.800 (7.6KB) & 400 (0.4KB) & 20 (0.04KB) & 380 (0.76KB) \\
\hline
\end{tabular}
-\vspace{0.5cm}
-\caption{The sizes of each data structure for the different experimented problem instances. The sizes are given in number of elements and in bytes (between brackets).}
+\caption[The sizes of each data structure for the different experimented problem instances.]{The sizes of each data structure for the different experimented problem instances. The sizes are given in number of elements and in bytes (between brackets).}
\label{ch8:tabMemSizes}
-\end{table*}
+\end{table}
\vspace{0.2cm}
Table \ref{ch8:instance_time} gives, for each instance according to its number of jobs and its number of machines, the used resolution time with a sequential B\&B. For example, the sequential resolution time of each instance defined with $20$ jobs and $20$ machines is approximately 10 minutes. Of course, the computation time of the lower bound of a sub-problem defined with $20$ jobs and $20$ machines is on average greater than the computation time of the lower bound of a sub-problem defined with $50$ jobs and $20$ machines. Therefore, as shown in this table, the sequential resolution time increases with the size of the instance in order to be sure that the number of sub-problems explored is significant for all instances.
-\begin{table*}
+\begin{table}
\setlength{\tabcolsep}{0.2cm}
\renewcommand{\arraystretch}{1.2}
\centering
Sequential resolution time (minutes) & 10 & 50 & 150 & 300 \\
\hline
\end{tabular}
-\vspace{0.3cm}
\caption{The sequential resolution time of each instance according to its number of jobs and machines}
\label{ch8:instance_time}
-\end{table*}
+\end{table}
\subsection{Performance impact of GPU-based parallelism}
The results show also that the parallel efficiency decreases with the size of the problem instance. For a fixed number of machines (here 20 machines) and a fixed pool size, the obtained speedup decline accordingly with the number of jobs. For instance for a pool size of 262144, the acceleration factor obtained with 200 jobs (13.4) while it is (40.50) for the instances with 20 jobs. This behavior is mainly due to the overhead induced by the transfer of the pool of resulting sub-problems between the CPU and the GPU. For example, for the instances with 200 jobs the size of the pool to exchange between the CPU and the GPU is ten times bigger than the size of the pool for the instances with 20 jobs.
-\begin{table*}
+\begin{table}
\setlength{\tabcolsep}{0.2cm}
\renewcommand{\arraystretch}{1.2}
\centering
% \hline
% \hline
\end{tabular}
-\vspace{0.3cm}
\caption{Speedups for different problem instances and pool sizes with the GPU-PTE-BB approach.}
\label{ch8:ParaGPU1}
-\end{table*}
+\end{table}
The results obtained with the GPU-PEB-BB approach (see Table \ref{ch8:ParaGPU2}) show that evaluating in parallel the bounds of a selected pool, allow to significantly speedup the execution of the B\&B. Indeed, an acceleration factor up to 71.69 is obtained for the 200 $\times$ 20 problem instances using a pool of 262144 sub-problems. The results show also that the parallel efficiency grows with the size of the problem instance. For a fixed number of machines (here 20 machines) and a fixed pool size, the obtained speedup grows accordingly with the number of jobs. For instance for a pool size of 262144, the acceleration factor obtained with 200 jobs (71.69) is almost the double of the one obtained with 20 jobs (38.40).
As far the pool size tuning is considered, we could notice that this parameter depends strongly on the problem instance being solved. Indeed, while the best acceleration is obtained with a pool size of 8192 sub-problems for the instances 50 $\times$ 20 and 20 $\times$ 20, the best speedups are obtained with a pool size of 262144 sub-problems with the instances 200 $\times$ 20 and 100 $\times$ 20.\\
-\begin{table*}
+\begin{table}
\setlength{\tabcolsep}{0.2cm}
\renewcommand{\arraystretch}{1.2}
\centering
% \hline
% \hline
\end{tabular}
-\vspace{0.3cm}
\caption{Speedups for different problem instances and pool sizes with the GPU-PEB-BB approach.}
\label{ch8:ParaGPU2}
-\end{table*}
+\end{table}
Compared to the parallel tree exploration-based GPU-accelerated B\&B approach, the parallel evaluation of bounds approach is by far much more efficient wherever the instance is. For example, while the GPU-PEB-BB approach reaches speedup of $\times$71.69 for the instance with 200 jobs on 20 machines, a speedup of a $\times$13.4 is measured with the parallel tree exploration-based approach which corresponds to an acceleration of $\times$5.56 . Moreover, on the contrary to the GPU-PEB-BB approach, in the GPU-PTE-BB the speedups decrease when the problem instance becomes higher. Remember here that while in the GPU-PEB-BB approach all threads evaluate only one node each whatever the permutation size is. In the GPU-PTE-BB, each thread branches all the children of its assigned parent node. Therefore, the bigger the size of the permutation is, the bigger the amount of work performed by each thread is and the bigger the difference between the workload is. Indeed, let us suppose that for the instance with $200$ jobs, the thread $0$ handles a node from the level $2$ of the tree and the thread $100$ handles a node from the level $170$ of the tree. In this case, the thread $0$ generates and evaluates $198$ nodes while the thread $100$ decomposes and bounds only $30$ nodes. The problem in this example is that the kernel execution would last until the thread $0$ finishes its work while the other threads might have ended their works and stayed idle.
The objective of this section is to demonstrate that the thread divergence reduction mechanisms we propose has an impact on the performance of the GPU accelerated B\&B and to evaluate how this impact is significant.
In the following, the reported results are obtained with the GPU-accelerated B\&B based on the parallel evaluation of bounds.
-\begin{table*}[!h]
+\begin{table}[!h]
\setlength{\tabcolsep}{0.2cm}
\renewcommand{\arraystretch}{1.2}
\centering
% \hline
% \hline
\end{tabular}
-\vspace{0.3cm}
\caption{Speedups for different instances and pool sizes using thread divergence management.}
\label{ch8:ParaDivergence}
-\end{table*}
+\end{table}
Table~\ref{ch8:ParaDivergence} shows the experimental results obtained using the sorting process and the refactoring approach presented in Section \ref{ch8:ThreadDivergence}. Results show that the proposed optimizations emphasize the GPU acceleration reported in Table~\ref{ch8:ParaGPU2} and obtained without thread divergence reduction. For example, for the instances of 200 jobs over 20 machines and a pool size of 262144, the average reported speedup is 77.46 while the average acceleration factor obtained without thread divergence management for the same instances and the same pool size is 71.69 which corresponds to an improvement of 7.68\%. Such considerable but not outstanding improvement is predictable, as claimed in \cite{ch8:Han}, since the factorized part of the branches in the FSP lower bound is very small.
Table~\ref{ch8:PTM-on-SM} reports the speedups obtained for the first experimented scenario where only the matrix $PTM$ is put on the shared memory. Results show that the speedup grows on average with the growing of the pool size in the same way as in Table~\ref{ch8:ParaDivergence}. For the largest problem instance and pool size, putting the PTM matrix on the shared memory improves the speedups up to ($14\%$) compared to those obtained when $PTM$ is on global memory reaching an acceleration of $\times 90.51$ for the problem instances $200 \times 20$ and a pool size of $262144$ sub-problems .
-\begin{table*}
+\begin{table}
\centering
\footnotesize
\begin{tabular}{|r|r|r|r|r|r|r|r|}
% \hline
% \hline
\end{tabular}
-\vspace{0.3cm}
- \caption{Speedup for different FSP instances and pool sizes obtained with data access optimization. $PTM$ is placed in shared memory and all others are placed in global memory.}
+ \caption[Speedup for different FSP instances and pool sizes obtained with data access optimization.]{Speedup for different FSP instances and pool sizes obtained with data access optimization. $PTM$ is placed in shared memory and all others are placed in global memory.}
\label{ch8:PTM-on-SM}
-\end{table*}
+\end{table}
Table~\ref{ch8:JM-on-SM} reports the behavior of the speedup averaged on the different problem instances (sizes) as a function of the pool size for the scenario where the Johnson's matrix is put on the shared memory. Results show that putting the $JM$ matrix on the shared matrix improves more the performances comparing to the first scenario where $PTM$ is put on the shared memory. Indeed, according to Table~\ref{ch8:tabMemComplex}, matrix $JM$ is accessed more frequently than matrix $PTM$. Putting $JM$ matrix on the shared memory allows accelerations up to $\times 97.83$ for the problem instances $200 \times 20$.
-\begin{table*}
+\begin{table}
\centering
\footnotesize
\begin{tabular}{|r|r|r|r|r|r|r|r|}
% \hline
% \hline
\end{tabular}
-\vspace{0.3cm}
- \caption{Speedup for different FSP instances and pool sizes obtained with data access optimization.
+ \caption[Speedup for different FSP instances and pool sizes obtained with data access optimization.]{Speedup for different FSP instances and pool sizes obtained with data access optimization.
$JM$ is placed in shared memory and all others are placed in global memory.}
\label{ch8:JM-on-SM}
-\end{table*}
+\end{table}
Table~\ref{ch8:JM-PTM-on-SM} reports the behavior of the average speedup for the different problem instances (sizes) with $20$ machines for the data placement scenario where both $PTM$ and $JM$ are put on shared memory. According to the underlying Table, the scenarios~(3) ($JM$ together or without $PTM$ in shared memory) is clearly better than the scenarii~(1)and~(2) (respectively $PTM$ in shared memory and $JM$ in shared memory) whatever is the problem instance (size).
-\begin{table*}
+\begin{table}
\centering
\footnotesize
\begin{tabular}{|r|r|r|r|r|r|r|r|}
% \hline
% \hline
\end{tabular}
-\vspace{0.3cm}
- \caption{Speedup for different FSP instances and pool sizes obtained with data access optimization.
-$PTM$ and $JM$ are placed together in shared memory and all others are placed in global memory.}
+ \caption[Speedup for different FSP instances and pool sizes obtained with data access optimization.]{Speedup for different FSP instances and pool sizes obtained with data access optimization. $PTM$ and $JM$ are placed together in shared memory and all others are placed in global memory.}
\label{ch8:JM-PTM-on-SM}
-\end{table*}
+\end{table}
By carefully analyzing each of the scenarii of data placement on the memory hierarchies of the GPU, the recommendation is to put in the shared memory the Johnson's and the processing time matrices ($JM$ and $PTM$) if they fit in together. Otherwise, the whole or a part of the Johnson's matrix has to be put in priority in the shared memory. The other data structures are mapped to the global memory.