\label{ch8:BB-FSP}
\subsection{Definition of the Flowshop Scheduling Problem}
-\label{ch8:LB-FSP}
As a case study for our GPU-based Branch-and-Bound, we considered the NP-hard and well-known problem in the scheduling theory: the "Permutation Flow-shop Scheduling Problem" (FSP).
In this work, the mono-objective case is considered. The FSP aims to find the optimal schedule of n jobs on m machines so that the overall completion time of all jobs, called {\it makespan}, is minimized.
\vspace{0.3cm}
\subsection{Lower Bound \index{Lower Bound} for the Flowshop Scheduling Problem}
-\label{ch8:LB-FSP}
The lower bounding technique provides a lower bound (LB) for each sub-problem generated by the branching operator. The more the bound is accurate, the more it allows to eliminate not promising nodes from the search tree. Therefore, the efficiency of a B\&B algorithm depends strongly on the quality of its lower bound function. In this chapter, we use the lower bound proposed by Lenstra {\it et al.}~\cite{ch8:Lenstra_1978} for FSP, based on the Johnson's algorithm~\cite{ch8:Johnson_1954}.
The data access optimization challenge is to find the best mapping of the data structures of the application at hand (different sizes and access frequencies) and the GPU hierarchy of memories (different sizes and access latencies). For instance, of these different memory spaces, global memory is the most plentiful but the one with the highest access latency. On the contrary, shared memory is smaller in size but has much higher bandwidth and lower latency than the global memory.
\subsection{Complexity analysis of the memory usage of the Lower Bound }
-\label{ch8:MemComplex}
In this section, the characteristics of the data structures used by the lower bound function are studied in terms of sizes and access frequencies. For an efficient implementation of the LB, six data structures are required: the matrix $PTM$ of the processing times of the jobs, the matrix of lags $LM$, the Johnson's matrix $JM$, the matrix $RM$ of the earliest starting times of jobs, the matrix $QM$ of their lowest latency times and the matrix $MM$ containing the couples of machines. The complexities of the different data structures are summarized in Table~\ref{ch8:tabMemComplex} where the columns represent respectively the name of the data structure, its size and the number of times it is accessed.
\end{table}
\subsection{Data placement pattern of the Lower Bound on GPU}
-\label{ch8:MemComplex}
This section discusses how best to map the six data structures identified above on the various kinds of memories of the GPU device.