From: couturie Date: Fri, 2 Aug 2013 19:13:52 +0000 (+0200) Subject: new X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/commitdiff_plain/cdcc8368664a64aed4169db27a06df9d0c9d6527?ds=inline;hp=30d1d29de747eecaa732000749e6aaf46ed2a5a3 new --- diff --git a/BookGPU/Chapters/chapter8/ch8.tex b/BookGPU/Chapters/chapter8/ch8.tex index 3caa749..f79178a 100644 --- a/BookGPU/Chapters/chapter8/ch8.tex +++ b/BookGPU/Chapters/chapter8/ch8.tex @@ -455,10 +455,10 @@ To reduce the computation time cost of the term $\min\limits_{(i,j)\in \jmath^2, This section discusses how best to map the six data structures identified above on the various kinds of memories of the GPU device.\\ -The focus is put on the shared memory which is a key enabler for many high-performance CUDA applications. Indeed, because it is on-chip, shared memory has much higher bandwidth and lower latency than local and global memory. However, for large problem instances (large $n$ and $m$) the data structures especially JM and LM (see Table \ref{ch8:tabMemSizes}), do not fit in the shared memory for some GPU configurations. \\ +The focus is put on the shared memory which is a key enabler for many high-performance CUDA applications. Indeed, because it is on-chip, shared memory has much higher bandwidth and lower latency than local and global memory. However, for large problem instances (large $n$ and $m$) the data structures, especially JM and LM (see Table \ref{ch8:tabMemSizes}), do not fit in the shared memory of some GPU configurations. \\ -In order to achieve further performances, we also take care of adequately use the global memory by judiciously configuring the L1 cache which greatly enables improving performance over direct access to global memory. Indeed, the GPU device we are using in our experiments is based on the NVIDIA Fermi architecture which introduced two new hierarchies of memories (L1 $/$ L2 cache) +In order to achieve further performances, we also take care of adequately using the global memory by judiciously configuring the L1 cache which greatly enables improved performance over direct access to global memory. Indeed, the GPU device we are using in our experiments is based on the NVIDIA Fermi architecture which introduced two new hierarchies of memories (L1/L2 cache) compared to older architectures. \begin{table} @@ -466,30 +466,31 @@ compared to older architectures. \footnotesize \begin{tabular}{|r|r|r|r|r|r|} \hline -Prob. instance & JM & LM & PTM & RM, QM & MM \\ +Prob. & \raisebox{-1.5ex}{JM} & \raisebox{-1.5ex}{LM} & \raisebox{-1.5ex}{PTM} & \raisebox{-1.5ex}{RM, QM} & \raisebox{-1.5ex}{MM} \\ +instance & & & & & \\ \hline \hline -$200 \times 20$ & 38.000 (38KB) & 38.000 (76KB) & 4.000 (4KB) & 20 (0.04KB) & 380 (0.76KB) \\ +$200 \times 20$ & 38,000 (38KB) & 38,000 (76KB) & 4,000 (4KB) & 20 (0.04KB) & 380 (0.76KB) \\ \hline -$100 \times 20$ & 19.000 (19KB) & 19.000 (38KB) & 2.000 (2KB) & 20 (0.04KB) & 380 (0.76KB) \\ +$100 \times 20$ & 19,000 (19KB) & 19,000 (38KB) & 2,000 (2KB) & 20 (0.04KB) & 380 (0.76KB) \\ \hline -$50 \times 20$ & 9.500 (9.5KB) & 9.500 (19KB) & 1.000 (1KB) & 20 (0.04KB) & 380 (0.76KB) \\ +$50 \times 20$ & 9,500 (9.5KB) & 9,500 (19KB) & 1,000 (1KB) & 20 (0.04KB) & 380 (0.76KB) \\ \hline -$20 \times 20$ & 3.800 (3.8KB) & 3.800 (7.6KB) & 400 (0.4KB) & 20 (0.04KB) & 380 (0.76KB) \\ +$20 \times 20$ & 3,800 (3.8KB) & 3,800 (7.6KB) & 400 (0.4KB) & 20 (0.04KB) & 380 (0.76KB) \\ \hline \end{tabular} -\caption[The sizes of each data structure for the different experimented problem instances.]{The sizes of each data structure for the different experimented problem instances. The sizes are given in number of elements and in bytes (between brackets).} +\caption[The sizes of each data structure for the different experimented problem instances.]{The sizes of each data structure for the different experimented problem instances. The sizes are given in number of elements and in bytes (between parentheses).} \label{ch8:tabMemSizes} \end{table} -Taking into consideration the sizes of each data structure presented in Table \ref{ch8:tabMemSizes}, our challenge is to find which data structure has to be mapped on which memory and in some cases how to split the data structures on different memories and efficiently manage their accesses. The sizes in bytes reported in Table \ref{ch8:tabMemSizes}, are computed knowing that in our implementation the elements of $JM$ and $PTM$ are unsigned chars (one byte) and that the elements of $LM$, $RM$, $QM$ and $MM$ are unsigned short ints (2 bytes). It is important here to highlight that the types of the data of the used matrices impact the size of each matrix. For instance, a matrix of $100$ integers has a size of $400$ octets while the same matrix with $100$ unsigned chars has a size of $100$ octets. In order to minimize the size of each of the used matrices, we analyzed the ranges of their values and defined their data types accordingly. For instance, in PTM all the processing times have positive values varying between $0$ and $100$. Therefore, we defined PTM as a matrix of \verb|unsigned char| having values in the range $[0, 255]$. Using the \verb|unsigned char| type instead of the integer type allows us to reduce by $4$ times the memory space occupied by PTM.\\ +Taking into consideration the sizes of each data structure presented in Table \ref{ch8:tabMemSizes}, our challenge is to find which data structure has to be mapped onto which memory and in some cases how to split the data structures onto different memories and efficiently manage their accesses. The sizes in bytes reported in Table \ref{ch8:tabMemSizes} are computed knowing that in our implementation the elements of $JM$ and $PTM$ are unsigned chars (one byte) and that the elements of $LM$, $RM$, $QM$, and $MM$ are unsigned short ints (2 bytes). It is important here to highlight that the types of the data of the used matrices impact the size of each matrix. For instance, a matrix of $100$ integers has a size of $400$ octets while the same matrix with $100$ unsigned chars has a size of $100$ octets. In order to minimize the size of each of the used matrices, we analyzed the ranges of their values and defined their data types accordingly. For instance, in PTM all the processing times have positive values varying between $0$ and $100$. Therefore, we defined PTM as a matrix of \verb|unsigned char| having values in the range $[0, 255]$. Using the \verb|unsigned char| type instead of the integer type allows us to reduce by $4$ times the memory space occupied by PTM.\\ According to the Table \ref{ch8:tabMemSizes} : \begin{itemize} - \item The data structures $RM$, $QM$ and $MM$ are small sized matrices. Therefore, their impact on the performances is not significant whatever is the memory to which they are off-loaded. In particular, preliminary experiments proves that putting them on the shared memory would allows a very poor performance improvement. + \item The data structures $RM$, $QM$ and $MM$ are small sized-matrices. Therefore, their impact on the performances is not significant whatever is the memory to which they are off-loaded. In particular, preliminary experiments prove that putting them on the shared memory would allows a very poor performance improvement. \item The $LM$ data structure is the double of the $JM$ in memory size but with a much lower access frequency. It is thus better to map $JM$ on the shared memory. \item The $PTM$ has almost the same access frequency than $JM$ but requires less memory space. \end{itemize} @@ -497,27 +498,27 @@ According to the Table \ref{ch8:tabMemSizes} : Consequently, the focus is put on the study of the performance impact of the placement of $JM$ and $PTM$ on the shared memory. Three placement scenarios of $JM$ and $PTM$ are experimented and studied: (1) Only $PTM$ is stored in shared memory and all others are placed in global memory~; (2) Only $JM$ is stored in shared memory and all others are placed on global memory~; (3) $PTM$ and $JM$ are stored together in shared memory and all others are placed on global memory. \\ -Taking profit from the configurable storage space provided in the new Fermi-based devices, the $64$ KB of local storage was spitted between the shared memory and the L1 cache according to the experimented scenario. +Taking profit from the configurable storage space provided in the new Fermi-based devices, the $64$ KB of local storage was split between the shared memory and the L1 cache according to the experimented scenario. \begin{itemize} -\item For the scenario were the data structures are put on the shared memory the $64$ KB of available storage are split on $48$ KB for shared memory and $16$ KB for L1 cache. -\item For the scenario where the data sets are put on global memory we used $16$ KB for shared memory and $48$ KB for L1 cache. +\item For the scenario where the data structures are put on the shared memory, the $64$ KB of available storage are split into $48$ KB for shared memory and $16$ KB for L1 cache. +\item For the scenario where the data sets are put on global memory, we used $16$ KB for shared memory and $48$ KB for L1 cache. \end{itemize} \section{Experiments} \label{ch8:Experiments} -In the following, we present the experimental study we have performed with the aim to evaluate the performance impact of the GPU-accelerated bounding, the techniques for reducing the thread divergence and the proposed approach for data placement on the GPU memories. +In the following, we present the experimental study we have performed with the aim of evaluating the performance impact of the GPU-accelerated bounding, the techniques for reducing the thread divergence, and the proposed approach for data placement on the GPU memories. \subsection{Parameters settings} -In our experiments, we used the flow-shop instances defined by Taillard \cite{ch8:Taillard_1993}. These standard instances are often used in the literature to evaluate the performance of methods that minimize the makespan. Optimal solutions of some of these instances are still not known. These instances are divided into groups of $10$ instances. In each group, the $10$ instances are defined by the same number of jobs and the same number of machines. The groups of 10 instances have different numbers of jobs, namely $20$, $50$, $10$, $200$ and $500$, and different numbers of machines, namely $5$, $10$ and $20$. For example, there are $10$ instances with $200$ jobs and $20$ machines belonging to the same group of instances.\\ +In our experiments, we used the flow-shop instances defined by Taillard \cite{ch8:Taillard_1993}. These standard instances are often used in the literature to evaluate the performance of methods that minimize the makespan. Optimal solutions of some of these instances are still not known. These instances are divided into groups of $10$ instances. In each group, the $10$ instances are defined by the same number of jobs and the same number of machines. The groups of 10 instances have different numbers of jobs, namely, $20$, $50$, $10$, $200$, and $500$, and different numbers of machines, namely, $5$, $10$, and $20$. For example, there are $10$ instances with $200$ jobs and $20$ machines belonging to the same group of instances.\\ -In this work, we used only the instances where the number of machines is equal to $20$. Indeed, instances where the number of machines is equal to $5$ or $10$ are easy to solve. For these instances, the used bounding operator gives so good lower bounds that it is possible to solve them in few minutes using a sequential B\&B. Therefore, these instances do not require the use of a GPU.\\ +In this work, we used only the instances where the number of machines is equal to $20$. Indeed, instances where the number of machines is equal to $5$ or $10$ are easy to solve. For these instances, the used bounding operator gives such good lower bounds that it is possible to solve them in a few minutes using a sequential B\&B. Therefore, these instances do not require the use of a GPU.\\ -Our approach has been implemented using C-CUDA 4.0. The experiments have been carried out using a an Intel Xeon E5520 bi-processor coupled with a GPU device. The bi-processor is 64-bit, quad-core and has a clock speed of 2.27GHz. The GPU device is an Nvidia Tesla C2050 with 448 CUDA cores (14 multiprocessors with 32 cores each), a clock speed of 1.15GHz, a 2.8GB global memory, a 49.15KB shared memory, and a warp size of 32. +Our approach has been implemented using C-CUDA 4.0. The experiments have been carried out using an Intel Xeon E5520 biprocessor coupled with a GPU device. The biprocessor is 64-bit, quad-core and has a clock speed of 2.27GHz. The GPU device is an NVIDIA Tesla C2050 with 448 CUDA cores (14 multiprocessors with 32 cores each), a clock speed of 1.15GHz, a 2.8GB global memory, a 49.15KB shared memory, and a warp size of 32. \subsection{Experimental protocol: computing the speedup} \label{ch8:Protocol} @@ -525,25 +526,25 @@ Our approach has been implemented using C-CUDA 4.0. The experiments have been ca We need to compute the speed up of our approach to evaluate its performances. This speed up is obtained by comparing our GPU B\&B version to a sequential B\&B version deployed on one CPU core. However, all the instances used in our experiments are extremely hard to solve. Indeed, the resolution of each of these instances requires several months of computation on one CPU core. For example, the optimal solution of one of these instances defined by $50$ jobs and $20$ machines is obtained after $25$ days of computation using an average of $328$ CPU cores \cite{ch8:Mezmaz_2007}. \\ -Using the approach defined in \cite{ch8:Mezmaz_2007}, it is possible to obtain a random list $L$ of subproblems such as the resolution of $L$ lasts $T$ minutes with a sequential B\&B. So by initializing the pool of our sequential B\&B with the subproblems of this list $L$, we are sure that the resolution of the sequential B\&B will last $T{cpu}$ minutes such as $T{cpu}$ will be approximately equal to $T$. Therefore, it will be possible to initialize the pool of our GPU B\&B with the same list $L$ of subproblems in order to compute the speed up. Let suppose that the resolution of the GPU B\&B will last $T{gpu}$ minutes. So the speed up of our GPU algorithm will be equal to $Tcpu/Tgpu$. With this experimental protocol, the subproblems explored by the GPU and CPU B\&B versions will be exactly the same. So to find the speed up associated to an instance, we: +Using the approach defined in \cite{ch8:Mezmaz_2007}, it is possible to obtain a random list $L$ of subproblems such that the resolution of $L$ lasts $T$ minutes with a sequential B\&B. So by initializing the pool of our sequential B\&B with the subproblems of this list $L$, we are sure that the resolution of the sequential B\&B will last $T{cpu}$ minutes such as $T{cpu}$ will be approximately equal to $T$. Therefore, it will be possible to initialize the pool of our GPU B\&B with the same list $L$ of subproblems in order to compute the speed up. Let us suppose that the resolution of the GPU B\&B will last $T{gpu}$ minutes. So the speed up of our GPU algorithm will be equal to $Tcpu/Tgpu$. With this experimental protocol, the subproblems explored by the GPU and CPU B\&B versions will be exactly the same. So to find the speed up associated to an instance, we: \begin{itemize} -\item compute, using the approach defined in \cite{ch8:Mezmaz_2007}, a list $L$ of subproblems such as the resolution of $L$ lasts $T$ minutes with a sequential B\&B, -\item initialize the pool of our sequential B\&B with the subproblems of this list $L$, +\item compute, using the approach defined in \cite{ch8:Mezmaz_2007}, a list $L$ of subproblems such as the resolution of $L$ lasts $T$ minutes with a sequential B\&B; +\item initialize the pool of our sequential B\&B with the subproblems of this list $L$; \item solve the subproblems of this pool with our sequential B\&B , -\item get the sequential resolution time $T{cpu}$ and the number of explored subproblems $N{cpu}$, -\item check that $T{cpu}$ is approximately equal to $T$, -\item initialize the pool of our GPU B\&B with the subproblems of the list $L$, -\item solve the subproblems of this pool with our GPU B\&B, -\item get the GPU resolution time $T{gpu}$ and the number of explored subproblems $N{gpu}$, -\item check that $N{gpu}$ is exactly equal to $N{cpu}$, -\item and finally compute the speed up associated to this instance by dividing $T{cpu}$ on $T{gpu}$ (i.e. $Tcpu/Tgpu$). +\item get the sequential resolution time $T{cpu}$ and the number of explored subproblems $N{cpu}$; +\item check that $T{cpu}$ is approximately equal to $T$; +\item initialize the pool of our GPU B\&B with the subproblems of the list $L$; +\item solve the subproblems of this pool with our GPU B\&B; +\item get the GPU resolution time $T{gpu}$ and the number of explored subproblems $N{gpu}$; +\item check that $N{gpu}$ is exactly equal to $N{cpu}$; +\item and finally compute the speed up associated to this instance by dividing $T{cpu}$ by $T{gpu}$ (i.e., $Tcpu/Tgpu$). \end{itemize} Table \ref{ch8:instance_time} gives, for each instance according to its number of jobs and its number of machines, the used resolution time with a sequential B\&B. For example, the sequential resolution time of each instance defined with $20$ jobs and $20$ machines is approximately 10 minutes. Of course, the computation time of the lower bound of a subproblem defined with $20$ jobs and $20$ machines is on average greater than the computation time of the lower bound of a subproblem defined with $50$ jobs and $20$ machines. Therefore, as shown in this table, the sequential resolution time increases with the size of the instance in order to be sure that the number of subproblems explored is significant for all instances. -\begin{table} +\begin{table}[htbp] \setlength{\tabcolsep}{0.2cm} \renewcommand{\arraystretch}{1.2} \centering @@ -555,7 +556,7 @@ Instance (No. of jobs x No. of machines) & 20$\times$20 & 50$\times$20 & 100$\ti Sequential resolution time (minutes) & 10 & 50 & 150 & 300 \\ \hline \end{tabular} -\caption{The sequential resolution time of each instance according to its number of jobs and machines} +\caption{The sequential resolution time of each instance according to its number of jobs and machines.} \label{ch8:instance_time} \end{table} @@ -569,7 +570,7 @@ The results obtained with the GPU-PTE-BB approach (see Table \ref{ch8:ParaGPU1}) The results show also that the parallel efficiency decreases with the size of the problem instance. For a fixed number of machines (here 20 machines) and a fixed pool size, the obtained speedup decline accordingly with the number of jobs. For instance for a pool size of 262144, the acceleration factor obtained with 200 jobs (13.4) while it is (40.50) for the instances with 20 jobs. This behavior is mainly due to the overhead induced by the transfer of the pool of resulting subproblems between the CPU and the GPU. For example, for the instances with 200 jobs the size of the pool to exchange between the CPU and the GPU is ten times bigger than the size of the pool for the instances with 20 jobs. -\begin{table} +\begin{table}[htbp] \setlength{\tabcolsep}{0.2cm} \renewcommand{\arraystretch}{1.2} \centering