From: couturie <couturie@extinction>
Date: Fri, 2 Aug 2013 14:11:09 +0000 (+0200)
Subject: new
X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/commitdiff_plain/30d1d29de747eecaa732000749e6aaf46ed2a5a3

new
---

diff --git a/BookGPU/Chapters/chapter8/ch8.tex b/BookGPU/Chapters/chapter8/ch8.tex
index ddd58a3..3caa749 100644
--- a/BookGPU/Chapters/chapter8/ch8.tex
+++ b/BookGPU/Chapters/chapter8/ch8.tex
@@ -161,20 +161,20 @@ The Johnson's algorithm allows the  optimal solution of FSP with two machines ($
 $$J_i \preceq J_j \Leftrightarrow \min(p_{i,1}\ ;\ p_{j,2}) \leq
 \min(p_{i,2}\ ;\ p_{j,1})$$
 
-We recall that $p_{k,l}$ designates the processing time of the job $J_k$ on the machine $M_l$. From the above rule, it follows the Johnson's theorem: \\
+We recall that $p_{k,l}$ designates the processing time of the job $J_k$ on the machine $M_l$. From the above rule, follows the Johnson's theorem: \\
 
-\textbf{Jonhson's theorem} \emph{Given $P$ an FSP with $m=2$, if $J_i\preceq J_j$ there exists an optimal schedule for $P$ in which the job $J_i$ precedes the job $J_j$.}\\
+\textbf{Jonhson's theorem} \emph{Given $P$ and FSP with $m=2$, if $J_i\preceq J_j$, there exists an optimal schedule for $P$ in which  job $J_i$ precedes job $J_j$.}\\
 
 According to Johnson's theorem, FSP with $m=2$ is solved with a time complexity of $O(n.log n)$. The optimal solution is obtained by first sorting in increasing order the jobs having a
-processing time shorter on the first machine than on the second one~; Second, sorting in decreasing order the jobs having a shorter processing time on the second machine. \\
+processing time shorter on the first machine than on the second one, and second, sorting in decreasing order the jobs having a shorter processing time on the second machine. \\
 
 
-In~\cite{ch8:JRJackson_1956} and~\cite{ch8:LGMitten_1959}, the Johnson's rule has been extended by Jackson and Mitten with lags which allowed further Lenstra {\it et al.} to propose a lower bound for FSP with $m \geq 3$. A lag~$l_j$ designates the minimum duration between the starting time of the job $J_j$ on the second machine and its finishing time on the first machine. Jackson and Mitten demonstrated that the optimal solution for FSP with $m=2$ can be obtained using the following transitive rule $\preceq$:
+In~\cite{ch8:JRJackson_1956} and~\cite{ch8:LGMitten_1959}, the Johnson's rule  extended by Jackson and Mitten with lags which  further allowed Lenstra et al. to propose a lower bound for FSP with $m \geq 3$. A lag~$l_j$ designates the minimum duration between the starting time of the job $J_j$ on the second machine and its finishing time on the first machine. Jackson and Mitten demonstrated that the optimal solution for FSP with $m=2$ can be obtained using the following transitive rule $\preceq$:
 
 $$J_i \preceq J_j \Leftrightarrow \min(p_{i,1}+l_i\ ;\ l_j+p_{j,2})
 \leq \min(l_i+p_{i,2}\ ;\ p_{j,1}+l_j)$$
 
-Based on this rule, Lenstra {\it et al.}~\cite{ch8:Lenstra_1978} have proposed the following lower bound for a subproblem associated to a partial schedule where a set {\Large $\jmath$} of jobs have to be scheduled on $m$ machines. $P_{Ja}^*(\jmath,M_k,M_l)$ represents the Jackson-Mitten optimal solution for the subproblem that consists in scheduling the set {\Large $\jmath$} of jobs on the two machines $M_k$ and~$M_l$. The term  $r_{i,k} = \sum_{l<k} p_{i,l}$ designates the starting time of the job $J_i$ on the machine $M_k$. The other term $q_{j,l} = \sum_{k>l} p_{j,k}$ refers to the latency between the finishing time of $J_j$ on $M_l$ and the finishing time of the schedule.
+Based on this rule, Lenstra et al.~\cite{ch8:Lenstra_1978} have proposed the following lower bound for a subproblem associated to a partial schedule where a set {\Large $\jmath$} of jobs have to be scheduled on $m$ machines. $P_{Ja}^*(\jmath,M_k,M_l)$ represents the Jackson-Mitten optimal solution for the subproblem that consists in scheduling the set {\Large $\jmath$} of jobs on the two machines $M_k$ and~$M_l$. The term  $r_{i,k} = \sum_{l<k} p_{i,l}$ designates the starting time of the job $J_i$ on the machine $M_k$. The other term $q_{j,l} = \sum_{k>l} p_{j,k}$ refers to the latency between the finishing time of $J_j$ on $M_l$ and the finishing time of the schedule.
 
 $$LB(\jmath)=\max\limits_{1 \leq k < l \leq m}\{P_{Ja}^*(\jmath,M_k,M_l)+\min\limits_{(i,j)\in \jmath^2, i \neq
 j}(r_{i,k}+q_{j,l}) \}$$
@@ -194,38 +194,38 @@ all the machines between~$k$~and~$l$.
 \section{GPU-accelerated B\&B based on the parallel tree exploration (GPU-PTE-BB)}
 \label{ch8:approach1} 
 
-The first approach we investigate for designing B\&B on GPUs consists in exploring in parallel the generated search tree. The idea is to divide the global search space into disjoint sub-spaces that are explored in parallel by the GPU threads. As explained in Section \ref{ch8:BB}, during the execution of a B\&B, the search space is described by a list of unexplored (pending) nodes and the best solution found so far. In the considered GPU-based scheme, a set of parent nodes is selected from this list according to their depth: deepest pending nodes are the first selected. The selected pool of nodes is off loaded to the GPU where each thread builds its own local search tree by applying the {\it branching}, {\it bounding} and {\it pruning} operators to the assigned node.\\
+The first approach we investigate for designing B\&B on GPUs consists of exploring in parallel the generated search tree. The idea is to divide the global search space into disjoint sub-spaces that are explored in parallel by the GPU threads. As explained in Section \ref{ch8:BB}, during the execution of a B\&B, the search space is described by a list of unexplored (pending) nodes and the best solution found so far. In the considered GPU-based scheme, a set of parent nodes is selected from this list according to their depth: deepest pending nodes are the first selected. The selected pool of nodes is off-loaded to the GPU where each thread builds its own local search tree by applying the {\it branching}, {\it bounding}, and {\it pruning} operators to the assigned node.\\
 
 
 \begin{figure}[h!]
 \centering
 \includegraphics[height=8cm, width=8.1cm]{Chapters/chapter8/figures/Diagram1.eps}
-\caption{The overall architecture of the parallel tree exploration-based GPU-accelerated Branch-and-Bound algorithm.} 
+\caption{The overall architecture of the parallel tree exploration-based GPU-accelerated branch-and-bound algorithm.} 
 \label{tree_approach}
 \end{figure}
 
 
-According to the CUDA threading model, each thread has a unique identifier used to determine its assigned role, assigns specific input and output positions and selects work to perform. Therefore, each node (problem) from the pending list is mapped to a thread to ensure that each sub-space of the solution space is evaluated concurrently and is disjoint from others. Figure \ref{tree_approach} illustrates the scheme of the parallel tree exploration-based GPU-accelerated B\&B.
+According to the CUDA threading model, each thread has a unique identifier used to determine its assigned role, which assigns specific input and output positions and selects work to perform. Therefore, each node (problem) from the pending list is mapped to a thread to ensure that each sub-space of the solution space is evaluated concurrently and is disjoint from others. Figure \ref{tree_approach} illustrates the scheme of the parallel tree exploration-based GPU-accelerated B\&B.
 
 \section{GPU-accelerated B\&B based on the parallel evaluation of bounds (GPU-PEB-BB) }
 \label{ch8:approach2} 
 
-In the GPU-accelerated B\&B based on the parallel evaluation of bounds, illustrated in Figure~\ref{ch8:approach}, the generation of the subproblems (elimination, selection and branching operations) to be solved is performed on CPU and the evaluation of their lower bounds (bounding operation) is executed on the GPU device. The pool of subproblems generated on CPU is off-loaded to the GPU device to be evaluated by a pool of threads partitioned into blocks. Each thread applies the lower bound function to one subproblem. Once the evaluation is completed, the lower bound values corresponding to the different subproblems is returned to the CPU to be used by the elimination operator to decide either to be pruned or to be decomposed. The process is iterated until the exploration is completed and the optimal solution is found. 
+In the GPU-accelerated B\&B based on the parallel evaluation of bounds, illustrated in Figure~\ref{ch8:approach}, the generation of the subproblems (elimination, selection and branching operations) to be solved is performed on CPU and the evaluation of their lower bounds (bounding operation) is executed on the GPU device. The pool of subproblems generated on CPU is off-loaded to the GPU device to be evaluated by a pool of threads partitioned into blocks. Each thread applies the lower bound function to one subproblem. Once the evaluation is completed, the lower bound values corresponding to the different subproblems are returned to the CPU to be used by the elimination operator to decide either to be pruned or to be decomposed. The process is iterated until the exploration is completed and the optimal solution is found. 
 
 
 \begin{figure}[h!]
   \begin{center}
 \includegraphics[scale=0.3]{Chapters/chapter8/figures/approach.eps}%
-\caption{The overall architecture of the GPU-accelerated Branch-and-Bound algorithm based on the parallel evaluation of bounds.}
+\caption{The overall architecture of the GPU-accelerated branch-and-bound algorithm based on the parallel evaluation of bounds.}
 \label{ch8:approach}
   \end{center}
 \end{figure}
 
 
-In both considered approaches, GPU-PEB-BB and GPU-PTE-BB, the GPU-based lower bound's implementation raises mainly two challenges. The first one is related to the ``single instruction multiple data'' (SIMD) model of the GPU and to the implementation of the LB. Indeed, although typically every GPU thread will run the identical lower bound function, the body of the lower bound can contains conditions on thread identifiers and data. This implies that different instructions are executed in some threads. In SIMD architectures like GPUs this behavior leads to the thread or branch divergence issue. This problem arises when threads of a same warp execute different data-dependent instructions. It might causes serious performance declining since computation occurs in parallel only when the same instructions are being performed. The second challenge consists in adjusting the pattern of accesses to the GPU device memory. Good placement of data over the different memory hierarchy grants programmers to further improve the throughput of many high-performance CUDA applications. For B\&B applied to FSP, threads of the same block perform concurrent accesses to the six data structures of the problem when they execute the lower bound function. These data structures have different sizes and access frequencies and should be wisely placed on the different memories of the GPUs that also have different sizes and latencies. 
+In both approaches, GPU-PEB-BB and GPU-PTE-BB, the GPU-based lower bound's implementation raises mainly two challenges. The first one is related to the SIMD model of the GPU and to the implementation of the LB. Indeed, although typically every GPU thread will run the identical lower bound function, the body of the lower bound can contain conditions on thread identifiers and data. This implies that different instructions are executed in some threads. In SIMD architectures such as GPUs this behavior leads to the thread or branch divergence issue. This problem arises when threads of a same warp execute different data-dependent instructions. It might causes serious performance declining since computation occurs in parallel only when the same instructions are being performed. The second challenge consists of adjusting the pattern of accesses to the GPU device memory. Good placement of data over the different memory hierarchy allows programmers to further improve the throughput of many high-performance CUDA applications. For B\&B applied to FSP, threads of the same block perform concurrent accesses to the six data structures of the problem when they execute the lower bound function. These data structures have different sizes and access frequencies and should be wisely placed on the different memories of the GPUs that also have different sizes and latencies. 
 
 
-In the following, we present how we dealt with the thread/branch divergence issue and maps the different data structures on the memory hierarchy of the GPU device taking into account the characteristics of the data structures and those of the different GPU memories.
+In the following, we present how we dealt with the thread/branch divergence issue and map the different data structures on the memory hierarchy of the GPU device taking into account the characteristics of the data structures and those of the different GPU memories.
 
 
 \section{Thread divergence}
@@ -233,13 +233,13 @@ In the following, we present how we dealt with the thread/branch divergence issu
 
 \subsection{The thread divergence issue}
 
-During the execution of an application on GPU, to each GPU multiprocessor is assigned one or more thread block(s) to execute. Those threads are partitioned into warps that get scheduled for execution. For each  instruction of the flow, the multiprocessor selects a warp that is ready to be run. A warp executes one common instruction at a time, so full efficiency is realized when all threads of a warp agree on their execution path. In this chapter, the G80 model, in which a warp is a pool of 32 threads, is used. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken. Threads that are not on the taken path are disabled, and when all paths complete, the threads converge back to the same execution path. This phenomenon is called thread/branch divergence\index{Thread divergence} and often causes serious performance degradations. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjointed code paths.\\
+During the execution of an application on GPU,  one or more thread block(s) are assigned to each GPU multiprocessor to execute. Those threads are partitioned into warps that get scheduled for execution. For each  instruction of the flow, the multiprocessor selects a warp that is ready to be run. A warp executes one common instruction at a time, so full efficiency is realized when all threads of a warp agree on their execution path. In this chapter, the G80 model, in which a warp is a pool of 32 threads, is used. If threads of a warp diverge via a data-dependent conditional branch, the warp serially executes each branch path taken. Threads that are not on the taken path are disabled, and when all paths are complete, the threads converge back to the same execution path. This phenomenon is called thread/branch divergence\index{Thread divergence} and often causes serious performance degradations. Branch divergence occurs only within a warp; different warps execute independently regardless of whether they are executing common or disjointed code paths.\\
 
 
-This section discusses thread divergence issue encountered when computing the bounds by GPU. The thread divergence occurs for two main reasons, namely the locations of nodes in the search tree and the control flow instructions within the bounding operator. \\
+This section discusses thread divergence issues encountered when computing the bounds by GPU. The thread divergence occurs for two main reasons, namely, the locations of nodes in the search tree and the control flow instructions within the bounding operator. \\
 
 
-\textbf{Divergence related to the location of nodes}\\
+\noindent \textbf{Divergence related to the location of nodes}\\
 
 
 This divergence is related to the positions of the nodes in the B\&B search tree. Below is given an example from the source code of the used LB showing that the execution flow depends on the position of the node in the search tree. In the following piece of code, three methods are used {\it is\_leaf()}, {\it makespan()} and {\it lower\_bound()}. {\it is\_leaf()} tests if the node {\it \_node} is a leaf or an internal node. If {\it \_node} is a leaf, {\it makespan()} computes the cost of its makespan. Otherwise, {\it \_node} is an internal node and {\it lower\_bound()} computes the value of its lower bound. 
@@ -252,10 +252,10 @@ This divergence is related to the positions of the nodes in the B\&B search tree
 \end{verbatim}\\
 
 
-\textbf{Divergence related to the control flow instructions}\\
+\noindent \textbf{Divergence related to the control flow instructions}\\
 
 
-Control flow refers to the order in which the instructions, statements or function calls are executed in a program. This flow is determined by instructions such as {\it if-then-else}, {\it for}, {\it while-do}, {\it switch-case}, etc. There are a dozen of such instructions in the implementation of our bounding operator. The source code examples given below show two scenarios in which this kind of instructions is used.
+Control flow refers to the order in which the instructions, statements, or function calls are executed in a program. This flow is determined by instructions such as if-then-else, for,  while-do,  switch-case. There are a dozen of such instructions in the implementation of our bounding operator. The source code examples given below show two scenarios in which this kind of instructions is used.
 
 \begin{itemize}
 \item Example 1:\\ \vspace{-0.4cm}
@@ -274,27 +274,27 @@ Control flow refers to the order in which the instructions, statements or functi
 
 \end{itemize}
 
-In these two examples, {\it thread\_idx} is the index associated to the current thread. Let suppose that the code of Example 1 is executed by $32$ threads, {\it pool[thread\_idx].begin} is equal to $0$ for the first thread, and {\it pool[thread\_idx].begin} is not equal to $0$ for the other $31$ threads. When the first thread executes the statement {\it ``time = TimeArrival[1];''}, 
-all the other $31$ threads remain idle. Therefore, the GPU cores on which these $31$ threads are executed remain idle and can not be used during the execution of the statement {\it ``time = TimeArrival[1];``}. \\
+In these two examples, thread\_idx is the index associated to the current thread. Let suppose that the code of Example 1 is executed by $32$ threads,  pool[thread\_idx].begin is equal to $0$ for the first thread, and  pool[thread\_idx].begin is not equal to $0$ for the other $31$ threads. When the first thread executes the statement time = TimeArrival[1];, 
+all the other $31$ threads remain idle. Therefore, the GPU cores on which these $31$ threads are executed remain idle and cannot be used during the execution of the statement time = TimeArrival[1];.\\
 
 
-The same scenario occurs during the execution of Example 2. Let us suppose that the instruction is executed by $32$ threads, {\it pool[thread\_idx].begin} is equal to $100$ for the first thread, and {\it pool[thread\_idx].begin} is equal to $0$ for the other $31$ threads. When the first thread executes the loop $for$, all the other $31$ threads remain idle. \\
+The same scenario occurs during the execution of Example 2. Let us suppose that the instruction is executed by $32$ threads,  pool[thread\_idx].begin is equal to $100$ for the first thread, and pool[thread\_idx].begin is equal to $0$ for the other $31$ threads. When the first thread executes the loop $for$, all the other $31$ threads remain idle. \\
 
 
-Existing techniques for handling branch divergence either demand hardware support \cite{ch8:Fung} or require host-GPU interaction \cite{ch8:Zhang}, which incurs overhead. Some other works such as \cite{ch8:Han} intervene at the code level. They expose a branch distribution method that aims to reduce the divergent portion of a branch by factoring out structurally similar code from the branch paths. In our work, we have also opted for software-based optimizations like \cite{ch8:Han}. In fact, we figure out how to literally rewrite the branching instructions into basic ones in order to make thread execution paths uniform. We also demonstrate that we could ameliorate performances only by judiciously reordering data being assigned to each thread.
+Existing techniques for handling branch divergence either demand hardware support \cite{ch8:Fung} or require host-GPU interaction \cite{ch8:Zhang}, which incurs overhead. Some other works such as \cite{ch8:Han} intervene at the code level. They expose a branch distribution method that aims to reduce the divergent portion of a branch by factoring out structurally similar code from the branch paths. In our work, we have also opted for software-based optimizations as in \cite{ch8:Han}. In fact, we figure out how to literally rewrite the branching instructions into basic ones in order to make thread execution paths uniform. We also demonstrate that we could ameliorate performances only by judiciously reordering data being assigned to each thread.
 
 \subsection{Mechanisms for reducing branch divergence}
 
 
- \textbf{Thread-data reordering}\\
+\noindent \textbf{Thread-data reordering}\\
 
-At each iteration of our GPU-accelerated B\&B approach, several thousands of subproblems are sent to the GPU. The GPU groups the received subproblems into several warps according to their reception order. The first 32 subproblems belong to the first warp, the following 32 subproblems belong to the second warp, etc. Therefore, thread-data reordering technique sorts subproblems before sending them to the GPU. These subproblems are sorted according to their position in the B\&B tree. This sort of subproblems allows to have warps containing more homogeneous subproblems, and reduces the number of thread divergences. \\
+At each iteration of our GPU-accelerated B\&B approach, several thousands of subproblems are sent to the GPU. The GPU groups the received subproblems into several warps according to their reception order. The first 32 subproblems belong to the first warp, the following 32 subproblems belong to the second warp, etc. Therefore, thread-data reordering technique sorts subproblems before sending them to the GPU. These subproblems are sorted according to their position in the B\&B tree. These sorts of subproblems allow  warps containing more homogeneous subproblems and reduce the number of thread divergences. \\
 
 
- \textbf{Branch refactoring}\\
+\noindent \textbf{Branch refactoring}\\
 
 
-As quoted above, thread or branch divergence occurs when the kernel includes conditional instructions and loops that make the threads performing different control flows leading to their serial execution. In this chapter, we investigate the branch refactoring approach to deal with thread divergence. Branch refactoring consists in rewriting the conditional instructions so that threads of the same warp execute an uniform code avoiding their divergence. To do that, two major ``if" scenarios are studied and some optimizations are proposed accordingly. These two scenarios correspond to the conditional instructions contained in the $LB$ kernel code. In the first scenario, the conditional expression is a comparison of the content of a variable to 0. For instance, the following example extracted from the pseudo-code of the lower bound $LB$ illustrates such scenario.\\
+As stated above, thread or branch divergence occurs when the kernel includes conditional instructions and loops that make the threads performing different control flows lead to their serial execution. In this chapter, we investigate the branch refactoring approach to deal with thread divergence. Branch refactoring consists of rewriting the conditional instructions so that threads of the same warp execute a uniform code avoiding their divergence. To do that, two major if scenarios are studied and some optimizations are proposed accordingly. These two scenarios correspond to the conditional instructions contained in the LB kernel code. In the first scenario, the conditional expression is a comparison of the content of a variable to 0. For instance, the following example extracted from the pseudocode of the lower bound LB illustrates such a scenario.\\
 
 
 \begin{tabular}{l}
@@ -309,7 +309,7 @@ As quoted above, thread or branch divergence occurs when the kernel includes con
 The refactoring idea is to replace the conditional expression by two functions namely $f$ and $g$ as shown in Equation~\ref{ch8:Eq1}.\\
 
 
-The behavior of $f$ and $g$ fits the cosine trigonometric function. These functions return values between $0$ and $1$. An integer variable is used to store the result of the cosine function. Its value is $0$ or $1$ since it is rounded to $0$ if it is not equal to~$1$. In order to increase the performance the CUDA runtime math operations are used: $sinf(x)$, $expf(x)$ and so forth. Those functions are mapped directly to the hardware level~\cite{ch8:cuda}. They are faster but provide lower accuracy which does not matter in our case because the results are rounded to $int$. 
+The behavior of $f$ and $g$ fits the cosine trigonometric function. These functions return values between $0$ and $1$. An integer variable is used to store the result of the cosine function. Its value is $0$ or $1$ since it is rounded to $0$ if it is not equal to~$1$. In order to increase the performance the CUDA runtime math operations are used: sinf(x), expf(x), and so forth. Those functions are mapped directly to the hardware level~\cite{ch8:cuda}. They are faster but provide lower accuracy which does not matter in our case because the results are rounded to int. 
 
 \begin{equation}
 \begin{array}{lllllllll}
@@ -318,7 +318,7 @@ The behavior of $f$ and $g$ fits the cosine trigonometric function. These functi
     & \multicolumn{2}{l}{} & ~~~~~~\Rightarrow & \multicolumn{2}{l}{} \\
     &\multicolumn{2}{l}{\emph{else}}    $a = c[1];$ &     &\multicolumn{2}{l}{\emph{else}} $a = 0 \times b[1] + c[1];$ \\\\
     & \multicolumn{6}{l}{\Rightarrow a = f(x) \times b[1] + g(x) \times c[1];}\\ \\
-    &\multicolumn{6}{l}{\emph{where:}}\\
+    &\multicolumn{6}{l}{\emph{where}}\\
     &&\multicolumn{6}{l}{ f(x)=\left\{
                     \begin{array}{lll}
                         f(x) = 0    & if &x = 0\\
@@ -337,7 +337,7 @@ The behavior of $f$ and $g$ fits the cosine trigonometric function. These functi
 \end{equation}\\
 
 
-The throughput of $sinf(x)$, $cosf(x)$, $expf(x)$ is one operation per clock cycle~\cite{ch8:cuda}. The refactoring result for the ``if" pseudo-code given above is the following:
+The throughput of sinf(x), cosf(x), expf(x) is one operation per clock cycle~\cite{ch8:cuda}. The refactoring result for the if pseudocode given above is the following:
 
 
 \begin{tabular}{l}
@@ -349,7 +349,7 @@ The throughput of $sinf(x)$, $cosf(x)$, $expf(x)$ is one operation per clock cyc
 \end{tabular}
 
 
-The second "if" scenario considered in our study compares two values between themselves as shown in Equation~\ref{ch8:Eq2}.
+The second if scenario considered in our study compares two values between themselves as shown in Equation~\ref{ch8:Eq2}.
 
 
 \begin{equation}
@@ -377,7 +377,7 @@ The second "if" scenario considered in our study compares two values between the
 \end{equation}
 
 
-For instance, the following example extracted from the pseudo-code of the lower bound $LB$ illustrates such scenario.\\
+For instance, the following example extracted from the pseudocode of the lower bound $LB$ illustrates such a scenario.\\
 
 
 \footnotesize
@@ -388,7 +388,7 @@ For instance, the following example extracted from the pseudo-code of the lower
 \normalsize \\
 
 
-The same transformations as those applied for the first scenario are applied here using the exponential function. Recall that the exponential is a positive function which is equal to $1$ when applied to $0$. Thus, if $x$ is greater than $y$ then $expf(x-y-1)$ returns a value between $0$ and $1$. If the result is rounded to an integer value $0$ will be obtained. Now, if $x$ is less than $y$ then $expf(x-y-1)$ returns a value greater than $1$ and since the minimum between $1$ and the exponential is get, the returned result would be $1$. Such behavior satisfies exactly our prerequisites. The above ``if" instruction pseudo-code is now equivalent to:
+The same transformations as those applied for the first scenario are applied here using the exponential function. Recall that the exponential is a positive function which is equal to $1$ when applied to $0$. Thus, if $x$ is greater than $y$ then expf$(x-y-1)$ returns a value between $0$ and $1$. If the result is rounded to an integer value $0$ will be obtained. Now, if $x$ is less than $y$ then expf$(x-y-1)$ returns a value greater than $1$ and since the minimum between $1$ and the exponential is get, the returned result would be $1$. Such behavior satisfies exactly our prerequisites. The above if instruction pseudocode is now equivalent to
 
 
 \small
@@ -402,10 +402,10 @@ The same transformations as those applied for the first scenario are applied her
 \section{Memory access optimization}
 \label{ch8:DataAccessOpt}
 
-Memory access optimizations \index{Memory access optimizations} are by far the most studied area for improving GPU-based application performances. Indeed, adjusting the pattern of accesses to the GPU device memory grants programmers to further improve the throughput of many high-performance CUDA applications. The goal of memory access optimizations is generally to use as much fast memory and as little slow-access memory as possible. This section discusses how best to set up data LB items on the various kinds of memory on the device.   \\
+Memory access optimizations \index{Memory access optimizations} are by far the most studied area for improving GPU-based application performances. Indeed, adjusting the pattern of accesses to the GPU device memory allows programmers to further improve the throughput of many high-performance CUDA applications. The goal of memory access optimizations is generally to use as much fast-access memory and as little slow-access memory as possible. This section discusses how best to set up data LB items on the various kinds of memory on the device.   \\
 
 
-CUDA enabled devices use several memory spaces, which have different characteristics in term of sizes and access latencies. These memory spaces include global memory, local memory , shared memory, texture memory , and registers. Devices of compute capability 2.0 have also an L1 $/$ L2 cache hierarchy that is used to cache local and global memory accesses.
+CUDA-enabled devices use several memory spaces, which have different characteristics in term of sizes and access latencies. These memory spaces include global memory, local memory , shared memory, texture memory , and registers. Devices of compute capability 2.0 also have an L1/L2 cache hierarchy that is used to cache local and global memory accesses.
 
 \begin{itemize}
 \item At the thread-level, each thread has its own allocated registers and a private local memory. CUDA uses this local memory for thread-private variables that do not fit in the threads registers, as well as for stack frames and register spilling. \item At the thread block-level, each thread block has a shared memory visible to all its associated threads. \item At the grid-level, all threads have access to the same global memory. Texture and constant cached memories are two other memories accessible by all threads. 
@@ -415,16 +415,16 @@ The data access optimization challenge is to find the best mapping of the data s
 
 \subsection{Complexity analysis of the memory usage of the lower bound }
 
-In this section, the characteristics of the data structures used by the lower bound function are studied in terms of sizes and access frequencies. For an efficient implementation of the LB, six data structures are required: the  matrix $PTM$ of the processing times of the jobs, the matrix of lags $LM$, the Johnson's matrix $JM$, the matrix $RM$ of the earliest starting times of jobs, the matrix $QM$ of their lowest latency times and the matrix $MM$ containing the couples of machines. The complexities of the different data structures are summarized in Table~\ref{ch8:tabMemComplex} where the columns represent respectively the name of the data structure, its size and the number of times it is accessed.\\
+In this section, the characteristics of the data structures used by the lower bound function are studied in terms of sizes and access frequencies. For an efficient implementation of the LB, six data structures are required: the  matrix $PTM$ of the processing times of the jobs, the matrix of lags $LM$, the Johnson's matrix $JM$, the matrix $RM$ of the earliest starting times of jobs, the matrix $QM$ of their lowest latency times, and the matrix $MM$ containing the couples of machines. The complexities of the different data structures are summarized in Table~\ref{ch8:tabMemComplex} where the columns represent, respectively, the name of the data structure, its size, and the number of times it is accessed.\\
 
 
-In the $LB$ expression, the computation of the term $P_{Ja}^*(\jmath,M_k,M_l)$ requires the calculation of the lag of each remaining job to be scheduled on the couple $(M_k,M_l)$ of machines using its processing times on these machines (Johnson's rule with lags). Such computation is repeated for each couple $(M_k,M_l)$ of machines with $1 \leq k,l \leq m$ and $k<l$. To avoid the repetitive computation of the lags, they are computed once at the beginning of the algorithm and stored in the matrix $LM$. The dimension of $LM$ is $n \times \frac{m\times (m-1)}{2}$, where $n$ and $m$ are respectively the number of jobs to be scheduled and $m$ the number of machines. $LM$ is accessed $n' \times \frac{m \times (m-1)}{2}$ times, $n'$ being the number of remaining jobs to be scheduled in the subproblem for which the lower bound is being calculated. The processing times of all the jobs on all the machines are stored in the matrix $PTM$. This matrix has a dimension of $n \times m$ and is accessed $n' \times m \times (m-1)$ times.\\
+In the LB expression, the computation of the term $P_{Ja}^*(\jmath,M_k,M_l)$ requires the calculation of the lag of each remaining job to be scheduled on the couple $(M_k,M_l)$ of machines using its processing times on these machines (Johnson's rule with lags). Such computation is repeated for each couple $(M_k,M_l)$ of machines with $1 \leq k,l \leq m$ and $k<l$. To avoid the repetitive computation of the lags, they are computed once at the beginning of the algorithm and stored in the matrix $LM$. The dimension of $LM$ is $n \times \frac{m\times (m-1)}{2}$, where $n$ and $m$ are respectively the number of jobs to be scheduled and $m$ the number of machines. $LM$ is accessed $n' \times \frac{m \times (m-1)}{2}$ times, $n'$ being the number of remaining jobs to be scheduled in the subproblem for which the lower bound is being calculated. The processing times of all the jobs on all the machines are stored in the matrix $PTM$. This matrix has a dimension of $n \times m$ and is accessed $n' \times m \times (m-1)$ times.\\
 
 
 In addition, in order to avoid relaunching the Johnson's algorithm for each couple of machines and each subset of jobs, the Johnson's algorithm is computed once to find the optimal solutions on the couples of machines. These optimal solutions are then stored in the Johnson's matrix $JM$. This matrix has the same dimension as $LM$ and is accessed $n \times \frac{m \times (m-1)}{2}$ times during the computation of the lower bound. Finally, the $MM$ matrix that contains all the couples of machines has a dimension and access frequency of $m \times (m-1)$.  \\
 
 
-To reduce the computation time cost of the term $\min\limits_{(i,j)\in \jmath^2, i \neq j}(r_{i,k}+q_{j,l})$ in the $LB$ expression, two matrices are defined, namely $RM$ and $QM$. They are used to store respectively the lowest starting and latency times of all the jobs on each machine. Their dimension is $m$ and are accessed $ m \times (m-1)$ times and $ \frac{m \times (m-1)}{2}$ times respectively.
+To reduce the computation time cost of the term $\min\limits_{(i,j)\in \jmath^2, i \neq j}(r_{i,k}+q_{j,l})$ in the LB expression, two matrices are defined, namely $RM$ and $QM$. They are used to store, respectively, the lowest starting and latency times of all the jobs on each machine. Their dimension is $m$ and, are accessed $ m \times (m-1)$ times and $ \frac{m \times (m-1)}{2}$ times, respectively.
 
 \begin{table}
   \centering
@@ -446,7 +446,7 @@ To reduce the computation time cost of the term $\min\limits_{(i,j)\in \jmath^2,
    MM &  $m \times (m-1)$ & $m \times (m-1)$ \\
  \hline
 \end{tabular}
- \caption[The different data structures of the $LB$ algorithm and their associated complexities in memory size and numbers of accesses.]{The different data structures of the $LB$ algorithm and their associated complexities in memory size and numbers of accesses. The parameters $n$, $m$ and $n'$ designate respectively the total number of jobs, the total number of machines and the number of remaining jobs to be scheduled for the subproblems the lower bound is being computed.}
+ \caption[The different data structures of the LB algorithm and their associated complexities in memory size and numbers of accesses.]{The different data structures of the LB algorithm and their associated complexities in memory size and numbers of accesses. The parameters $n$, $m$, and $n'$ designate, respectively, the total number of jobs, the total number of machines and the number of remaining jobs to be scheduled for the subproblems the lower bound is being computed.}
 \label{ch8:tabMemComplex}
 \end{table}