new

[book_gpu.git] / BookGPU / Chapters / chapter9 / ch9.tex
diff --git a/BookGPU/Chapters/chapter9/ch9.tex b/BookGPU/Chapters/chapter9/ch9.tex

index a78e63396984732a861dc76d756c739ca43cffc8..0ebc4464899aa36c9f29dc45822ff385b20a2adb 100644 (file)
--- a/BookGPU/Chapters/chapter9/ch9.tex
+++ b/BookGPU/Chapters/chapter9/ch9.tex
@@ -6,28 +6,27 @@
  
  \chapterauthor{Nouredine Melab}{Université Lille 1, LIFL/UMR CNRS 8022, 59655-Villeneuve d'Ascq cedex, France}
  
-\chapter{Parallel GPU-accelerated Metaheuristics}
+\chapter{Parallel GPU-accelerated metaheuristics}
  \label{chapter9}
  \section{Introduction}
  This chapter presents GPU-based parallel
-metaheuristics\index{Metaheuristics!parallel~metaheuristics},
+metaheuristics\index{metaheuristics!parallel metaheuristics},
  challenges, and issues related to the particularities of the GPU
  architecture and a synthesis on the different implementation
  strategies used in the literature. The implementation of parallel
-metaheuristics\index{Metaheuristics!parallel~metaheuristics} on
+metaheuristics on
  GPUs is not straightforward. The traditional models used in CPUs
  must be rethought to meet the new requirements of GPU
  architectures. This chapter is organized as follows. Combinatorial
-optimization\index{Combinatorial~optimization} and  resolution
+optimization\index{combinatorial optimization} and  resolution
  methods are introduced in Section~\ref{ch8:sec:optim}. The main
  traditional parallel models used for metaheuristics are recalled
  in Section~\ref{ch8:sec:paraMeta}.
  Section~\ref{ch8:sec:challenges} highlights the  main challenges
  related to the GPU implementation of metaheuristics. A
  state-of-the-art of GPU-based parallel
-metaheuristics\index{Metaheuristics!parallel~metaheuristics} is
-summarized in Section~\ref{ch8:sec:state}. In Section
-Section~\ref{ch8:sec:GPU_APIs}, the main developed GPU-based
+metaheuristics is
+summarized in Section~\ref{ch8:sec:state}. In Section~\ref{ch8:sec:frameworks}, the main developed GPU-based
  frameworks for metaheuristics are described. Finally, a case study
  is presented in Section~\ref{ch8:sec:case} and some concluding
  remarks are given in Section~\ref{ch8:conclusion}
@@ -35,9 +34,9 @@ remarks are given in Section~\ref{ch8:conclusion}
  \section{Combinatorial optimization}
  \label{ch8:sec:optim}
  
-Combinatorial optimization\index{Combinatorial~optimization} (CO) is a branch of applied and discrete mathematics.
+Combinatorial optimization (CO) is a branch of applied and discrete mathematics.
  It consists in finding optimal configuration(s) among a finite set of possible configurations
-(or solutions) of a given combinatorial optimization problem (COP).  The set of all possible solutions noted $S$ is called solution space or search space. Each solution  in $S$ is defined by its real cost calculated by an objective function. COPs are generally defined as follows~\cite{blumMeta}:\\  %(Definition~\ref{def:cops})
+(or solutions) of a given combinatorial optimization problem (COP).  The set of all possible solutions noted $S$ is called solution space or search space. Each solution  in $S$ is defined by its real cost calculated by an objective function. COPs are generally defined as follows~\cite{blumMeta}:  %(Definition~\ref{def:cops})
  
  
  %\begin{minipage}{0.5\linewidth}
@@ -45,7 +44,7 @@ A combinatorial problem $P=(S,f)$ can be defined by
  \begin{itemize}
  \item a set of decision variables $X$,
  \item an objective function $f$ to optimize (minimize or maximize) over the set $S$,
-\item subject to constraints on the decision variables.\\
+\item subject to constraints on the decision variables.
  \end{itemize}
  %\end{minipage}
  
@@ -56,7 +55,7 @@ for solving an optimization problem, two classes of optimization
  methods can be distinguished: \emph{exact methods} and
  \emph{approximate methods}. Exact methods allow one to reach
  optimal solution(s) of the handled optimization problem with a
-proof of its or their optimality. The most known methods of this
+proof of its or their optimality. The known methods of this
  class are the \emph{branch and bound technique}, \emph{dynamic
  programming}, \emph{constraint programming}, and \emph{A*
  algorithm}. However, optimization problems, whether practical or
@@ -73,8 +72,8 @@ these problems. Indeed, these methods allow one to reach good
  quality solutions in reasonable computation time compared to exact
  methods but with no guarantee to find optimal or even bounded
  solutions. This is due to the nature of the search process adopted
-by these approaches which consists in performing a
-search in a subset of the whole search space.\\
+by these approaches which consists of performing a
+search in a subset of the whole search space.
  
  Regarding the number of solutions considered at each iteration in
  the search process, two classes of metaheuristics can be
@@ -87,11 +86,11 @@ process starts with a single solution (generally set at random)
  and iteratively improves it by exploring its neighborhood in the
  search space. The most known methods in this class are local
  search methods that include \emph{simulated
-annealing}\index{Metaheuristics!simulated~annealing}~\cite{Kirkpatrick1983SA},
+annealing}\index{metaheuristics!simulated annealing}~\cite{Kirkpatrick1983SA},
  \emph{tabu search}~\cite{Glover1989TS}, \emph{iterated local
-search\index{Metaheuristics!iterated local
+search\index{metaheuristics!iterated local
  search}}~\cite{stutzle2006ILSforQAP}, and \emph{variable
-neighborhood search}~\cite{HansenMladenovic1997VNS}.\\
+neighborhood search}~\cite{HansenMladenovic1997VNS}.
  
  Unlike s-metaheuristics, p-metaheuristics start with a population
  of solutions and implement an iterative process that evolves the
@@ -106,7 +105,7 @@ Optimization problems, whether real-life or academic, are more
  often NP-hard and CPU time and/or memory consuming. Metaheuristics
  allow the significant reduction of the computational time of the search
  process but remain time-consuming particularly when it comes
-dealing with large-sized problems. \\
+dealing with large-sized problems. 
  
  The use of parallelism in the design of metaheuristics is a relevant
  approach that is widely adopted by the combinatorial optimization
@@ -120,7 +119,7 @@ methods and the solving of large-sized optimization problems.
    \item Sequential processor architectures have reached their
  physical limit which prevents creating faster processors. The
  current trend of microprocessor manufacturers consists of placing
-multiple cores on a single chip. Nowadays, laptops and
+multicores on a single chip. Nowadays, laptops and
  workstations are multicore processors. In addition, the evolution
  of network technologies and the proliferation of broadband
  networks have made possible the emergence of clusters of
@@ -130,8 +129,8 @@ high-performance computing.
  \end{itemize}
  
  From the granularity of parallelism point of view, three major parallel
-models for metaheuristics can be distinguished~\cite{talbi2009mfdti}: \emph{algorithmic-level}\index{Metaheuristics!algorithmic-level parallelism},
-\emph{iteration-level} \index{Metaheuristics!iteration-level parallelism}, and \emph{solution-level} as illustrated in Figure~\ref{ch8:fig:paraMeta}. \\
+models for metaheuristics can be distinguished~\cite{talbi2009mfdti}: \emph{algorithmic-level}\index{metaheuristics!algorithmic-level parallelism},
+\emph{iteration-level}\index{metaheuristics!iteration-level parallelism}, and \emph{solution-level} as illustrated in Figure~\ref{ch8:fig:paraMeta}. 
  
  \begin{figure}[h!]
  \centerline{\includegraphics[width=0.6\textwidth]{Chapters/chapter9/figures/paraMeta.pdf}}
@@ -142,61 +141,58 @@ models for metaheuristics can be distinguished~\cite{talbi2009mfdti}: \emph{algo
  \begin{itemize}
  
  \item{In the
-algorithmic-level\index{Metaheuristics!algorithmic-level
-parallelism} parallel model, several self-contained metaheuristics
+algorithmic-level parallel model, several self-contained metaheuristics
  are launched in parallel. The parallel
-metaheuristics\index{Metaheuristics!parallel~metaheuristics} may
+metaheuristics\index{metaheuristics!parallel metaheuristics} may
  start with identical or different solutions (s-metaheuristics
  case) or populations (p-metaheuristics case). Their parameter
  settings such as the size of tabu list for tabu
-search\index{Metaheuristics!tabu~search}, transition probabilities
+search\index{metaheuristics!tabu search}, transition probabilities
  for ant colonies, mutation and crossover probabilities for
  evolutionary
-algorithm\index{Metaheuristics!evolutionary~algorithm}s may be the
+algorithm\index{metaheuristics!evolutionary algorithm}s may be the
  same or different. The parallel processes may evolve independently
  or in a cooperative manner. In cooperative parallel models, the
  algorithms exchange information related to the search during
  evolution in order to find better and more robust solutions.}
  
-\item{In the iteration-level\index{Metaheuristics!iteration-level
-parallelism} parallel model, the focus is on the parallelization
+\item{In the iteration-level parallel model, the focus is on the parallelization
  of each iteration of the metaheuristic. Indeed, metaheuristics are
  generally iterative search processes. Moreover, the most
  resource-consuming part of a metaheuristic is the evaluation of
  the generated solutions at each iteration. For s-metaheuristics
-(e.g., tabu search\index{Metaheuristics!tabu~search}, simulated
+(e.g., tabu search\index{metaheuristics!tabu search}, simulated
  annealing, variable neighborhood search), the evaluation and
  generation of the neighborhood is the most time-consuming step of
  the algorithm particularly when it comes to dealing with large
  neighborhood sets. In this parallel model, the neighborhood is
  decomposed into partitions, and each partition is evaluated in a
  parallel way. For p-metaheuristics (e.g., evolutionary
-algorithm\index{Metaheuristics!evolutionary~algorithm}s, ant
+algorithms, ant
  colonies, swarm optimization), the
-iteration-level\index{Metaheuristics!iteration-level parallelism}
+iteration-level
  parallel model arises naturally since these metaheuristics deal
  with a population of independent solutions. In evolutionary
-algorithm\index{Metaheuristics!evolutionary~algorithm}s, for
-instance, the iteration-level\index{Metaheuristics!iteration-level
-parallelism} model consists of decomposing the whole population
+algorithms, for
+instance, the iteration-level model consists of decomposing the whole population
  into several partitions where each partition is evaluated in
  parallel.}
  
  \item{In the
-solution-level\index{Metaheuristics!solution-level~parallelism}
+solution-level
  parallel model, the focus is on the parallelization of the
  evaluation of a single solution. This model is useful when the
  objective function and/or the constraints are time and/or memory
  consuming. Unlike the two previous parallel models, the
-solution-level\index{Metaheuristics!solution-level~parallelism}
+solution-level\index{metaheuristics!solution-level parallelism}
  parallel model is problem-dependent.}
  \end{itemize}
-
+\clearpage
  \section{Challenges for the design of GPU-based metaheuristics}
  \label{ch8:sec:challenges}
  
  Developing GPU-based parallel
-metaheuristics\index{Metaheuristics!parallel~metaheuristics} is
+metaheuristics\index{metaheuristics!parallel metaheuristics} is
  not straightforward. The parallel models have to be rethought to
  meet the new requirements of the GPU architecture. Several major
  issues have to be taken into account both at design and
@@ -207,7 +203,7 @@ of tasks, and data transfer between the CPU and
  GPU~\cite{luongMultiStart}.
  
  \subsection{Data placement on a hierarchical memory}
-\index{GPU Challenges!data~placement} During the execution of
+\index{GPU!data placement} During the execution of
  metaheuristics on GPU, the different threads may access multiple
  data structures from multiple memory spaces. These memories have
  different sizes and access latencies. Nevertheless, faster
@@ -230,12 +226,12 @@ threads and the corresponding metaheuristic elements (one neighbor
  per thread, one individual per thread, single population per
  threads block, one ant per thread, etc.) must be defined to ensure
  a maximum occupancy of the GPU and
-to cover CPU/GPU communication\index{GPU Challenges!CPU/GPU~communication} and memory access times.\\
+to cover CPU/GPU communication\index{GPU!CPU/GPU communication} and memory access times.
  
  According to the used metaheuristic and to the handled problem, the data
  values may have different types and different ranges of their values. The data
  types should then be chosen carefully and the ranges of the data values should
-be analyzed to minimize the amount of occupied memory space.\\
+be analyzed to minimize the amount of occupied memory space.
  
  In addition to the size and latency of GPU memory issues, the
  memory access pattern is another important issue to be dealt with
@@ -257,7 +253,7 @@ data structures (e.g., population of individuals for a CUDA thread
  block) on the shared memory.
  
  \subsection{Threads synchronization}
-\index{GPU Challenges!threads~synchronization} The thread
+\index{GPU!threads synchronization} The thread
  synchronization issue is caused by both the GPU architecture and
  the synchronization requirements of the implemented method.
  Indeed, GPUs are based on a multicore architecture organized into
@@ -271,14 +267,14 @@ GPU is achieved when launching a large amount of threads
  (thousands of threads)~\cite{CUDA}. However, the execution order
  of these thousands of threads is unknown by the programmer which
  makes the prediction of their execution order a challenging issue.
-On an other hand, the developer has to control explicitly the
+Plus, the developer has to control explicitly the
  threads through the insertion of barrier synchronizations in the
  codes to avoid concurrent accesses to data structures and to meet
  some requirements related to data-dependent synchronizations.
  
  \subsection{Thread divergence}
  
-Thread divergence\index{GPU Challenges!thread~divergence} is
+Thread divergence\index{GPU!thread divergence} is
  another challenging issue in GPU-based
  metaheuristics~\cite{cecilia, pugace, audreyANT}. Generally,
  metaheuristics contain irregular loops and conditional
@@ -286,11 +282,11 @@ instructions when generating and evaluating the neighborhood
  (s-metaheuristics), and the population (p-metaheuristics) in the
  same block. In addition,  the decision to apply a crossover or a
  mutation on an individual in a genetic
-algorithm\index{Metaheuristics!genetic~algorithm} and the
+algorithm and the
  exploration of different paths using an ant
-colony\index{Metaheuristics!ant~colony~optimization} are random
+colony\index{metaheuristics!ant colony optimization} are random
  operations. Threads of the same warp have to execute
-simultaneously instructions leading to different branches whereas
+instructions simultaneously  leading to different branches whereas
  in an SIMD model the threads of a same warp execute the same
  instruction. Consequently, the different branches of a conditional
  instruction which is data-dependent lead to a serial execution of
@@ -298,30 +294,28 @@ the different threads degrading the performance of the application
  in terms of execution time. The challenge here is then to revisit
  the traditional irregular metaheuristic codes to eliminate these
  divergences.
-
+\clearpage
  \subsection{Task distribution and CPU/GPU communication}
  
  The performance of GPU-based metaheuristics in terms of execution
  time could be improved by choosing the most appropriate parallel
-model (algorithmic-level\index{Metaheuristics!algorithmic-level
-parallelism}, instruction-level,
-solution-level\index{Metaheuristics!solution-level~parallelism}).
+model (algorithmic-level, instruction-level,
+solution-level).
  Moreover, an efficient decomposition of the metaheuristic and an
  efficient assignment of code portions between the CPU and GPU
  should be adopted. The objective is to take benefit from the GPU
  computing power without affecting the efficiency and the behavior
  of the metaheuristic and without losing performance in CPU/GPU
-communication\index{GPU Challenges!CPU/GPU~communication} and
+communication\index{GPU!CPU/GPU communication} and
  memory accesses. In order to decide which part of the
  metaheuristic will be executed on which component, one should
  perform a careful analysis on the serial code of the
  metaheuristic, identify the compute-intensive tasks (e.g.,
  exploration of the neighborhood, evaluation of individuals), and
  then offload them to the GPU, while the remaining
-tasks still run on the CPU in a serial way. \\
+tasks still run on the CPU in a serial way. 
  
-The CPU/GPU communication\index{GPU
-Challenges!CPU/GPU~communication} is done through the global
+The CPU/GPU communication is done through the global
  memory which is a slow memory making the memory transfer between
  the CPU and GPU time-consuming which can significantly degrade the
  performance of the application. Accesses to this memory should be
@@ -332,21 +326,20 @@ therefore, the whole execution time of the metaheuristic.
  \section{State-of-the-art parallel metaheuristics on GPUs}
  \label{ch8:sec:state}
  After more than two decades of research by the combinatorial optimisation
-community devoted to developing adequate parallel metaheuristics\index{Metaheuristics!parallel~metaheuristics} for different types of
+community devoted to developing adequate parallel metaheuristics for different types of
  parallel architectures (clusters, supercomputers and grids), the actual developement
-of General Perpose GPU (GPGPU) brings new challenges for parallel metaheuristics\index{Metaheuristics!parallel~metaheuristics} on SIMD architectures.\\
+of General Perpose GPU (GPGPU) brings new challenges for parallel metaheuristics on SIMD architectures.
  
  The first works on metaheuristic algorithms implemented on GPUs
  started on old graphics cards before the appearance of modern GPUs
  equipped with high-level programming interfaces such as CUDA and
-OpenCL. Among these pioneering works we cite the work of Wong
-\emph{et al.}~\cite{wongOldGPU2006} dealing with the
+OpenCL. Among these pioneering works we cite the work of Wong et al.~\cite{wongOldGPU2006} dealing with the
  implementation
-of EAs on graphics processing cards and the work by Catala \emph{et al.} in~\cite{catala2007} where the ACO\index{Metaheuristics!ant~colony~optimization} algorithm
-is implemented on old GPU architectures. Yu \emph{et al.}~\cite{yu2005} and
-Li \emph{et al.}~\cite{li2007} proposed a full parallelization of genetic
-algorithm\index{Metaheuristics!genetic~algorithm}s  on old GPU architectures using
-shader libraries based on Direct3D and OpenGL.\\
+of EAs on graphics processing cards and the work by Catala et al. in~\cite{catala2007} where the ACO\index{metaheuristics!ant colony optimization} algorithm
+is implemented on old GPU architectures. Yu et al.~\cite{yu2005} and
+Li et al.~\cite{li2007} proposed a full parallelization of genetic
+algorithms  on old GPU architectures using
+shader libraries based on Direct3D and OpenGL.
  
  Such architectures are based on preconfigured pipelined stages
  used to accelerate the transformation of 3D geometric primitives
@@ -384,14 +377,14 @@ Nevertheless, small neighborhoods may lead to  nonoptimal
  occupation of the CUDA threads which may lead in turn to an
  overhead due to the communication and memory latencies. Therefore,
  large neighborhoods are necessary for efficient implementation of
-local searches on GPUs.\\
+local searches on GPUs.
  
-Luong \emph{et al.}~\cite{luong2010large} proposed efficient
+Luong et al.~\cite{luong2010large} proposed efficient
  mappings for large neighborhood structures on GPUs. In this work,
  three different neighborhoods are studied and mapped to the
  hierarchical GPU for binary problems. The three neighborhoods are
-based on the \emph{Hamming} distance. The move operators used in
-the three neighborhoods consider \emph{Hamming} distances of 1, 
+based on the Hamming distance. The move operators used in
+the three neighborhoods consider Hamming distances of 1, 
  2, and 3 (this consists on flipping the binary value of one, two,
  or three positions at a time in the candidate binary vector).
  In~\cite{luong2010large}, each thread is associated to a unique
@@ -400,12 +393,12 @@ efficiently map the different neighborhoods on the device memory,
  more explicitly, how to calculate the memory index of each
  solution associated to each CUDA thread's \textit{id}.
  %For 1-Hamming neighborhoods, as there is exactly n solutions in the neighborhood, the mapping of this neighborhood to CUDA threads is obvious: the CPU host offloads to GPU exactly $n$ threads, and each thread id is associated to one index in the binary vector. In the case of 2-Hamming and 3-Hamming neighborhoods, each thread id should be mapped respectively to two and three indexes  in the candidate vector.
-The three neighborhoods are  implemented and experimented on the Permuted Perceptron Problem (PPP) using a tabu search\index{Metaheuristics!tabu~search} algorithm (TS). Accelerations from $9.9 \times$ to $18.5 \times$ are obtained on different problem sizes.\\ % The experiments are performed on an Intel Xeon 8 cores 3GHz coupled with an NVIDIA GTX 280 card.\\
+The three neighborhoods are  implemented and experimented on the Permuted Perceptron Problem (PPP) using a tabu search\index{metaheuristics!tabu search} algorithm (TS). Accelerations from $9.9 \times$ to $18.5 \times$ are obtained on different problem sizes. % The experiments are performed on an Intel Xeon 8 cores 3GHz coupled with an NVIDIA GTX 280 card.\\
  
-In the same context, Deevacq \emph{et al.}~\cite{audreyANT}
-proposed two parallelization strategies inspired by the multi-walk
+In the same context, Deevacq et al.~\cite{audreyANT}
+proposed two parallelization strategies inspired by the multiwalk
  parallelization strategy, of a 3-opt iterated local
-search\index{Metaheuristics!iterated local search} algorithm (ILS)
+search algorithm (ILS)
  over a CPU/GPU architecture. In the first strategy, each Local
  Search (LS) is associated to a unique CUDA thread and improves a
  unique solution by generating its neighborhood. The second
@@ -417,105 +410,98 @@ be stored on the global memory, while the exploration of a single
  solution at a time in the second strategy allows the use of the
  shared memory to store the related data structures. The two
  strategies have been experimented on standard benchmarks of
-the Traveling Salesman Problem (TSP) with sizes varying from $100$ to $3038$ cities. The results indicate that increasing the number of solutions to be explored simultaneously improves the speedup in the two strategies, but at the same time it decreases final solution quality. The greater speedup factor reached by the second strategy is $6 \times$.\\
+the Traveling Salesman Problem (TSP) with sizes varying from $100$ to $3038$ cities. The results indicate that increasing the number of solutions to be explored simultaneously improves the speedup in the two strategies, but at the same time it decreases final solution quality. The greater speedup factor reached by the second strategy is $6 \times$.
  
-The same strategy is used by Luong \emph{et al.}
+The same strategy is used by Luong et al.
  in~\cite{luongMultiStart} to implement multistart parallel local
  search algorithms (a special case of the
-algorithmic-level\index{Metaheuristics!algorithmic-level
-parallelism} parallel model where several homogeneous LS
+algorithmic-level parallel model where several homogeneous LS
  algorithms are used). The multistart model is combined with
-iteration-level\index{Metaheuristics!iteration-level parallelism}
+iteration-level
  parallelism: several LS algorithms are managed by the CPU and the
  neighborhood evaluation step of each algorithm is parallelized on
  the GPU (each GPU thread is associated with one neighbor and
  executes the same evaluation function kernel). The advantage of
  such a model is that it allows a  high occupancy of the GPU
-threads. Nevertheless, memory management\index{GPU
-Challenges!memory~management} causes new issues due to the
+threads. Nevertheless, memory management causes new issues due to the
  quantity of data to store and to communicate between CPU     and
  GPU. A second proposition for implementing the same model on GPU
  consists of implementing the whole LS processes on GPU with each
  GPU thread being associated to a unique LS algorithm. This solves
  the communication issue encountered in the first model. In
-addition, a memory management\index{GPU
-Challenges!memory~management} strategy is proposed to improve the
+addition, a memory management strategy is proposed to improve the
  efficiency of the
-algorithmic-level\index{Metaheuristics!algorithmic-level
-parallelism} model: texture memory is used to avoid memory latency
+algorithmic-level model: texture memory is used to avoid memory latency
  due to uncoalesced memory accesses. The proposed approaches are
  implemented on the quadratic assignment problem (QAP) using CUDA.
  The acceleration rates obtained for the
-algorithmic-level\index{Metaheuristics!algorithmic-level
-parallelism} with usage of texture memory rise from $7.8\times$ to
+algorithmic-level with usage of texture memory rise from $7.8\times$ to
  $12\times$ (for different
-QAP benchmark sizes). \\
+QAP benchmark sizes). 
  
-Janiak \emph{et al.}~\cite{Janiak_et_al_2008} implemented two
+Janiak et al.~\cite{Janiak_et_al_2008} implemented two
  algorithms for TSP and the flow-shop scheduling problem (FSP).
  These algorithms are based on a multistart tabu
-search\index{Metaheuristics!tabu~search} model. Both of the two
+search model. Both of the 
  algorithms exploit multicore CPU and GPU. A full parallelization
  on GPU is adopted using shader libraries where each thread is
-mapped with one tabu search\index{Metaheuristics!tabu~search}.
+mapped with one tabu search.
  However, even though their experiments report that the use of GPU
  speedups the serial execution almost $16 \times$, the mapping of
-one thread with one tabu search\index{Metaheuristics!tabu~search}
+one thread with one tabu search
  requires a large number of local search algorithms to
-cover the memory access latency. The same mapping policy is adopted by Zhu \emph{et al.} in~\cite{zhu_et_al_2008} (one thread is associated to one local search) solving the quadratic assignment problem but using the CUDA toolkit instead of shader libraries.\\
+cover the memory access latency. The same mapping policy is adopted by Zhu et al. in~\cite{zhu_et_al_2008} (one thread is associated to one local search) solving the quadratic assignment problem but using the CUDA toolkit instead of shader libraries.
  
-Luong \emph{et al.}~\cite{luong2012ppsn} proposed a GPU-based
+Luong et al.~\cite{luong2012ppsn} proposed a GPU-based
  implementation of hybrid metaheuristics on heterogeneous parallel
  architectures (multicore CPU  coupled to one GPU).  The challenge
  of using such a heterogeneous architecture is how to distribute
  tasks between the CPU cores and the GPU in such a way to have
  optimal performances. Among the three traditional parallel models
-(solution-level\index{Metaheuristics!solution-level~parallelism},
-iteration-level\index{Metaheuristics!iteration-level parallelism},
-and algorithmic-level\index{Metaheuristics!algorithmic-level
-parallelism}), the authors pointed out that the most convenient
+(solution-level,
+iteration-level
+and algorithmic-level), the authors point out that the most convenient
  model for the considered heterogeneous architecture is a hybrid
  model combining
-iteration-level\index{Metaheuristics!iteration-level parallelism}
-and algorithmic-level\index{Metaheuristics!algorithmic-level
-parallelism} models. Several CPU threads execute several instances
+iteration-level
+and algorithmic-level models. Several CPU threads execute several instances
  of the same S-metaheuristic in parallel while the GPU device is
  associated to one CPU core and  used to accelerate the
  neighborhood calculation of several S-metaheuristics at the same
  time.
   In order to efficiently exploit the remaining CPU cores, a load-balancing heuristic is also proposed in order to decide on the number of additional
-S-metaheuristics to launch on the remaining CPU cores relative to the efficiency of the GPU calculations. The proposed approach is applied to the QAP using several instances of the fast ant colony algorithm (FANT)~\cite{taillardFant}. \\
+S-metaheuristics to launch on the remaining CPU cores relative to the efficiency of the GPU calculations. The proposed approach is applied to the QAP using several instances of the Fast Ant Colony Algorithm (FANT)~\cite{taillardFant}. 
  
  All the previously noted works  exploit the same parallel models
  used on CPUs based on the task parallelism. A different
  implementation approach is proposed by Paul in~\cite{gerald2012}
  to implement a simulated
-annealing\index{Metaheuristics!simulated~annealing} (SA) algorithm
+annealing (SA) algorithm
  for the QAP on GPUs. Indeed, the author used a preinitialized
  matrix \emph{delta} in which the incremental evaluation of simple
  swap moves are calculated and stored relative to the initial
  permutation $p$. For the GPU implementation, the authors used the
  parallel implementation of neighborhood exploration. The
  time-consuming tasks in the SA-matrix are identified by the
-authors as: updating the matrix and passing through it to select
+authors as updating the matrix and passing through it to select
  the next accepted move. To initialize the delta matrix, several
  threads from different blocks explore different segments of the
  matrix (different moves) at the same time.  In order to select the
-next accepted, swap several threads are also used. Starting from
+next accepted swap, several threads are also used. Starting from
  the last move, a group of threads explores different subsets of
  the delta matrix. The shared memory is used to preload all the
  necessary elements for a given group of threads responsible for
  the updating of the delta matrix. The main difference in this work
  compared to the previous works resides in the introduction of a
  data parallelism using the precalculated delta matrix. The use of
-this matrix allows the increasing of the number of threads
+this matrix allows the increase in the number of threads
  involved in the evaluation of a single move. Experimentations are
  done on standard QAP instances from the
  QAPLIB~\cite{burkard1991qaplib}. Speedups up to $10 \times$ are
  achieved by the GPU implementation compared
-to the same sequential implementation on CPU using SA-matrix.\\
+to the same sequential implementation on CPU using SA-matrix.
  
-\subsection{Implementing population-based metaheuristics on GPUs}
+\subsection[Implementing population-based metaheuristics\hfill\break on GPUs]{Implementing population-based metaheuristics on GPUs}
  
  State-of-the-art works dealing with the implementation of
  p-metaheuristics on GPUs generally rely on parallel models and
@@ -524,28 +510,28 @@ p-metaheuristics over different types of parallel architectures:
  supercomputers, clusters, and computational grids. Three main
  classes of p-metaheuristics are considered in this section:
  evolutionary
-algorithm\index{Metaheuristics!evolutionary~algorithm}s (EAs), ant
-colony\index{Metaheuristics!ant~colony~optimization} optimization
-(ACO\index{Metaheuristics!ant~colony~optimization}), and particle
-swarm optimization\index{Metaheuristics!particle swarm
+algorithms (EAs), ant
+colony optimization
+(ACO), and particle
+swarm optimization\index{metaheuristics!particle swarm
  optimization} (PSO).
-
-\subsubsection*{Evolutionary Algorithms}
+\clearpage
+\subsubsection*{Evolutionary algorithms}
  
  Traditional parallel models for EAs are classified in three main
-classes: coarse grain models using several sub-populations
+classes: coarse-grain models using several subpopulations
  (islands), master-slave models used for the parallelization of CPU
  intensive steps (evaluation and transformation), and cellular
  models based on the use of one population disposed (generally) on
  a toroidal grid.
  
  The three traditional models have been implemented on GPUs by several researchers for different
-optimization problems. The main chalenges to be raised when implementing the traditional models on GPUs concern (1) the saturation of the GPU in order to cover memory latency by calculations, and (2) efficent usage of the hierarchical GPU  memories.\\
+optimization problems. The main chalenges to be raised when implementing the traditional models on GPUs concern (1) the saturation of the GPU in order to cover memory latency by calculations, and (2) efficent usage of the hierarchical GPU  memories.
  
  In~\cite{kannan}, Kannan and Ganji present a CUDA implementation
  of the drug discovery application Autodock (molecular docking
  application). Autodock uses a genetic
-algorithm\index{Metaheuristics!genetic~algorithm} to find optimal
+algorithm to find optimal
  docking positions of a ligand to a protein. The most
  time-consuming task in Autodock is the fitness function
  evaluation. The fitness function used for a docking problem
@@ -564,17 +550,17 @@ advantage of the per block approach resides in the use of shared
  memory instead of global memory to store all the information
  related to each individual. The obtained speedups range from  $10
  \times$ to $47 \times$ for population sizes
-ranging from $50$ to $10000$. \\
+ranging from $50$ to $10000$. 
  
-Maitre \emph{et al.}~\cite{maitre2009} also exploited the
+Maitre et al.~\cite{maitre2009} also exploited the
  master-slave parallelism of EAs on GPUs using EASEA. EASEA is a
  C-like metalanguage for easy development of EAs. The user writes a
-description  of the problem specific components (fitness function,
+description  of the problem-specific components (fitness function,
  problem representation, etc) in EASEA. The code  is then compiled
  to obtain a ready-to-use evolutionary
-algorithm\index{Metaheuristics!evolutionary~algorithm}. The EASEA
+algorithm. The EASEA
  compiler  uses genetic
-algorithm\index{Metaheuristics!genetic~algorithm}LIB and EO
+algorithm LIB and EO
  Libraries to produce C++ or JAVA written EA codes.
  In~\cite{maitre2009}, the authors  proposed an extension of EASEA
  to produce CUDA code from the EASEA files. This extension has been
@@ -585,24 +571,24 @@ during the experiments: a benchmark mathematical function and a
  real problem (molecular structure prediction). In order to
  maximize the GPU occupation, very large populations are used (from
  $2000$ to $20000$). Even though transferring such large
-populations from the CPU to the GPU device memory at every generation is very costly, the authors reported important speedups on the two problems on a GTX260 card:  $105 \times$ is reported for the benchmark function while for the real problem the reported speedup is $60 \times$. This may be best explained by the complexity of the fitness functions. Nevertheless, there is no indication in the paper about the memory management\index{GPU Challenges!memory~management} of the populations on GPU.\\
+populations from the CPU to the GPU device memory at every generation is very costly, the authors report important speedups on the two problems on a GTX260 card:  $105 \times$ is reported for the benchmark function while for the real problem the reported speedup is $60 \times$. This may be best explained by the complexity of the fitness functions. Nevertheless, there is no indication in the paper about the memory management of the populations on GPU.
  
  The master-slave  model is efficient when the fitness function is
  highly time intensive. Nevertheless, it requires the use of
  large-sized populations in order to saturate the GPU unless the
  per-block is used (one individual perblock) when the acceleration
  of the fitness function itself is possible. The use of many
-sub-populations of medium sizes is another way to obtain a maximum
+subpopulations of medium sizes is another way to obtain a maximum
  occupation of the GPU. This is coarse-grained parallelism (island
-model).\\
+model).
  
-The coarse grained model is  used by Pospichal \emph{et al.}
+The coarse-grained model is  used by Pospichal et al.
  in~\cite{pospichal10} to implement a parallel genetic
-algorithm\index{Metaheuristics!genetic~algorithm} on GPU. In this
+algorithm on GPU. In this
  work the entire genetic
-algorithm\index{Metaheuristics!genetic~algorithm} is implemented
+algorithm is implemented
  on GPU. This choice is motivated by the overhead engendered by the
-CPU/GPU communication\index{GPU Challenges!CPU/GPU~communication}
+CPU/GPU communication
  when only population evaluation is performed on GPU. Each
  population island is mapped with a CUDA thread block and each
  thread is responsible for a unique individual. Subpopulations are
@@ -610,11 +596,11 @@ stored on shared memory of each block. Nevertheless, because
  interblock communications are not possible on the CUDA
  architecture, the islands evolve independently in each block, and
  migrations are performed
-asynchronously through the global memory. That is, after a given number of generations, selected individuals for migration from each island are copied to the GPU global memory part of the neighbor island and then to its shared memory to replace the worst individuals in the local population. The experiments are performed on three benchmark mathematical functions. During these experiments, the island sizes are varied from $2$ to $256$ individuals and island numbers from $1$ to $1024$. The maximum performance is achieved for high number of islands and increasing population sizes.\\
+asynchronously through the global memory. That is, after a given number of generations, selected individuals for migration from each island are copied to the GPU global memory part of the neighbor island and then to its shared memory to replace the worst individuals in the local population. The experiments are performed on three benchmark mathematical functions. During these experiments, the island sizes are varied from $2$ to $256$ individuals and island numbers from $1$ to $1024$. The maximum performance is achieved for high number of islands and increasing population sizes.
  
  The same strategy is also adopted by Tsutsui and Fujimoto
  in~\cite{tsutsuiGAQAP}  to implement a coarse-grained genetic
-algorithm\index{Metaheuristics!genetic~algorithm} on GPU to solve
+algorithm on GPU to solve
  the QAP. Initially, several subpopulations are created on CPU and
  transferred to the global memory. The subpopulations are organized
  in the global memory into blocks of $8$ individuals in such a way
@@ -627,24 +613,23 @@ through the global memory. Experiments are performed on standard
  QAP benchmarks from the QAPLIB~\cite{burkard1991qaplib}. The GPU
  implementation reached speedups of $2.9\times$ to $12.6 \times$
  compared to a single core implementation of a coarse-grained
-genetic algorithm\index{Metaheuristics!genetic~algorithm} on a
-Intel i7 processor.\\
+genetic algorithm on a
+Intel i7 processor.
  
-Nowotniak \emph{et al.}~\cite{nowotniak} proposed a GPU-based
-implementation of a quantum inspired genetic
-algorithm\index{Metaheuristics!genetic~algorithm} (QIGA). The used
-parallel model is a hierarchical model based on two levels: each
+Nowotniak and Kucharski~\cite{nowotniak} proposed a GPU-based
+implementation of a Quantum Inspired Genetic Algorithm (QIGA). The 
+parallel model used is a hierarchical model based on two levels: each
  thread in a block transforms a unique individual and a different
  population is assigned to each block. The algorithm is run
  entirely on GPU. A different instance of the QIGA is run on each
  block and the computations have been shared between 8 GPUs. This
  approach is very convenient to speed up the experimental process
  on metaheuristics that require several independent runs of the
-same algorithm in order to asses statistical efficiency. The
+same algorithm in order to assess statistical efficiency. The
  populations are stored in the shared memory, the data matrix used
  for fitness evaluation is placed in read only constant memory, and
  finally seeds for random numbers generated on the GPU are stored
-in the global memory.\\
+in the global memory.
  
  In coarse-grained parallelism, the use of the per-block approach
  to implement the islands (one subpopulation per thread block) is
@@ -652,11 +637,11 @@ almost natural  and it allows the use of shared memory to store
  the subpopulations. Fine-grained parallel models for EAs (cellular
  EAs) have been used by many authors to implement parallel EAs on
  GPUs. Indeed, the fine-grained parallelism of EAs fits perfectly
-to the SIMD architecture of the GPU. \\
+to the SIMD architecture of the GPU. 
  
-Pinel \emph{et al.} in~\cite{pinel2012JPDC} developed a highly
+Pinel et al. in~\cite{pinel2012JPDC} developed a highly
  parallel synchronous cellular genetic
-algorithm\index{Metaheuristics!genetic~algorithm} (CGA), called
+algorithm (CGA), called
  GraphCell, to solve the independent task scheduling problem on GPU
  architectures. In CGAs, the population is arranged into a
  two-dimensional toroidal grid where only neighboring solutions are
@@ -678,7 +663,7 @@ threads used to accelerate the recombination operators especially
  when dealing with large instances of the problem. In addition to
  the recombination operators, the rest of the CGA steps are  also
  parallelized on GPU (fitness evaluation, mutation, and
-replacement).\\
+replacement).
  
  A similar work is proposed by Vidal and Alba  in~\cite{albaCGAGPU}
  where a  CGA using a toroidal grid is completely implemented on
@@ -693,31 +678,31 @@ performed on a set of standard benchmark functions with different
  grid sizes ranging from $32^2$ to $512^2$. The speedups reached by
  the GPU implementation against the CPU version range from
  $5\times$ to $24\times$ and increase as the size of the population
-increases. However, the CPU implementation run faster than the GPU
+increases. However, the CPU implementation runs faster than the GPU
  version for all the tested  benchmarks when the size of the
  population is set to $32^2$. When the size of the population is
  too small, there are not enough computations to cover the overhead
  created by the call of kernel functions, CPU/GPU
-communication\index{GPU Challenges!CPU/GPU~communication}s,
+communications,
  synchronization, and access to global memory. Finally, an
  interesting review on GPU parallel computation in bio-inspired
-algorithms is proposed by Arenas \emph{et al.}
-in~\cite{arenas2011}. \\
+algorithms is proposed by Arenas et al.
+in~\cite{arenas2011}. 
  
-\subsubsection*{Ant Colony Optimization}
+\subsubsection*{Ant colony optimization}
  
  Ant colony optimization
-(ACO\index{Metaheuristics!ant~colony~optimization}) is another
+(ACO) is another
  p-metaheuristic subject to parallelization on GPUs.
  State-of-the-art works on parallelizing
-ACO\index{Metaheuristics!ant~colony~optimization} focus on
+ACO focus on
  accelerating the tour construction step performed by each ant by
  taking a task-based parallelism approach, with pheromone
-deposition on the CPU.\\
+deposition on the CPU.
  
-In~\cite{cecilia}, Cecilia \emph{et al.}  present a GPU-based
+In~\cite{cecilia}, Cecilia et al.  present a GPU-based
  implementation of
-ACO\index{Metaheuristics!ant~colony~optimization} for TSP where
+ACO for TSP where
  the two steps (tour construction and pheromone update) are
  parallelized on the GPU. A data parallelism approach is used to
  enhance the performance of the tour  construction step. The
@@ -735,32 +720,31 @@ update, several GPU techniques are also used to increase the data
  bandwidth of the application mainly by the use of precalculated
  matrices that are easily updated by several threads (one thread
  per matrix entry). The achieved speedups are $21 \times$ for tour
-construction and  $20 \times$ for pheromone updates.\\
+construction and  $20 \times$ for pheromone updates.
  
-In another work, Tsutsui \emph{et al.}~\cite{tsutsui} propose  a
+In another work, Tsutsui and Fujimoto~\cite{tsutsui} propose  a
  hybrid algorithm combining
-ACO\index{Metaheuristics!ant~colony~optimization} metaheuristic
-and tabu search (TS)\index{Metaheuristics!tabu~search} implemented
+ACO metaheuristic
+and Tabu Search (TS) implemented
  on GPU to solve the QAP. A solution of QAP is represented as a
  permutation of ${1,2,..,n}$ with $n$ being the size of the
  problem. The TS algorithm is based on the 2-opt neighborhood
  (swapping of two elements $(i,j)$ in the permutation). The authors
  point out that the move cost of each neighbor depends on the
  couple $(i,j)$. Two groups of moves are formed according to the
-move cost. In order to avoid thread divergence\index{GPU
-Challenges!thread~divergence} within the same warp, the
+move cost. In order to avoid thread divergence\index{GPU!thread divergence} within the same warp, the
  neighborhood evaluation is parallelized in such a way to assign
  only moves of the same cost to each thread warp. This strategy is
  called MATA for Move-cost Adjusted Thread Assignment. Concerning
-the memory management\index{GPU Challenges!memory~management}, all
-the data of the ACO\index{Metaheuristics! ant~colony~optimization}
+the memory management\index{GPU!memory management}, all
+the data of the ACO\index{metaheuristics!ant colony optimization}
  (population, pheromone matrix), QAP matrices, and tabu list are
  placed on the global memory of the GPU. Shared memory is used only
  for working data common to all threads in a given block.
   All the
  steps of the hybrid algorithm
-ACO\index{Metaheuristics!ant~colony~optimization}-TS
-(ACO\index{Metaheuristics!ant~colony~optimization} initialization,
+ACO-TS
+(ACO initialization,
  pheromone update, construct solutions, applying TS) are
  implemented as kernel functions on the GPU. The GPU/CPU
  communications are only used to transfer the best-so-far solution
@@ -769,9 +753,9 @@ performed on standard QAP benchmarks showed that the GPU
  implementation using MATA obtained a  speedup of $19 \times$
  compared to the CPU implementation,  compared with a speedup of
  only $5 \times$
-when MATA is not used. \\
+when MATA is not used. 
  
-\subsubsection*{Particle Swarm Optimization}
+\subsubsection*{Particle swarm optimization}
  In~\cite{zhou2009} Zhou and Tan propose a full GPU implementation
  of a standard PSO algorithm. All the data is stored in global
  memory (velocities, positions, swarm population, etc). Only
@@ -792,7 +776,7 @@ best of each particle), the GPU threads are mapped to the \emph{N}
  particles of the swarm one to one. Experiments done on four
  benchmark functions show speedups ranging from $3.7 \times$ to
  $9.0 \times$ for swarm sizes
-ranging from $400$ to $2800$.\\
+ranging from $400$ to $2800$.
  
  \subsection{Synthesis of the implementation strategies}
  \label{ch8:sec:synthesis} After reviewing some works dealing with
@@ -806,7 +790,7 @@ Indeed, even though the parallelization models used in most works
  for GPUs are derived from the traditional parallel models of each
  metaheuristic (on CPU), their implementation could take a
  different way and sometimes it may result in new parallel models
-customized for GPUs.\\
+customized for GPUs.
  
  Traditional parallel models for metaheuristics are based on an
  intuitive task parallelism: the independent tasks of the
@@ -818,18 +802,17 @@ the evaluation step. Nevertheless, because of the particularity of
  the GPU architecture, some authors have used new implementation
  techniques to enhance the data parallelism in the sequential
  algorithms in order to increase the data throughput of the
-application.\\
+application.
  
  From this observation we propose the following classification
  based on 2 levels: design level and implementation level as
  illustrated in Figure~\ref{ch8:fig:classification}. The design
  level regroups the three classes of parallel models used in
  metaheuristics
-(solution-level\index{Metaheuristics!solution-level~parallelism},
-iteration-level\index{Metaheuristics!iteration-level parallelism},
-algorithmic-level\index{Metaheuristics!algorithmic-level
-parallelism}) with examples for s-metaheuristics, EAs,
-ACO\index{Metaheuristics!ant~colony~optimization} and PCO. This
+(solution-level,
+iteration-level,
+algorithmic-level) with examples for s-metaheuristics, EAs,
+ACO and PCO. This
  classification  is principally built from the reviewed
  state-of-the-art works  in the previous section. The
  implementation level refers to the way these parallel design
@@ -837,7 +820,7 @@ models are implemented on GPU. This classification focuses only on
  the mapping strategies between the GPU threads and the
  parallelized tasks (neighborhood evaluation, solution
  construction, and so on). The different implementation strategies
-are explained in the following sections.\\
+are explained in the following sections.
  
  \begin{figure}[h]
  \centerline{\includegraphics[width=1\textwidth]{Chapters/chapter9/figures/classification.pdf}}
@@ -848,7 +831,7 @@ are explained in the following sections.\\
  
  
  \subsubsection*{GPU thread mapping for solution-level parallelism}
-\index{GPU-based Metaheuristics!GPU-thread mapping} Parallel
+\index{GPU!thread mapping} Parallel
  models at solution level consist of parallelizing a time intensive
  atomic task of the algorithm. Generally, it consists of the
  fitness evaluation~\cite{kannan}. Nevertheless, crossover
@@ -860,18 +843,18 @@ associated to a block of threads. A second level of parallelism is
  used inside each block to parallelize the fitness evaluation of a
  single solution. This kind of mapping allows the use of shared
  memory to store the data structures of the solution and does not
-require the use of very large neighborhoods or populations.\\
+require the use of very large neighborhoods or populations.
  %data parallelism in SA-matrix to parallelize
  
  \subsubsection*{GPU thread mapping for iteration-level parallelism}
-\index{GPU-based Metaheuristics!GPU-thread mapping}
+\index{GPU!thread mapping}
  Iteration-level parallelism consists of parallelizing the tasks
  performed independently on different solutions. Different mapping
  strategies are adopted in the reviewed works to implement these
-models.\\
+models.
  
  In Figure \ref{ch8:fig:classification}, the first example of
-iteration-level\index{Metaheuristics!iteration-level parallelism}
+iteration-level
  parallelism is the parallel evaluation of neighborhoods in
  s-metaheuristics. In most of the reviewed works,  a per-thread
  mapping approach is used: each solution of the neighborhood is
@@ -892,12 +875,11 @@ memory.
  %pheromone update data parallelism matrices
  
  \subsubsection*{GPU thread mapping  for algorithmic-level parallelism}
-\index{GPU-based Metaheuristics!GPU-thread mapping}
-Algorithmic-level parallelism consists of launching several self-contained algorithms in parallel. In the previously reviewed works two algorithmic-level\index{Metaheuristics!algorithmic-level parallelism} models have been used: the multistart model and the island model (parallel EAs).\\
+Algorithmic-level parallelism consists of launching several self-contained algorithms in parallel. In the previously reviewed works two algorithmic-level models have been used: the multistart model and the island model (parallel EAs).
  
  The implementation of the multistart model is based on two
  different mapping strategies~\cite{luongMultiStart, audreyANT}:
-(1) each local search (LS) is associated to a unique thread and
+(1) each Local Search (LS) is associated to a unique thread and
  (2) each solution (from multiple neighborhoods) is associated to a
  unique thread. The first mapping strategy (one thread per LS)
  presents a big drawback: the number of LS to use should be very
@@ -909,10 +891,9 @@ neighborhoods. In the second mapping strategy, the LS algorithms
  are placed on CPU and the neighborhood evaluation of each LS is
  parallelized on GPU using per-thread mapping strategy (one thread
  per solution). This consists of a hierarchical parallel model
-combining algorithmic-level\index{Metaheuristics!algorithmic-level
-parallelism} parallelism (multistart) with
-iteration-level\index{Metaheuristics!iteration-level parallelism}
-parallelism (master-worker).\\
+combining algorithmic-level parallelism (multistart) with
+iteration-level
+parallelism (master-worker).
  
  In the island model, the same mapping is used in all the reviewed
  works~\cite{tsutsuiGAQAP, nowotniak, maitre2009}: each
@@ -925,7 +906,7 @@ advantage of this hierarchical implementation is that it allows
  the occupation of a large number of threads even for medium
  population sizes. The second advantage consists of using shared
  memory to store subpopulations to benefit from the low latency of
-shared memory.\\
+shared memory.
  
  \section{Frameworks for metaheuristics on GPUs}
  \label{ch8:sec:frameworks}
@@ -937,40 +918,39 @@ on GPUs. Although the works on this subject are not yet mature and
  do not cover the main metaheuristic algorithms, we will present
  the only three works to our knowledge, which propose open source
  frameworks for developing
-metaheuristics on GPUs.\\
+metaheuristics on GPUs.
  
  The three frameworks reviewed in this section are
  PUGACE\index{GPU-based frameworks!PUGACE}~\cite{pugace},
  ParadisEO-MO-GPU\index{GPU-based
  frameworks!ParadisEO-MO-GPU}~\cite{paradiseoGPU}, and
-libCudaOptimize\index{GPU-based
-frameworks!libCudaOptimize}~\cite{libcuda}. PUGACE\index{GPU-based
+libCUDAOptimize\index{GPU-based
+frameworks!libCUDAOptimize}~\cite{libcuda}. PUGACE\index{GPU-based
  frameworks!PUGACE} is a framework for implementing EAs on GPUs.
  ParadisEO-MO-GPU is an extension of the framework
  ParadisEO~\cite{paradiseo} for implementing s-metaheuristics on
-GPU. Finally, libCudaOptimize\index{GPU-based
-frameworks!libCudaOptimize} is a  library intended for the
+GPU. Finally, libCUDAOptimize\index{GPU-based
+frameworks!libCUDAOptimize} is a  library intended for the
  implementation of p-metaheuristics on GPU. The three frameworks
  are presented in more detail in the following.
  
-\subsection{PUGACE\index{GPU-based frameworks!PUGACE}: framework for implementing evolutionary computation on GPUs}
-PUGACE\index{GPU-based frameworks!PUGACE} is a generic  framework
+\subsection{PUGACE: framework for implementing evolutionary computation on GPUs}
+PUGACE is a generic  framework
  for easy implementation of cellular evolutionary algorithms on
  GPUs implemented using C and CUDA. It is based on the frameworks
  MALLBA and JCell (a framework for cellular algorithms). The
  authors justified the choice of cellular evolutionary
-algorithm\index{Metaheuristics!evolutionary~algorithm} by the good
+algorithm by the good
  feedback found in the literature concerning its efficient
  implementation on GPUs compared to other parallel models for EAs
  (island, master-slave). The main standard evolutionary operators
-are already implemented in PUGACE\index{GPU-based
-frameworks!PUGACE}: different selection strategies, standard
-crossover, and mutation operators (\emph{PMX, swap, 2-exchange},
+are already implemented in PUGACE: different selection strategies, standard
+crossover, and mutation operators (PMX, swap, 2-exchange,
  etc.). Different problem encoding is also supported. The framework
  is organized as a set of modules in which the different
  functionalities  are separated as much as possible in order to
  facilitate the extension of the framework. All
-the functions and procedures that execute on GPU are implemented in the same file kernel.cu. \\
+the functions and procedures that execute on GPU are implemented in the same file kernel.cu. 
  
  The implementation strategy adopted on the GPU is as follows.
  Population initialization is done on the CPU side and the
@@ -978,16 +958,15 @@ population is transferred to the GPU. On the GPU side, each
  individual is associated to a unique CUDA thread. The function
  evaluation and mutation are done on the GPU while selection and
  replacement are maintained on the CPU. In order to avoid thread
-divergence\index{GPU Challenges!thread~divergence} appearing in
+divergence\index{GPU!thread divergence} appearing in
  the same CUDA thread block at the crossover step (because of the
  probability of application which may give different results from
  one thread to the other), the decision of whether to apply a
  crossover is taken at the block level and applied to all the
  individuals within the block. It is the decision on the choose
-of the cutting point for the crossover.\\
+of the cutting point for the crossover.
  
-The framework is validated on standard benchmarks of the quadratic
-assignment problem (QAP). Speedups of $15.44 \times$  to $18.41
+The framework is validated on standard benchmarks of the QAP. Speedups of $15.44 \times$  to $18.41
  \times$ are achieved compared to a CPU implementation of a cEA
  using population sizes rising from 512 to 1024 and from 1024 to
  2048.
@@ -1005,11 +984,11 @@ using population sizes rising from 512 to 1024 and from 1024 to
  
  \subsection{ParadisEO-MO-GPU}
  
-Melab \emph{et al.}~\cite{paradiseoGPU} developed a reusable
+Melab et al.~\cite{paradiseoGPU} developed a reusable
  framework ParadisEO-MO-GPU\index{GPU-based
  frameworks!ParadisEO-MO-GPU} for parallel local search
  metaheuristics (s-metaheuristics) on GPUs. It focuses on the
-iteration-level\index{Metaheuristics!iteration-level parallelism}
+iteration-level
  parallel model of s-metaheuristics which consists of exploring in
  parallel the neighborhood of a problem solution on GPU. The
  framework, implemented using C++ and CUDA, is an extension of the
@@ -1023,7 +1002,7 @@ parallelization techniques and deployment on GPUs. The framework
  allows one to efficiently manage the hierarchical organization of
  the memories (latencies and sizes) of the GPU and its
  communication with the CPU as well as the minimizing of the user
-involvement in its management.\\
+involvement in its management.
  
  \begin{figure}[h]
  \centerline{\includegraphics[width=0.8\textwidth]{Chapters/chapter9/figures/paradiseoGPU.pdf}}
@@ -1046,13 +1025,13 @@ evaluation. After all the neighborhood is generated and evaluated,
  it is sent back to the CPU which selects the best solution (See
  Figure~\ref{ch1:fig:paradiseoGPU}).
  
-\subsection{libCudaOptimize: an open source library of GPU-based metaheuristics}
-\index{GPU-based frameworks!libCudaOptimize}
-LibCudaOptimize~\cite{libcuda} is a C++/Cuda open source library
+\subsection{libCUDAOptimize: an open source library of GPU-based metaheuristics}
+\index{GPU-based frameworks!libCUDAOptimize}
+LibCUDAOptimize~\cite{libcuda} is a C++/CUDA open source library
  for the design and implementation of metaheuristics on GPUs. Until
-now, the metaheuristics supported by LibCudaOptimize are: scatter
+now, the metaheuristics supported by LibCUDAOptimize are: scatter
  search, differential evolution, and particle swarm
-optimization\index{Metaheuristics!particle swarm optimization}.
+optimization\index{metaheuristics!particle swarm optimization}.
  Nevertheless, the library is designed in such a way to allow
  further extensions for other metaheuristics and it is still in
  development phase by the authors. The parallelization strategy
@@ -1064,18 +1043,18 @@ on CPU.
  \label{ch8:sec:case}
  
  In this case study, a large neighborhood GPU-based local
-search\index{GPU-based Metaheuristics!GPU-based~local~search}
-method for solving the quadratic 3-dimensional assignment problem
-(Q3AP) will be presented. The local search method is an iterated
-local search\index{Metaheuristics!iterated local search}
+search\index{GPU!based local search}
+method for solving the Quadratic 3-dimensional Assignment Problem
+(Q3AP) will be presented. The local search method is an Iterated
+Local Search\index{metaheuristics!iterated local search}
  (ILS)~\cite{stutzle2006ILSforQAP} using an embedded
-TS\index{Metaheuristics!tabu~search} algorithm. The ILS principle
+TS algorithm. The ILS principle
  consists of executing iteratively the embedded local search, each
  iteration which starts from a disrupted local optima reached by
  the previous local search process. The disruption heuristic is a
  performance parameter of an ILS algorithm and should be
  judiciously defined. A template of an
-ILS algorithm is given by the Algorithm~\ref{ch8:ils_algorithm_template}.\\
+ILS algorithm is given by the Algorithm~\ref{ch8:ils_algorithm_template}.
  
  \begin{algorithm}[H]
      \SetAlgoLined
@@ -1089,7 +1068,7 @@ ILS algorithm is given by the Algorithm~\ref{ch8:ils_algorithm_template}.\\
       $s^*$=AcceptationCriteria($s^*,s^{*'},history$)\;
  
      }
-\caption{Iterated local search template}
+\caption{iterated local search template}
  \label{ch8:ils_algorithm_template}
  \end{algorithm}
  
@@ -1117,17 +1096,17 @@ ILS algorithm is given by the Algorithm~\ref{ch8:ils_algorithm_template}.\\
  \subsection{The quadratic 3-dimensional assignment problem}
  
  The Q3AP is an extension of the well-known NP-hard QAP. The latter
-is one of the most studied problem by the combinatorial
+is one of the most studied problems by the combinatorial
  optimization community due to its wide range of practical
  applications (site planning, schedule problem, computer-aided
  design, etc.) and to its computational challenges since it is
  considered as one of the most computationally difficult
-optimization problems.\\
+optimization problems.
  
  The Q3AP was first introduced by William P. Pierskalla in
  1967~\cite{Pierskalla1967Q3AP} and, unlike the QAP, the Q3AP is a
  less studied problem. Indeed, the Q3AP was revisited only this
-past years and has recently been used to model some advanced
+past year and has recently been used to model some advanced
  assignment problems such as the symbol-mapping problem encountered
  in wireless communication systems and described
  in~\cite{Hahn2008q3ap}. The Q3AP optimization problem can be
@@ -1146,7 +1125,7 @@ where
  
  \begin{eqnarray}
    X=(x_{ijl})\in I \cap J \cap L, \label{Q3APConstraints1}\\
-  x_{ijl}\in \{0,1\}, \quad i,j,l=0,1,...,n-1. \label{Q3APConstraints2}
+  x_{ijl}\in \{0,1\}, \quad i,j,l=0,1,...,n-1 \label{Q3APConstraints2}
  \end{eqnarray} $I$, $J$, and $L$ sets are defined as follows:
  \begin{displaymath} I=\lbrace X=(x_{ijl}):
  \sum_{j=0}^{n-1}\sum_{l=0}^{n-1}x_{ijl}=1, \quad
@@ -1156,7 +1135,7 @@ i=0,1,...,n-1\rbrace \mathrm{;} \end{displaymath}
  j=0,1,...,n-1\rbrace \mathrm{;} \end{displaymath}
  \begin{displaymath} L=\lbrace X=(x_{ijl}):
  \sum_{i=0}^{n-1}\sum_{j=0}^{n-1}x_{ijl}=1, \quad
-l=0,1,...,n-1\rbrace \mathrm{.} \end{displaymath}
+l=0,1,...,n-1\rbrace  \end{displaymath}
  % fin------------------------------------------------------------------------------
  
  Other equivalent formulations of the Q3AP can be found in the literature. A particularly useful one is the
@@ -1176,7 +1155,7 @@ where $\pi_1$ and $\pi_2$ are permutations over the set
  $\left\lbrace 0,1,\ldots,n-1\right\rbrace$. According to this
  formulation, minimizing the Q3AP consists of finding a double
  permutation $(\pi_1,\pi_2)$ which minimizes the objective function
-(\ref{Q3APPermutationBasedFormulation}).\\
+(\ref{Q3APPermutationBasedFormulation}).
  
  The Q3AP is proven to be an NP-hard problem since it is an
  extension of the quadratic assignment problem and of the axial
@@ -1186,19 +1165,18 @@ number of feasible solutions of an instance of size $n$ is $n! \times n!$.
  \subsection{Iterated tabu search algorithm for the Q3AP}
  
  To tackle large-sized instances of the Q3AP and speed up the
-search process, a parallel ILS\index{Metaheuristics!iterated local
-search} algorithm has been designed. The local search embedded in
-the ILS is a TS\index{Metaheuristics!tabu~search}. A TS
+search process, a parallel ILS algorithm has been designed. The local search embedded in
+the ILS is a TS. A TS
  procedure~\cite{Glover1989TS} starts from an initial feasible
  solution and tries, at each step, to move to a neighboring
  solution that minimizes the fitness (for a minimization case). If
  no such move exists, the neighbor solution that less degrades the
-fitness is chosen as a next move. This enables, TS process to
+fitness is chosen as a next move. This enables the TS process to
  escape local optima. However, this strategy may generate cycles,
  i.e., previous moves can be selected again. To avoid cycles, the
  TS manages a short-term memory that contains the moves that have
-been recently performed. A TS\index{Metaheuristics!tabu~search}
-template is given by the Algorithm \ref{TS_pseudo_code}.\\
+been recently performed. A TS
+template is given by Algorithm \ref{TS_pseudo_code}.
  %
  %  \begin{algorithm}
  %  \label{TS_pseudo_code}
@@ -1231,7 +1209,7 @@ template is given by the Algorithm \ref{TS_pseudo_code}.\\
            $t = t + 1$\;
  
      }
-\caption{Tabu search template}
+\caption{tabu search template}
  \label{TS_pseudo_code}
  \end{algorithm}
  
@@ -1242,7 +1220,7 @@ The neighborhood function is a crucial parameter in any local
  search algorithm. Indeed, if the neighborhood function is not
  adequate to the problem and/or does not consider the targeted
  computing framework, any local search algorithm will fail to reach
-good quality solutions of the search space.\\
+good quality solutions of the search space.
  
  Regarding the Q3AP, many neighborhood structures can be
  considered. A basic neighborhood was proposed and investigated in
@@ -1284,7 +1262,7 @@ algorithms~\cite{Ahuja:2007:VLN:1528422.1528438}. However, for
  such a neighborhood, the generation/evaluation step of an LS
  becomes a time-consuming task and may dramatically increase the
  computational time of the LS process. This
-justifies the use of intensive data-parallelism provided by GPUs where all neighboring solutions may be concurrently evaluated. \\
+justifies the use of intensive data-parallelism provided by GPUs where all neighboring solutions may be concurrently evaluated. 
  
  The considered large-sized neighborhood consists of swapping two
  positions in both permutations $\pi_1$ and $\pi_2$. This
@@ -1317,17 +1295,17 @@ methods. The main challenges that must be faced when designing a
  local search algorithm are the efficient distribution of the
  search process between the CPU and the GPU minimizing the data
  transfer between them, the hierarchical memory
-management\index{GPU Challenges!memory~management} and the
+management\index{GPU!memory management} and the
  capacity constraints of GPU memories, and the thread
  synchronization. All these issues must be regarded when designing
  parallel LS models to allow
-solving of large scale optimization problems on GPU architectures.\\
+solving of large scale optimization problems on GPU architectures.
  
  To go back to our problem (i.e., Q3AP), we propose in
  Algorithm~\ref{ch8:algoITS} an iterated tabu
-search\index{Metaheuristics!tabu~search} on GPU (GPU-ITS). The
+search on GPU (GPU-ITS). The
  parallel model is in agreement with the
-iteration-level\index{Metaheuristics!iteration-level parallelism}
+iteration-level
  parallel model of LS methods presented in Section
  \ref{ch8:sec:paraMeta} (Fig. \ref{ch8:fig:paraMeta}). This
  algorithm can be seen as a cooperative model between the CPU and
@@ -1346,7 +1324,7 @@ We notice that the input matrices are read-only structures and do
  not change during all the execution of the LS algorithm.
  Therefore, their associated memory is copied only once during all
  the execution. Third, comes the parallel
-iteration-level\index{Metaheuristics!iteration-level parallelism},
+iteration-level,
  in which each neighboring solution is generated, evaluated, and
  copied into the neighborhood fitnesses structure (from lines 10 to
  14). Fourth, since the order in which candidate neighbors are
@@ -1405,7 +1383,7 @@ given number of iterations has been reached.
          Update the tabu list\;
          Copy the chosen solution on GPU device memory\;
      }
-\caption{Template of an iterated tabu search on GPU for solving the Q3AP}
+\caption{template of an iterated tabu search on GPU for solving the Q3AP}
  \label{ch8:algoITS}
  \end{algorithm}
  
@@ -1415,9 +1393,9 @@ given number of iterations has been reached.
  In this section, some experimental results related to the approach
  presented in Section \ref{ch8:ITS-Q3APSection} are reported. We
  recall that the approach is a GPU-based iterated tabu
-search\index{Metaheuristics!tabu~search} (GPU-ITS) method
+search (GPU-ITS) method
  consisting in an iterated local search (ILS) embedding a tabu
-search\index{Metaheuristics!tabu~search} (TS) and where the
+search (TS) and where the
  generation/evaluation step of the TS process is executed on GPU.
  The ILS is used to improve the quality of successive local optima
  provided by TS methods. This is achieved by perturbing the local
@@ -1425,7 +1403,7 @@ optima reached by the current TS process and reconsidering it as
  initial solution of the following TS process. Regarding our
  algorithm, the applied perturbation is a random number $\mu $ of
  swaps in either the first or the second permutation where $\mu \in
-[2:n]$ ($n$ is the instance size).\\
+[2:n]$ ($n$ is the instance size).
  
  Experiments have been carried out on a node of the Chirloute
  cluster of the Lille site. This is one of the 10 sites that
@@ -1435,7 +1413,7 @@ of an Intel Xeon E5620 CPU and a NVIDIA Tesla Fermi M2050 (448
  cores) GPU type. The number of ILS iterations and the number of TS
  iterations were set respectively to 20 and 500. The tabu list size
  has been initalized to $\frac{m}{4}$, $m$ being the size of the
-neighborhood set.\\
+neighborhood set.
  
  Table \ref{ch8:ITSQ3APResults} reports the obtained results for
  the GPU-ITS using our large-sized neighborhood structure. The
@@ -1450,7 +1428,7 @@ evaluation function of the parallel GPU version have been
  measured. The number of successful tries (hits) and the average
  number of ILS iterations to converge to the optimal/best known
  value are also represented. The associated standard deviation for
-each average measurement is shown in small type characters.\\
+each average measurement is shown in small type characters.
  
  \begin{table}
  %\tiny
@@ -1458,7 +1436,7 @@ each average measurement is shown in small type characters.\\
  \small
  \begin{tabular}{|l|r|r|r|r|r|r|r|r|} \hline
  \multicolumn{1}{|c|}{Instance }& \multicolumn{1}{|c|}{Optimal}& \multicolumn{1}{|c|}{Average}  & \multicolumn{1}{|c|}{Maximal}  & \multicolumn{1}{|c|}{Hits} & \multicolumn{1}{|c|}{CPU} & \multicolumn{1}{|c|}{GPU} & \multicolumn{1}{|c|}{Speed-} & \multicolumn{1}{|c|}{ITS} \\
-& \multicolumn{1}{|c|}{/BKV\footnotemark }& \multicolumn{1}{|c|}{value }& \multicolumn{1}{|c|}{value }& & \multicolumn{1}{|c|}{time }& \multicolumn{1}{|c|}{time }&  \multicolumn{1}{|c|}{up}& \multicolumn{1}{|c|}{iteration}\\ \hline
+& \multicolumn{1}{|c|}{/BKV\footnotemark }& \multicolumn{1}{|c|}{value }& \multicolumn{1}{|c|}{value }& \multicolumn{1}{|c|}{} & \multicolumn{1}{|c|}{time }& \multicolumn{1}{|c|}{time }&  \multicolumn{1}{|c|}{up}& \multicolumn{1}{|c|}{iter.}\\ \hline
  Nug12-d & $580$   & $615.4$    & $744$   & $35$\% & $87.7$ & $2.5$ & $34.7 \times$& $16$ \\
      &   & \tiny{$41.7$}    & && \tiny{$40.9$} & \tiny{$0.9$} & & \tiny{$6.3$} \\ \hline
  Nug13-d & $1912$  & $1985.4$   & $2100$  & $20$\% & $209.2$ & $3.3$ & $63.5 \times$ & $17$ \\
@@ -1471,7 +1449,7 @@ Nug22-d & $42476$ & $43282.1$ & $44140$ & $25$\% & $4506.5$ & $32.7$ & $137.8 \t
  & & \tiny{$529.6$} & & & \tiny{$341.1$} & \tiny{$6.6$} & & \tiny{$4.0$} \\ \hline
  \end{tabular}
  \caption{Results of the GPU-based iterated tabu search for
-different Q3AP instances.} \label{ch8:ITSQ3APResults} \center
+different Q3AP instances.} \label{ch8:ITSQ3APResults} % \center %Shashi
  \end{table}
  
  %\begin{table}
@@ -1510,18 +1488,17 @@ the robustness/quality of provided solutions.
  \label{ch8:conclusion}
  
  This chapter has presented state-of-the-art GPU-based parallel
-metaheuristics\index{Metaheuristics!parallel~metaheuristics} and
+metaheuristics and
  a case study on implementing large neighborhood local search
  methods on GPUs for solving large benchmarks of the quadratic
-three-dimensional assignment problem (Q3AP). \\
+three-dimensional assignment problem (Q3AP). 
  
  Implementing parallel
-metaheuristics\index{Metaheuristics!parallel~metaheuristics} on
+metaheuristics on
  GPU architectures poses new issues and challenges such as memory
-management\index{GPU Challenges!memory~management},  finding
+management;  finding
  efficient mapping strategies between tasks to parallelize; and the
-GPU threads, thread divergence\index{GPU
-Challenges!thread~divergence}, and synchronization. Actually, most
+GPU threads, thread divergence, and synchronization. Actually, most
  of metaheuristics have been implemented on GPU using different
  implementation strategies. In this chapter, a two-level
  classification of the reviewed works has been proposed: design
@@ -1532,6 +1509,6 @@ This classification focuses mainly on  the mapping between the
  metaheuristic tasks to parallelize and the GPU threads. Indeed,
  the choice of a given mapping strategy strongly influences the
  other challenges (memory usage, communication, thread
-divergence\index{GPU Challenges!thread~divergence}).
+divergence).
  
  \putbib[Chapters/chapter9/biblio9]