X-Git-Url: https://bilbo.iut-bm.univ-fcomte.fr/and/gitweb/book_gpu.git/blobdiff_plain/aa5e4797e8e8585b78ee799ff26314839eac0d13..f58a76268a01dff4b9427557dba6f052787415d5:/BookGPU/Chapters/chapter9/ch9.tex?ds=sidebyside

diff --git a/BookGPU/Chapters/chapter9/ch9.tex b/BookGPU/Chapters/chapter9/ch9.tex
index caf17c4..4ad0d6f 100644
--- a/BookGPU/Chapters/chapter9/ch9.tex
+++ b/BookGPU/Chapters/chapter9/ch9.tex
@@ -10,22 +10,22 @@
 \label{chapter9}
 \section{Introduction}
 This chapter presents GPU-based parallel
-metaheuristics\index{Metaheuristics!parallel~metaheuristics},
+metaheuristics\index{metaheuristics!parallel metaheuristics},
 challenges, and issues related to the particularities of the GPU
 architecture and a synthesis on the different implementation
 strategies used in the literature. The implementation of parallel
-metaheuristics\index{Metaheuristics!parallel~metaheuristics} on
+metaheuristics on
 GPUs is not straightforward. The traditional models used in CPUs
 must be rethought to meet the new requirements of GPU
 architectures. This chapter is organized as follows. Combinatorial
-optimization\index{Combinatorial~optimization} and  resolution
+optimization\index{combinatorial optimization} and  resolution
 methods are introduced in Section~\ref{ch8:sec:optim}. The main
 traditional parallel models used for metaheuristics are recalled
 in Section~\ref{ch8:sec:paraMeta}.
 Section~\ref{ch8:sec:challenges} highlights the  main challenges
 related to the GPU implementation of metaheuristics. A
 state-of-the-art of GPU-based parallel
-metaheuristics\index{Metaheuristics!parallel~metaheuristics} is
+metaheuristics is
 summarized in Section~\ref{ch8:sec:state}. In Section~\ref{ch8:sec:frameworks}, the main developed GPU-based
 frameworks for metaheuristics are described. Finally, a case study
 is presented in Section~\ref{ch8:sec:case} and some concluding
@@ -34,7 +34,7 @@ remarks are given in Section~\ref{ch8:conclusion}
 \section{Combinatorial optimization}
 \label{ch8:sec:optim}
 
-Combinatorial optimization\index{Combinatorial~optimization} (CO) is a branch of applied and discrete mathematics.
+Combinatorial optimization (CO) is a branch of applied and discrete mathematics.
 It consists in finding optimal configuration(s) among a finite set of possible configurations
 (or solutions) of a given combinatorial optimization problem (COP).  The set of all possible solutions noted $S$ is called solution space or search space. Each solution  in $S$ is defined by its real cost calculated by an objective function. COPs are generally defined as follows~\cite{blumMeta}:\\  %(Definition~\ref{def:cops})
 
@@ -86,9 +86,9 @@ process starts with a single solution (generally set at random)
 and iteratively improves it by exploring its neighborhood in the
 search space. The most known methods in this class are local
 search methods that include \emph{simulated
-annealing}\index{Metaheuristics!simulated~annealing}~\cite{Kirkpatrick1983SA},
+annealing}\index{metaheuristics!simulated annealing}~\cite{Kirkpatrick1983SA},
 \emph{tabu search}~\cite{Glover1989TS}, \emph{iterated local
-search\index{Metaheuristics!iterated local
+search\index{metaheuristics!iterated local
 search}}~\cite{stutzle2006ILSforQAP}, and \emph{variable
 neighborhood search}~\cite{HansenMladenovic1997VNS}.\\
 
@@ -129,8 +129,8 @@ high-performance computing.
 \end{itemize}
 
 From the granularity of parallelism point of view, three major parallel
-models for metaheuristics can be distinguished~\cite{talbi2009mfdti}: \emph{algorithmic-level}\index{Metaheuristics!algorithmic-level parallelism},
-\emph{iteration-level}\index{Metaheuristics!iteration-level parallelism}, and \emph{solution-level} as illustrated in Figure~\ref{ch8:fig:paraMeta}. \\
+models for metaheuristics can be distinguished~\cite{talbi2009mfdti}: \emph{algorithmic-level}\index{metaheuristics!algorithmic-level parallelism},
+\emph{iteration-level}\index{metaheuristics!iteration-level parallelism}, and \emph{solution-level} as illustrated in Figure~\ref{ch8:fig:paraMeta}. \\
 
 \begin{figure}[h!]
 \centerline{\includegraphics[width=0.6\textwidth]{Chapters/chapter9/figures/paraMeta.pdf}}
@@ -141,53 +141,50 @@ models for metaheuristics can be distinguished~\cite{talbi2009mfdti}: \emph{algo
 \begin{itemize}
 
 \item{In the
-algorithmic-level\index{Metaheuristics!algorithmic-level
-parallelism} parallel model, several self-contained metaheuristics
+algorithmic-level parallel model, several self-contained metaheuristics
 are launched in parallel. The parallel
-metaheuristics\index{Metaheuristics!parallel~metaheuristics} may
+metaheuristics\index{metaheuristics!parallel metaheuristics} may
 start with identical or different solutions (s-metaheuristics
 case) or populations (p-metaheuristics case). Their parameter
 settings such as the size of tabu list for tabu
-search\index{Metaheuristics!tabu~search}, transition probabilities
+search\index{metaheuristics!tabu search}, transition probabilities
 for ant colonies, mutation and crossover probabilities for
 evolutionary
-algorithm\index{Metaheuristics!evolutionary~algorithm}s may be the
+algorithm\index{metaheuristics!evolutionary algorithm}s may be the
 same or different. The parallel processes may evolve independently
 or in a cooperative manner. In cooperative parallel models, the
 algorithms exchange information related to the search during
 evolution in order to find better and more robust solutions.}
 
-\item{In the iteration-level\index{Metaheuristics!iteration-level
-parallelism} parallel model, the focus is on the parallelization
+\item{In the iteration-level parallel model, the focus is on the parallelization
 of each iteration of the metaheuristic. Indeed, metaheuristics are
 generally iterative search processes. Moreover, the most
 resource-consuming part of a metaheuristic is the evaluation of
 the generated solutions at each iteration. For s-metaheuristics
-(e.g., tabu search\index{Metaheuristics!tabu~search}, simulated
+(e.g., tabu search\index{metaheuristics!tabu search}, simulated
 annealing, variable neighborhood search), the evaluation and
 generation of the neighborhood is the most time-consuming step of
 the algorithm particularly when it comes to dealing with large
 neighborhood sets. In this parallel model, the neighborhood is
 decomposed into partitions, and each partition is evaluated in a
 parallel way. For p-metaheuristics (e.g., evolutionary
-algorithm\index{Metaheuristics!evolutionary~algorithm}s, ant
+algorithms, ant
 colonies, swarm optimization), the
-iteration-level\index{Metaheuristics!iteration-level parallelism}
+iteration-level
 parallel model arises naturally since these metaheuristics deal
 with a population of independent solutions. In evolutionary
-algorithm\index{Metaheuristics!evolutionary~algorithm}s, for
-instance, the iteration-level\index{Metaheuristics!iteration-level
-parallelism} model consists of decomposing the whole population
+algorithms, for
+instance, the iteration-level model consists of decomposing the whole population
 into several partitions where each partition is evaluated in
 parallel.}
 
 \item{In the
-solution-level\index{Metaheuristics!solution-level~parallelism}
+solution-level
 parallel model, the focus is on the parallelization of the
 evaluation of a single solution. This model is useful when the
 objective function and/or the constraints are time and/or memory
 consuming. Unlike the two previous parallel models, the
-solution-level\index{Metaheuristics!solution-level~parallelism}
+solution-level\index{metaheuristics!solution-level parallelism}
 parallel model is problem-dependent.}
 \end{itemize}
 
@@ -195,7 +192,7 @@ parallel model is problem-dependent.}
 \label{ch8:sec:challenges}
 
 Developing GPU-based parallel
-metaheuristics\index{Metaheuristics!parallel~metaheuristics} is
+metaheuristics\index{metaheuristics!parallel metaheuristics} is
 not straightforward. The parallel models have to be rethought to
 meet the new requirements of the GPU architecture. Several major
 issues have to be taken into account both at design and
@@ -206,7 +203,7 @@ of tasks, and data transfer between the CPU and
 GPU~\cite{luongMultiStart}.
 
 \subsection{Data placement on a hierarchical memory}
-\index{GPU Challenges!data~placement} During the execution of
+\index{GPU!data placement} During the execution of
 metaheuristics on GPU, the different threads may access multiple
 data structures from multiple memory spaces. These memories have
 different sizes and access latencies. Nevertheless, faster
@@ -229,7 +226,7 @@ threads and the corresponding metaheuristic elements (one neighbor
 per thread, one individual per thread, single population per
 threads block, one ant per thread, etc.) must be defined to ensure
 a maximum occupancy of the GPU and
-to cover CPU/GPU communication\index{GPU Challenges!CPU/GPU~communication} and memory access times.\\
+to cover CPU/GPU communication\index{GPU!CPU/GPU communication} and memory access times.\\
 
 According to the used metaheuristic and to the handled problem, the data
 values may have different types and different ranges of their values. The data
@@ -256,7 +253,7 @@ data structures (e.g., population of individuals for a CUDA thread
 block) on the shared memory.
 
 \subsection{Threads synchronization}
-\index{GPU Challenges!threads~synchronization} The thread
+\index{GPU!threads synchronization} The thread
 synchronization issue is caused by both the GPU architecture and
 the synchronization requirements of the implemented method.
 Indeed, GPUs are based on a multicore architecture organized into
@@ -277,7 +274,7 @@ some requirements related to data-dependent synchronizations.
 
 \subsection{Thread divergence}
 
-Thread divergence\index{GPU Challenges!thread~divergence} is
+Thread divergence\index{GPU!thread divergence} is
 another challenging issue in GPU-based
 metaheuristics~\cite{cecilia, pugace, audreyANT}. Generally,
 metaheuristics contain irregular loops and conditional
@@ -285,9 +282,9 @@ instructions when generating and evaluating the neighborhood
 (s-metaheuristics), and the population (p-metaheuristics) in the
 same block. In addition,  the decision to apply a crossover or a
 mutation on an individual in a genetic
-algorithm\index{Metaheuristics!genetic~algorithm} and the
+algorithm and the
 exploration of different paths using an ant
-colony\index{Metaheuristics!ant~colony~optimization} are random
+colony\index{metaheuristics!ant colony optimization} are random
 operations. Threads of the same warp have to execute
 instructions simultaneously  leading to different branches whereas
 in an SIMD model the threads of a same warp execute the same
@@ -302,15 +299,14 @@ divergences.
 
 The performance of GPU-based metaheuristics in terms of execution
 time could be improved by choosing the most appropriate parallel
-model (algorithmic-level\index{Metaheuristics!algorithmic-level
-parallelism}, instruction-level,
-solution-level\index{Metaheuristics!solution-level~parallelism}).
+model (algorithmic-level, instruction-level,
+solution-level).
 Moreover, an efficient decomposition of the metaheuristic and an
 efficient assignment of code portions between the CPU and GPU
 should be adopted. The objective is to take benefit from the GPU
 computing power without affecting the efficiency and the behavior
 of the metaheuristic and without losing performance in CPU/GPU
-communication\index{GPU Challenges!CPU/GPU~communication} and
+communication\index{GPU!CPU/GPU communication} and
 memory accesses. In order to decide which part of the
 metaheuristic will be executed on which component, one should
 perform a careful analysis on the serial code of the
@@ -319,8 +315,7 @@ exploration of the neighborhood, evaluation of individuals), and
 then offload them to the GPU, while the remaining
 tasks still run on the CPU in a serial way. \\
 
-The CPU/GPU communication\index{GPU
-Challenges!CPU/GPU~communication} is done through the global
+The CPU/GPU communication is done through the global
 memory which is a slow memory making the memory transfer between
 the CPU and GPU time-consuming which can significantly degrade the
 performance of the application. Accesses to this memory should be
@@ -331,19 +326,19 @@ therefore, the whole execution time of the metaheuristic.
 \section{State-of-the-art parallel metaheuristics on GPUs}
 \label{ch8:sec:state}
 After more than two decades of research by the combinatorial optimisation
-community devoted to developing adequate parallel metaheuristics\index{Metaheuristics!parallel~metaheuristics} for different types of
+community devoted to developing adequate parallel metaheuristics for different types of
 parallel architectures (clusters, supercomputers and grids), the actual developement
-of General Perpose GPU (GPGPU) brings new challenges for parallel metaheuristics\index{Metaheuristics!parallel~metaheuristics} on SIMD architectures.\\
+of General Perpose GPU (GPGPU) brings new challenges for parallel metaheuristics on SIMD architectures.\\
 
 The first works on metaheuristic algorithms implemented on GPUs
 started on old graphics cards before the appearance of modern GPUs
 equipped with high-level programming interfaces such as CUDA and
 OpenCL. Among these pioneering works we cite the work of Wong et al.~\cite{wongOldGPU2006} dealing with the
 implementation
-of EAs on graphics processing cards and the work by Catala et al. in~\cite{catala2007} where the ACO\index{Metaheuristics!ant~colony~optimization} algorithm
+of EAs on graphics processing cards and the work by Catala et al. in~\cite{catala2007} where the ACO\index{metaheuristics!ant colony optimization} algorithm
 is implemented on old GPU architectures. Yu et al.~\cite{yu2005} and
 Li et al.~\cite{li2007} proposed a full parallelization of genetic
-algorithm\index{Metaheuristics!genetic~algorithm}s  on old GPU architectures using
+algorithms  on old GPU architectures using
 shader libraries based on Direct3D and OpenGL.\\
 
 Such architectures are based on preconfigured pipelined stages
@@ -398,12 +393,12 @@ efficiently map the different neighborhoods on the device memory,
 more explicitly, how to calculate the memory index of each
 solution associated to each CUDA thread's \textit{id}.
 %For 1-Hamming neighborhoods, as there is exactly n solutions in the neighborhood, the mapping of this neighborhood to CUDA threads is obvious: the CPU host offloads to GPU exactly $n$ threads, and each thread id is associated to one index in the binary vector. In the case of 2-Hamming and 3-Hamming neighborhoods, each thread id should be mapped respectively to two and three indexes  in the candidate vector.
-The three neighborhoods are  implemented and experimented on the Permuted Perceptron Problem (PPP) using a tabu search\index{Metaheuristics!tabu~search} algorithm (TS). Accelerations from $9.9 \times$ to $18.5 \times$ are obtained on different problem sizes.\\ % The experiments are performed on an Intel Xeon 8 cores 3GHz coupled with an NVIDIA GTX 280 card.\\
+The three neighborhoods are  implemented and experimented on the Permuted Perceptron Problem (PPP) using a tabu search\index{metaheuristics!tabu search} algorithm (TS). Accelerations from $9.9 \times$ to $18.5 \times$ are obtained on different problem sizes.\\ % The experiments are performed on an Intel Xeon 8 cores 3GHz coupled with an NVIDIA GTX 280 card.\\
 
 In the same context, Deevacq et al.~\cite{audreyANT}
 proposed two parallelization strategies inspired by the multi-walk
 parallelization strategy, of a 3-opt iterated local
-search\index{Metaheuristics!iterated local search} algorithm (ILS)
+search algorithm (ILS)
 over a CPU/GPU architecture. In the first strategy, each Local
 Search (LS) is associated to a unique CUDA thread and improves a
 unique solution by generating its neighborhood. The second
@@ -420,45 +415,40 @@ the Traveling Salesman Problem (TSP) with sizes varying from $100$ to $3038$ cit
 The same strategy is used by Luong et al.
 in~\cite{luongMultiStart} to implement multistart parallel local
 search algorithms (a special case of the
-algorithmic-level\index{Metaheuristics!algorithmic-level
-parallelism} parallel model where several homogeneous LS
+algorithmic-level parallel model where several homogeneous LS
 algorithms are used). The multistart model is combined with
-iteration-level\index{Metaheuristics!iteration-level parallelism}
+iteration-level
 parallelism: several LS algorithms are managed by the CPU and the
 neighborhood evaluation step of each algorithm is parallelized on
 the GPU (each GPU thread is associated with one neighbor and
 executes the same evaluation function kernel). The advantage of
 such a model is that it allows a  high occupancy of the GPU
-threads. Nevertheless, memory management\index{GPU
-Challenges!memory~management} causes new issues due to the
+threads. Nevertheless, memory management causes new issues due to the
 quantity of data to store and to communicate between CPU     and
 GPU. A second proposition for implementing the same model on GPU
 consists of implementing the whole LS processes on GPU with each
 GPU thread being associated to a unique LS algorithm. This solves
 the communication issue encountered in the first model. In
-addition, a memory management\index{GPU
-Challenges!memory~management} strategy is proposed to improve the
+addition, a memory management strategy is proposed to improve the
 efficiency of the
-algorithmic-level\index{Metaheuristics!algorithmic-level
-parallelism} model: texture memory is used to avoid memory latency
+algorithmic-level model: texture memory is used to avoid memory latency
 due to uncoalesced memory accesses. The proposed approaches are
 implemented on the quadratic assignment problem (QAP) using CUDA.
 The acceleration rates obtained for the
-algorithmic-level\index{Metaheuristics!algorithmic-level
-parallelism} with usage of texture memory rise from $7.8\times$ to
+algorithmic-level with usage of texture memory rise from $7.8\times$ to
 $12\times$ (for different
 QAP benchmark sizes). \\
 
 Janiak et al.~\cite{Janiak_et_al_2008} implemented two
 algorithms for TSP and the flow-shop scheduling problem (FSP).
 These algorithms are based on a multistart tabu
-search\index{Metaheuristics!tabu~search} model. Both of the 
+search model. Both of the 
 algorithms exploit multicore CPU and GPU. A full parallelization
 on GPU is adopted using shader libraries where each thread is
-mapped with one tabu search\index{Metaheuristics!tabu~search}.
+mapped with one tabu search.
 However, even though their experiments report that the use of GPU
 speedups the serial execution almost $16 \times$, the mapping of
-one thread with one tabu search\index{Metaheuristics!tabu~search}
+one thread with one tabu search
 requires a large number of local search algorithms to
 cover the memory access latency. The same mapping policy is adopted by Zhu et al. in~\cite{zhu_et_al_2008} (one thread is associated to one local search) solving the quadratic assignment problem but using the CUDA toolkit instead of shader libraries.\\
 
@@ -468,15 +458,13 @@ architectures (multicore CPU  coupled to one GPU).  The challenge
 of using such a heterogeneous architecture is how to distribute
 tasks between the CPU cores and the GPU in such a way to have
 optimal performances. Among the three traditional parallel models
-(solution-level\index{Metaheuristics!solution-level~parallelism},
-iteration-level\index{Metaheuristics!iteration-level parallelism},
-and algorithmic-level\index{Metaheuristics!algorithmic-level
-parallelism}), the authors point out that the most convenient
+(solution-level,
+iteration-level
+and algorithmic-level), the authors point out that the most convenient
 model for the considered heterogeneous architecture is a hybrid
 model combining
-iteration-level\index{Metaheuristics!iteration-level parallelism}
-and algorithmic-level\index{Metaheuristics!algorithmic-level
-parallelism} models. Several CPU threads execute several instances
+iteration-level
+and algorithmic-level models. Several CPU threads execute several instances
 of the same S-metaheuristic in parallel while the GPU device is
 associated to one CPU core and  used to accelerate the
 neighborhood calculation of several S-metaheuristics at the same
@@ -488,7 +476,7 @@ All the previously noted works  exploit the same parallel models
 used on CPUs based on the task parallelism. A different
 implementation approach is proposed by Paul in~\cite{gerald2012}
 to implement a simulated
-annealing\index{Metaheuristics!simulated~annealing} (SA) algorithm
+annealing (SA) algorithm
 for the QAP on GPUs. Indeed, the author used a preinitialized
 matrix \emph{delta} in which the incremental evaluation of simple
 swap moves are calculated and stored relative to the initial
@@ -513,7 +501,7 @@ QAPLIB~\cite{burkard1991qaplib}. Speedups up to $10 \times$ are
 achieved by the GPU implementation compared
 to the same sequential implementation on CPU using SA-matrix.\\
 
-\subsection{Implementing population-based metaheuristics on GPUs}
+\subsection[Implementing population-based metaheuristics\hfill\break on GPUs]{Implementing population-based metaheuristics on GPUs}
 
 State-of-the-art works dealing with the implementation of
 p-metaheuristics on GPUs generally rely on parallel models and
@@ -522,10 +510,10 @@ p-metaheuristics over different types of parallel architectures:
 supercomputers, clusters, and computational grids. Three main
 classes of p-metaheuristics are considered in this section:
 evolutionary
-algorithm\index{Metaheuristics!evolutionary~algorithm}s (EAs), ant
-colony\index{Metaheuristics!ant~colony~optimization} optimization
-(ACO\index{Metaheuristics!ant~colony~optimization}), and particle
-swarm optimization\index{Metaheuristics!particle swarm
+algorithms (EAs), ant
+colony optimization
+(ACO), and particle
+swarm optimization\index{metaheuristics!particle swarm
 optimization} (PSO).
 
 \subsubsection*{Evolutionary algorithms}
@@ -543,7 +531,7 @@ optimization problems. The main chalenges to be raised when implementing the tra
 In~\cite{kannan}, Kannan and Ganji present a CUDA implementation
 of the drug discovery application Autodock (molecular docking
 application). Autodock uses a genetic
-algorithm\index{Metaheuristics!genetic~algorithm} to find optimal
+algorithm to find optimal
 docking positions of a ligand to a protein. The most
 time-consuming task in Autodock is the fitness function
 evaluation. The fitness function used for a docking problem
@@ -570,9 +558,9 @@ C-like metalanguage for easy development of EAs. The user writes a
 description  of the problem-specific components (fitness function,
 problem representation, etc) in EASEA. The code  is then compiled
 to obtain a ready-to-use evolutionary
-algorithm\index{Metaheuristics!evolutionary~algorithm}. The EASEA
+algorithm. The EASEA
 compiler  uses genetic
-algorithm\index{Metaheuristics!genetic~algorithm} LIB and EO
+algorithm LIB and EO
 Libraries to produce C++ or JAVA written EA codes.
 In~\cite{maitre2009}, the authors  proposed an extension of EASEA
 to produce CUDA code from the EASEA files. This extension has been
@@ -583,7 +571,7 @@ during the experiments: a benchmark mathematical function and a
 real problem (molecular structure prediction). In order to
 maximize the GPU occupation, very large populations are used (from
 $2000$ to $20000$). Even though transferring such large
-populations from the CPU to the GPU device memory at every generation is very costly, the authors report important speedups on the two problems on a GTX260 card:  $105 \times$ is reported for the benchmark function while for the real problem the reported speedup is $60 \times$. This may be best explained by the complexity of the fitness functions. Nevertheless, there is no indication in the paper about the memory management\index{GPU Challenges!memory~management} of the populations on GPU.\\
+populations from the CPU to the GPU device memory at every generation is very costly, the authors report important speedups on the two problems on a GTX260 card:  $105 \times$ is reported for the benchmark function while for the real problem the reported speedup is $60 \times$. This may be best explained by the complexity of the fitness functions. Nevertheless, there is no indication in the paper about the memory management of the populations on GPU.\\
 
 The master-slave  model is efficient when the fitness function is
 highly time intensive. Nevertheless, it requires the use of
@@ -596,11 +584,11 @@ model).\\
 
 The coarse-grained model is  used by Pospichal et al.
 in~\cite{pospichal10} to implement a parallel genetic
-algorithm\index{Metaheuristics!genetic~algorithm} on GPU. In this
+algorithm on GPU. In this
 work the entire genetic
-algorithm\index{Metaheuristics!genetic~algorithm} is implemented
+algorithm is implemented
 on GPU. This choice is motivated by the overhead engendered by the
-CPU/GPU communication\index{GPU Challenges!CPU/GPU~communication}
+CPU/GPU communication
 when only population evaluation is performed on GPU. Each
 population island is mapped with a CUDA thread block and each
 thread is responsible for a unique individual. Subpopulations are
@@ -612,7 +600,7 @@ asynchronously through the global memory. That is, after a given number of gener
 
 The same strategy is also adopted by Tsutsui and Fujimoto
 in~\cite{tsutsuiGAQAP}  to implement a coarse-grained genetic
-algorithm\index{Metaheuristics!genetic~algorithm} on GPU to solve
+algorithm on GPU to solve
 the QAP. Initially, several subpopulations are created on CPU and
 transferred to the global memory. The subpopulations are organized
 in the global memory into blocks of $8$ individuals in such a way
@@ -625,11 +613,11 @@ through the global memory. Experiments are performed on standard
 QAP benchmarks from the QAPLIB~\cite{burkard1991qaplib}. The GPU
 implementation reached speedups of $2.9\times$ to $12.6 \times$
 compared to a single core implementation of a coarse-grained
-genetic algorithm\index{Metaheuristics!genetic~algorithm} on a
+genetic algorithm on a
 Intel i7 processor.\\
 
 Nowotniak and Kucharski~\cite{nowotniak} proposed a GPU-based
-implementation of a Quantum Inspired Genetic Algorithm\index{Metaheuristics!genetic~algorithm} (QIGA). The 
+implementation of a Quantum Inspired Genetic Algorithm (QIGA). The 
 parallel model used is a hierarchical model based on two levels: each
 thread in a block transforms a unique individual and a different
 population is assigned to each block. The algorithm is run
@@ -653,7 +641,7 @@ to the SIMD architecture of the GPU. \\
 
 Pinel et al. in~\cite{pinel2012JPDC} developed a highly
 parallel synchronous cellular genetic
-algorithm\index{Metaheuristics!genetic~algorithm} (CGA), called
+algorithm (CGA), called
 GraphCell, to solve the independent task scheduling problem on GPU
 architectures. In CGAs, the population is arranged into a
 two-dimensional toroidal grid where only neighboring solutions are
@@ -695,7 +683,7 @@ version for all the tested  benchmarks when the size of the
 population is set to $32^2$. When the size of the population is
 too small, there are not enough computations to cover the overhead
 created by the call of kernel functions, CPU/GPU
-communication\index{GPU Challenges!CPU/GPU~communication}s,
+communications,
 synchronization, and access to global memory. Finally, an
 interesting review on GPU parallel computation in bio-inspired
 algorithms is proposed by Arenas et al.
@@ -704,17 +692,17 @@ in~\cite{arenas2011}. \\
 \subsubsection*{Ant colony optimization}
 
 Ant colony optimization
-(ACO\index{Metaheuristics!ant~colony~optimization}) is another
+(ACO) is another
 p-metaheuristic subject to parallelization on GPUs.
 State-of-the-art works on parallelizing
-ACO\index{Metaheuristics!ant~colony~optimization} focus on
+ACO focus on
 accelerating the tour construction step performed by each ant by
 taking a task-based parallelism approach, with pheromone
 deposition on the CPU.\\
 
 In~\cite{cecilia}, Cecilia et al.  present a GPU-based
 implementation of
-ACO\index{Metaheuristics!ant~colony~optimization} for TSP where
+ACO for TSP where
 the two steps (tour construction and pheromone update) are
 parallelized on the GPU. A data parallelism approach is used to
 enhance the performance of the tour  construction step. The
@@ -736,28 +724,27 @@ construction and  $20 \times$ for pheromone updates.\\
 
 In another work, Tsutsui and Fujimoto~\cite{tsutsui} propose  a
 hybrid algorithm combining
-ACO\index{Metaheuristics!ant~colony~optimization} metaheuristic
-and Tabu Search (TS)\index{Metaheuristics!tabu~search} implemented
+ACO metaheuristic
+and Tabu Search (TS) implemented
 on GPU to solve the QAP. A solution of QAP is represented as a
 permutation of ${1,2,..,n}$ with $n$ being the size of the
 problem. The TS algorithm is based on the 2-opt neighborhood
 (swapping of two elements $(i,j)$ in the permutation). The authors
 point out that the move cost of each neighbor depends on the
 couple $(i,j)$. Two groups of moves are formed according to the
-move cost. In order to avoid thread divergence\index{GPU
-Challenges!thread~divergence} within the same warp, the
+move cost. In order to avoid thread divergence\index{GPU!thread divergence} within the same warp, the
 neighborhood evaluation is parallelized in such a way to assign
 only moves of the same cost to each thread warp. This strategy is
 called MATA for Move-cost Adjusted Thread Assignment. Concerning
-the memory management\index{GPU Challenges!memory~management}, all
-the data of the ACO\index{Metaheuristics! ant~colony~optimization}
+the memory management\index{GPU!memory management}, all
+the data of the ACO\index{metaheuristics!ant colony optimization}
 (population, pheromone matrix), QAP matrices, and tabu list are
 placed on the global memory of the GPU. Shared memory is used only
 for working data common to all threads in a given block.
  All the
 steps of the hybrid algorithm
-ACO\index{Metaheuristics!ant~colony~optimization}-TS
-(ACO\index{Metaheuristics!ant~colony~optimization} initialization,
+ACO-TS
+(ACO initialization,
 pheromone update, construct solutions, applying TS) are
 implemented as kernel functions on the GPU. The GPU/CPU
 communications are only used to transfer the best-so-far solution
@@ -822,11 +809,10 @@ based on 2 levels: design level and implementation level as
 illustrated in Figure~\ref{ch8:fig:classification}. The design
 level regroups the three classes of parallel models used in
 metaheuristics
-(solution-level\index{Metaheuristics!solution-level~parallelism},
-iteration-level\index{Metaheuristics!iteration-level parallelism},
-algorithmic-level\index{Metaheuristics!algorithmic-level
-parallelism}) with examples for s-metaheuristics, EAs,
-ACO\index{Metaheuristics!ant~colony~optimization} and PCO. This
+(solution-level,
+iteration-level,
+algorithmic-level) with examples for s-metaheuristics, EAs,
+ACO and PCO. This
 classification  is principally built from the reviewed
 state-of-the-art works  in the previous section. The
 implementation level refers to the way these parallel design
@@ -845,7 +831,7 @@ are explained in the following sections.\\
 
 
 \subsubsection*{GPU thread mapping for solution-level parallelism}
-\index{GPU-based Metaheuristics!GPU-thread mapping} Parallel
+\index{GPU!thread mapping} Parallel
 models at solution level consist of parallelizing a time intensive
 atomic task of the algorithm. Generally, it consists of the
 fitness evaluation~\cite{kannan}. Nevertheless, crossover
@@ -861,14 +847,14 @@ require the use of very large neighborhoods or populations.\\
 %data parallelism in SA-matrix to parallelize
 
 \subsubsection*{GPU thread mapping for iteration-level parallelism}
-\index{GPU-based Metaheuristics!GPU-thread mapping}
+\index{GPU!thread mapping}
 Iteration-level parallelism consists of parallelizing the tasks
 performed independently on different solutions. Different mapping
 strategies are adopted in the reviewed works to implement these
 models.\\
 
 In Figure \ref{ch8:fig:classification}, the first example of
-iteration-level\index{Metaheuristics!iteration-level parallelism}
+iteration-level
 parallelism is the parallel evaluation of neighborhoods in
 s-metaheuristics. In most of the reviewed works,  a per-thread
 mapping approach is used: each solution of the neighborhood is
@@ -889,8 +875,7 @@ memory.
 %pheromone update data parallelism matrices
 
 \subsubsection*{GPU thread mapping  for algorithmic-level parallelism}
-\index{GPU-based Metaheuristics!GPU-thread mapping}
-Algorithmic-level parallelism consists of launching several self-contained algorithms in parallel. In the previously reviewed works two algorithmic-level\index{Metaheuristics!algorithmic-level parallelism} models have been used: the multistart model and the island model (parallel EAs).\\
+Algorithmic-level parallelism consists of launching several self-contained algorithms in parallel. In the previously reviewed works two algorithmic-level models have been used: the multistart model and the island model (parallel EAs).\\
 
 The implementation of the multistart model is based on two
 different mapping strategies~\cite{luongMultiStart, audreyANT}:
@@ -906,9 +891,8 @@ neighborhoods. In the second mapping strategy, the LS algorithms
 are placed on CPU and the neighborhood evaluation of each LS is
 parallelized on GPU using per-thread mapping strategy (one thread
 per solution). This consists of a hierarchical parallel model
-combining algorithmic-level\index{Metaheuristics!algorithmic-level
-parallelism} parallelism (multistart) with
-iteration-level\index{Metaheuristics!iteration-level parallelism}
+combining algorithmic-level parallelism (multistart) with
+iteration-level
 parallelism (master-worker).\\
 
 In the island model, the same mapping is used in all the reviewed
@@ -950,18 +934,17 @@ frameworks!libCUDAOptimize} is a  library intended for the
 implementation of p-metaheuristics on GPU. The three frameworks
 are presented in more detail in the following.
 
-\subsection{PUGACE\index{GPU-based frameworks!PUGACE}: framework for implementing evolutionary computation on GPUs}
-PUGACE\index{GPU-based frameworks!PUGACE} is a generic  framework
+\subsection{PUGACE: framework for implementing evolutionary computation on GPUs}
+PUGACE is a generic  framework
 for easy implementation of cellular evolutionary algorithms on
 GPUs implemented using C and CUDA. It is based on the frameworks
 MALLBA and JCell (a framework for cellular algorithms). The
 authors justified the choice of cellular evolutionary
-algorithm\index{Metaheuristics!evolutionary~algorithm} by the good
+algorithm by the good
 feedback found in the literature concerning its efficient
 implementation on GPUs compared to other parallel models for EAs
 (island, master-slave). The main standard evolutionary operators
-are already implemented in PUGACE\index{GPU-based
-frameworks!PUGACE}: different selection strategies, standard
+are already implemented in PUGACE: different selection strategies, standard
 crossover, and mutation operators (PMX, swap, 2-exchange,
 etc.). Different problem encoding is also supported. The framework
 is organized as a set of modules in which the different
@@ -975,7 +958,7 @@ population is transferred to the GPU. On the GPU side, each
 individual is associated to a unique CUDA thread. The function
 evaluation and mutation are done on the GPU while selection and
 replacement are maintained on the CPU. In order to avoid thread
-divergence\index{GPU Challenges!thread~divergence} appearing in
+divergence\index{GPU!thread divergence} appearing in
 the same CUDA thread block at the crossover step (because of the
 probability of application which may give different results from
 one thread to the other), the decision of whether to apply a
@@ -1005,7 +988,7 @@ Melab et al.~\cite{paradiseoGPU} developed a reusable
 framework ParadisEO-MO-GPU\index{GPU-based
 frameworks!ParadisEO-MO-GPU} for parallel local search
 metaheuristics (s-metaheuristics) on GPUs. It focuses on the
-iteration-level\index{Metaheuristics!iteration-level parallelism}
+iteration-level
 parallel model of s-metaheuristics which consists of exploring in
 parallel the neighborhood of a problem solution on GPU. The
 framework, implemented using C++ and CUDA, is an extension of the
@@ -1048,7 +1031,7 @@ LibCUDAOptimize~\cite{libcuda} is a C++/CUDA open source library
 for the design and implementation of metaheuristics on GPUs. Until
 now, the metaheuristics supported by LibCUDAOptimize are: scatter
 search, differential evolution, and particle swarm
-optimization\index{Metaheuristics!particle swarm optimization}.
+optimization\index{metaheuristics!particle swarm optimization}.
 Nevertheless, the library is designed in such a way to allow
 further extensions for other metaheuristics and it is still in
 development phase by the authors. The parallelization strategy
@@ -1060,12 +1043,12 @@ on CPU.
 \label{ch8:sec:case}
 
 In this case study, a large neighborhood GPU-based local
-search\index{GPU-based Metaheuristics!GPU-based~local~search}
+search\index{GPU!based local search}
 method for solving the Quadratic 3-dimensional Assignment Problem
 (Q3AP) will be presented. The local search method is an Iterated
-Local Search\index{Metaheuristics!iterated local search}
+Local Search\index{metaheuristics!iterated local search}
 (ILS)~\cite{stutzle2006ILSforQAP} using an embedded
-TS\index{Metaheuristics!tabu~search} algorithm. The ILS principle
+TS algorithm. The ILS principle
 consists of executing iteratively the embedded local search, each
 iteration which starts from a disrupted local optima reached by
 the previous local search process. The disruption heuristic is a
@@ -1182,9 +1165,8 @@ number of feasible solutions of an instance of size $n$ is $n! \times n!$.
 \subsection{Iterated tabu search algorithm for the Q3AP}
 
 To tackle large-sized instances of the Q3AP and speed up the
-search process, a parallel ILS\index{Metaheuristics!iterated local
-search} algorithm has been designed. The local search embedded in
-the ILS is a TS\index{Metaheuristics!tabu~search}. A TS
+search process, a parallel ILS algorithm has been designed. The local search embedded in
+the ILS is a TS. A TS
 procedure~\cite{Glover1989TS} starts from an initial feasible
 solution and tries, at each step, to move to a neighboring
 solution that minimizes the fitness (for a minimization case). If
@@ -1193,7 +1175,7 @@ fitness is chosen as a next move. This enables the TS process to
 escape local optima. However, this strategy may generate cycles,
 i.e., previous moves can be selected again. To avoid cycles, the
 TS manages a short-term memory that contains the moves that have
-been recently performed. A TS\index{Metaheuristics!tabu~search}
+been recently performed. A TS
 template is given by Algorithm \ref{TS_pseudo_code}.\\
 %
 %  \begin{algorithm}
@@ -1313,7 +1295,7 @@ methods. The main challenges that must be faced when designing a
 local search algorithm are the efficient distribution of the
 search process between the CPU and the GPU minimizing the data
 transfer between them, the hierarchical memory
-management\index{GPU Challenges!memory~management} and the
+management\index{GPU!memory management} and the
 capacity constraints of GPU memories, and the thread
 synchronization. All these issues must be regarded when designing
 parallel LS models to allow
@@ -1321,9 +1303,9 @@ solving of large scale optimization problems on GPU architectures.\\
 
 To go back to our problem (i.e., Q3AP), we propose in
 Algorithm~\ref{ch8:algoITS} an iterated tabu
-search\index{Metaheuristics!tabu~search} on GPU (GPU-ITS). The
+search on GPU (GPU-ITS). The
 parallel model is in agreement with the
-iteration-level\index{Metaheuristics!iteration-level parallelism}
+iteration-level
 parallel model of LS methods presented in Section
 \ref{ch8:sec:paraMeta} (Fig. \ref{ch8:fig:paraMeta}). This
 algorithm can be seen as a cooperative model between the CPU and
@@ -1342,7 +1324,7 @@ We notice that the input matrices are read-only structures and do
 not change during all the execution of the LS algorithm.
 Therefore, their associated memory is copied only once during all
 the execution. Third, comes the parallel
-iteration-level\index{Metaheuristics!iteration-level parallelism},
+iteration-level,
 in which each neighboring solution is generated, evaluated, and
 copied into the neighborhood fitnesses structure (from lines 10 to
 14). Fourth, since the order in which candidate neighbors are
@@ -1411,9 +1393,9 @@ given number of iterations has been reached.
 In this section, some experimental results related to the approach
 presented in Section \ref{ch8:ITS-Q3APSection} are reported. We
 recall that the approach is a GPU-based iterated tabu
-search\index{Metaheuristics!tabu~search} (GPU-ITS) method
+search (GPU-ITS) method
 consisting in an iterated local search (ILS) embedding a tabu
-search\index{Metaheuristics!tabu~search} (TS) and where the
+search (TS) and where the
 generation/evaluation step of the TS process is executed on GPU.
 The ILS is used to improve the quality of successive local optima
 provided by TS methods. This is achieved by perturbing the local
@@ -1467,7 +1449,7 @@ Nug22-d & $42476$ & $43282.1$ & $44140$ & $25$\% & $4506.5$ & $32.7$ & $137.8 \t
 & & \tiny{$529.6$} & & & \tiny{$341.1$} & \tiny{$6.6$} & & \tiny{$4.0$} \\ \hline
 \end{tabular}
 \caption{Results of the GPU-based iterated tabu search for
-different Q3AP instances.} \label{ch8:ITSQ3APResults} \center
+different Q3AP instances.} \label{ch8:ITSQ3APResults} % \center %Shashi
 \end{table}
 
 %\begin{table}
@@ -1506,18 +1488,17 @@ the robustness/quality of provided solutions.
 \label{ch8:conclusion}
 
 This chapter has presented state-of-the-art GPU-based parallel
-metaheuristics\index{Metaheuristics!parallel~metaheuristics} and
+metaheuristics and
 a case study on implementing large neighborhood local search
 methods on GPUs for solving large benchmarks of the quadratic
 three-dimensional assignment problem (Q3AP). \\
 
 Implementing parallel
-metaheuristics\index{Metaheuristics!parallel~metaheuristics} on
+metaheuristics on
 GPU architectures poses new issues and challenges such as memory
-management\index{GPU Challenges!memory~management};  finding
+management;  finding
 efficient mapping strategies between tasks to parallelize; and the
-GPU threads, thread divergence\index{GPU
-Challenges!thread~divergence}, and synchronization. Actually, most
+GPU threads, thread divergence, and synchronization. Actually, most
 of metaheuristics have been implemented on GPU using different
 implementation strategies. In this chapter, a two-level
 classification of the reviewed works has been proposed: design
@@ -1528,6 +1509,6 @@ This classification focuses mainly on  the mapping between the
 metaheuristic tasks to parallelize and the GPU threads. Indeed,
 the choice of a given mapping strategy strongly influences the
 other challenges (memory usage, communication, thread
-divergence\index{GPU Challenges!thread~divergence}).
+divergence).
 
 \putbib[Chapters/chapter9/biblio9]